Skip to content
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
2519852
add protein-related code from https://github.com/ChEB-AI/python-chebai
aditya0by0 Apr 3, 2025
7221a9e
remove chebi imports in init.py
aditya0by0 Apr 14, 2025
2422518
changes for fix
aditya0by0 Apr 14, 2025
64d7623
change loss module for protein data
aditya0by0 Apr 14, 2025
beaf74e
update trainer for protein reader
aditya0by0 Apr 14, 2025
9fd19a9
remove chebi imports and libraries
aditya0by0 Apr 14, 2025
7048cd0
remove chebi version param from base data class
aditya0by0 Apr 14, 2025
9c8521f
electra config: update vocab size & max pos for protein seq
aditya0by0 Apr 14, 2025
b45b266
add changes from `out_dim` PR (#74 in python-chebai)
sfluegel05 Apr 16, 2025
a293931
remove not required files
aditya0by0 Apr 23, 2025
9120538
Update .gitignore
aditya0by0 Apr 23, 2025
6d7e6bd
update readers for proteins
aditya0by0 Apr 23, 2025
83e3342
import offset constants from chebai + remove its worflow
aditya0by0 Apr 23, 2025
22815fb
rename base folder to chebai_proteins
aditya0by0 Apr 23, 2025
68d4040
update notebook for chebai_proteins root
aditya0by0 Apr 23, 2025
78d79da
add chebai repo to to setup.py
aditya0by0 Apr 23, 2025
8dce9cb
Update setup.py
aditya0by0 Apr 23, 2025
71e361e
update unit test
aditya0by0 Apr 23, 2025
3819fd3
fix imports from chebai_proteins
aditya0by0 Apr 23, 2025
ab9bd1c
BCELoss config for deepgo2
aditya0by0 Apr 24, 2025
dcbd578
scope esm2 config
aditya0by0 Apr 24, 2025
1b2856d
MultilabelAUROC for deepgo MLP
aditya0by0 Apr 24, 2025
31b6f45
update migration script
aditya0by0 Apr 24, 2025
6c2506d
update configs
aditya0by0 Apr 24, 2025
19ab4a7
make python dir
aditya0by0 Apr 24, 2025
add85e3
deepgo: raise error if no classes are selected
aditya0by0 May 4, 2025
c89f26d
rectify consistent naming of scope
aditya0by0 May 4, 2025
196d662
reader: add collator to esm reader
aditya0by0 May 7, 2025
5af20c8
set weight_only=False for esm reader
aditya0by0 May 7, 2025
d653f52
use `TokenIndexerReader` for `ProteinDataReader`
aditya0by0 May 10, 2025
71fa9fe
update test for protein reader for tokenindexer changes
aditya0by0 May 10, 2025
508a47a
fix protein test for mock open
aditya0by0 May 10, 2025
a8823c8
add abstract DataReader for proteins repo to override token path
aditya0by0 May 11, 2025
cd92ca5
proteins readme
aditya0by0 May 12, 2025
7cf059c
Revert "add abstract DataReader for proteins repo to override token p…
aditya0by0 May 12, 2025
979b4f2
Update .gitignore
aditya0by0 May 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Juypter notebooks contains images, and tables, and parsing text
# blowing up the total language fraction unrealistically;
# then 'Juypter notebooks' are suddenly major part of repo language.

# As they don't want to parse notebooks better
# (wont-fix = https://github.com/github/linguist/issues/3496)
# Simply exclude this file from counting now:

notebooks/*.ipynb linguist-generated=true
stream_viz/tutorial/*.ipynb linguist-generated=true
10 changes: 10 additions & 0 deletions .github/workflows/black.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: Lint

on: [push, pull_request]

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: psf/black@stable
38 changes: 38 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Unittests

on: [pull_request]

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install --upgrade pip setuptools wheel
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
python -m pip install -e .

- name: Display Python & Installed Packages
run: |
python --version
pip freeze

- name: Run Unit Tests
run: python -m unittest discover -s tests/unit -v
env:
ACTIONS_STEP_DEBUG: true # Enable debug logs
ACTIONS_RUNNER_DEBUG: true # Additional debug logs from Github Actions itself
110 changes: 110 additions & 0 deletions .github/workflows/token_consistency.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
name: Check consistency of tokens.txt file

# Define the file paths under `paths` to trigger this check only when specific files are modified.
# This script will then execute checks only on files that have changed, rather than all files listed in `paths`.

# **Note** : To add a new token file for checks, include its path in:
# - `on` -> `push` and `pull_request` sections
# - `jobs` -> `check_tokens` -> `steps` -> Set global variable for multiple tokens.txt paths -> `TOKENS_FILES`

on:
push:
paths:
- "chebai/preprocessing/bin/protein_token/tokens.txt"
- "chebai/preprocessing/bin/protein_token_3_gram/tokens.txt"
pull_request:
paths:
- "chebai/preprocessing/bin/protein_token/tokens.txt"
- "chebai/preprocessing/bin/protein_token_3_gram/tokens.txt"

jobs:
check_tokens:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Get list of changed files
id: changed_files
run: |
git fetch origin dev

# Get the list of changed files compared to origin/dev and save them to a file
git diff --name-only origin/dev > changed_files.txt

# Print the names of changed files on separate lines
echo "Changed files:"
while read -r line; do
echo "Changed File name : $line"
done < changed_files.txt

- name: Set global variable for multiple tokens.txt paths
run: |
# All token files that needs to checked must be included here too, same as in `paths`.
TOKENS_FILES=(
"chebai/preprocessing/bin/protein_token/tokens.txt"
"chebai/preprocessing/bin/protein_token_3_gram/tokens.txt"
)
echo "TOKENS_FILES=${TOKENS_FILES[*]}" >> $GITHUB_ENV

- name: Process only changed tokens.txt files
run: |
# Convert the TOKENS_FILES environment variable into an array
TOKENS_FILES=(${TOKENS_FILES})

# Iterate over each token file path
for TOKENS_FILE_PATH in "${TOKENS_FILES[@]}"; do
# Check if the current token file path is in the list of changed files
if grep -q "$TOKENS_FILE_PATH" changed_files.txt; then
echo "----------------------- Processing $TOKENS_FILE_PATH -----------------------"

# Get previous tokens.txt version
git fetch origin dev
git diff origin/dev -- $TOKENS_FILE_PATH > tokens_diff.txt || echo "No previous tokens.txt found for $TOKENS_FILE_PATH"

# Check for deleted or added lines in tokens.txt
if [ -f tokens_diff.txt ]; then

# Check for deleted lines (lines starting with '-')
deleted_lines=$(grep '^-' tokens_diff.txt | grep -v '^---' | sed 's/^-//' || true)
if [ -n "$deleted_lines" ]; then
echo "Error: Lines have been deleted from $TOKENS_FILE_PATH."
echo -e "Deleted Lines: \n$deleted_lines"
exit 1
fi

# Check for added lines (lines starting with '+')
added_lines=$(grep '^+' tokens_diff.txt | grep -v '^+++' | sed 's/^+//' || true)
if [ -n "$added_lines" ]; then

# Count how many lines have been added
num_added_lines=$(echo "$added_lines" | wc -l)

# Get last `n` lines (equal to num_added_lines) of tokens.txt
last_lines=$(tail -n "$num_added_lines" $TOKENS_FILE_PATH)

# Check if the added lines are at the end of the file
if [ "$added_lines" != "$last_lines" ]; then

# Find lines that were added but not appended at the end of the file
non_appended_lines=$(diff <(echo "$added_lines") <(echo "$last_lines") | grep '^<' | sed 's/^< //')

echo "Error: New lines have been added to $TOKENS_FILE_PATH, but they are not at the end of the file."
echo -e "Added lines that are not at the end of the file: \n$non_appended_lines"
exit 1
fi
fi

if [ "$added_lines" == "" ]; then
echo "$TOKENS_FILE_PATH validation successful: No lines were deleted, and no new lines were added."
else
echo "$TOKENS_FILE_PATH validation successful: No lines were deleted, and new lines were correctly appended at the end."
fi
else
echo "No previous version of $TOKENS_FILE_PATH found."
fi
else
echo "$TOKENS_FILE_PATH was not changed, skipping."
fi
done
171 changes: 171 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/
docs/build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# configs/ # commented as new configs can be added as a part of a feature

/.idea
/data
/logs
/results_buffer
electra_pretrained.ckpt
.jupyter
.virtual_documents
25 changes: 25 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
repos:
- repo: https://github.com/psf/black
rev: "24.2.0"
hooks:
- id: black
- id: black-jupyter # for formatting jupyter-notebook

- repo: https://github.com/pycqa/isort
rev: 5.13.2
hooks:
- id: isort
name: isort (python)
args: ["--profile=black"]

- repo: https://github.com/asottile/seed-isort-config
rev: v2.2.0
hooks:
- id: seed-isort-config

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
Loading