Protein Prediction Codebase Migration #1

aditya0by0 · 2025-04-14T17:04:28Z

PR to move code base from protein-prediction from python-chebai to python-chebai-proteins

Related Issues:

Related PRs

Protein function prediction with GO - Part 3 python-chebai#64 (Includes SCOPe changes)
Protein function prediction with GO - Part 2 python-chebai#57
Protein function prediction with GO python-chebai#39

aditya0by0 · 2025-04-14T18:38:48Z

@sfluegel05, Please let me know if the below files are needed in this repository, as I am not sure about them

chebai/preprocessing/collect_all.py
chebai/result/analyse_sem.py
chebai/result/molplot.py
chebai/result in general

sfluegel05 · 2025-04-16T09:10:11Z

@sfluegel05, Please let me know if the below files are needed in this repository, as I am not sure about them
* chebai/preprocessing/collect_all.py

* chebai/result/analyse_sem.py

* chebai/result/molplot.py

* `chebai/result` in general

I would say that none of them are needed here. I would even remove more files. We only need files here that are not in the python-chebai repository. The idea is that the proteins repository does not work by itself, but only adds some specific classes to the main repository. The only files I would duplicate are the GItHub workflows.

aditya0by0 · 2025-04-16T20:06:20Z

@sfluegel05 Please check the results from the latest training on the scope50 dataset

For this run, I made the following changes:

Set the vocab size to 31 (21 amino acids for n_gram=1 + 10 special tokens for the LLM)
Updated max_position_embeddings to 1000, aligning with the default maximum sequence length for proteins

The training completed 81 epochs in 5 hours on a single GPU.

🔗WandB Run Overview

aditya0by0 · 2025-04-16T20:24:44Z

The idea is that the proteins repository does not work by itself, but only adds some specific classes to the main repository. The only files I would duplicate are the GItHub workflows.

Thank you for the clarification.

Based on your explanation, I understand that the proteins repository is intended to complement the main repository by adding specific classes. If we follow the same pattern as with python-chebai-graph, and use python-chebai as a base library, I believe there are a few important considerations for protein-specific use cases.

Firstly, this approach may lead to the installation of several unnecessary dependencies such as RDKit, pysmiles, deepsmiles, selfies, etc., which are not required for protein sequence data.

Secondly, the base class XYDataBaseModule includes default parameters that are more tailored to ChEBI data, and may not align with the needs of protein-related models as here 7048cd0, this not-required params get logged in hparam logs . And also for instance, the vocabulary size and max_position_embeddings differ significantly between chemical and protein data. There are also some minor but relevant code changes specific to protein data, as shown in the following commits:

Given that protein sequences represent a fundamentally different data type compared to chemical molecules, I would suggest considering a standalone repository for proteins. This would help keep the dependency footprint minimal and allow more flexibility in handling domain-specific requirements.

Please let me know your thoughts.

aditya0by0 · 2025-04-23T15:38:57Z

Just want to recall that if we want to use ESM2 for generating embeddings for SCOPe dataset, we can't use ELECTRA model for further training due the following discussed reasons ChEB-AI/python-chebai#64 (comment)

aditya0by0 · 2025-05-12T10:41:16Z

@sfluegel05, Please review and merge.

chebai_proteins/preprocessing/reader.py

…ath" This reverts commit a8823c8.

aditya0by0 · 2025-05-15T12:35:23Z

I think there are 2 branches one is dev and another is main. This branch got merged into main. So we can remove the dev branch which is set as default.

#1 (comment)

add protein-related code from https://github.com/ChEB-AI/python-chebai

2519852

aditya0by0 self-assigned this Apr 14, 2025

aditya0by0 added 2 commits April 14, 2025 19:13

remove chebi imports in init.py

7221a9e

changes for fix

2422518

aditya0by0 marked this pull request as draft April 14, 2025 17:25

aditya0by0 requested a review from sfluegel05 April 14, 2025 17:25

aditya0by0 added 3 commits April 14, 2025 20:35

change loss module for protein data

64d7623

update trainer for protein reader

beaf74e

remove chebi imports and libraries

9fd19a9

aditya0by0 and others added 3 commits April 14, 2025 22:10

remove chebi version param from base data class

7048cd0

electra config: update vocab size & max pos for protein seq

9c8521f

add changes from out_dim PR (#74 in python-chebai)

b45b266

aditya0by0 added 9 commits April 23, 2025 16:33

remove not required files

a293931

Update .gitignore

9120538

update readers for proteins

6d7e6bd

import offset constants from chebai + remove its worflow

83e3342

rename base folder to chebai_proteins

22815fb

update notebook for chebai_proteins root

68d4040

add chebai repo to to setup.py

78d79da

Update setup.py

8dce9cb

update unit test

71e361e

aditya0by0 added 4 commits April 23, 2025 17:56

fix imports from chebai_proteins

3819fd3

BCELoss config for deepgo2

ab9bd1c

scope esm2 config

dcbd578

MultilabelAUROC for deepgo MLP

1b2856d

aditya0by0 added 11 commits April 24, 2025 12:43

update migration script

31b6f45

update configs

6c2506d

make python dir

19ab4a7

deepgo: raise error if no classes are selected

add85e3

rectify consistent naming of scope

c89f26d

reader: add collator to esm reader

196d662

set weight_only=False for esm reader

5af20c8

use TokenIndexerReader for ProteinDataReader

d653f52

update test for protein reader for tokenindexer changes

71fa9fe

fix protein test for mock open

508a47a

add abstract DataReader for proteins repo to override token path

a8823c8

proteins readme

cd92ca5

sfluegel05 reviewed May 12, 2025

View reviewed changes

chebai_proteins/preprocessing/reader.py Outdated Show resolved Hide resolved

aditya0by0 mentioned this pull request May 12, 2025

Fix for reader directory for subclasses defined in other repository ChEB-AI/python-chebai#91

Merged

aditya0by0 added 2 commits May 12, 2025 20:00

Revert "add abstract DataReader for proteins repo to override token p…

7cf059c

…ath" This reverts commit a8823c8.

Update .gitignore

979b4f2

aditya0by0 requested a review from sfluegel05 May 15, 2025 10:45

sfluegel05 marked this pull request as ready for review May 15, 2025 11:53

sfluegel05 merged commit 4c2971c into main May 15, 2025
6 checks passed

sfluegel05 deleted the protein_prediction branch May 15, 2025 11:54

sfluegel05 mentioned this pull request May 19, 2025

From main branch to dev. Delete main after merge. #2

Merged

aditya0by0 added a commit that referenced this pull request Aug 11, 2025

electra config for scope

3c4a6a3

#1 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Protein Prediction Codebase Migration #1

Protein Prediction Codebase Migration #1

Uh oh!

aditya0by0 commented Apr 14, 2025 •

edited

Loading

Uh oh!

aditya0by0 commented Apr 14, 2025

Uh oh!

sfluegel05 commented Apr 16, 2025

Uh oh!

aditya0by0 commented Apr 16, 2025 •

edited

Loading

Uh oh!

aditya0by0 commented Apr 16, 2025

Uh oh!

aditya0by0 commented Apr 23, 2025

Uh oh!

aditya0by0 commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

aditya0by0 commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Protein Prediction Codebase Migration #1

Protein Prediction Codebase Migration #1

Uh oh!

Conversation

aditya0by0 commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues:

Related PRs

Uh oh!

aditya0by0 commented Apr 14, 2025

Uh oh!

sfluegel05 commented Apr 16, 2025

Uh oh!

aditya0by0 commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditya0by0 commented Apr 16, 2025

Uh oh!

aditya0by0 commented Apr 23, 2025

Uh oh!

aditya0by0 commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

aditya0by0 commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aditya0by0 commented Apr 14, 2025 •

edited

Loading

aditya0by0 commented Apr 16, 2025 •

edited

Loading