Skip to content

Conversation

@aditya0by0
Copy link
Member

@aditya0by0 aditya0by0 commented Apr 14, 2025

@aditya0by0 aditya0by0 self-assigned this Apr 14, 2025
@aditya0by0 aditya0by0 marked this pull request as draft April 14, 2025 17:25
@aditya0by0 aditya0by0 requested a review from sfluegel05 April 14, 2025 17:25
@aditya0by0
Copy link
Member Author

@sfluegel05, Please let me know if the below files are needed in this repository, as I am not sure about them

  • chebai/preprocessing/collect_all.py
  • chebai/result/analyse_sem.py
  • chebai/result/molplot.py
  • chebai/result in general

@sfluegel05
Copy link
Contributor

@sfluegel05, Please let me know if the below files are needed in this repository, as I am not sure about them

* chebai/preprocessing/collect_all.py

* chebai/result/analyse_sem.py

* chebai/result/molplot.py

* `chebai/result` in general

I would say that none of them are needed here. I would even remove more files. We only need files here that are not in the python-chebai repository. The idea is that the proteins repository does not work by itself, but only adds some specific classes to the main repository. The only files I would duplicate are the GItHub workflows.

@aditya0by0
Copy link
Member Author

aditya0by0 commented Apr 16, 2025

@sfluegel05 Please check the results from the latest training on the scope50 dataset

For this run, I made the following changes:

  • Set the vocab size to 31 (21 amino acids for n_gram=1 + 10 special tokens for the LLM)
  • Updated max_position_embeddings to 1000, aligning with the default maximum sequence length for proteins

The training completed 81 epochs in 5 hours on a single GPU.

🔗WandB Run Overview

@aditya0by0
Copy link
Member Author

The idea is that the proteins repository does not work by itself, but only adds some specific classes to the main repository. The only files I would duplicate are the GItHub workflows.

Thank you for the clarification.

Based on your explanation, I understand that the proteins repository is intended to complement the main repository by adding specific classes. If we follow the same pattern as with python-chebai-graph, and use python-chebai as a base library, I believe there are a few important considerations for protein-specific use cases.

Firstly, this approach may lead to the installation of several unnecessary dependencies such as RDKit, pysmiles, deepsmiles, selfies, etc., which are not required for protein sequence data.

Secondly, the base class XYDataBaseModule includes default parameters that are more tailored to ChEBI data, and may not align with the needs of protein-related models as here 7048cd0, this not-required params get logged in hparam logs . And also for instance, the vocabulary size and max_position_embeddings differ significantly between chemical and protein data. There are also some minor but relevant code changes specific to protein data, as shown in the following commits:

Given that protein sequences represent a fundamentally different data type compared to chemical molecules, I would suggest considering a standalone repository for proteins. This would help keep the dependency footprint minimal and allow more flexibility in handling domain-specific requirements.

Please let me know your thoughts.

@aditya0by0
Copy link
Member Author

Just want to recall that if we want to use ESM2 for generating embeddings for SCOPe dataset, we can't use ELECTRA model for further training due the following discussed reasons ChEB-AI/python-chebai#64 (comment)

@aditya0by0
Copy link
Member Author

@sfluegel05, Please review and merge.

@aditya0by0 aditya0by0 requested a review from sfluegel05 May 15, 2025 10:45
@sfluegel05 sfluegel05 marked this pull request as ready for review May 15, 2025 11:53
@sfluegel05 sfluegel05 merged commit 4c2971c into main May 15, 2025
6 checks passed
@sfluegel05 sfluegel05 deleted the protein_prediction branch May 15, 2025 11:54
@aditya0by0
Copy link
Member Author

I think there are 2 branches one is dev and another is main. This branch got merged into main. So we can remove the dev branch which is set as default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants