Skip to content

Commit 89796c7

Browse files
committed
auto-generated MSA intergation
1 parent 6b40a66 commit 89796c7

13 files changed

+653
-103
lines changed

README.md

+27
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,30 @@ We welcome external contributions and are eager to engage with the community. Co
6363
## License
6464

6565
Our model and code are released under MIT License, and can be freely used for both academic and commercial purposes.
66+
67+
68+
## Cite
69+
70+
If you use this code or the models in your research, please cite the following papers:
71+
72+
```bibtex
73+
@article{mirdita2022colabfold,
74+
title={Boltz-1: Democratizing Biomolecular Interaction Modeling},
75+
author={Wohlwend, Jeremy and Corso, Gabriele and Passaro, Saro and Reveiz, Mateo and Leidal, Ken and Swiderski, Wojtek and Portnoi, Tally and Chinn, Itamar and Siltera, Jacob and Jaakkola, Tommi and Barzilay, Regina},
76+
journal={},
77+
year={2024},
78+
}
79+
```
80+
81+
```bibtex
82+
@article{mirdita2022colabfold,
83+
title={ColabFold: making protein folding accessible to all},
84+
author={Mirdita, Milot and Sch{\"u}tze, Konstantin and Moriwaki, Yoshitaka and Heo, Lim and Ovchinnikov, Sergey and Steinegger, Martin},
85+
journal={Nature methods},
86+
volume={19},
87+
number={6},
88+
pages={679--682},
89+
year={2022},
90+
publisher={Nature Publishing Group US New York}
91+
}
92+
```

docs/prediction.md

+32-32
Original file line numberDiff line numberDiff line change
@@ -20,37 +20,6 @@ Before diving into more details about the input formats, here are the key differ
2020

2121

2222

23-
## Fasta format
24-
25-
The fasta format should contain entries as follows:
26-
27-
```
28-
>CHAIN_ID|ENTITY_TYPE|MSA_PATH
29-
SEQUENCE
30-
```
31-
32-
The `CHAIN_ID` is a unique identifier for each input chain. The `ENTITY_TYPE` can be one of `protein`, `dna`, `rna`, `smiles`, `ccd` (note that we support both smiles and CCD code for ligands). The `MSA_PATH` is only specified for protein entities and is the path to the `.a3m` file containing a pre-computed MSA for the sequence of the protein. If you wish to explicitly run single sequence mode (which is generally advised against as it will hurt model performance), you may do so by using the special keyword `empty` for that protein (ex: `>A|protein|empty`).
33-
34-
For each of these cases, the corresponding `SEQUENCE` will contain an amino acid sequence (e.g. `EFKEAFSLF`), a sequence of nucleotide bases (e.g. `ATCG`), a smiles string (e.g. `CC1=CC=CC=C1`), or a CCD code (e.g. `ATP`), depending on the entity.
35-
36-
As an example:
37-
38-
```yaml
39-
>A|protein|./examples/msa/seq1.a3m
40-
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
41-
>B|protein|./examples/msa/seq1.a3m
42-
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
43-
>C|ccd
44-
SAH
45-
>D|ccd
46-
SAH
47-
>E|smiles
48-
N[C@@H](Cc1ccc(O)cc1)C(=O)O
49-
>F|smiles
50-
N[C@@H](Cc1ccc(O)cc1)C(=O)O
51-
```
52-
53-
5423
## YAML format
5524

5625
The YAML format is more flexible and allows for more complex inputs, particularly around covalent bonds. The schema of the YAML is the following:
@@ -78,7 +47,7 @@ constraints:
7847
binder: CHAIN_ID
7948
contacts: [[CHAIN_ID, RES_IDX], [CHAIN_ID, RES_IDX]]
8049
```
81-
`sequences` has one entry for every unique chain/molecule in the input. Each polymer entity as a `ENTITY_TYPE` either `protein`, `dna` or`rna` and have a `sequence` attribute. Non-polymer entities are indicated by `ENTITY_TYPE` equal to `ligand` and have a `smiles` or `ccd` attribute. `CHAIN_ID` is the unique identifier for each chain/molecule, and it should be set as a list in case of multiple identical entities in the structure. Protein entities should also contain an `msa` attribute with `MSA_PATH` indicating the path to the `.a3m` file containing a computed MSA for the sequence of the protein. If you wish to explicitly run single sequence mode (which is generally advised against as it will hurt model performance), you may do so by using the special keyword `empty` for that protein (ex: `msa: empty`).
50+
`sequences` has one entry for every unique chain/molecule in the input. Each polymer entity as a `ENTITY_TYPE` either `protein`, `dna` or`rna` and have a `sequence` attribute. Non-polymer entities are indicated by `ENTITY_TYPE` equal to `ligand` and have a `smiles` or `ccd` attribute. `CHAIN_ID` is the unique identifier for each chain/molecule, and it should be set as a list in case of multiple identical entities in the structure. For proteins, the `msa` key is optional. If unset, MSA's will be automatically generated using the mmseqs2 server. If you wish to use a precomputed MSA, use the `msa` attribute with `MSA_PATH` indicating the path to the `.a3m` file containing the MSA for that protein. If you wish to explicitly run single sequence mode (which is generally advised against as it will hurt model performance), you may do so by using the special keyword `empty` for that protein (ex: `msa: empty`).
8251

8352
The `modifications` field is an optional field that allows you to specify modified residues in the polymer (`protein`, `dna` or`rna`). The `position` field specifies the index (starting from 1) of the residue, and `ccd` is the CCD code of the modified residue. This field is currently only supported for CCD ligands.
8453

@@ -102,6 +71,37 @@ sequences:
10271
```
10372

10473

74+
## Fasta format
75+
76+
The fasta format is a little simpler, and should contain entries as follows:
77+
78+
```
79+
>CHAIN_ID|ENTITY_TYPE|MSA_PATH
80+
SEQUENCE
81+
```
82+
83+
The `CHAIN_ID` is a unique identifier for each input chain. The `ENTITY_TYPE` can be one of `protein`, `dna`, `rna`, `smiles`, `ccd` (note that we support both smiles and CCD code for ligands). The `MSA_PATH` is optional, and only applicable to proteins. By default, MSA's are auto-generated using the mmseqs2 server. If you wish to use a custom MSA, use it toset path to the `.a3m` file containing a pre-computed MSA for this protein. If you wish to explicitly run single sequence mode (which is generally advised against as it will hurt model performance), you may do so by using the special keyword `empty` for that protein (ex: `>A|protein|empty`).
84+
85+
For each of these cases, the corresponding `SEQUENCE` will contain an amino acid sequence (e.g. `EFKEAFSLF`), a sequence of nucleotide bases (e.g. `ATCG`), a smiles string (e.g. `CC1=CC=CC=C1`), or a CCD code (e.g. `ATP`), depending on the entity.
86+
87+
As an example:
88+
89+
```yaml
90+
>A|protein|./examples/msa/seq1.a3m
91+
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
92+
>B|protein|./examples/msa/seq1.a3m
93+
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
94+
>C|ccd
95+
SAH
96+
>D|ccd
97+
SAH
98+
>E|smiles
99+
N[C@@H](Cc1ccc(O)cc1)C(=O)O
100+
>F|smiles
101+
N[C@@H](Cc1ccc(O)cc1)C(=O)O
102+
```
103+
104+
105105
## Options
106106

107107
The following options are available for the `predict` command:

0 commit comments

Comments
 (0)