Skip to content

Commit 2deeafa

Browse files
committed
boltz-1
0 parents  commit 2deeafa

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+18517
-0
lines changed

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 Jeremy Wohlwend, Gabriele Corso, Saro Passaro
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
<h1 align="center">Boltz-1:
2+
3+
Democratizing Biomolecular Interaction Modeling
4+
</h1>
5+
6+
![](docs/boltz1_pred_figure.png)
7+
8+
Boltz-1 is an open-source model which predicts the 3D structure of proteins, rna, dna and small molecules; it handles modified residues, covalent ligands and glycans, as well as condition the generation on pocket residues.
9+
10+
For more information about the model, see our [technical report](https://gcorso.github.io/assets/boltz1.pdf).
11+
12+
## Installation
13+
Install boltz with PyPI (recommended):
14+
15+
```
16+
pip install boltz
17+
```
18+
19+
or directly from GitHub for daily updates:
20+
21+
```
22+
git clone https://github.com/jwohlwend/boltz.git
23+
cd boltz; pip install -e .
24+
```
25+
> Note: we recommend installing boltz in a fresh python environment
26+
27+
## Inference
28+
29+
You can run inference using Boltz-1 with:
30+
31+
```
32+
boltz predict input_path
33+
```
34+
35+
Boltz currently accepts three input formats:
36+
37+
1. Fasta file, for most use cases
38+
39+
2. A comprehensive YAML schema, for more complex use cases
40+
41+
3. A directory containing files of the above formats, for batched processing
42+
43+
To see all available options: `boltz predict --help` and for more informaton on these input formats, see our [prediction instructions](docs/prediction.md).
44+
45+
## Training
46+
47+
If you're interested in retraining the model, see our [training instructions](docs/training.md).
48+
49+
## Contributing
50+
51+
We welcome external contributions and are eager to engage with the community. Connect with us on our [Slack channel](https://boltz-community.slack.com/archives/C0818M6DWH2) to discuss advancements, share insights, and foster collaboration around Boltz-1.
52+
53+
## Coming very soon
54+
55+
- [ ] Pocket conditioning support
56+
- [ ] More examples
57+
- [ ] Full data processing pipeline
58+
- [ ] Colab notebook for inference
59+
- [ ] Confidence model checkpoint
60+
- [ ] Support for custom paired MSA
61+
- [ ] Kernel integration
62+
63+
## License
64+
65+
Our model and code are released under MIT License, and can be freely used for both academic and commercial purposes.

docs/boltz1_pred_figure.png

1.72 MB
Loading

docs/prediction.md

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Prediction
2+
3+
Once you have installed `boltz`, you can start making predictions by simply running:
4+
5+
`boltz predict <INPUT_PATH>`
6+
7+
where `<INPUT_PATH>` is a path to the input file or a directory. The input file can either be in fasta (enough for most use cases) or YAML format (for more complex inputs). If you specify a directory, `boltz` will run predictions on each `.yaml` or `.fasta` file in the directory.
8+
9+
Before diving into more details about the input formats, here are the key differences in what they each support:
10+
11+
| Feature | Fasta | YAML |
12+
| -------- |--------------------| ------- |
13+
| Polymers | :white_check_mark: | :white_check_mark: |
14+
| Smiles | :white_check_mark: | :white_check_mark: |
15+
| CCD code | :white_check_mark: | :white_check_mark: |
16+
| Custom MSA | :white_check_mark: | :white_check_mark: |
17+
| Modified Residues | :x: | :white_check_mark: |
18+
| Covalent bonds | :x: | :white_check_mark: |
19+
| Pocket conditioning | :x: | :white_check_mark: |
20+
21+
22+
23+
## Fasta format
24+
25+
The fasta format should contain entries as follows:
26+
27+
```
28+
>CHAIN_ID|ENTITY_TYPE|MSA_PATH
29+
SEQUENCE
30+
```
31+
32+
Where `CHAIN_ID` is a unique identifier for each input chain, `ENTITY_TYPE` can be one of `protein`, `dna`, `rna`, `smiles`, `ccd` and `MSA_PATH` is only specified for protein entities and is the path to the `.a3m` file containing a computed MSA for the sequence of the protein. Note that we support both smiles and CCD code for ligands.
33+
34+
For each of these cases, the corresponding `SEQUENCE` will contain an amino acid sequence (e.g. `EFKEAFSLF`), a sequence of nucleotide bases (e.g. `ATCG`), a smiles string (e.g. `CC1=CC=CC=C1`), or a CCD code (e.g. `ATP`), depending on the entity.
35+
36+
As an example:
37+
38+
```yaml
39+
>A|protein|./examples/msa/seq1.a3m
40+
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
41+
>B|protein|./examples/msa/seq1.a3m
42+
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
43+
>C|ccd
44+
SAH
45+
>D|ccd
46+
SAH
47+
>E|smiles
48+
N[C@@H](Cc1ccc(O)cc1)C(=O)O
49+
>F|smiles
50+
N[C@@H](Cc1ccc(O)cc1)C(=O)O
51+
```
52+
53+
54+
## YAML format
55+
56+
The YAML format is more flexible and allows for more complex inputs, particularly around covalent bonds. The schema of the YAML is the following:
57+
58+
```yaml
59+
sequences:
60+
- ENTITY_TYPE:
61+
id: CHAIN_ID
62+
sequence: SEQUENCE # only for protein, dna, rna
63+
smiles: SMILES # only for ligand, exclusive with ccd
64+
ccd: CCD # only for ligand, exclusive with smiles
65+
msa: MSA_PATH # only for protein
66+
modifications:
67+
- position: RES_IDX # index of residue, starting from 1
68+
ccd: CCD # CCD code of the modified residue
69+
70+
- ENTITY_TYPE:
71+
id: [CHAIN_ID, CHAIN_ID] # multiple ids in case of multiple identical entities
72+
...
73+
constraints:
74+
- bond:
75+
atom1: [CHAIN_ID, RES_IDX, ATOM_NAME]
76+
atom2: [CHAIN_ID, RES_IDX, ATOM_NAME]
77+
- pocket:
78+
binder: CHAIN_ID
79+
contacts: [[CHAIN_ID, RES_IDX], [CHAIN_ID, RES_IDX]]
80+
```
81+
`sequences` has one entry for every unique chain/molecule in the input. Each polymer entity as a `ENTITY_TYPE` either `protein`, `dna` or`rna` and have a `sequence` attribute. Non-polymer entities are indicated by `ENTITY_TYPE` equal to `ligand` and have a `smiles` or `ccd` attribute. `CHAIN_ID` is the unique identifier for each chain/molecule, and it should be set as a list in case of multiple identical entities in the structure. Protein entities should also contain an `msa` attribute with `MSA_PATH` indicating the path to the `.a3m` file containing a computed MSA for the sequence of the protein.
82+
83+
The `modifications` field is an optional field that allows you to specify modified residues in the polymer (`protein`, `dna` or`rna`). The `position` field specifies the index (starting from 1) of the residue, and `ccd` is the CCD code of the modified residue. This field is currently only supported for CCD ligands.
84+
85+
`constraints` is an optional field that allows you to specify additional information about the input structure. Currently, we support just `bond`. The `bond` constraint specifies a covalent bonds between two atoms (`atom1` and `atom2`). It is currently only supported for CCD ligands and canonical residues, `CHAIN_ID` refers to the id of the residue set above, `RES_IDX` is the index (starting from 1) of the residue (1 for ligands), and `ATOM_NAME` is the standardized atom name (can be verified in CIF file of that component on the RCSB website).
86+
87+
As an example:
88+
89+
```yaml
90+
version: 1
91+
sequences:
92+
- protein:
93+
id: [A, B]
94+
sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
95+
msa: ./examples/msa/seq1.a3m
96+
- ligand:
97+
id: [C, D]
98+
ccd: SAH
99+
- ligand:
100+
id: [E, F]
101+
smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O
102+
```
103+
104+
105+
## Options
106+
107+
The following options are available for the `predict` command:
108+
109+
boltz predict [OPTIONS] input_path
110+
111+
| **Option** | **Type** | **Default** | **Description** |
112+
|-----------------------------|-----------------|--------------------|---------------------------------------------------------------------------------|
113+
| `--out_dir PATH` | `PATH` | `./` | The path where to save the predictions. |
114+
| `--cache PATH` | `PATH` | `~/.boltz` | The directory where to download the data and model. |
115+
| `--checkpoint PATH` | `PATH` | None | An optional checkpoint. Uses the provided Boltz-1 model by default. |
116+
| `--devices INTEGER` | `INTEGER` | `1` | The number of devices to use for prediction. |
117+
| `--accelerator` | `[gpu,cpu,tpu]` | `gpu` | The accelerator to use for prediction. |
118+
| `--recycling_steps INTEGER` | `INTEGER` | `3` | The number of recycling steps to use for prediction. |
119+
| `--sampling_steps INTEGER` | `INTEGER` | `200` | The number of sampling steps to use for prediction. |
120+
| `--diffusion_samples INTEGER` | `INTEGER` | `1` | The number of diffusion samples to use for prediction. |
121+
| `--output_format` | `[pdb,mmcif]` | `mmcif` | The output format to use for the predictions. |
122+
| `--num_workers INTEGER` | `INTEGER` | `2` | The number of dataloader workers to use for prediction. |
123+
| `--override` | `FLAG` | `False` | Whether to override existing predictions if found. |
124+
125+
## Output
126+
127+
After running the model, the generated outputs are organized into the output directory following the structure below:
128+
```
129+
out_dir/
130+
├── lightning_logs/ # Logs generated during training or evaluation
131+
├── predictions/ # Contains the model's predictions
132+
├── [input_file1]/
133+
├── [input_file1]_model_0.cif # The predicted structure in CIF format
134+
...
135+
└── [input_file1]_model_[diffusion_samples-1].cif # The predicted structure in CIF format
136+
└── [input_file2]/
137+
...
138+
└── processed/ # Processed data used during execution
139+
```
140+
The `predictions` folder contains a unique folder for each input file. The input folders contain diffusion_samples predictions saved in the output_format. The `processed` folder contains the processed input files that are used by the model during inference.

docs/training.md

+47
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Training
2+
3+
## Download processed data
4+
5+
Instructions on how to download the processed dataset for training are coming soon, we are currently uploading the data to sharable storage and will update this page when ready.
6+
7+
## Modify the configuration file
8+
9+
The training script requires a configuration file to run. This file specifies the paths to the data, the output directory, and other parameters of the data, model and training process.
10+
11+
We provide under `scripts/train/configs` a template configuration file analogous to the one we used for training the structure model (`structure.yaml`) and the confidence model (`confidence.yaml`).
12+
13+
The following are the main parameters that you should modify in the configuration file to get the structure model to train:
14+
15+
```yaml
16+
trainer:
17+
devices: 1
18+
19+
output: SET_PATH_HERE # Path to the output directory
20+
resume: PATH_TO_CHECKPOINT_FILE # Path to a checkpoint file to resume training from if any null otherwise
21+
22+
data:
23+
datasets:
24+
- _target_: boltz.data.module.training.DatasetConfig
25+
target_dir: PATH_TO_TARGETS_DIR # Path to the directory containing the processed structure files
26+
msa_dir: PATH_TO_MSA_DIR # Path to the directory containing the processed MSA files
27+
28+
symmetries: PATH_TO_SYMMETRY_FILE # Path to the file containing molecule the symmetry information
29+
max_tokens: 512 # Maximum number of tokens in the input sequence
30+
max_atoms: 4608 # Maximum number of atoms in the input structure
31+
```
32+
33+
`max_tokens` and `max_atoms` are the maximum number of tokens and atoms in the crop. Depending on the size of the GPUs you are using (as well as the training speed desired), you may want to adjust these values. Other recommended values are 256 and 2304, or 384 and 3456 respectively.
34+
35+
## Run the training script
36+
37+
Before running the full training, we recommend using the debug flag. This turns off DDP (sets single device) and set `num_workers` to 0 so everything is in a single process, as well as disabling wandb:
38+
39+
python scripts/train/train.py scripts/train/configs/structure.yaml debug=1
40+
41+
Once that seems to run okay, you can kill it and launch the training run:
42+
43+
python scripts/train/train.py scripts/train/configs/structure.yaml
44+
45+
We also provide a different configuration file to train the confidence model:
46+
47+
python scripts/train/train.py scripts/train/configs/confidence.yaml

examples/ligand.fasta

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
>A|protein|./examples/msa/seq1.a3m
2+
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
3+
>B|protein|./examples/msa/seq1.a3m
4+
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
5+
>C|ccd
6+
SAH
7+
>D|ccd
8+
SAH
9+
>E|smiles
10+
N[C@@H](Cc1ccc(O)cc1)C(=O)O
11+
>F|smiles
12+
N[C@@H](Cc1ccc(O)cc1)C(=O)O

examples/ligand.yaml

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
version: 1 # Optional, defaults to 1
2+
sequences:
3+
- protein:
4+
id: [A, B]
5+
sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
6+
msa: ./examples/msa/seq1.a3m
7+
- ligand:
8+
id: [C, D]
9+
ccd: SAH
10+
- ligand:
11+
id: [E, F]
12+
smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O

0 commit comments

Comments
 (0)