|
| 1 | +# Prediction |
| 2 | + |
| 3 | +Once you have installed `boltz`, you can start making predictions by simply running: |
| 4 | + |
| 5 | +`boltz predict <INPUT_PATH>` |
| 6 | + |
| 7 | +where `<INPUT_PATH>` is a path to the input file or a directory. The input file can either be in fasta (enough for most use cases) or YAML format (for more complex inputs). If you specify a directory, `boltz` will run predictions on each `.yaml` or `.fasta` file in the directory. |
| 8 | + |
| 9 | +Before diving into more details about the input formats, here are the key differences in what they each support: |
| 10 | + |
| 11 | +| Feature | Fasta | YAML | |
| 12 | +| -------- |--------------------| ------- | |
| 13 | +| Polymers | :white_check_mark: | :white_check_mark: | |
| 14 | +| Smiles | :white_check_mark: | :white_check_mark: | |
| 15 | +| CCD code | :white_check_mark: | :white_check_mark: | |
| 16 | +| Custom MSA | :white_check_mark: | :white_check_mark: | |
| 17 | +| Modified Residues | :x: | :white_check_mark: | |
| 18 | +| Covalent bonds | :x: | :white_check_mark: | |
| 19 | +| Pocket conditioning | :x: | :white_check_mark: | |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +## Fasta format |
| 24 | + |
| 25 | +The fasta format should contain entries as follows: |
| 26 | + |
| 27 | +``` |
| 28 | +>CHAIN_ID|ENTITY_TYPE|MSA_PATH |
| 29 | +SEQUENCE |
| 30 | +``` |
| 31 | + |
| 32 | +Where `CHAIN_ID` is a unique identifier for each input chain, `ENTITY_TYPE` can be one of `protein`, `dna`, `rna`, `smiles`, `ccd` and `MSA_PATH` is only specified for protein entities and is the path to the `.a3m` file containing a computed MSA for the sequence of the protein. Note that we support both smiles and CCD code for ligands. |
| 33 | + |
| 34 | +For each of these cases, the corresponding `SEQUENCE` will contain an amino acid sequence (e.g. `EFKEAFSLF`), a sequence of nucleotide bases (e.g. `ATCG`), a smiles string (e.g. `CC1=CC=CC=C1`), or a CCD code (e.g. `ATP`), depending on the entity. |
| 35 | + |
| 36 | +As an example: |
| 37 | + |
| 38 | +```yaml |
| 39 | +>A|protein|./examples/msa/seq1.a3m |
| 40 | +MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ |
| 41 | +>B|protein|./examples/msa/seq1.a3m |
| 42 | +MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ |
| 43 | +>C|ccd |
| 44 | +SAH |
| 45 | +>D|ccd |
| 46 | +SAH |
| 47 | +>E|smiles |
| 48 | +N[C@@H](Cc1ccc(O)cc1)C(=O)O |
| 49 | +>F|smiles |
| 50 | +N[C@@H](Cc1ccc(O)cc1)C(=O)O |
| 51 | +``` |
| 52 | + |
| 53 | + |
| 54 | +## YAML format |
| 55 | + |
| 56 | +The YAML format is more flexible and allows for more complex inputs, particularly around covalent bonds. The schema of the YAML is the following: |
| 57 | + |
| 58 | +```yaml |
| 59 | +sequences: |
| 60 | + - ENTITY_TYPE: |
| 61 | + id: CHAIN_ID |
| 62 | + sequence: SEQUENCE # only for protein, dna, rna |
| 63 | + smiles: SMILES # only for ligand, exclusive with ccd |
| 64 | + ccd: CCD # only for ligand, exclusive with smiles |
| 65 | + msa: MSA_PATH # only for protein |
| 66 | + modifications: |
| 67 | + - position: RES_IDX # index of residue, starting from 1 |
| 68 | + ccd: CCD # CCD code of the modified residue |
| 69 | + |
| 70 | + - ENTITY_TYPE: |
| 71 | + id: [CHAIN_ID, CHAIN_ID] # multiple ids in case of multiple identical entities |
| 72 | + ... |
| 73 | +constraints: |
| 74 | + - bond: |
| 75 | + atom1: [CHAIN_ID, RES_IDX, ATOM_NAME] |
| 76 | + atom2: [CHAIN_ID, RES_IDX, ATOM_NAME] |
| 77 | + - pocket: |
| 78 | + binder: CHAIN_ID |
| 79 | + contacts: [[CHAIN_ID, RES_IDX], [CHAIN_ID, RES_IDX]] |
| 80 | +``` |
| 81 | +`sequences` has one entry for every unique chain/molecule in the input. Each polymer entity as a `ENTITY_TYPE` either `protein`, `dna` or`rna` and have a `sequence` attribute. Non-polymer entities are indicated by `ENTITY_TYPE` equal to `ligand` and have a `smiles` or `ccd` attribute. `CHAIN_ID` is the unique identifier for each chain/molecule, and it should be set as a list in case of multiple identical entities in the structure. Protein entities should also contain an `msa` attribute with `MSA_PATH` indicating the path to the `.a3m` file containing a computed MSA for the sequence of the protein. |
| 82 | + |
| 83 | +The `modifications` field is an optional field that allows you to specify modified residues in the polymer (`protein`, `dna` or`rna`). The `position` field specifies the index (starting from 1) of the residue, and `ccd` is the CCD code of the modified residue. This field is currently only supported for CCD ligands. |
| 84 | + |
| 85 | +`constraints` is an optional field that allows you to specify additional information about the input structure. Currently, we support just `bond`. The `bond` constraint specifies a covalent bonds between two atoms (`atom1` and `atom2`). It is currently only supported for CCD ligands and canonical residues, `CHAIN_ID` refers to the id of the residue set above, `RES_IDX` is the index (starting from 1) of the residue (1 for ligands), and `ATOM_NAME` is the standardized atom name (can be verified in CIF file of that component on the RCSB website). |
| 86 | + |
| 87 | +As an example: |
| 88 | + |
| 89 | +```yaml |
| 90 | +version: 1 |
| 91 | +sequences: |
| 92 | + - protein: |
| 93 | + id: [A, B] |
| 94 | + sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ |
| 95 | + msa: ./examples/msa/seq1.a3m |
| 96 | + - ligand: |
| 97 | + id: [C, D] |
| 98 | + ccd: SAH |
| 99 | + - ligand: |
| 100 | + id: [E, F] |
| 101 | + smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O |
| 102 | +``` |
| 103 | + |
| 104 | + |
| 105 | +## Options |
| 106 | + |
| 107 | +The following options are available for the `predict` command: |
| 108 | + |
| 109 | + boltz predict [OPTIONS] input_path |
| 110 | + |
| 111 | +| **Option** | **Type** | **Default** | **Description** | |
| 112 | +|-----------------------------|-----------------|--------------------|---------------------------------------------------------------------------------| |
| 113 | +| `--out_dir PATH` | `PATH` | `./` | The path where to save the predictions. | |
| 114 | +| `--cache PATH` | `PATH` | `~/.boltz` | The directory where to download the data and model. | |
| 115 | +| `--checkpoint PATH` | `PATH` | None | An optional checkpoint. Uses the provided Boltz-1 model by default. | |
| 116 | +| `--devices INTEGER` | `INTEGER` | `1` | The number of devices to use for prediction. | |
| 117 | +| `--accelerator` | `[gpu,cpu,tpu]` | `gpu` | The accelerator to use for prediction. | |
| 118 | +| `--recycling_steps INTEGER` | `INTEGER` | `3` | The number of recycling steps to use for prediction. | |
| 119 | +| `--sampling_steps INTEGER` | `INTEGER` | `200` | The number of sampling steps to use for prediction. | |
| 120 | +| `--diffusion_samples INTEGER` | `INTEGER` | `1` | The number of diffusion samples to use for prediction. | |
| 121 | +| `--output_format` | `[pdb,mmcif]` | `mmcif` | The output format to use for the predictions. | |
| 122 | +| `--num_workers INTEGER` | `INTEGER` | `2` | The number of dataloader workers to use for prediction. | |
| 123 | +| `--override` | `FLAG` | `False` | Whether to override existing predictions if found. | |
| 124 | + |
| 125 | +## Output |
| 126 | + |
| 127 | +After running the model, the generated outputs are organized into the output directory following the structure below: |
| 128 | +``` |
| 129 | +out_dir/ |
| 130 | +├── lightning_logs/ # Logs generated during training or evaluation |
| 131 | +├── predictions/ # Contains the model's predictions |
| 132 | + ├── [input_file1]/ |
| 133 | + ├── [input_file1]_model_0.cif # The predicted structure in CIF format |
| 134 | + ... |
| 135 | + └── [input_file1]_model_[diffusion_samples-1].cif # The predicted structure in CIF format |
| 136 | + └── [input_file2]/ |
| 137 | + ... |
| 138 | +└── processed/ # Processed data used during execution |
| 139 | +``` |
| 140 | +The `predictions` folder contains a unique folder for each input file. The input folders contain diffusion_samples predictions saved in the output_format. The `processed` folder contains the processed input files that are used by the model during inference. |
0 commit comments