Skip to content

Ensemblify is a computational tool that generates ensembles of intrinsically disordered regions from AlphaFold or user-defined models. It provides an user-friendly way to generate, analyze and reweight (against experimental data) ensembles, making it accessible to all members of the scientific community, indepent of their programming expertise.

License

Notifications You must be signed in to change notification settings

CordeiroLab/ensemblify

 
 

Repository files navigation

Ensemblify: A Python package for generating ensembles of intrinsically disordered regions of AlphaFold or user defined models

💡 What is Ensemblify?

Ensemblify is a Python package that can generate protein conformational ensembles by sampling dihedral angle values from a three-residue fragment database and inserting them into flexible regions of a protein of interest (e.g. intrinsically disordered regions (IDRs)).

It supports both user-defined models and AlphaFold[1] predictions, using the predicted Local Distance Difference Test (pLDDT) and Predicted Aligned Error (PAE) confidence metrics to guide conformational sampling. Designed to enhance the study of IDRs, it allows flexible customization of sampling parameters and works with single or multi-chain proteins, offering a powerful tool for protein structure research. Ensemble analysis and reweighting with experimental data is also available through interactive graphical dashboards.

🧰 How do I install Ensemblify?

Step-by-step instructions for installing Ensemblify are available in the Installation section.

After installing Ensemblify, make sure to visit the Tripeptide Database section to learn where you can get the database files required for ensemble generation.

💻 How can I use Ensemblify?

Ensemblify can be used either as a Command Line Interface (CLI):

conda activate ensemblify_env
ensemblify [options]

or as a Python library inside a script or Jupyter notebook:

import ensemblify as ey
ey.do_cool_stuff()

Check the Usage section for more details.

You can also check out the Quick Reference Guide notebook for a basic rundown of Ensemblify's features.

🔎 How does Ensemblify work?

A general overview of Ensemblify, descriptions of employed methods and applications can be found in the Ensemblify paper:

PAPER

🧰 Installation

1. Ensemblify Python Package

It is heavily recommended to install the ensemblify Python package in a dedicated virtual environment.

You can create a new virtual environment using your favorite virtual environment manager. Examples shown will use conda. If you want to download conda you can do so through their website. We recommend miniconda, a free minimal installer for conda.

To install the ensemblify Python package, you can follow these commands:

  1. Get the ensemblify source code. To do this you:

    1.1. Install Git if you haven't already:

    • On Linux: sudo apt-get install git
    • On macOS: brew install git # using Homebrew

    1.2. Clone this repository and cd into it:

    git clone https://github.com/npfernandes/ensemblify.git
    cd ensemblify
  2. Create your ensemblify_env conda environment with all of Ensemblify's Python dependencies:

    Using the provided environment file (recommended):

    conda env create -f environment_Linux.yml # or environment_macOS.yml, for macOS users
    conda activate ensemblify_env
    Creating the environment and manually installing the Python dependencies (not recommended):
    conda create --channel=conda-forge --name ensemblify_env python=3.10 MDAnalysis=2.6.1 mdtraj=1.9.9 numpy=1.26.4 pandas=2.2.2 pyarrow=13.0.0 scikit-learn=1.4.2 scipy=1.12.0 tqdm=4.66.2
    conda activate ensemblify_env
    pip install biopython==1.81 plotly==5.23.0 pyyaml==6.0.1 "ray[default]"==2.33.0
  3. Install the ensemblify python package into your newly created environment.

    pip install .

2. Third Party Software

Each of Ensemblify's modules has different dependencies to third party software, so if you only plan on using a certain module you do not have to install software required for others. The requirements are:

PyRosetta

PyRosetta[2] is a Python-based interface to the powerful Rosetta molecular modeling suite. Its functionalities are used through Ensemblify in order to generate conformational ensembles. You can install it by following these commands:

  1. Activate your ensemblify_env conda environment:

    conda activate ensemblify_env

    If you have not yet created it, check the Ensemblify Python Package section.

  2. Install the pyrosetta-installer Python package, kindly provided by RosettaCommons, to aid in the pyrosetta installation:

    pip install pyrosetta-installer 
  3. Use pyrosetta-installer to download (~ 1.6 GB) and install pyrosetta (note the distributed and serialization parameters):

    python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta(distributed=True,serialization=True)'
  4. To test your pyrosetta installation, you can type in a terminal:

    python -c 'import pyrosetta.distributed; pyrosetta.distributed.init()'

If this step does not produce a complaint or error, your installation has been successful.

Remember to re-activate the ensemblify_env conda environment each time you wish to run code that uses pyrosetta.

FASPR

FASPR[3] is an ultra-fast and accurate program for deterministic protein sidechain packing. To compile the provided FASPR source-code, you can follow these commands:

  1. Activate your ensemblify_env conda environment:

    conda activate ensemblify_env

    If you have not yet created it, check the Ensemblify Python Package section.

  2. Navigate to where the FASPR source code is located:

    cd src/ensemblify/third_party/FASPR-master/ # assuming the cloned repository is your current working directory
  3. Compile the FASPR source code:

    For Linux users:

    g++ -O3 --fast-math -o FASPR src/*.cpp

    For macOS users:

    g++ -03 -fast-math -o FASPR src/*.cpp # if you get an error, remove -fast-math
  4. Add an environment variable with the path to your FASPR executable to your ensemblify_env conda environment:

    conda env config vars set FASPR_PATH=$(realpath FASPR)
    conda deactivate
    conda activate ensemblify_env
    echo $FASPR_PATH # to check if the variable has been set correctly

    this will allow Ensemblify to know where your FASPR executable is located.

PULCHRA

PULCHRA[4] (PowerfUL CHain Restoration Algorithm) is a program for reconstructing full-atom protein models from reduced representations. To compile the provided PULCHRA modified source-code, you can follow these commands:

  1. Activate your ensemblify_env conda environment:

    conda activate ensemblify_env

    If you have not yet created it, check the Ensemblify Python Package section.

  2. Navigate to where the PULCHRA source code is located:

    cd src/ensemblify/third_party/pulchra-master/ # assuming the cloned repository is your current working directory
  3. Compile the PULCHRA source code:

    cc -O3 -o pulchra pulchra_CHANGED.c pulchra_data.c -lm

    Do not be alarmed if some warnings show up on your screen; this is normal and they can be ignored.

  4. Add an environment variable with the path to your PULCHRA executable to your ensemblify_env conda environment:

    conda env config vars set PULCHRA_PATH=$(realpath pulchra)
    conda deactivate
    conda activate ensemblify_env
    echo $PULCHRA_PATH # to check if the variable has been set correctly

    this will allow Ensemblify to know where your PULCHRA executable is located.

GROMACS

GROMACS[5] is a molecular dynamics package mainly designed for simulations of proteins, lipids, and nucleic acids. It comes with a large selection of flexible tools for trajectory analysis and the output formats are also supported by all major analysis and visualisation packages.

To download and compile the GROMACS source code from their website you can follow these commands:

  1. Create and navigate into your desired GROMACS installation directory, for example:

    mkdir -p ~/software/GROMACS
    cd ~/software/GROMACS
  2. Download the GROMACS source code from their website:

    wget -O gromacs-2024.2.tar.gz https://zenodo.org/records/11148655/files/gromacs-2024.2.tar.gz?download=1
  3. Follow the GROMACS installation instructions to compile the GROMACS source code (this could take a while):

    tar xfz gromacs-2024.2.tar.gz
    cd gromacs-2024.2
    mkdir build
    cd build
    cmake .. -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON
    make -j $(nproc)
    make check
    sudo make install
    source /usr/local/gromacs/bin/GMXRC

    Environment variables that will allow Ensemblify to know where GROMACS is located will have already been added to your shell configuration file.

PEPSI-SAXS

Pepsi-SAXS[6] (Polynomial Expansions of Protein Structures and Interactions - SAXS) is an adaptive method for rapid and accurate computation of small-angle X-ray scattering (SAXS) profiles from atomistic protein models.

To download the Pepsi-SAXS executable from their website you can follow these commands:

  1. Create and navigate into your desired Pepsi-SAXS installation directory, for example:

    mkdir -p ~/software/Pepsi-SAXS/
    cd ~/software/Pepsi-SAXS/
  2. Download and extract the Pepsi-SAXS Linux executable:

    For Linux users:

    wget -O Pepsi-SAXS-Linux.zip https://files.inria.fr/NanoDFiles/Website/Software/Pepsi-SAXS/Linux/3.0/Pepsi-SAXS-Linux.zip
    unzip Pepsi-SAXS-Linux.zip

    For macOS users:

    curl -O Pepsi-SAXS-MacOS.zip https://files.inria.fr/NanoDFiles/Website/Software/Pepsi-SAXS/MacOS/2.6/Pepsi-SAXS.zip
    unzip Pepsi-SAXS-MacOS.zip
  3. Add an environment variable with the path to your Pepsi-SAXS executable to your ensemblify_env conda environment:

    conda activate ensemblify_env
    conda env config vars set PEPSI_SAXS_PATH=$(realpath Pepsi-SAXS)
    conda deactivate
    conda activate ensemblify_env
    echo $PEPSI_SAXS_PATH # to check if the variable has been set correctly

    this will allow Ensemblify to know where your Pepsi-SAXS executable is located.

BIFT

Bayesian indirect Fourier transformation (BIFT) of small-angle experimental data allows for an estimation of parameters that describe the data[7]. Larsen et al. show in [8] that BIFT can identify whether the experimental error in small-angle scattering data is over or underestimated. Here we use their implementation of this method to make this determination and scale the error values accordingly.

To compile the provided BIFT source code, you can follow these commands:

  1. Activate your ensemblify_env conda environment:

    conda activate ensemblify_env

    If you have not yet created it, check the Ensemblify Python Package section.

  2. Navigate to where the BIFT source code is located:

    cd src/ensemblify/third_party/BIFT/ # assuming the cloned repository is your current working directory
  3. Compile the BIFT source code:

    gfortran -march=native -O3 bift.f -o bift

    the -march=native flag may be replaced with -m64 or -m32, and it may be necessary to include the -static flag depending on which system you are on.

  4. Add an environment variable with the path to your BIFT executable to your ensemblify_env conda environment:

    conda env config vars set BIFT_PATH=$(realpath bift)
    conda deactivate
    conda activate ensemblify_env
    echo $BIFT_PATH # to check if the variable has been set correctly

    this will allow Ensemblify to know where your BIFT executable is located.

Do not forget to visit the Tripeptide Database section to learn where you can get the database files that are required for conformational ensemble generation.

🗃 Tripeptide Database

Ensemblify provides a three-residue fragment (tripeptide) database from which to sample dihedral angles, found here [link].

This database was originally created and published by González-Delgado et al. and, as described in [9], it was built by extracting dihedral angles from structures taken from the SCOPe[10] [11] 2.07 release, a curated database of high-resolution experimentally determined protein structures. In total, 6,740,433 tripeptide dihedral angle values were extracted, making up the all dataset. A structurally filtered dataset, coil, was generated by removing tripeptides contained in α-helices or β-strands, reducing the number of tripeptide dihedral angle values to 3,141,877.

Using your own database

Ensemblify can sample dihedral angles from any file in a supported format (currently .parquet, .pkl or .csv), structured according to Database Structure. Tripeptide sampling mode will only work if a tripeptide database is provided. However, single residue sampling mode will work even when you provide a tripeptide database.

Database Structure

Tripeptide Database

Your database must contain at least 10 columns: 9 containing the Phi, Psi and Omega angles for each residue of the triplet (in radians) and 1 with the string identification of the fragment they make up. Any additional columns will be ignored.

FRAG OMG1 PHI1 PSI1 OMG2 PHI2 PSI2 OMG3 PHI3 PSI3
AAA 3.136433 -1.696219 1.100253 -3.140388 -2.765840 2.675006 3.140606 -2.006085 2.063136
... ... ... ... ... ... ... ... ... ...
VYV -3.135116 -2.503945 -0.949731 -3.119968 1.407456 1.979130 -3.112883 -2.592680 2.573798

Single Residue Database

Your database must contain at least 4 columns: 3 containing the Phi, Psi and Omega angles for each residue (in radians) and 1 with the string identification of the residue. Any additional columns will be ignored. Note the '2' suffix in the column names which helps with compatibility between single residue and tripeptide sampling modes.

FRAG OMG2 PHI2 PSI2
A -3.140388 -2.765840 2.675006
... ... ... ...
Y -3.119968 1.407456 1.979130

💻 Usage

Ensemblify offers four main modules, all of which can be accessed either through the command line or from inside a Python script/Jupyter Notebook.

The generation module

With the generation module, you can generate conformational ensembles for your protein of interest.

Before generating an ensemble, you must create a parameters file either:

Check the parameters file setup section for more details.

To generate an ensemble, provide Ensemblify with the path to your parameters file.

Using the ensemblify command in a terminal:

ensemblify generation -p parameters_file.yaml

Inside a Python script or Jupyter Notebook:

from ensemblify.generation import generate_ensemble
generate_ensemble('parameters_file.yaml')

Check the generationmodule documentation for more detailed usage examples.

📝 Setting up your parameters file

An .html form is provided to aid you in building your parameters file.

Parameters Form Preview

alt text

If you prefer to create your own parameters file from scratch, a template file is also provided.

The conversion module

With the conversion module, you can convert your generated .pdb structures into a .xtc trajectory file, enabling you to easily store and analyze your conformational ensemble.

To do this, provide:

  • the name for your created trajectory;
  • the directory where the ensemble is stored;
  • the directory where the trajectory file should be created.

Using the ensemblify command in a terminal:

ensemblify conversion -j trajectory_name -e ensemble_dir -t trajectory_dir

Inside a Python script or Jupyter Notebook:

from ensemblify.conversion import ensemble2traj
ensemble2traj('trajectory_name','ensemble_dir','trajectory_dir')

Check the conversionmodule documentation for more detailed usage examples.

The analysis module

With the analysis module, you can create an interactive graphical dashboard displaying structural information calculated from the conformational ensemble of your protein of interest.

To do this, provide:

  • your ensemble in trajectory format;
  • your trajectory's corresponding topology file;
  • the name you want to use for your protein in the graphical dashboard.

Using the ensemblify command in a terminal:

ensemblify analysis -trj trajectory.xtc -top topology.pdb -tid trajectory_name

Inside a Python script or Jupyter Notebook:

from ensemblify.analysis import analyze_trajectory
analyze_trajectory('trajectory.xtc','topology.pdb','trajectory_name')

Check the analysismodule documentation for more detailed usage examples.

The reweighting module

With the reweighting module, you can use experimental SAXS data to reweigh your conformational ensemble following the Bayesian Maximum Entropy method [12].

To do this, provide:

  • your ensemble in trajectory format;
  • your trajectory's corresponding topology file;
  • the name you want to use for your protein in the graphical dashboard;
  • the experimental SAXS data of your protein.

Using the ensemblify command in a terminal:

ensemblify reweighting -trj trajectory.xtc -top topology.pdb -tid trajectory_name -exp exp_SAXS_data.dat

Inside a Python script or Jupyter Notebook:

from ensemblify.reweighting import reweight_ensemble
reweight_ensemble('trajectory.xtc','topology.pdb','trajectory_name','exp_SAXS_data.dat')

Check the reweightingmodule documentation for more detailed usage examples.

📚 Accessing Documentation

Ensemblify's documentation is available together with an API reference at https://ensemblify.readthedocs.io. Alternatively, the source-code contains docstrings with relevant information.

🗨️ Citation and Publications

If you use Ensemblify, please cite its original publication:

PUB

🤝 Acknowledgements

We would like to thank the DeepMind team for developing AlphaFold.

We would also like to thank the team at the Juan Cortés lab in the LAAS-CNRS institute for creating the tripeptide database used in the development of this tool. Check out their work at https://moma.laas.fr/.

✍️ Authors

Nuno P. Fernandes (Main Developer) [GitHub]

Tiago Lopes Gomes (Initial prototyping, Supervisor) [GitHub]

Tiago N. Cordeiro (Supervisor) [GitHub]

📖 References

[1] J. Jumper, R. Evans, A. Pritzel et al., "Highly accurate protein structure prediction with AlphaFold," Nature, vol. 596, pp. 583–589, 2021. [Link]

[2] S. Chaudhury, S. Lyskov and J. J. Gray, "PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta," Bioinformatics, vol. 26, no. 5, pp. 689-691, Mar. 2010 [Link]

[3] X. Huang, R. Pearce and Y. Zhang, "FASPR: an open-source tool for fast and accurate protein side-chain packing," Bioinformatics, vol. 36, no. 12, pp. 3758-3765, Jun. 2020 [Link]

[4] P. Rotkiewicz and J. Skolnick, "Fast procedure for reconstruction of full-atom protein models from reduced representations," Journal of Computational Chemistry, vol. 29, no. 9, pp. 1460-1465, Jul. 2008 [Link]

[5] S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M.R. Shirts, and J.C. Smith et al., “GROMACS 4.5: A high-throughput and highly parallel open source molecular simulation toolkit,” Bioinformatics, vol. 29, no. 7, pp. 845–854, 2013 [Link].

[6] S. Grudinin, M. Garkavenko and A. Kazennov, "Pepsi-SAXS: an adaptive method for rapid and accurate computation of small-angle X-ray scattering profiles," Structural Biology, vol. 73, no. 5, pp. 449-464, May 2017 [Link]

[7] B. Vestergaard and S. Hansen, "Application of Bayesian analysis to indirect Fourier transformation in small-angle scattering," Journal of Applied Crystallography, vol. 39, no. 6, pp. 797-804, Dec. 2006 [Link]

[8] A. H. Larsen and M. C. Pedersen, "Experimental noise in small-angle scattering can be assessed using the Bayesian indirect Fourier transformation," Journal of Applied Crystallography, vol. 54, no. 5, pp. 1281-1289, Oct. 2021 [Link]

[9] J. González-Delgado , P. Bernadó , P. Neuvial and J. Cortés, "Statistical proofs of the interdependence between nearest neighbor effects on polypeptide backbone conformations," Journal of Structural Biology, vol. 214, no. 4, p. 107907, Dec. 2022 [Link]

[10] N. K. Fox, S. E. Brenner and J. M. Chandonia, "SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures," Nucleic Acids Research, vol. 42, no. D1, pp. D304-D309, Jan. 2014 [Link]

[11] J. M. Chandonia, N. K. Fox and S. E. Brenner, "SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database," Nucleic Acids Research, vol. 47, no. D1, pp. D475–D481, Jan. 2019 [Link]

[12] S. Bottaro , T. Bengsten and K. Lindorff-Larsen, "Integrating Molecular Simulation and Experimental Data: A Bayesian/Maximum Entropy Reweighting Approach," pp. 219-240, Feb. 2020. In: Z. Gáspári, (eds) Structural Bioinformatics, Methods in Molecular Biology, vol. 2112, Humana, New York, NY. [Link]

About

Ensemblify is a computational tool that generates ensembles of intrinsically disordered regions from AlphaFold or user-defined models. It provides an user-friendly way to generate, analyze and reweight (against experimental data) ensembles, making it accessible to all members of the scientific community, indepent of their programming expertise.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%