Skip to content

mitoclub/nemu-pipeline-nf

Repository files navigation

NeMu-pipeline

A Nextflow-based bioinformatics pipeline for sequence analysis and phylogenetic inference.

Workflow schematic representation

scheme

Dependencies

All dependencies are specified in environment.yml and include:

  • Nextflow (with Java)
  • Python
  • IQtree
  • SeqKit
  • TaxonKit
  • BLAST+
  • MAFFT and MACSE

Installation

Conda

Clone the repository. Install the dependencies using conda or mamba.

Conda and mamba can be installed following these instructions.

git clone https://github.com/mitoclub/nemu-pipeline-nf.git
cd nemu-pipeline

conda env create -f environment.yml -n nemu --yes
# OR
# mamba create -f environment.yml -n nemu --yes

conda activate nemu-pipeline

Docker

Build the container to run the pipeline inside it. You can either run the entire pipeline with Nextflow inside the container or use the container as an isolated environment for the Nextflow pipeline (requires manually installing Nextflow 25.10 with Java OpenJDK 17).

docker build -t nemu-pipeline:latest .
#docker run -it nemu-pipeline:latest nextflow run /app/main.nf ...

Usage

When input is nucleotide multi-fasta file

Sequences must be orthologous and may be aligned (--aligned parameter).

It's possible to pass several input fasta files using "*". Note that it's required to use caveats for the filename.

nextflow run main.nf \
  -o results_ecoli \
  -process.cpus=20 \
  --input-type nucleotide_coding \
  --input "sample_input_ecoli/*.fasta" \
  --gencode 1 \
  --outgroup-id outgroup

When input is protein sequences

nextflow run main.nf \
  -o results_test \
  -process.cpus=20 \
  -resume \
  -with-trace \
  --input-type protein \
  --input "test_data/test_proteins.fasta" \
  --gencode 2 \
  --db path_to_nuc_database \
  --taxdump "$HOME/.taxonkit"

Recomendations for comparative-species analysis: --branch-spectra, --model, OUTGRP, consCatCutoff etc. TODO

Command line options

It's possible to use command line parameters or change nextflow.config file.

Usage: nextflow run main.nf --input <input_fasta> [options]

TODO update to latest

Main options:
    --input FILE            Input FASTA file (required)
                            If input type is protein, a multi-FASTA with one or several sequences 
                            required (header format: ">ID [Species name]"). If input type 
                            is nucleotide, one or several fasta files with orthologous 
                            sequences (including outgroup) required
    --input_type STRING     Type of input sequences: 
                            protein, nucleotide_coding, nucleotide_noncoding (default: nucleotide_coding)
    --gencode NUM           Genetic code table (default: 1)
                            Used for codon-aware alignment and annotation of mutations

Required options for nucleotide input:
    --outgroup-id STRING    Outgroup sequence ID (default: OUTGRP)
                            Specify outgroup sequence ID in the alignment for rooting the tree
    --aligned BOOL          Input sequences are pre-aligned (default: false)

Required options for protein input:
    --db PATH               BLAST database path (required for protein input)
    --taxdump DIR           Taxdump directory path (required for protein input)
                            TODO integrate to the container
    --species-name STRING   Override species name. Useful when you work with proteins 
                            from single species
    --max-target-seqs NUM   Max target sequences for BLAST (default: 2000)
                            tblastn will collect no more than this number of sequences

Nextflow options:
    -with-report FILE       Generate execution report
    -with-trace FILE        Generate execution trace
    -with-timeline FILE     Generate execution timeline
    -output-dir DIR         Specify output directory (default: ./results)

Common options:
    --threads NUM           Number of threads to use (default: 1) TODO delete if not needed
    --save-intermeds BOOL   Save intermediate files TODO implement
    --help                  Show this help message and exit

Options for MSA & Phylogeny:
    --msa-mode STRING       MSA mode: auto, macse, mafft_macse, mafft (default: auto)
    --min-seqs NUM          Minimum number of sequences to proceed phylogenetic inference (default: 4)
    --treefile FILE         Input tree file (optional; default: build tree de novo)
    --model STRING          IQ-TREE substitution model (default: GTR+FO+G4+I)
    --model-asr STRING      ASR substitution model (default: GTR+FO+G4+I)
    --run-treeshrink BOOL   Run TreeShrink to prune long branches (default: true)

Options for Mutation Extraction:
    --cons-cat-cutoff NUM   Conservation category cutoff for mutation extraction (default: 0)
                            0 = no cutoff; only mutations in sites with rate category 
                            less than or equal to this value will be used
    --proba-arg BOOL        Use probabilities in mutation extraction (default: true)
    --uncertainty-coef BOOL Use phylogeny uncertainty coefficient in mutation extraction 
                            (default: true)

Options for Mutation Spectra Derivation:
    --plot BOOL             Generate barcharts of mutation spectra (default: true)
    --internal BOOL         Derive spectra for internal branches (default: false)
    --terminal BOOL         Derive spectra for terminal branches (default: false)
    --branch-spectra BOOL   Derive spectra for individual branches (default: false)

TODO

  • Add parsing of taxon IDs from FASTA headers (configurable option)
  • Add comprehensive tests for pipeline (Nextflow-based test files)
  • Run and validate tests
  • Finalize pipeline configuration
  • Prepare and optimize container image

About

Pipeline for accurate reconstruction of neutral mutation spectra from evolutionary data

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors