Skip to content

Update snakemake_assemble to be useful lab-wide #1

@tanaes

Description

@tanaes

Background

I've been using snakemake workflows for processing of shotgun data. I've found them to be extremely useful for fast, robust, and repeatable processing and analysis. In particular, these workflows have been useful for rapidly integrating and comparing new or alternative tools.

However, my initial attempts -- snakemake_shotqual and snakeamek_anvio -- didn't take advantage of quite enough of Snakemake's useful features, needed to be made more modular, and needed to be updated for better testing and usability.

This repository is my initial attempt at modularizing and organizing a snakemake-based shotgun metagenomics toolset so that it can be more useful across a larger number of datasets, and perhaps form a more useful starting point for both production-level data analysis and integration of shotgun analysis steps into other tools (such as Qiita and Qiime2).

General concepts

I've set up the steps of the overall processing pipeline in a set of modular snakefiles, located in ./bin/snakefiles. These individual modules are each included in the parent ./Snakefile, which also imports global variables from a configuration file (config.yaml) and specifies a single, cannonical complete end-to-end workflow in the all top-level rule.

Within each of the individual modules -- for example, qc or assemble -- there is a top-level rule that specifies the endpoints of that particular analysis module. For qc, that means ending with trimmed and host-filtered fastq files, plus a MultiQC report summarizing the results of those steps:

rule qc:
    input:
        expand(qc_dir + "{sample}/skewer_trimmed/{sample}.trimmed.R1.fastq.gz", sample=samples),
        expand(qc_dir + "{sample}/skewer_trimmed/{sample}.trimmed.R2.fastq.gz", sample=samples),
        expand(qc_dir + "{sample}/filtered/{sample}.R1.trimmed.filtered.fastq.gz", sample=samples),
        expand(qc_dir + "{sample}/filtered/{sample}.R2.trimmed.filtered.fastq.gz", sample=samples),
        qc_dir + "multiQC_per_sample/multiqc_report.html"

For assemble, it means ending with an assembly using the specified assembler plus Quast and MetaQuast reports for each of the samples chosen for assembly:

    input:
        expand(assemble_dir + "{sample}/{assembler}/{sample}.contigs.fa",
               sample=samples, assembler=config['assemblers']),
        expand(assemble_dir + "{sample}/metaquast.tar.gz",
               sample=samples),
        expand(assemble_dir + "{sample}/quast.tar.gz",
               sample=samples)

Current features

Right now, the workflow is set up for peforming QC, assembly, binning, and binning visualization with Anvi'o. I have also been working on adding functional profiling with HuMAnN2 and taxonomic profiling with MetaPhlAn2, Centrifuge, Kraken, and Shogun.

Next steps

Several additional steps need to be taken before the codebase is generally useful lab-wide. Here's what I envision needing to have happen, and where I can use your help.

Installation

Right now, I've included a requirements.yaml file and an install script that should set up the majority of modules in a local conda environment for execution. However, a recent update to snakemake allows specification and automatic creation of rule-specific environments natively, which I think will vastly improve the usability and portability of the pipeline.

  • create conda requirements files for rules (per module?)
  • update snakefiles to invoke environments
  • update problematic conda recipes (having trouble with Anvi'o, MaxBin2, and Quast)

Configuration

All of the project-specific information is encoded in a config.yaml file. This needs to be better documented, and it would be good to have some sort of generic helper tool (previously I was using an ipynb) to generate it for a dataset.

  • clean up config.yaml specification
  • create helper function for config creation. Better ipynb?
  • add alternative launch scripts for running on different HPC environments, e.g. Comet/SLURM or local execution
  • install critical databases for use on Barnacle, preferably on local storage per-node to minimize filesystem IO

Stability and testing

There's currently a few failure modes that could be caught more elegantly, and it would be useful to integrate testing.

  • add mechanism for more gracefully tolerating failed or low-coverage samples. In config prep ipynb?
  • add unit testing for python sub-rules
  • add unit or endpoint testing for rules
  • add travis build instructions and testing for GitHub integrated testing

Feature improvements

A couple additional features would greatly facilitate the push-button utility of the workflow. Right now I think the major needs have to do with generating data products that can be plugged into downstream applications.

  • complete addition of taxonomy profilers
  • add taxonomy annotation of assembly bins
  • add Qiime-compatible Biom outputs for relevant rules (e.g. humann2 and taxonomy steps)
  • add better binning visualizations (e.g. ICoVer)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions