Update snakemake_assemble to be useful lab-wide

## Background

I've been using snakemake workflows for processing of shotgun data. I've found them to be extremely useful for fast, robust, and repeatable processing and analysis. In particular, these workflows have been useful for rapidly integrating and comparing new or alternative tools. 

However, my initial attempts -- [snakemake_shotqual](https://github.com/tanaes/snakemake_shotqual) and [snakeamek_anvio](https://github.com/tanaes/snakemake_anvio) -- didn't take advantage of quite enough of Snakemake's useful features, needed to be made more modular, and needed to be updated for better testing and usability. 

This repository is my initial attempt at modularizing and organizing a snakemake-based shotgun metagenomics toolset so that it can be more useful across a larger number of datasets, and perhaps form a more useful starting point for both production-level data analysis and integration of shotgun analysis steps into other tools (such as Qiita and Qiime2). 


## General concepts

I've set up the steps of the overall processing pipeline in a set of modular snakefiles, located in `./bin/snakefiles`. These individual modules are each included in the parent `./Snakefile`, which also imports global variables from a configuration file (`config.yaml`) and specifies a single, cannonical `complete` end-to-end workflow in the `all` top-level rule. 

Within each of the individual modules -- for example, `qc` or `assemble` -- there is a top-level rule that specifies the endpoints of that particular analysis module. For `qc`, that means ending with trimmed and host-filtered fastq files, plus a MultiQC report summarizing the results of those steps:

```python
rule qc:
    input:
        expand(qc_dir + "{sample}/skewer_trimmed/{sample}.trimmed.R1.fastq.gz", sample=samples),
        expand(qc_dir + "{sample}/skewer_trimmed/{sample}.trimmed.R2.fastq.gz", sample=samples),
        expand(qc_dir + "{sample}/filtered/{sample}.R1.trimmed.filtered.fastq.gz", sample=samples),
        expand(qc_dir + "{sample}/filtered/{sample}.R2.trimmed.filtered.fastq.gz", sample=samples),
        qc_dir + "multiQC_per_sample/multiqc_report.html"
```

For `assemble`, it means ending with an assembly using the specified assembler plus Quast and MetaQuast reports for each of the samples chosen for assembly:

```python
    input:
        expand(assemble_dir + "{sample}/{assembler}/{sample}.contigs.fa",
               sample=samples, assembler=config['assemblers']),
        expand(assemble_dir + "{sample}/metaquast.tar.gz",
               sample=samples),
        expand(assemble_dir + "{sample}/quast.tar.gz",
               sample=samples)
```

## Current features

Right now, the workflow is set up for peforming QC, assembly, binning, and binning visualization with Anvi'o. I have also been working on adding functional profiling with HuMAnN2 and taxonomic profiling with MetaPhlAn2, Centrifuge, Kraken, and Shogun. 


## Next steps

Several additional steps need to be taken before the codebase is generally useful lab-wide. Here's what I envision needing to have happen, and where I can use your help.

### Installation

Right now, I've included a requirements.yaml file and an install script that should set up the majority of modules in a local conda environment for execution. However, a recent update to snakemake allows specification and automatic creation of rule-specific environments natively, which I think will vastly improve the usability and portability of the pipeline. 

- [ ] create conda requirements files for rules (per module?)
- [ ] update snakefiles to invoke environments
- [ ] update problematic conda recipes (having trouble with Anvi'o, MaxBin2, and Quast)


### Configuration

All of the project-specific information is encoded in a `config.yaml` file. This needs to be better documented, and it would be good to have some sort of generic helper tool (previously I was using an [ipynb](https://github.com/tanaes/snakemake_shotqual/blob/master/config_prep.ipynb)) to generate it for a dataset. 

- [ ] clean up config.yaml specification
- [ ] create helper function for config creation. Better ipynb?
- [ ] add alternative launch scripts for running on different HPC environments, e.g. Comet/SLURM or local execution
- [ ] install critical databases for use on Barnacle, preferably on local storage per-node to minimize filesystem IO


### Stability and testing

There's currently a few failure modes that could be caught more elegantly, and it would be useful to integrate testing.

- [ ] add mechanism for more gracefully tolerating failed or low-coverage samples. In config prep ipynb?
- [ ] add unit testing for python sub-rules
- [ ] add unit or endpoint testing for rules
- [ ] add travis build instructions and testing for GitHub integrated testing


### Feature improvements

A couple additional features would greatly facilitate the push-button utility of the workflow. Right now I think the major needs have to do with generating data products that can be plugged into downstream applications.

- [ ] complete addition of taxonomy profilers
- [ ] add taxonomy annotation of assembly bins
- [ ] add Qiime-compatible Biom outputs for relevant rules (e.g. humann2 and taxonomy steps)
- [ ] add better binning visualizations (e.g. [ICoVer](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1653-5))




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update snakemake_assemble to be useful lab-wide #1

Background

General concepts

Current features

Next steps

Installation

Configuration

Stability and testing

Feature improvements

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Update snakemake_assemble to be useful lab-wide #1

Description

Background

General concepts

Current features

Next steps

Installation

Configuration

Stability and testing

Feature improvements

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions