Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 36 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,12 @@
## Introduction

**nf-core/seqsubmit** is a Nextflow pipeline for submitting sequence data to [ENA](https://www.ebi.ac.uk/ena/browser/home).
Currently, the pipeline supports three submission modes, each routed to a dedicated workflow and requiring its own input samplesheet structure:
Currently, the pipeline supports four submission modes, each routed to a dedicated workflow and requiring its own input samplesheet structure:

- `mags` for Metagenome Assembled Genomes (MAGs) submission with `GENOMESUBMIT` workflow
- `bins` for bins submission with `GENOMESUBMIT` workflow
- `metagenomic_assemblies` for assembly submission with `ASSEMBLYSUBMIT` workflow
- `reads` for raw sequencing reads submission with `READSUBMIT` workflow

![seqsubmit workflow diagram](assets/seqsubmit_schema.png)

Expand Down Expand Up @@ -123,6 +124,38 @@ assembly_2,data/contigs_2.fasta.gz,,,42.7,ERR011323,MEGAHIT,1.2.9
> [!IMPORTANT]
> **Samplesheet column requirements**: All columns shown in the example above must be present in your samplesheet, even if some values are empty. Columns must be in exactly the same order as shown.

### `reads` mode (`READSUBMIT`)

The input must follow `assets/schema_input_reads.json`.

Required columns:

- `sample`
- `sample_accession`
- `fastq_1`
- `fastq_2`
- `platform`
- `instrument`
- `library_source`
- `library_selection`
- `library_strategy`

Optional columns:

- `insert_size`
- `library_name`
- `description`

Example `samplesheet_reads.csv`:

```csv
sample,sample_accession,fastq_1,fastq_2,platform,instrument,library_source,library_selection,library_strategy,insert_size,library_name,description
illumina_run_001,SAMEA1234567,data/reads_R1.fastq.gz,data/reads_R2.fastq.gz,ILLUMINA,Illumina HiSeq 2000,GENOMIC,RANDOM,WGS,500,HiSeq_library_001,Illumina sequencing of sample XYZ
```

> [!IMPORTANT]
> **Samplesheet column requirements**: All columns shown in the example above must be present in your samplesheet, even if some values are empty. Columns must be in exactly the same order as shown.

## Usage

> [!NOTE]
Expand All @@ -142,7 +175,7 @@ The `mags`/`bins` workflow requires databases for completeness/contamination est

| Parameter | Description |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------- |
| `--mode` | Type of the data to be submitted. Options: `[mags, bins, metagenomic_assemblies]` |
| `--mode` | Type of the data to be submitted. Options: `[mags, bins, metagenomic_assemblies, reads]` |
| `--input` | Path to the samplesheet describing the data to be submitted |
| `--outdir` | Path to the output directory for pipeline results |
| `--submission_study` OR `--study_metadata` | ENA study accession (PRJ/ERP) to submit the data to OR metadata file in JSON/TSV/CSV format to register new study |
Expand All @@ -161,7 +194,7 @@ General command template:
```bash
nextflow run nf-core/seqsubmit \
-profile <docker/singularity/...> \
--mode <mags|bins|metagenomic_assemblies> \
--mode <mags|bins|metagenomic_assemblies|reads> \
--input <samplesheet.csv> \
--centre_name <your_centre> \
--submission_study <your_study> \
Expand Down
127 changes: 127 additions & 0 deletions assets/schema_input_reads.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/nf-core/seqsubmit/main/assets/schema_input_reads.json",
"title": "nf-core/seqsubmit pipeline - params.input schema",
"description": "Schema for the sample sheet provided with params.input if params.mode is set to 'reads'",
"type": "array",
"items": {
"type": "object",
"properties": {
"sample": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample must be provided and cannot contain spaces",
"meta": ["id"],
"description": "Unique experiment/run name"
},
"sample_accession": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample accession must be provided and cannot contain spaces",
"description": "ENA sample accession of the sample used to generate the reads"
},
"fastq_1": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.(fq|fastq)(\\.gz)?$",
"errorMessage": "FASTQ file must have extension '.fq' or '.fastq' (optionally gzipped)",
"description": "Forward reads FASTQ file (single-end or paired-end)"
},
"fastq_2": {
"anyOf": [
{
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.(fq|fastq)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
],
"errorMessage": "FASTQ file for reverse reads must have extension '.fq' or '.fastq' (optionally gzipped)",
"description": "Reverse reads FASTQ file if paired-end. Leave empty for single-end reads"
},
"platform": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Platform must be provided and cannot contain spaces",
"description": "Sequencing platform (e.g., ILLUMINA, PACBIO_SMRT, OXFORD_NANOPORE, ION_TORRENT)"
},
"instrument": {
"type": "string",
"pattern": "^[^\\n]+$",
"errorMessage": "Instrument must be provided and cannot span multiple lines",
"description": "Sequencer model (e.g., 'Illumina HiSeq 2000', 'PacBio Sequel')"
},
"library_source": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Library source must be provided and cannot contain spaces",
"description": "Library source (GENOMIC, METAGENOMIC, TRANSCRIPTOMIC, etc.)"
},
"library_selection": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Library selection must be provided and cannot contain spaces",
"description": "Library selection (RANDOM, PCR, cDNA, etc.)"
},
"library_strategy": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Library strategy must be provided and cannot contain spaces",
"description": "Library strategy (WGS, RNA-Seq, AMPLICON, etc.)"
},
"insert_size": {
"anyOf": [
{
"type": "number",
"minimum": 0
},
{
"type": "string",
"maxLength": 0
}
],
"errorMessage": "Insert size must be a positive number or empty",
"description": "Fragment/insert size for paired-end reads (optional)"
},
"library_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "string",
"maxLength": 0
}
],
"description": "Descriptive library name (optional)"
},
"description": {
"anyOf": [
{
"type": "string"
},
{
"type": "string",
"maxLength": 0
}
],
"description": "Free-text description of the experiment (optional)"
}
},
"required": [
"sample",
"sample_accession",
"fastq_1",
"platform",
"instrument",
"library_source",
"library_selection",
"library_strategy"
]
}
}
2 changes: 1 addition & 1 deletion conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ process {
]
}

withName: 'REGISTERSTUDY|GENERATE_ASSEMBLY_MANIFEST' {
withName: 'REGISTERSTUDY|GENERATE_ASSEMBLY_MANIFEST|CREATE_READS_MANIFEST' {
publishDir = [
enabled: false
]
Expand Down
34 changes: 34 additions & 0 deletions conf/test_reads_paired.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defines input files and everything required to run a fast and simple pipeline test.

Use as follows:
nextflow run nf-core/seqsubmit -profile test_reads,<docker/singularity> --outdir <OUTDIR>

----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 2,
memory: '8.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test --mode reads profile'
config_profile_description = 'Minimal test profile for reads submission'

// Input data
input = "${projectDir}/assets/samplesheet_reads.csv"
outdir = 'test_output'

mode = "reads"
submission_study = "PRJEB98843"
centre_name = "TEST_CENTER"

test_upload = true
}
16 changes: 15 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The directories listed below will be created in the results directory (set with

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and performs automated submission of sequence data to ENA. Exact steps and generated outputs depend on the data type and `--mode` executed (`mags`, `bins` or `metagenomic_assemblies`).
The pipeline is built using [Nextflow](https://www.nextflow.io/) and performs automated submission of sequence data to ENA. Exact steps and generated outputs depend on the data type and `--mode` executed (`mags`, `bins`, `metagenomic_assemblies` or `reads`).

## `mags` and `bins` outputs

Expand Down Expand Up @@ -59,6 +59,20 @@ Assembly study registration, manifest generation, and Webin-CLI submission are e
> Users should read the ENA documentation on referencing submitted data: \
> metagenomic assemblies: https://ena-docs.readthedocs.io/en/latest/submit/assembly/metagenome/primary.html#assigned-accession-numbers

## `reads` outputs

When `--mode reads` is used, results are written under `reads/`.

<details markdown="1">
<summary>Output files</summary>

- `reads/`
- `upload/reads_accessions.tsv`: run accessions assigned to submitted reads.

</details>

Manifest generation and Webin-CLI submission are executed by the workflow, but their intermediate outputs are not currently published into `--outdir` by the pipeline.

## Common outputs

### MultiQC
Expand Down
Loading
Loading