Skip to content

Fix --save_output_as_bam flag#2154

Open
FriederikeHanssen wants to merge 18 commits into
devfrom
save-output-bam
Open

Fix --save_output_as_bam flag#2154
FriederikeHanssen wants to merge 18 commits into
devfrom
save-output-bam

Conversation

@FriederikeHanssen
Copy link
Copy Markdown
Contributor

@FriederikeHanssen FriederikeHanssen commented Mar 9, 2026

What this fixes

--save_output_as_bam was broken in several ways. This PR makes it work end-to-end and tightens an existing safeguard.

User-visible behaviour

  • With --save_output_as_bam: markduplicates, sentieon dedup, and recalibration all publish BAM directly. No more silently-skipped variant calling, no more duplicate-emission errors.
  • Without --save_output_as_bam: BAM files no longer appear in preprocessing/converted/cram_to_bam/ as a side-effect of cnvkit/msisensor2/muse running. The conversion still happens for those tools (they need BAM input), it's just not published.
  • --use_gatk_spark markduplicates + --save_mapped now errors at parameter validation, regardless of --save_output_as_bam. Previously it produced a name-sorted, unindexable mapped BAM/CRAM (Potentially incompatible outputs are generated with BAM output and Spark MarkDuplicates #1949).

How

Instead of always producing CRAM and converting back to BAM, the relevant processes now output the requested format directly:

  • APPLYBQSR, GATK4_MARKDUPLICATES, GATK4SPARK_MARKDUPLICATES, and SENTIEON_DEDUP have ext.prefix/ext.suffix conditional on params.save_output_as_bam.
  • bam_applybqsr, bam_markduplicates, bam_markduplicates_spark, and bam_sentieon_dedup each expose a single alignment channel that mixes their BAM and CRAM output branches — only one is ever populated per sample, so there is no duplicate emission to join on.
  • CSV restart files derive the data type from the actual filename instead of a brittle .minus(".cram") hack.
  • The conditional CRAM_TO_BAM/CRAM_TO_BAM_RECAL conversion steps at the preprocessing stages are gone. The variant-calling-stage CRAM_TO_BAM (needed for cnvkit/msisensor2/muse BAM input) still runs but now only publishes its output when --save_output_as_bam is set.

Issues closed

Test plan

  • nf-test test tests/save_output_as_bam.nf.test --profile debug,test,docker — both scenarios pass
  • nf-test test tests/default.nf.test --profile debug,test,docker — default test passes (no BAM artifacts without flag)
  • With --save_output_as_bam: BAM files in preprocessing/markduplicates/ and preprocessing/recalibrated/, variant calling runs
  • Without flag: only CRAM files, no preprocessing/converted/cram_to_bam/ artifacts
  • With intervals (WGS): no duplicate emission errors
  • With sentieon dedup: BAM published to preprocessing/sentieon_dedup/ when flag is set
  • --use_gatk_spark markduplicates --save_mapped errors out at parameter validation

🤖 Generated with Claude Code

…g, unnecessary conversions

Closes #2136, #2064, #2149, #2148

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nf-core-bot
Copy link
Copy Markdown
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

FriederikeHanssen and others added 4 commits March 12, 2026 11:25
Resolve conflicts:
- CHANGELOG.md: keep entries from both sides
- fastq_preprocess_gatk/main.nf: keep our fix removing old CRAM_TO_BAM_RECAL block

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The process was removed in #2154, so the sample log output
in usage.md should no longer list it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of always producing CRAM and converting back to BAM:
- Make GATK4_MARKDUPLICATES/GATK4SPARK_MARKDUPLICATES ext.prefix
  conditional on save_output_as_bam (same pattern as APPLYBQSR)
- Emit unified `alignment` channel from bam_markduplicates subworkflows
- Remove CRAM_TO_BAM conversion step at markduplicates stage
- Fix BAM_TO_CRAM_MAPPING ext.when to skip conversion when
  save_output_as_bam is set with skip_markduplicates
- Fix CSV create subworkflows to derive file type from actual
  filenames instead of using fragile .minus(".cram") hack
- Add BAM publishDir patterns for markduplicates configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead BAM_TO_CRAM config block (no process uses this alias)
- Remove unused save_output_as_bam parameter from
  CHANNEL_MARKDUPLICATES_CREATE_CSV and
  CHANNEL_BASERECALIBRATOR_CREATE_CSV (type is now derived
  from filename, making the parameter redundant)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 12, 2026

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit a245a21

+| ✅ 224 tests passed       |+
#| ❔  13 tests were ignored |#
!| ❗   7 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • schema_lint - Input mimetype is missing or empty
  • schema_description - No description provided in schema for parameter: markduplicates_pixel_distance
  • schema_description - No description provided in schema for parameter: gatk_pcr_indel_model

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.5.1
  • Run at 2026-05-29 13:00:52

GATK4_MARKDUPLICATES does not auto-index BAM output (only indexes
when converting to CRAM). Add explicit INDEX_MARKDUPLICATES step
for the BAM path, matching the pattern already used in the spark
variant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread tests/save_output_as_bam.nf.test
Comment thread tests/save_output_as_bam.nf.test
Comment thread CHANGELOG.md Outdated
FriederikeHanssen and others added 2 commits March 12, 2026 16:33
The skip QC/recal/md test now correctly skips BAM_TO_CRAM_MAPPING
when --save_output_as_bam is set, reducing process count from 10 to 9.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- alignment_from_everything: Remove CRAM_TO_BAM, add INDEX_MARKDUPLICATES,
  update file listings and stable_content md5s for BAM output
- alignment_to_fastq: Same structural changes, update multiqc aggregate md5s
- save_output_as_bam: Fix warning field to match CI behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FriederikeHanssen and others added 2 commits March 20, 2026 16:01
- alignment_from_everything/alignment_to_fastq: Fix .bam.metrics md5s
  (values were extracted from wrong side of CI diff)
- save_output_as_bam: Add missing variant calling snapshot, fix warnings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GATK4_MARKDUPLICATES .bam.metrics output includes timestamps,
making md5 values change between CI runs. Add to .nftignore
(same as .cram.metrics) and remove from snapshot stable_content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@maxulysse maxulysse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good but order of the PR in changelog

@gorgitko
Copy link
Copy Markdown

Hi, what's the time estimate for merging & releasing this fix? Thanks!

@FriederikeHanssen
Copy link
Copy Markdown
Contributor Author

I asked the issue authors above to test. @amizeranschi ran into something that i haven't had time to investigate: #2064 it would be good too exclude first if the changes in this PR are causing the stuck issue and/or validate that all the other scenarios are functioning now, if you have time to run any of them

FriederikeHanssen and others added 3 commits May 28, 2026 11:03
- Sentieon dedup now respects --save_output_as_bam: conditional ext.prefix
  produces dedup.bam/dedup.cram natively; BAM_SENTIEON_DEDUP exposes a
  unified `alignment` emit mixing the BAM and CRAM branches (#2078)
- Variant-calling-stage CRAM_TO_BAM no longer publishes converted BAMs to
  preprocessing/converted/cram_to_bam when --save_output_as_bam is unset;
  the conversion still runs for cnvkit/msisensor2/muse (#2148)
- Widen the --use_gatk_spark markduplicates + --save_mapped check to error
  regardless of --save_output_as_bam: Spark requires name-sorted input, so
  the saved mapped alignment (BAM or CRAM) is name-sorted and unindexable
  (#1949). Updated nextflow_schema.json and docs/usage.md accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Align indentation of SAMTOOLS_STATS withName block with sibling
- Make SAMTOOLS_STATS ext.prefix respect --sentieon_consensus, matching
  the SENTIEON_DEDUP prefix that was already conditional

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxulysse
Copy link
Copy Markdown
Member

Let's merge #2197 before to fix the tests

maxulysse and others added 4 commits May 28, 2026 16:29
The previous mix of bam/bai + cram/crai joins fails because
SENTIEON_DEDUP.out.bai is non-optional in the module (sentieon driver
always writes a .bai alongside .crai). With --save_output_as_bam unset,
.out.bam is empty but .out.bai has an entry — failOnMismatch on the bam
side triggers a Join mismatch error before the pipeline can progress.

Branch on params.save_output_as_bam instead: only the populated path
runs, so the empty-side mismatch never happens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- sentieon_dedup: SAMTOOLS_STATS prefix now respects --sentieon_consensus
  (test.consensus.cram.stats instead of test.dedup.cram.stats)
- variant_calling_{cnvkit,muse,all}: drop preprocessing/converted/cram_to_bam
  paths since CRAM_TO_BAM no longer publishes when --save_output_as_bam is unset
- alignment_{from_everything,to_fastq}: refresh MultiQC samtools md5s
- save_output_as_bam: add MultiQC bcftools_stats + samtools_insert_size
  entries now that variant calling actually runs (no longer silently skipped)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Derive file type from `file.name.endsWith('.bam')` like
channel_markduplicates_create_csv and channel_baserecalibrator_create_csv
already do, and drop the now-unused save_output_as_bam parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants