-
Notifications
You must be signed in to change notification settings - Fork 777
Open
Description
Bug report
The scenario: A pipeline step succeeds (but on a retry attempt), then the pipeline continues and fails on a later step, but then on pipeline resume, the on-retry-succeeded step is not picked up, and that step runs again. This isn’t always the case; it only happens when that step had attempted to re-run in a previous execution, at least once (i.e., having a relevant work folder for that module/step created). Relevant blog post here
Expected behavior and actual behavior
Expected behaviour: The on-retry-succeeded step is properly cached, and pipeline executes resumes from where it left
Actual behaviour: Even though it completed successfully on a retry, it won't be cached, and re-runs.
Steps to reproduce the problem
- clone proteinfamilies
- alter conf/test.config here to make it fail on the 1st attempt, but pass on the 2nd:
withName: 'NFCORE_PROTEINFAMILIES:PROTEINFAMILIES:FAA_SEQFU_SEQKIT:SEQKIT_SEQ' {
memory = { task.attempt == 1 ? 1.MB : 2.GB } // memory = { 1.4.GB * task.attempt }
time = { 2.m * task.attempt }
}
- Your executor has to be able to stop a process when it requires excess memory than what was defined (or use docker), so I used slurm with this slurm.config:
profiles {
slurm {
executor {
name = "slurm"
queueSize = 100
queueGlobalStatus = true
}
workDir = "/path/to/work_proteinfamilies_cache_test/"
process {
queue = 'standard'
cache = 'lenient'
}
params {
// Boilerplate options
outdir = "${launchDir}/results"
}
}
}
- Run with command:
NXF_VER=25.10.4 nextflow run proteinfamilies -c conf/slurm.config -profile singularity,test,slurm -resume - As soon as the work folder for
SEQKIT_SEQis created, kill the pipeline (ctrl +C) (this step is improtant, otherwise if it passes on the 2nd attempt, without pre-failing at least once, the cache will work properly) - Resume the pipeline with the same command, and wait for
SEQKIT_SEQto complete successfully on its 2nd attempt. - Kill the pipeline again.
- Resume the pipeline with the same command. Verdict:
SEQKIT_SEQwill begin to run again, whileSEQFU_STATS_BEFORE, which has run in parallel and completed on its first attempt, will have been cached properly.
Program output
Irrelevant
Environment
- Nextflow version: NXF_VER=25.10.4
- Java version:
openjdk 21.0.3 2024-04-16 LTS
OpenJDK Runtime Environment Temurin-21.0.3+9 (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (build 21.0.3+9-LTS, mixed mode, sharing)
- Operating system: Linux, slurm
- Bash version:
GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)
Additional context
Dummy bot PR with attempt to pinpoint the bug in the codebase here
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels