Skip to content

Cache fail when specific conditions are met #6884

@vagkaratzas

Description

@vagkaratzas

Bug report

The scenario: A pipeline step succeeds (but on a retry attempt), then the pipeline continues and fails on a later step, but then on pipeline resume, the on-retry-succeeded step is not picked up, and that step runs again. This isn’t always the case; it only happens when that step had attempted to re-run in a previous execution, at least once (i.e., having a relevant work folder for that module/step created). Relevant blog post here

Expected behavior and actual behavior

Expected behaviour: The on-retry-succeeded step is properly cached, and pipeline executes resumes from where it left
Actual behaviour: Even though it completed successfully on a retry, it won't be cached, and re-runs.

Steps to reproduce the problem

  1. clone proteinfamilies
  2. alter conf/test.config here to make it fail on the 1st attempt, but pass on the 2nd:
withName: 'NFCORE_PROTEINFAMILIES:PROTEINFAMILIES:FAA_SEQFU_SEQKIT:SEQKIT_SEQ' {
       memory = { task.attempt == 1 ? 1.MB : 2.GB } // memory = { 1.4.GB * task.attempt }
       time   = { 2.m    * task.attempt }
   }
  1. Your executor has to be able to stop a process when it requires excess memory than what was defined (or use docker), so I used slurm with this slurm.config:
profiles {
   slurm {
       executor {
           name              = "slurm"
           queueSize         = 100
           queueGlobalStatus = true

       }
       workDir = "/path/to/work_proteinfamilies_cache_test/"
       process {
           queue  = 'standard'
           cache  = 'lenient'
       }
       params {
           // Boilerplate options
           outdir = "${launchDir}/results"
       }
   }
}
  1. Run with command:
    NXF_VER=25.10.4 nextflow run proteinfamilies -c conf/slurm.config -profile singularity,test,slurm -resume
  2. As soon as the work folder for SEQKIT_SEQ is created, kill the pipeline (ctrl +C) (this step is improtant, otherwise if it passes on the 2nd attempt, without pre-failing at least once, the cache will work properly)
  3. Resume the pipeline with the same command, and wait for SEQKIT_SEQ to complete successfully on its 2nd attempt.
  4. Kill the pipeline again.
  5. Resume the pipeline with the same command. Verdict: SEQKIT_SEQ will begin to run again, while SEQFU_STATS_BEFORE, which has run in parallel and completed on its first attempt, will have been cached properly.

Program output

Irrelevant

Environment

  • Nextflow version: NXF_VER=25.10.4
  • Java version:
openjdk 21.0.3 2024-04-16 LTS
OpenJDK Runtime Environment Temurin-21.0.3+9 (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (build 21.0.3+9-LTS, mixed mode, sharing)
  • Operating system: Linux, slurm
  • Bash version: GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)

Additional context

Dummy bot PR with attempt to pinpoint the bug in the codebase here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions