Skip to content

bot attempt to figure the bug, based on reproducible example#6882

Open
vagkaratzas wants to merge 1 commit intonextflow-io:masterfrom
vagkaratzas:bug-fix-cache-fail-after-retry
Open

bot attempt to figure the bug, based on reproducible example#6882
vagkaratzas wants to merge 1 commit intonextflow-io:masterfrom
vagkaratzas:bug-fix-cache-fail-after-retry

Conversation

@vagkaratzas
Copy link

@vagkaratzas vagkaratzas commented Mar 3, 2026

Related to issue #6884

I (and other people in my team) have long been suffering from seemingly random cache failures of Nextflow, here and there. I think I have tracked down the culprit and posted a reproducible example on the relevant issue above and at a blog post.

I fed this example to the bots and they tried to pinpoint the culprit, so this PR is to potentially give the direction to where the bug-fix might be hiding.

Hope this helps!

Bot report:

Root Cause

  The bug is a hash chain mismatch between the retry path and the resume path.

  checkCachedOrLaunchTask builds a chain:
  H1 = hash(H_base + 1)
  H2 = hash(H1   + 2)
  H3 = hash(H2   + 3)   ← tries increments each iteration

  The resume always starts with tries=1 from H_base (a fresh task with attempt=1), correctly traversing the chain.

  The retry (old code) called checkCachedOrLaunchTask(taskCopy, taskCopy.hash, false), where taskCopy.hash was the chained hash of the last
  failed attempt (e.g. H2) — not H_base. Then internally used tries = task.failCount + 1.

  In the user's scenario (one killed prior attempt + one OOM failure before the successful retry):

  ┌─────────────────┬───────────────┬────────────┬──────────────────────────┐
  │      Path       │ Starting hash │ tries used │          Result          │
  ├─────────────────┼───────────────┼────────────┼──────────────────────────┤
  │ Retry (old)     │ H2            │ 2          │ hash(H2 + 2) = H3_retry  │
  ├─────────────────┼───────────────┼────────────┼──────────────────────────┤
  │ Resume (step 3) │ H_base → ...  │ 3          │ hash(H2 + 3) = H3_resume │
  └─────────────────┴───────────────┴────────────┴──────────────────────────┘

  H3_retry ≠ H3_resume — the resume never finds the cached result.

  Fix (3 changes to TaskProcessor.groovy)

  1. checkCachedOrLaunchTask (line 799): tries = task.failCount + 1 → tries = 1. The loop always starts at position 1 and naturally advances
  past any existing directories.
  2. Path 1 retry (spot preemption / transient errors): Inside the submit block, set taskCopy.config.attempt = 1 and compute retryHash = new
  TaskHasher(taskCopy).compute() (= H_base). Pass retryHash instead of taskCopy.hash.
  3. Path 2 retry (RETRY error strategy): Same pattern — compute H_base with attempt=1 before taskCopy.config.attempt is bumped to
  taskErrCount+1. The loop then traverses H1→H2→H3 exactly as the resume will, landing on the same hash position.

@netlify
Copy link

netlify bot commented Mar 3, 2026

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit faf89cd
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69a709f87b84a80009ad9fbd
😎 Deploy Preview https://deploy-preview-6882--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant