Skip to content

Fix cached task after abort#6903

Open
jorgee wants to merge 4 commits intomasterfrom
6884-fix-retries-with-abort
Open

Fix cached task after abort#6903
jorgee wants to merge 4 commits intomasterfrom
6884-fix-retries-with-abort

Conversation

@jorgee
Copy link
Contributor

@jorgee jorgee commented Mar 9, 2026

Close #6884
Alternative for #6882
This pull request enhances task retry logic in Nextflow by improving how task failures and aborts are tracked and incorporated into cache key calculation. The changes ensure that both failure and abort counts are considered, making task resumption and retry behavior more robust and predictable.

Task retry and cache logic improvements:

  • The hash used for determining task cache keys now includes both failCount and a new abortedCount to ensure that retries after aborts or failures use a distinct hash.
  • When a cached entry is found with a status of FAILED or ABORTED, the corresponding counters (failCount or abortedCount) on the TaskRun object are incremented, ensuring accurate retry attempts and cache key updates.

Task state tracking enhancements:

  • Added a new abortedCount field to the TaskRun class to track the number of times a task execution has been aborted.
  • Introduced isAborted() and isFailed() helper methods to the TraceRecord class for clearer and more maintainable status checks.

TODO:

  • Check attempts
  • Add test to reproduce failure and check fix

… when task previously aborted or retired

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link

netlify bot commented Mar 9, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 7ee0159
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69b0160b7f52510008097af2

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Comment on lines +847 to +850
if( task.failCount > 0 && task.config.getAttempt() != task.failCount + 1 ) {
task.config.attempt = task.failCount + 1
task.resolve(taskBody)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is not currently required to fix the case of the issue.
However, when a task has been executed with previous failures but has not been completed, it is not cached, and it is re-executed. This execution is done with task.attempt = 1, with this code it is re-exceuted with task.attempt = failedCount +1.

This case is happening in the added test. A process is defined to fail in the first attempt and succeed for the rest. So, it should execute twice in total. In the test, the execution is aborted after the first retry. Without this code, the task will be reexecuted again twice (first fails and second succeeds). With this code, the previous failed task will be counted as an attempt and then the task runs only once.

I am not sure if reexecuting with attempt =1 was intended or if it should be managed as in this code and update the attempts according to cached failures. @bentsherman @pditommaso what's your opinion about it?

@jorgee jorgee marked this pull request as ready for review March 10, 2026 08:54
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache fail when specific conditions are met

1 participant