Add K8s failure classification, terminationReason, and retryOn directive#6919
Open
rhassaine wants to merge 1 commit intonextflow-io:masterfrom
Open
Add K8s failure classification, terminationReason, and retryOn directive#6919rhassaine wants to merge 1 commit intonextflow-io:masterfrom
rhassaine wants to merge 1 commit intonextflow-io:masterfrom
Conversation
…On directive When running Nextflow pipelines on Kubernetes, all task failures are treated identically — there is no distinction between infrastructure failures (OOMKilled, Evicted, Preempted) and application errors (exit code 1, script bugs). This makes it impossible to implement smart retry strategies like "retry OOM with more memory" or "fail fast on application errors" without resorting to fragile exit-code matching in error strategy closures. Additionally, task.terminationReason was not accessible in error strategy closures because the value was set on TaskRun but never plumbed through to TaskConfig (the object exposed as `task` in closures). The K8s task handler now extracts the container termination reason from the K8s API and classifies failures into categories: - **Application failures** (OOMKilled, Error, etc.): `terminationReason` is set on the task but no special handling — respects the user's error strategy. - **Infrastructure failures** (Evicted, Preempting, DeadlineExceeded, Shutdown): Sets `task.aborted = true` and throws `NodeTerminationException` for automatic retry, since these are transient platform issues not caused by the task itself. - **Inferred reasons**: When K8s doesn't provide a reason but the exit code is informative (137 = SIGKILL/OOM, 143 = SIGTERM), a synthetic reason is set (e.g. `OOMKilled(exit137)`). Also fixes the `exitcode` → `exitCode` typo in `K8sClient.jobStateFallback0()` and the `0 ?: readExitFile()` Groovy truthiness bug (0 is falsy in Groovy, so exit code 0 was incorrectly falling through to readExitFile). Building on the work in nextflow-io#6436 and nextflow-io#6442 which introduced K8s exit code reading from the container terminated state. - Added `volatile String terminationReason` field to `TaskRun` - Plumbed `terminationReason` and `aborted` from `TaskRun` to `TaskConfig` in `TaskProcessor.resumeOrDie()`, following the existing `exitStatus` pattern - `terminationReason` is now accessible as `task.terminationReason` in error strategy closures - Native logging: retry messages now include `[reason: OOMKilled]` and hard failure error blocks include a "Termination reason" section — no custom error strategy closure needed for visibility New process directive `retryOn` that provides a declarative way to retry based on termination reasons without writing Groovy closures: process FOO { retryOn 'OOMKilled' memory { 2.GB * task.attempt } ... } Or in config files (assignment syntax): process { retryOn = ['OOMKilled', 'OOMKilled(exit137)'] } When the task's `terminationReason` matches any value in the `retryOn` list, the error strategy is overridden to RETRY. The user's error strategy closure is still evaluated first (preserving side effects like logging), but the return value is overridden. Tested on GKE with nf-core/oncoanalyser using a config that forces OOM: process { withName: "BWAMEM2_ALIGN" { memory = { 1500.MB * task.attempt } retryOn = ['OOMKilled'] } } BWAMEM2_ALIGN OOMs at 1500MB, Nextflow logs: [2b/54e1be] NOTE: Process `BWAMEM2_ALIGN (...)` terminated with an error exit status (137) [reason: OOMKilled] -- Execution is retried (1) The task is retried with 3000MB (attempt 2), then 4500MB (attempt 3) if needed. - TaskRun.groovy: added terminationReason field, cleared on makeCopy() - TaskConfig.groovy: added getRetryOn() getter - TaskProcessor.groovy: plumbed terminationReason/aborted to config, implemented retryOn override in checkErrorStrategy(), added native terminationReason logging - ProcessBuilder.groovy: registered retryOn as valid directive - K8sTaskHandler.groovy: failure classification, terminationReason extraction, infrastructure failure detection, K8sOutOfCpu/MemoryException in catch clause - K8sClient.groovy: fixed exitcode→exitCode typo, explicit exitCode: 0 - K8sTaskHandlerTest.groovy: 6 new tests + 1 updated test - K8sClientTest.groovy: 1 new test for job fallback exit code Signed-off-by: Rayan Hassaine <r.hassaine@hartwigmedicalfoundation.nl> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Rayan Hassaïne <r.hassaine@hartwigmedicalfoundation.nl>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
terminationReasonplumbing: Newtask.terminationReasonfield accessible in error strategy closures, plus native logging in retry messages ([reason: OOMKilled]) and hard failure error blocksretryOndirective: New declarative process directive to retry based on termination reason without Groovy closures (e.g.retryOn 'OOMKilled'orretryOn = ['OOMKilled', 'OOMKilled(exit137)']in config)Motivation
When running pipelines on Kubernetes, all task failures are treated identically — no distinction between OOMKilled, Evicted, or application errors. This makes it impossible to implement smart retry strategies like "retry OOM with more memory" or "fail fast on app errors" without fragile exit-code matching. Building on the work in #6436 and #6442 which introduced K8s exit code reading from the container terminated state.
Changes
nf-k8s plugin
K8sTaskHandler: Extract termination reason from container state, classify infrastructure vs application failures, infer reason from exit codes (137→OOMKilled, 143→SignalTerm). Generic reasons (Error,Completed) are filtered out to avoid polluting logs.K8sClient: Fixexitcode→exitCodetypo and0 ?: readExitFile()Groovy truthiness bug injobStateFallback0()Core (nextflow module)
TaskRun: Addedvolatile String terminationReasonfieldTaskConfig: AddedgetRetryOn()getter for the new directiveTaskProcessor: Plumbed terminationReason/aborted to TaskConfig, implemented retryOn override incheckErrorStrategy(), added native terminationReason logging in retry and error messagesTaskErrorFormatter: Added "Termination reason" section to hard failure error blocksProcessBuilder: RegisteredretryOnas valid process directiveProcessDsl.DirectiveDsl: AddedretryOnmethods for config validationTests
K8sTaskHandlerTest(OOM, CPU, infrastructure classification, Evicted/aborted, OOMKilled/not-aborted, application failure)K8sClientTest(job fallback exit code)Example usage
Or in config:
process { retryOn = ['OOMKilled', 'OOMKilled(exit137)'] memory = { 4.GB * task.attempt } }Test plan
make test)[reason: OOMKilled]in native retry logs and successful retry with scaled memory (200MB on attempt 2)[reason: OOMKilled]in logs and successful retry with scaled memorytask.aborted = trueError,Completed) are filtered out — only meaningful reasons appear in logs🤖 Generated with Claude Code