Add K8s failure classification, terminationReason, and retryOn directive by rhassaine · Pull Request #6919 · nextflow-io/nextflow

rhassaine · 2026-03-13T13:53:17Z

Summary

K8s failure classification: The nf-k8s plugin now extracts container termination reasons from the K8s API and classifies failures as infrastructure (Evicted, Preempting, DeadlineExceeded, Shutdown → auto-retry) vs application (OOMKilled, Error → respect user's error strategy)
terminationReason plumbing: New task.terminationReason field accessible in error strategy closures, plus native logging in retry messages ([reason: OOMKilled]) and hard failure error blocks
retryOn directive: New declarative process directive to retry based on termination reason without Groovy closures (e.g. retryOn 'OOMKilled' or retryOn = ['OOMKilled', 'OOMKilled(exit137)'] in config)

Motivation

When running pipelines on Kubernetes, all task failures are treated identically — no distinction between OOMKilled, Evicted, or application errors. This makes it impossible to implement smart retry strategies like "retry OOM with more memory" or "fail fast on app errors" without fragile exit-code matching. Building on the work in #6436 and #6442 which introduced K8s exit code reading from the container terminated state.

Changes

nf-k8s plugin

K8sTaskHandler: Extract termination reason from container state, classify infrastructure vs application failures, infer reason from exit codes (137→OOMKilled, 143→SignalTerm). Generic reasons (Error, Completed) are filtered out to avoid polluting logs.
K8sClient: Fix exitcode→exitCode typo and 0 ?: readExitFile() Groovy truthiness bug in jobStateFallback0()

Core (nextflow module)

TaskRun: Added volatile String terminationReason field
TaskConfig: Added getRetryOn() getter for the new directive
TaskProcessor: Plumbed terminationReason/aborted to TaskConfig, implemented retryOn override in checkErrorStrategy(), added native terminationReason logging in retry and error messages
TaskErrorFormatter: Added "Termination reason" section to hard failure error blocks
ProcessBuilder: Registered retryOn as valid process directive
ProcessDsl.DirectiveDsl: Added retryOn methods for config validation

Tests

6 new + 1 updated test in K8sTaskHandlerTest (OOM, CPU, infrastructure classification, Evicted/aborted, OOMKilled/not-aborted, application failure)
1 new test in K8sClientTest (job fallback exit code)

Example usage

process ALIGN {
    retryOn 'OOMKilled'
    memory { 2.GB * task.attempt }
    maxRetries 3

    script:
    """
    bwa-mem2 mem ...
    """
}

Or in config:

process {
    retryOn = ['OOMKilled', 'OOMKilled(exit137)']
    memory = { 4.GB * task.attempt }
}

Test plan

All 148 unit tests pass (make test)
Tested on GKE with custom OOM pipeline: process allocates 150MB against a 100MB memory limit, confirmed [reason: OOMKilled] in native retry logs and successful retry with scaled memory (200MB on attempt 2)
Tested on GKE with nf-core/oncoanalyser (on latest stable Nextflow): forced OOM on BWAMEM2_ALIGN at 1500MB, confirmed [reason: OOMKilled] in logs and successful retry with scaled memory
Infrastructure failures (Evicted, Preempting) correctly set task.aborted = true
Application failures (exit 1) do not trigger retryOn or set aborted
Generic K8s reasons (Error, Completed) are filtered out — only meaningful reasons appear in logs

🤖 Generated with Claude Code

…On directive When running Nextflow pipelines on Kubernetes, all task failures are treated identically — there is no distinction between infrastructure failures (OOMKilled, Evicted, Preempted) and application errors (exit code 1, script bugs). This makes it impossible to implement smart retry strategies like "retry OOM with more memory" or "fail fast on application errors" without resorting to fragile exit-code matching in error strategy closures. Additionally, task.terminationReason was not accessible in error strategy closures because the value was set on TaskRun but never plumbed through to TaskConfig (the object exposed as `task` in closures). The K8s task handler now extracts the container termination reason from the K8s API and classifies failures into categories: - **Application failures** (OOMKilled, Error, etc.): `terminationReason` is set on the task but no special handling — respects the user's error strategy. - **Infrastructure failures** (Evicted, Preempting, DeadlineExceeded, Shutdown): Sets `task.aborted = true` and throws `NodeTerminationException` for automatic retry, since these are transient platform issues not caused by the task itself. - **Inferred reasons**: When K8s doesn't provide a reason but the exit code is informative (137 = SIGKILL/OOM, 143 = SIGTERM), a synthetic reason is set (e.g. `OOMKilled(exit137)`). Also fixes the `exitcode` → `exitCode` typo in `K8sClient.jobStateFallback0()` and the `0 ?: readExitFile()` Groovy truthiness bug (0 is falsy in Groovy, so exit code 0 was incorrectly falling through to readExitFile). Building on the work in nextflow-io#6436 and nextflow-io#6442 which introduced K8s exit code reading from the container terminated state. - Added `volatile String terminationReason` field to `TaskRun` - Plumbed `terminationReason` and `aborted` from `TaskRun` to `TaskConfig` in `TaskProcessor.resumeOrDie()`, following the existing `exitStatus` pattern - `terminationReason` is now accessible as `task.terminationReason` in error strategy closures - Native logging: retry messages now include `[reason: OOMKilled]` and hard failure error blocks include a "Termination reason" section — no custom error strategy closure needed for visibility New process directive `retryOn` that provides a declarative way to retry based on termination reasons without writing Groovy closures: process FOO { retryOn 'OOMKilled' memory { 2.GB * task.attempt } ... } Or in config files (assignment syntax): process { retryOn = ['OOMKilled', 'OOMKilled(exit137)'] } When the task's `terminationReason` matches any value in the `retryOn` list, the error strategy is overridden to RETRY. The user's error strategy closure is still evaluated first (preserving side effects like logging), but the return value is overridden. Tested on GKE with nf-core/oncoanalyser using a config that forces OOM: process { withName: "BWAMEM2_ALIGN" { memory = { 1500.MB * task.attempt } retryOn = ['OOMKilled'] } } BWAMEM2_ALIGN OOMs at 1500MB, Nextflow logs: [2b/54e1be] NOTE: Process `BWAMEM2_ALIGN (...)` terminated with an error exit status (137) [reason: OOMKilled] -- Execution is retried (1) The task is retried with 3000MB (attempt 2), then 4500MB (attempt 3) if needed. - TaskRun.groovy: added terminationReason field, cleared on makeCopy() - TaskConfig.groovy: added getRetryOn() getter - TaskProcessor.groovy: plumbed terminationReason/aborted to config, implemented retryOn override in checkErrorStrategy(), added native terminationReason logging - ProcessBuilder.groovy: registered retryOn as valid directive - K8sTaskHandler.groovy: failure classification, terminationReason extraction, infrastructure failure detection, K8sOutOfCpu/MemoryException in catch clause - K8sClient.groovy: fixed exitcode→exitCode typo, explicit exitCode: 0 - K8sTaskHandlerTest.groovy: 6 new tests + 1 updated test - K8sClientTest.groovy: 1 new test for job fallback exit code Signed-off-by: Rayan Hassaine <r.hassaine@hartwigmedicalfoundation.nl> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Rayan Hassaïne <r.hassaine@hartwigmedicalfoundation.nl>

netlify · 2026-03-13T13:53:24Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`1d6afcb`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/69b416d19f94d00008d7ccb4

rhassaine requested a review from a team as a code owner March 13, 2026 13:53

bentsherman added the executor/k8s label Mar 13, 2026

pditommaso force-pushed the master branch from 6fe40e1 to ea1f4ea Compare March 17, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add K8s failure classification, terminationReason, and retryOn directive#6919

Add K8s failure classification, terminationReason, and retryOn directive#6919
rhassaine wants to merge 1 commit intonextflow-io:masterfrom
rhassaine:k8s-failure-classification

rhassaine commented Mar 13, 2026

Uh oh!

netlify bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rhassaine commented Mar 13, 2026

Summary

Motivation

Changes

nf-k8s plugin

Core (nextflow module)

Tests

Example usage

Test plan

Uh oh!

netlify bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Mar 13, 2026 •

edited

Loading