Skip to content

Add K8s failure classification, terminationReason, and retryOn directive#6919

Open
rhassaine wants to merge 1 commit intonextflow-io:masterfrom
rhassaine:k8s-failure-classification
Open

Add K8s failure classification, terminationReason, and retryOn directive#6919
rhassaine wants to merge 1 commit intonextflow-io:masterfrom
rhassaine:k8s-failure-classification

Conversation

@rhassaine
Copy link

Summary

  • K8s failure classification: The nf-k8s plugin now extracts container termination reasons from the K8s API and classifies failures as infrastructure (Evicted, Preempting, DeadlineExceeded, Shutdown → auto-retry) vs application (OOMKilled, Error → respect user's error strategy)
  • terminationReason plumbing: New task.terminationReason field accessible in error strategy closures, plus native logging in retry messages ([reason: OOMKilled]) and hard failure error blocks
  • retryOn directive: New declarative process directive to retry based on termination reason without Groovy closures (e.g. retryOn 'OOMKilled' or retryOn = ['OOMKilled', 'OOMKilled(exit137)'] in config)

Motivation

When running pipelines on Kubernetes, all task failures are treated identically — no distinction between OOMKilled, Evicted, or application errors. This makes it impossible to implement smart retry strategies like "retry OOM with more memory" or "fail fast on app errors" without fragile exit-code matching. Building on the work in #6436 and #6442 which introduced K8s exit code reading from the container terminated state.

Changes

nf-k8s plugin

  • K8sTaskHandler: Extract termination reason from container state, classify infrastructure vs application failures, infer reason from exit codes (137→OOMKilled, 143→SignalTerm). Generic reasons (Error, Completed) are filtered out to avoid polluting logs.
  • K8sClient: Fix exitcodeexitCode typo and 0 ?: readExitFile() Groovy truthiness bug in jobStateFallback0()

Core (nextflow module)

  • TaskRun: Added volatile String terminationReason field
  • TaskConfig: Added getRetryOn() getter for the new directive
  • TaskProcessor: Plumbed terminationReason/aborted to TaskConfig, implemented retryOn override in checkErrorStrategy(), added native terminationReason logging in retry and error messages
  • TaskErrorFormatter: Added "Termination reason" section to hard failure error blocks
  • ProcessBuilder: Registered retryOn as valid process directive
  • ProcessDsl.DirectiveDsl: Added retryOn methods for config validation

Tests

  • 6 new + 1 updated test in K8sTaskHandlerTest (OOM, CPU, infrastructure classification, Evicted/aborted, OOMKilled/not-aborted, application failure)
  • 1 new test in K8sClientTest (job fallback exit code)

Example usage

process ALIGN {
    retryOn 'OOMKilled'
    memory { 2.GB * task.attempt }
    maxRetries 3

    script:
    """
    bwa-mem2 mem ...
    """
}

Or in config:

process {
    retryOn = ['OOMKilled', 'OOMKilled(exit137)']
    memory = { 4.GB * task.attempt }
}

Test plan

  • All 148 unit tests pass (make test)
  • Tested on GKE with custom OOM pipeline: process allocates 150MB against a 100MB memory limit, confirmed [reason: OOMKilled] in native retry logs and successful retry with scaled memory (200MB on attempt 2)
  • Tested on GKE with nf-core/oncoanalyser (on latest stable Nextflow): forced OOM on BWAMEM2_ALIGN at 1500MB, confirmed [reason: OOMKilled] in logs and successful retry with scaled memory
  • Infrastructure failures (Evicted, Preempting) correctly set task.aborted = true
  • Application failures (exit 1) do not trigger retryOn or set aborted
  • Generic K8s reasons (Error, Completed) are filtered out — only meaningful reasons appear in logs

🤖 Generated with Claude Code

…On directive

When running Nextflow pipelines on Kubernetes, all task failures are treated
identically — there is no distinction between infrastructure failures (OOMKilled,
Evicted, Preempted) and application errors (exit code 1, script bugs). This makes
it impossible to implement smart retry strategies like "retry OOM with more memory"
or "fail fast on application errors" without resorting to fragile exit-code matching
in error strategy closures.

Additionally, task.terminationReason was not accessible in error strategy closures
because the value was set on TaskRun but never plumbed through to TaskConfig (the
object exposed as `task` in closures).

The K8s task handler now extracts the container termination reason from the K8s API
and classifies failures into categories:

- **Application failures** (OOMKilled, Error, etc.): `terminationReason` is set on
  the task but no special handling — respects the user's error strategy.
- **Infrastructure failures** (Evicted, Preempting, DeadlineExceeded, Shutdown):
  Sets `task.aborted = true` and throws `NodeTerminationException` for automatic
  retry, since these are transient platform issues not caused by the task itself.
- **Inferred reasons**: When K8s doesn't provide a reason but the exit code is
  informative (137 = SIGKILL/OOM, 143 = SIGTERM), a synthetic reason is set
  (e.g. `OOMKilled(exit137)`).

Also fixes the `exitcode` → `exitCode` typo in `K8sClient.jobStateFallback0()`
and the `0 ?: readExitFile()` Groovy truthiness bug (0 is falsy in Groovy, so
exit code 0 was incorrectly falling through to readExitFile).

Building on the work in nextflow-io#6436 and nextflow-io#6442 which introduced K8s exit code reading
from the container terminated state.

- Added `volatile String terminationReason` field to `TaskRun`
- Plumbed `terminationReason` and `aborted` from `TaskRun` to `TaskConfig` in
  `TaskProcessor.resumeOrDie()`, following the existing `exitStatus` pattern
- `terminationReason` is now accessible as `task.terminationReason` in error
  strategy closures
- Native logging: retry messages now include `[reason: OOMKilled]` and hard
  failure error blocks include a "Termination reason" section — no custom
  error strategy closure needed for visibility

New process directive `retryOn` that provides a declarative way to retry based on
termination reasons without writing Groovy closures:

    process FOO {
        retryOn 'OOMKilled'
        memory { 2.GB * task.attempt }
        ...
    }

Or in config files (assignment syntax):

    process {
        retryOn = ['OOMKilled', 'OOMKilled(exit137)']
    }

When the task's `terminationReason` matches any value in the `retryOn` list, the
error strategy is overridden to RETRY. The user's error strategy closure is still
evaluated first (preserving side effects like logging), but the return value is
overridden.

Tested on GKE with nf-core/oncoanalyser using a config that forces OOM:

    process {
        withName: "BWAMEM2_ALIGN" {
            memory = { 1500.MB * task.attempt }
            retryOn = ['OOMKilled']
        }
    }

BWAMEM2_ALIGN OOMs at 1500MB, Nextflow logs:
    [2b/54e1be] NOTE: Process `BWAMEM2_ALIGN (...)` terminated with an error
    exit status (137) [reason: OOMKilled] -- Execution is retried (1)

The task is retried with 3000MB (attempt 2), then 4500MB (attempt 3) if needed.

- TaskRun.groovy: added terminationReason field, cleared on makeCopy()
- TaskConfig.groovy: added getRetryOn() getter
- TaskProcessor.groovy: plumbed terminationReason/aborted to config, implemented
  retryOn override in checkErrorStrategy(), added native terminationReason logging
- ProcessBuilder.groovy: registered retryOn as valid directive
- K8sTaskHandler.groovy: failure classification, terminationReason extraction,
  infrastructure failure detection, K8sOutOfCpu/MemoryException in catch clause
- K8sClient.groovy: fixed exitcode→exitCode typo, explicit exitCode: 0
- K8sTaskHandlerTest.groovy: 6 new tests + 1 updated test
- K8sClientTest.groovy: 1 new test for job fallback exit code

Signed-off-by: Rayan Hassaine <r.hassaine@hartwigmedicalfoundation.nl>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Rayan Hassaïne <r.hassaine@hartwigmedicalfoundation.nl>
@rhassaine rhassaine requested a review from a team as a code owner March 13, 2026 13:53
@netlify
Copy link

netlify bot commented Mar 13, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 1d6afcb
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69b416d19f94d00008d7ccb4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants