KEP-5322: DRA: Handle permanent driver failures #5549

nojnhuh · 2025-09-19T19:22:55Z

One-line PR description: Add KEP to define how DRA drivers can report permanent failures and how the kubelet should handle those.

Issue link: DRA: Handle permanent driver failures #5322

Other comments:

KEP-5322: DRA: Handle permanent driver allocation failures

nojnhuh · 2025-09-19T19:30:01Z

I've filled out the Summary, Motivation, and User Stories sections in case there's any feedback there. I'm continuing to fill out the rest of the KEP.

/cc @klueska @SergeyKanzhelev @lauralorenz @pohly @jackfrancis

nojnhuh · 2025-09-19T22:00:01Z

/wg device-management

Fill out the rest of the doc

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

SergeyKanzhelev · 2025-09-30T16:12:11Z

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

+}
+```
+
+### Kubelet Handling


How do we "recommend" to protect from re-scheduling to that same node by scheduler? Should we recommend that the DRA driver MUST taint the resource (or delete it) before returning the permanent failure?

Or should we restart conversation on marking a node as "not suitable" for a Pod for a while?

Yes, I think a taint is the recommended approach when the issue is with the device that got allocated. There may be cases though where the issue is in the opaque parameters associated with a request. In those instances where a driver could successfully allocate the same device with a different set of parameters, a taint might get in the way more than it would help.

I'll mention this in the doc.

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

keps/sig-node/5322-dra-driver-permanent-failure/README.md

SergeyKanzhelev · 2025-09-30T16:23:00Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+for miscategorized errors to cause a Pod to terminally fail even when a
+subsequent attempt might allow the Pod's startup to progress.
+
+## Design Details


What do we put into the pod status fields introduced in #4680 in this case?

Since #4680 describes only the health of devices and this KEP also takes into account the other parameters associated with a request, I think we should let DRA drivers decide how the Pod status is utilized independently of this KEP. "This device is unhealthy" I think is different enough from "this request cannot be fulfilled" that coupling the two features in some way might be confusing.

If the issue is that these permanent errors from the DRA driver will be hard to diagnose, would generating an Event when one of a Pod's requests permanently fails be enough?

I want to make sure we surface as much information as possible to explain why the pod didn't fit in. Pod Admission failure message may have some details, but it may be too shallow. I wonder if we will be better off by adding details intoe Pod Status for each device. This way we can surface information when ALL devices cannot fit vs. single device failed. Trying to squeeze all this infrormation in a single Event or single admission failure message may be very messy.

In any case, writing down exactly where information go so user can diagnose the admission problem, is needed for this KEP.

I think an Event is actually already generated in this case, but I can try forcing this situation to make sure: https://github.com/kubernetes/kubernetes/blob/8ebc216c595158389fa20c4fff75a8c84cbe3fff/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1328

Overall through I agree that having a clear path to identity these issues is necessary. I'll document here that the kubelet logs and Events for the Pod will show the error but may need to be updated to notate the error as permanent or not. Happy to add whatever else we need to make these more visible though.

so will we be able to tell customer all DRA devices that are misconfigured (various plugins and different devices)? Or the error will be generic and only for the first plugin? What if events are throttled after continuous rescheduling of a Pod to the same Node over and over? Imagine after an hour of retries user tryies to understand what happens, but there are no more events. Will pod admission failure indicate clearly what happened? Please specify.

Will it be possible to map the allocation error to the plugin and device (as in fields in #4680) will make it very easy to understand.

Also please add test cases ensuring that errors exposed have some minimal level of details

Addressing Sergey's feedback

pohly

Retitle to "KEP-5322: DRA: Handle permanent driver failures" to match the feature gate name? Please also rename the issue.

"allocation failure" is misleading because allocation happens during scheduling.

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

keps/sig-node/5322-dra-driver-permanent-failure/README.md

pohly · 2025-10-01T10:18:43Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+-->
+
+- [X] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: DRAHandlePermanentDriverFailures


I was about to suggest shortening to DRAPermanentDriverFailures, but --features DRAPermanentDriverFailures=true looks like it asks for failures, so better not 😁

Rename, scrub usage of "allocate"

Address Patrick's other comments

swatisehgal

Please add the PRR approvers by creating a file at keps/prod-readiness/sig-node/5322.yaml so this PR can be properly tracked for PRR review. You can refer to the examples in keps/prod-readiness/sig-node for guidance on how to populate the file.

nojnhuh · 2025-10-02T20:34:19Z

Please add the PRR approvers by creating a file at keps/prod-readiness/sig-node/5322.yaml so this PR can be properly tracked for PRR review. You can refer to the examples in keps/prod-readiness/sig-node for guidance on how to populate the file.

Thank you! I've started a thread on Slack to see if any PRR reviewers are available to take this one.

nojnhuh · 2025-10-02T20:35:11Z

Retitle to "KEP-5322: DRA: Handle permanent driver failures" to match the feature gate name? Please also rename the issue.

I think I made this change everywhere except the title of this PR:

/retitle KEP-5322: DRA: Handle permanent driver failures

jpbetz · 2025-10-03T22:19:20Z

@nojnhuh Would you also add a corresponding file under https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node for this KEP and list me as the PRR reviewer?

Add PRR YAML

nojnhuh · 2025-10-04T05:56:11Z

@nojnhuh Would you also add a corresponding file under https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node for this KEP and list me as the PRR reviewer?

Done, thanks!

keps/prod-readiness/sig-node/5322.yaml

Remove beta PRR approver

jpbetz

/approve
For Alpha PRR

This appears to handle a new opt-in feature controlled by a field in the recommended way.

k8s-ci-robot · 2025-10-10T20:55:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jpbetz, nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [jpbetz]
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Update kep.yaml

mrunalp · 2025-10-14T16:11:27Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+    // When true and a non-empty error is returned, indicates that the
+    // error is permanent. Permanent errors are expected to fail
+    // consistently for a given request and should not be retried.
+    bool permanent_error = 3;


Instead of bool, can this be an enum of error type?

I was wondering if this permanent/non-permanent distinction is binary enough that a bool makes the most sense, but @johnbelamaric's point during the WG Device Management meeting today about potentially handling user errors (e.g. bad opaque config) differently from things like hardware problems is definitely harder to do without some more context.

I've updated to make this an enum to at least give us that extra flexibility.

keps/sig-node/5322-dra-driver-permanent-failure/kep.yaml

@mrunalp

Add @mrunalp as an approver

`bool permanent_error` -> `error_type` enum

k8s-ci-robot · 2025-10-14T16:31:10Z

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

36e4346 fixup! Copy KEP template

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nojnhuh · 2025-10-14T16:34:20Z

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

36e4346 fixup! Copy KEP template

I'll squash these commits once the PR is ready to merge and remove all the intermediate commit messages. I've been adding those mostly for myself to track changes here.

SergeyKanzhelev · 2025-10-14T18:19:40Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+    Unknown = 0;
+    // Permanent errors are expected to fail consistently
+    // for a given request and should not be retried.
+    Permanent = 1;


should we distinguish "bad config" and "failed device"?

I'm trying to think of how the kubelet would behave differently if it knew about both.

Marking a Pod as Failed when a device has failed and triggering a higher-level controller to delete that Pod and create a new one (which lands on a different node after the DRA driver taints that device) seems like a reasonable course of action.

How should bad config be handled? If rescheduling an equivalent Pod/ResourceClaim won't help, is marking the Pod Failed not the right thing to do? Since ResourceClaims, ResourceClaimTemplates, and where those are referred to in Pods are all immutable, I don't know if there's a more appropriate status. The Pod and its ResourceClaims need to be deleted anyway.

If a Pod transitioning to a Failed phase is common to both scenarios, then I think the only difference would be whether or not a taint is applied to the device. Since drivers can manage taints themselves, maybe the kubelet doesn't need to know the difference?

My suggestion here (along the #5549 (comment)) is to introduce distinguishing of those early. At a minimum, the message to the user will be different. At maximum, we may need a different reaction on it. Retrying slowly with the device failure (waiting for the controller to decide what to do based on taints) vs. faiing the admission for bad config and marking node-device combination as not suitable.

The message from the driver does bubble up to an Event on the Pod at least. This is a synthetic example from the dra-example-driver showing invalid config, which could probably be made more explicit that it's indicating a user error. The message might instead indicate a problem with the device or some other kind of error. Is there a better place to mirror this message from the driver?

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8s default-scheduler Successfully assigned gpu-test5/dep0-798858f588-nnpsj to dra-example-driver-cluster-worker Warning FailedPrepareDynamicResources 8s kubelet Failed to prepare dynamic resources: prepare dynamic resources: NodePrepareResources failed for ResourceClaim dep0-798858f588-nnpsj-shared-gpus-jblsv: error preparing devices for claim 3cf849a7-5421-49e1-bb03-92fab77ad38f: prepare failed: error applying GPU config: too many partitions

I'm wondering though that if a Pod+ResourceClaim can never start because of how it's defined by a user, then Failed isn't the right terminal outcome if that means Pod controllers will continually delete and recreate the same thing. But it's also wasteful for the kubelet to continuously check in on a Pod that we know won't ever succeed. Is there an existing well-known way we can express that a Pod will not be able to start on this Node or any other Node?

@klueska Can you think of any more that drivers can do to prevent misconfiguration in the first place? Are there cases that webhook validation can't catch?

SergeyKanzhelev · 2025-10-14T18:21:43Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+prevent the replacement Pod from being allocated the same device which is
+expected to fail again the same way.
+
+When [device taints and tolerations](https://kep.k8s.io/5055) are available, DRA


so once tainted, the Pod will be deleted, correct? Why will we need kubelet to handle this case than in special way? Would it be best to let controller to delete the Pod in this case universally?

This way the only error DRA plugin will report as permanent is "bad config" - unfit device.

And all permanent failure race conditions will be handled the same way by the controller outside of this KEP.

I don't think the kubelet needs to do anything special here other than mark the Pod as Failed when a DRA driver returns a permanent error. This is the rough sequence of events I'd expect when a driver encounters a problem the allocated device.

Pod is scheduled onto a Node

kubelet invokes DRA driver to prepare resources

DRA driver encounters a permanent error while preparing resources

DRA driver taints the device

DRA driver returns the permanent error to the kubelet

kubelet marks the Pod as Failed

A higher-level Pod controller sees one of its Pods is Failed and tries to replace it

Scheduler allocates an untainted device for the replacement Pod

When the user-provided config is invalid and rescheduling won't help though, I can see where marking the Pod Failed could cause Pod controllers to spin on the same error forever. @klueska How was the NVIDIA driver planning to handle permanent errors caused by user misconfiguration?

My point here is that tainting of a device WILL result in pod deletion by the taint controller:

A new controller watches tainted devices and deletes pods using them unless they tolerate the device taint

Why will we have two differentness ways to delete a Pod when the device is tainted as permanently failed? Would be best to keep those unified. Otherwise - a millisecond difference on device health detection result in different error messages on the Pod.

This KEP may be useful for misconfiguration-based issues. And this is where the discussion on handling the crash loop backoff is very important

The KEP mentions using NoSchedule taints as a lowest common denominator which will prevent the same error from occurring later without affecting other Pods that happen to be running successfully already with that device. I see though how using a NoExecute taint accomplishes the same overall goal.

@klueska Can you think of any reason we wouldn't instead encourage drivers to do what Sergey is suggesting and simply put a NoExecute taint on the device and let the eviction controller delete the Pod? Such as if it's likely that a "permanent error" related to the device itself only affects new Pods vs. Pods already using it?

yeah the only reason I can see doing this rather than going taint direction is if a driver can allocate devices differently to different pods. like some pods requesting would be failed but others would not be. I would be kind of surprised that would happen, and maybe we need a more concrete use case to catch that situation if so

haircommander · 2025-10-16T15:41:14Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+    // Permanent errors are expected to fail consistently
+    // for a given request and should not be retried.
+    Permanent = 1;
+}


shouldn't there be a transient type or something? seems odd to classify all non permanent failures as unknown

nojnhuh · 2025-10-16T17:38:22Z

Overall based on the feedback from @SergeyKanzhelev and @haircommander in #5549 (comment) about using NoExecute taints to express an error being "permanent" and my hunch from #5549 (comment) that webhook validation can catch most kinds of misconfiguration, I'm leaning toward recommending we try those approaches for 1.35 instead of implementing the changes described in this KEP. The use cases requiring more advanced error handling aren't clear yet so I can't quite justify the added complexity in the kubelet.

Copy KEP template

513153e

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 19, 2025

k8s-ci-robot requested review from dchen1107 and mrunalp September 19, 2025 19:23

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 19, 2025

amend! Copy KEP template

444c9d6

KEP-5322: DRA: Handle permanent driver allocation failures

nojnhuh force-pushed the 5322-dra-perma-err branch from 45d542c to 444c9d6 Compare September 19, 2025 19:27

k8s-ci-robot requested review from SergeyKanzhelev, jackfrancis, klueska, lauralorenz and pohly September 19, 2025 19:30

nojnhuh mentioned this pull request Sep 8, 2025

DRA: Handle permanent driver failures #5322

Open

4 tasks

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Sep 19, 2025

github-project-automation bot added this to Dynamic Resource Allocation Sep 19, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Sep 19, 2025

nojnhuh moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Sep 19, 2025

pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Sep 22, 2025

fixup! Copy KEP template

34f83e4

Fill out the rest of the doc

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 29, 2025

SergeyKanzhelev reviewed Sep 30, 2025

View reviewed changes

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 30, 2025

View reviewed changes

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 30, 2025

View reviewed changes

keps/sig-node/5322-dra-driver-permanent-failure/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 30, 2025

View reviewed changes

keps/sig-node/5322-dra-driver-permanent-failure/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 30, 2025

View reviewed changes

keps/sig-node/5322-dra-driver-permanent-failure/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 30, 2025

View reviewed changes

fixup! Copy KEP template

535b856

Addressing Sergey's feedback

pohly reviewed Oct 1, 2025

View reviewed changes

nojnhuh added 2 commits October 1, 2025 11:15

fixup! Copy KEP template

bb4638b

Rename, scrub usage of "allocate"

fixup! Copy KEP template

0aa4687

Address Patrick's other comments

swatisehgal reviewed Oct 2, 2025

View reviewed changes

k8s-ci-robot changed the title ~~KEP-5322: DRA: Handle permanent driver allocation failures~~ KEP-5322: DRA: Handle permanent driver failures Oct 2, 2025

fixup! Copy KEP template

1433816

Add PRR YAML

jpbetz reviewed Oct 10, 2025

View reviewed changes

keps/prod-readiness/sig-node/5322.yaml Outdated Show resolved Hide resolved

fixup! Copy KEP template

66bb771

Remove beta PRR approver

jpbetz approved these changes Oct 10, 2025

View reviewed changes

fixup! Copy KEP template

80d86ce

Update kep.yaml

mrunalp reviewed Oct 14, 2025

View reviewed changes

keps/sig-node/5322-dra-driver-permanent-failure/kep.yaml Outdated Show resolved Hide resolved

nojnhuh added 2 commits October 14, 2025 11:29

fixup! Copy KEP template

36e4346

Add @mrunalp as an approver

fixup! Copy KEP template

1806f5d

`bool permanent_error` -> `error_type` enum

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 14, 2025

SergeyKanzhelev reviewed Oct 14, 2025

View reviewed changes

haircommander reviewed Oct 16, 2025

View reviewed changes

pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Oct 17, 2025

KEP-5322: DRA: Handle permanent driver failures #5549

Are you sure you want to change the base?

KEP-5322: DRA: Handle permanent driver failures #5549

Conversation

nojnhuh commented Sep 19, 2025

Uh oh!

nojnhuh commented Sep 19, 2025

Uh oh!

nojnhuh commented Sep 19, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swatisehgal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nojnhuh commented Oct 2, 2025

Uh oh!

nojnhuh commented Oct 2, 2025

Uh oh!

jpbetz commented Oct 3, 2025

Uh oh!

nojnhuh commented Oct 4, 2025

Uh oh!

Uh oh!

jpbetz left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 14, 2025

Uh oh!

nojnhuh commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

swatisehgal left a comment •

edited

Loading

SergeyKanzhelev Oct 14, 2025 •

edited

Loading