Add KEP for DRA: Extended Resource #5136

yliaog · 2025-02-05T22:34:07Z

One-line PR description:
Add new KEP for supporting extended resource requests in DRA

Issue link:
DRA: Handle extended resource requests via DRA Driver #5004

Other comments:
kubelet and scheduler for extended resource backed by DRA kubernetes#130653

yliaog · 2025-02-05T22:34:38Z

/assign @johnbelamaric

johnbelamaric

Awesome, thanks @yliaog

keps/sig-node/5004-dra-extended-resource/kep.yaml

keps/sig-node/5004-dra-extended-resource/README.md

keps/sig-node/5004-dra-extended-resource/kep.yaml

keps/sig-scheduling/5004-dra-extended-resource/README.md

mortent · 2025-02-06T03:34:30Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+  * It is a singleton. There is at most one resource claim object for a given
+    extended resource in a given namespace.
+  * It is not owned by a pod, its owner reference is null.
+  * Its field `status.allocation.devices` is used, other fields are ununsed,


The DeviceRequestAllocationResult has a Request field that is a reference to the request in the ResourceClaim spec. But in this situation, there is no corresponding request in the spec (since the spec is empty). Drivers might use this value, among other things for looking up any device configuration. How do we plan to handle this?

DRA driver has two parts:
1/ 1st part that handles DRA resource claim actuation
2/ 2nd part that handles extended resources actuation

the 1st part does not need to actuate on this special resource claim object, it is ignored there.

the 2nd part needs to read the list of allocated devices from the special resource claim object, and only use the devices in that list for extended resources actuation.

This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

DRA driver has to be updated to advertise devices as 'extended resource'.

That said, I revised the design based on the feedback, thanks for the comments!

A cluster node install either a DRA driver, or a deivce plugin driver for a given named resource. Devices are picked at the scheudling time, and communicated to kubelet and DRA driver through the special resource claim.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

That's a bit different from what I understood. If it's still the scheduler which picks devices and tells the driver about it, then it's fine.

From #5136 (review):

When you and I talked on Friday, we discussed allowing kubelet to remain unchanged. Instead, it would call the Device Plugin grpc API for those with extended resources, but since that grpc API would be handled by the DRA driver, it would know when receiving calls on that API, it should look for the special resource claim.

That probably goes back to my comment above: "I also very much dislike that DRA drivers need to be updated at all."

Why should we put additional work on all DRA drivers, now and in the future, if we can instead do something once in the kubelet? It has implications for the graduation of this feature, but should that be a deciding factor?

I know that I've said that we want to keep the kubelet as dumb as possible, but in this case I think it's simpler overall to do this in the kubelet. I'm also a bit worried about the implications of skipping some of the usual admission checks that the kubelet does for claims, like "reserved for". Perhaps that doesn't matter, but then we need to explain in the KEP why.

My initial thoughts were to minimize kubelet changes, and shift the work to device driver. But after some more thoughts, I agree it's better to do the work once, and great in kubelet, so device driver can have less work to do, and less chances to implement it wrongly.

pohly · 2025-02-06T09:43:20Z

/cc

keps/sig-scheduling/5004-dra-extended-resource/README.md

pohly · 2025-02-07T12:02:29Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+  * It is a singleton. There is at most one resource claim object for a given
+    extended resource in a given namespace.
+  * It is not owned by a pod, its owner reference is null.
+  * Its field `status.allocation.devices` is used, other fields are ununsed,


This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.

keps/sig-scheduling/5004-dra-extended-resource/README.md

Co-authored-by: John Belamaric <[email protected]>

johnbelamaric · 2025-05-07T22:27:25Z

/approve

pohly

/hold

For SIG Scheduling approval (currently has approval through @johnbelamaric for PRR, but not from the SIG), and for squashing into one commit before merging.

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml

keps/sig-scheduling/5004-dra-extended-resource/README.md

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml

johnbelamaric

@pohly thanks for the hold, I should have waited on the approve :)

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml

Co-authored-by: Patrick Ohly <[email protected]>

k8s-ci-robot · 2025-05-08T18:15:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, yliaog

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-scheduling/OWNERS~~ [johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yliaog · 2025-05-08T23:24:40Z

/assign @dom4ha

keps/sig-scheduling/5004-dra-extended-resource/README.md

Co-authored-by: Patrick Ohly <[email protected]>

yliaog · 2025-05-14T17:08:29Z

/assign @klueska

yliaog · 2025-05-14T17:09:38Z

/assign @thockin

johnbelamaric · 2025-05-14T17:18:30Z

/assign @liggitt

liggitt · 2025-05-15T14:30:46Z

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml

+  - "@thockin" # API Review
+
+see-also:
+  - "/keps/sig-node/4381-dra-structured-parameters"


Are there other references to the original design(s) for extended resources that would be helpful to reference? either KEPs or original designs if they were before KEPs?

A quick search in the KEPs folder showed these, I wasn't sure if they were related or not:

https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2403-pod-resources-allocatable-resources

https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/1819-scheduler-extender/README.md?plain=1#L153

liggitt · 2025-05-15T16:51:51Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+   special `ResourceClaim` for the pod with the allocation result recording
+   which devices were picked. More details on this special `ResourceClaim`
+   follow below.  When using extended resources advertised for a node by device
+   plugin, the existing resource tracking reserves them.


Does that mean the scheduler needs to already know which node the pod will go to before it creates the resourceclaim, to know whether the node is providing the extended resource via a legacy driver or a DRA driver?

Is the scheduler able to run its full node selection algorithm in the absence of this resource claim, then create it and satisfy it, without invalidating already-run parts of the scheduler pipeline?

Does this require relocating the existing handling of extended resource matching in the scheduler?

Consider how the selection / resourceclaim creation approach will work in the scheduler when there are multiple extended resources requested, some of which use DRA and some of which use legacy device plugins.

Describe and test the failure / recovery / cleanup path if the scheduler succeeds in creating some ResourceClaim objects for a pod's extended resources and fails to create others.

liggitt · 2025-05-15T16:53:51Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+   which devices were picked. More details on this special `ResourceClaim`
+   follow below.  When using extended resources advertised for a node by device
+   plugin, the existing resource tracking reserves them.
+1. Introduce a field `ExtendedResourceClaimStatus` to pod's `Status`, such that:


plural, since there can be more than one?

Suggested change

1. Introduce a field `ExtendedResourceClaimStatus` to pod's `Status`, such that:

1. Introduce a field `ExtendedResourceClaimStatuses` to pod's `Status`, such that:

liggitt · 2025-05-15T17:00:41Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+    // Status of extended resource claim backed by DRA.
+    // +featureGate=DynamicResourceAllocation
+    // +optional
+    ExtendedResourceClaimStatus *PodExtendedResourceClaimStatus


is it correct to assume that a single resourceclaim is sufficient for the extended resources requested by a pod?

a pod requesting a GPU and a high bandwidth network device could have two extended resources, satisfied by two different DRA drivers ... is it normal to batch those requests into a single resource claim?

liggitt · 2025-05-15T17:01:31Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+        // ContainerName is the unique container name within the pod.
+        ContainerName string


I can't recall, are container names guaranteed unique across container types (initContainers, containers, ephemeralContainers)?

liggitt · 2025-05-15T20:54:50Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+If the pod still needs to be considered by the plugin, then it checks if the
+special resource claim for extended resources backed by DRA has been created
+before by scheduler, by checking resource claim name having pod name in the
+annotation `resource.kubernetes.io/extened-resource-claim: pod-name`.


Suggested change

annotation `resource.kubernetes.io/extened-resource-claim: pod-name`.

annotation `resource.kubernetes.io/extended-resource-claim: pod-name`.

Does it do this check using its local informer-fed cache? if so, be aware that could be stale and might not have observed a resourceclaim it created in a previous loop.

Describe how the scheduler recovers from a scenario where two resourceclaims like this incorrectly created for the same pod?

if the resource claim already exists, it will have been created for a specific node, right? does this step have to make sure the node the claim was for is still in the set of viable nodes here? if it isn't, what happens here?

liggitt · 2025-05-15T20:58:53Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+before by scheduler, by checking resource claim name having pod name in the
+annotation `resource.kubernetes.io/extened-resource-claim: pod-name`.
+
+If found, scheduler would reuse it. If not found, scheduler would create a


Clarify if the "create" mentioned here means "create in the API" or "create in-memory"? Since ResourceClaimSpec is immutable in the API on update, I assume it means in-memory in the scheduler. That means the scheduler bookkeeping needs to know if the claim is persisted yet or not / mutable or not

liggitt · 2025-05-15T21:04:51Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+    container name and the extended resource backed by DRA name inside the container.
+  * Its `status.allocation.devices` and `status.allocation.reservedFor` are
+    used.
+  * It does not have annotation `resource.kubernetes.io/pod-claim-name:` as


Also, the scheduler can encounter scenarios where it creates the resourceclaim object, then a subsequent step fails, it has to try again later, and in the meantime, DeviceClass mappings to extended resources have changed.

Would having one resourceclaim per {pod, extended resource type} combination make this situation easier to deal with?

liggitt · 2025-05-15T21:14:43Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+If a pod has an extended resource backed by DRA, and the node does *not* have
+device plugin to provide the capacity for the resource, then the
+dynamicresource plugin needs to try allocate the resource by filling in the
+special claim's `Spec.Devices.Requests` field.


is this manipulating a separate copy of the special claim per node (since whether to use DRA for the extended resource requests depends on whether the particular node in consideration has device plugins for those extended resources or not)?

liggitt · 2025-05-15T21:38:58Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+
+The allocator needs to be modified to allow for the special resource claim for
+extended resource backed by DRA, which could vary by node. The `Allocate`
+method takes the claim as a parameter, in adddition to node parameter. The


It looks like that code assumes a single extended resource referenced from a pod, which I don't think is correct. I agree it would be better to limit this to informing what is in the claimsToAllocate list handled inside Allocate(). That would need to become node-specific, but would be far more contained, I think.

Manipulating / needing to deepcopy a separate copy of a single special claim per node seems problematic (since whether to use DRA for the extended resource requests depends on whether the particular node in consideration has device plugins for a particular combination of extended resources or not)

This seems like another place where partitioning claims by extended resource type would help:

in PreFilter, for every DRA-backed extended resource a pod references, make an in-memory ResourceClaim for that extended resource

when filtering nodes in DynamicResources#Filter, compute claimsToAllocate inside Allocate() by adding in the in-memory claims for resources the node doesn't advertise legacy capacity for

when filtering nodes in Fit#Filter, allow fitsRequest to ignore DRA-backed extended resources the node doesn't advertise legacy capacity for

in Reserve, create the spec and update the status for the in-memory ResourceClaims for the extended resources the selected node did not advertise capacity for

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Feb 5, 2025

k8s-ci-robot requested review from dchen1107 and mrunalp February 5, 2025 22:34

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2025

k8s-ci-robot assigned johnbelamaric Feb 5, 2025

johnbelamaric reviewed Feb 5, 2025

View reviewed changes

yliaog force-pushed the master branch from 0168b7a to 93e435a Compare February 5, 2025 23:38

johnbelamaric reviewed Feb 6, 2025

View reviewed changes

keps/sig-node/5004-dra-extended-resource/kep.yaml Outdated Show resolved Hide resolved

yliaog force-pushed the master branch from 93e435a to fd04dcf Compare February 6, 2025 01:24

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 6, 2025

mortent reviewed Feb 6, 2025

View reviewed changes

k8s-ci-robot requested a review from pohly February 6, 2025 09:43

johnbelamaric mentioned this pull request Feb 6, 2025

DRA: Handle extended resource requests via DRA Driver #5004

Open

4 tasks

johnbelamaric reviewed Feb 6, 2025

View reviewed changes

keps/sig-scheduling/5004-dra-extended-resource/README.md Show resolved Hide resolved

yliaog force-pushed the master branch 11 times, most recently from 7ccd621 to a1d3c16 Compare February 6, 2025 23:30

pohly reviewed Feb 7, 2025

View reviewed changes

yliaog and others added 2 commits May 6, 2025 17:40

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

2b3a71d

Co-authored-by: John Belamaric <[email protected]>

fixes backed on feedback

891db0d

yliaog force-pushed the master branch from 73c54cb to 891db0d Compare May 7, 2025 19:59

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2025

pohly requested changes May 8, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to In Progress in SIG Apps May 8, 2025

github-project-automation bot moved this from !SIG Auth to Changes Requested in SIG Auth May 8, 2025

github-project-automation bot moved this from Needs Triage to In Progress in SIG CLI May 8, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2025

johnbelamaric reviewed May 8, 2025

View reviewed changes

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml Outdated Show resolved Hide resolved

johnbelamaric reviewed May 8, 2025

View reviewed changes

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml Outdated Show resolved Hide resolved

keps/sig-scheduling/5004-dra-extended-resource/kep.yaml Outdated Show resolved Hide resolved

yliaog and others added 3 commits May 8, 2025 10:28

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

4e3b0a6

Co-authored-by: Patrick Ohly <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

8d808bf

Co-authored-by: Patrick Ohly <[email protected]>

fixes based on feedback

b501862

yliaog force-pushed the master branch from 225b33b to b501862 Compare May 8, 2025 18:15

k8s-ci-robot assigned dom4ha May 8, 2025

bart0sh reviewed May 9, 2025

View reviewed changes

keps/sig-scheduling/5004-dra-extended-resource/README.md Show resolved Hide resolved

yliaog and others added 2 commits May 9, 2025 09:10

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

69764ab

Co-authored-by: Patrick Ohly <[email protected]>

fixes based on feedback

cf68c13

yliaog mentioned this pull request May 14, 2025

kubelet and scheduler for extended resource backed by DRA kubernetes/kubernetes#130653

Open

k8s-ci-robot assigned klueska May 14, 2025

k8s-ci-robot assigned thockin May 14, 2025

k8s-ci-robot assigned liggitt May 14, 2025

liggitt reviewed May 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KEP for DRA: Extended Resource #5136

Add KEP for DRA: Extended Resource #5136

yliaog commented Feb 5, 2025 •

edited

Loading

yliaog commented Feb 5, 2025

johnbelamaric left a comment

mortent Feb 6, 2025

yliaog Feb 6, 2025

pohly Feb 7, 2025

johnbelamaric Feb 7, 2025 •

edited

Loading

yliaog Feb 10, 2025

pohly Feb 10, 2025

yliaog Feb 10, 2025

pohly commented Feb 6, 2025

pohly Feb 7, 2025

johnbelamaric commented May 7, 2025

pohly left a comment

johnbelamaric left a comment

k8s-ci-robot commented May 8, 2025

yliaog commented May 8, 2025

yliaog commented May 14, 2025

yliaog commented May 14, 2025

johnbelamaric commented May 14, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

liggitt May 15, 2025

	1. Introduce a field `ExtendedResourceClaimStatus` to pod's `Status`, such that:
	1. Introduce a field `ExtendedResourceClaimStatuses` to pod's `Status`, such that:

		// ContainerName is the unique container name within the pod.
		ContainerName string

	annotation `resource.kubernetes.io/extened-resource-claim: pod-name`.
	annotation `resource.kubernetes.io/extended-resource-claim: pod-name`.

Add KEP for DRA: Extended Resource #5136

Are you sure you want to change the base?

Add KEP for DRA: Extended Resource #5136

Conversation

yliaog commented Feb 5, 2025 • edited Loading

yliaog commented Feb 5, 2025

johnbelamaric left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Feb 6, 2025

Choose a reason for hiding this comment

johnbelamaric commented May 7, 2025

pohly left a comment

Choose a reason for hiding this comment

johnbelamaric left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 8, 2025

yliaog commented May 8, 2025

yliaog commented May 14, 2025

yliaog commented May 14, 2025

johnbelamaric commented May 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yliaog commented Feb 5, 2025 •

edited

Loading

johnbelamaric Feb 7, 2025 •

edited

Loading