Skip to content

Add KEP for DRA: Extended Resource #5136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open

Conversation

yliaog
Copy link

@yliaog yliaog commented Feb 5, 2025

  • One-line PR description:
    Add new KEP for supporting extended resource requests in DRA

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Feb 5, 2025
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2025
@yliaog
Copy link
Author

yliaog commented Feb 5, 2025

/assign @johnbelamaric

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @yliaog

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 6, 2025
* It is a singleton. There is at most one resource claim object for a given
extended resource in a given namespace.
* It is not owned by a pod, its owner reference is null.
* Its field `status.allocation.devices` is used, other fields are ununsed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DeviceRequestAllocationResult has a Request field that is a reference to the request in the ResourceClaim spec. But in this situation, there is no corresponding request in the spec (since the spec is empty). Drivers might use this value, among other things for looking up any device configuration. How do we plan to handle this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA driver has two parts:
1/ 1st part that handles DRA resource claim actuation
2/ 2nd part that handles extended resources actuation

the 1st part does not need to actuate on this special resource claim object, it is ignored there.

the 2nd part needs to read the list of allocated devices from the special resource claim object, and only use the devices in that list for extended resources actuation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.

Copy link
Member

@johnbelamaric johnbelamaric Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA driver has to be updated to advertise devices as 'extended resource'.

That said, I revised the design based on the feedback, thanks for the comments!

A cluster node install either a DRA driver, or a deivce plugin driver for a given named resource. Devices are picked at the scheudling time, and communicated to kubelet and DRA driver through the special resource claim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

That's a bit different from what I understood. If it's still the scheduler which picks devices and tells the driver about it, then it's fine.

From #5136 (review):

When you and I talked on Friday, we discussed allowing kubelet to remain unchanged. Instead, it would call the Device Plugin grpc API for those with extended resources, but since that grpc API would be handled by the DRA driver, it would know when receiving calls on that API, it should look for the special resource claim.

That probably goes back to my comment above: "I also very much dislike that DRA drivers need to be updated at all."

Why should we put additional work on all DRA drivers, now and in the future, if we can instead do something once in the kubelet? It has implications for the graduation of this feature, but should that be a deciding factor?

I know that I've said that we want to keep the kubelet as dumb as possible, but in this case I think it's simpler overall to do this in the kubelet. I'm also a bit worried about the implications of skipping some of the usual admission checks that the kubelet does for claims, like "reserved for". Perhaps that doesn't matter, but then we need to explain in the KEP why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thoughts were to minimize kubelet changes, and shift the work to device driver. But after some more thoughts, I agree it's better to do the work once, and great in kubelet, so device driver can have less work to do, and less chances to implement it wrongly.

@pohly
Copy link
Contributor

pohly commented Feb 6, 2025

/cc

@yliaog yliaog force-pushed the master branch 11 times, most recently from 7ccd621 to a1d3c16 Compare February 6, 2025 23:30
* It is a singleton. There is at most one resource claim object for a given
extended resource in a given namespace.
* It is not owned by a pod, its owner reference is null.
* Its field `status.allocation.devices` is used, other fields are ununsed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.

@johnbelamaric
Copy link
Member

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2025
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

For SIG Scheduling approval (currently has approval through @johnbelamaric for PRR, but not from the SIG), and for squashing into one commit before merging.

@github-project-automation github-project-automation bot moved this from Needs Triage to In Progress in SIG Apps May 8, 2025
@github-project-automation github-project-automation bot moved this from !SIG Auth to Changes Requested in SIG Auth May 8, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to In Progress in SIG CLI May 8, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2025
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly thanks for the hold, I should have waited on the approve :)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, yliaog

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yliaog
Copy link
Author

yliaog commented May 8, 2025

/assign @dom4ha

@yliaog
Copy link
Author

yliaog commented May 14, 2025

/assign @klueska

@yliaog
Copy link
Author

yliaog commented May 14, 2025

/assign @thockin

@johnbelamaric
Copy link
Member

/assign @liggitt

- "@thockin" # API Review

see-also:
- "/keps/sig-node/4381-dra-structured-parameters"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other references to the original design(s) for extended resources that would be helpful to reference? either KEPs or original designs if they were before KEPs?

A quick search in the KEPs folder showed these, I wasn't sure if they were related or not:

special `ResourceClaim` for the pod with the allocation result recording
which devices were picked. More details on this special `ResourceClaim`
follow below. When using extended resources advertised for a node by device
plugin, the existing resource tracking reserves them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does that mean the scheduler needs to already know which node the pod will go to before it creates the resourceclaim, to know whether the node is providing the extended resource via a legacy driver or a DRA driver?
  2. Is the scheduler able to run its full node selection algorithm in the absence of this resource claim, then create it and satisfy it, without invalidating already-run parts of the scheduler pipeline?
  3. Does this require relocating the existing handling of extended resource matching in the scheduler?
  4. Consider how the selection / resourceclaim creation approach will work in the scheduler when there are multiple extended resources requested, some of which use DRA and some of which use legacy device plugins.
  5. Describe and test the failure / recovery / cleanup path if the scheduler succeeds in creating some ResourceClaim objects for a pod's extended resources and fails to create others.

which devices were picked. More details on this special `ResourceClaim`
follow below. When using extended resources advertised for a node by device
plugin, the existing resource tracking reserves them.
1. Introduce a field `ExtendedResourceClaimStatus` to pod's `Status`, such that:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plural, since there can be more than one?

Suggested change
1. Introduce a field `ExtendedResourceClaimStatus` to pod's `Status`, such that:
1. Introduce a field `ExtendedResourceClaimStatuses` to pod's `Status`, such that:

// Status of extended resource claim backed by DRA.
// +featureGate=DynamicResourceAllocation
// +optional
ExtendedResourceClaimStatus *PodExtendedResourceClaimStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct to assume that a single resourceclaim is sufficient for the extended resources requested by a pod?

a pod requesting a GPU and a high bandwidth network device could have two extended resources, satisfied by two different DRA drivers ... is it normal to batch those requests into a single resource claim?

Comment on lines +428 to +429
// ContainerName is the unique container name within the pod.
ContainerName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall, are container names guaranteed unique across container types (initContainers, containers, ephemeralContainers)?

If the pod still needs to be considered by the plugin, then it checks if the
special resource claim for extended resources backed by DRA has been created
before by scheduler, by checking resource claim name having pod name in the
annotation `resource.kubernetes.io/extened-resource-claim: pod-name`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
annotation `resource.kubernetes.io/extened-resource-claim: pod-name`.
annotation `resource.kubernetes.io/extended-resource-claim: pod-name`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it do this check using its local informer-fed cache? if so, be aware that could be stale and might not have observed a resourceclaim it created in a previous loop.

Describe how the scheduler recovers from a scenario where two resourceclaims like this incorrectly created for the same pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the resource claim already exists, it will have been created for a specific node, right? does this step have to make sure the node the claim was for is still in the set of viable nodes here? if it isn't, what happens here?

before by scheduler, by checking resource claim name having pod name in the
annotation `resource.kubernetes.io/extened-resource-claim: pod-name`.

If found, scheduler would reuse it. If not found, scheduler would create a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify if the "create" mentioned here means "create in the API" or "create in-memory"? Since ResourceClaimSpec is immutable in the API on update, I assume it means in-memory in the scheduler. That means the scheduler bookkeeping needs to know if the claim is persisted yet or not / mutable or not

container name and the extended resource backed by DRA name inside the container.
* Its `status.allocation.devices` and `status.allocation.reservedFor` are
used.
* It does not have annotation `resource.kubernetes.io/pod-claim-name:` as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the scheduler can encounter scenarios where it creates the resourceclaim object, then a subsequent step fails, it has to try again later, and in the meantime, DeviceClass mappings to extended resources have changed.

Would having one resourceclaim per {pod, extended resource type} combination make this situation easier to deal with?

If a pod has an extended resource backed by DRA, and the node does *not* have
device plugin to provide the capacity for the resource, then the
dynamicresource plugin needs to try allocate the resource by filling in the
special claim's `Spec.Devices.Requests` field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this manipulating a separate copy of the special claim per node (since whether to use DRA for the extended resource requests depends on whether the particular node in consideration has device plugins for those extended resources or not)?


The allocator needs to be modified to allow for the special resource claim for
extended resource backed by DRA, which could vary by node. The `Allocate`
method takes the claim as a parameter, in adddition to node parameter. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that code assumes a single extended resource referenced from a pod, which I don't think is correct. I agree it would be better to limit this to informing what is in the claimsToAllocate list handled inside Allocate(). That would need to become node-specific, but would be far more contained, I think.

Manipulating / needing to deepcopy a separate copy of a single special claim per node seems problematic (since whether to use DRA for the extended resource requests depends on whether the particular node in consideration has device plugins for a particular combination of extended resources or not)

This seems like another place where partitioning claims by extended resource type would help:

  1. in PreFilter, for every DRA-backed extended resource a pod references, make an in-memory ResourceClaim for that extended resource
  2. when filtering nodes in DynamicResources#Filter, compute claimsToAllocate inside Allocate() by adding in the in-memory claims for resources the node doesn't advertise legacy capacity for
  3. when filtering nodes in Fit#Filter, allow fitsRequest to ignore DRA-backed extended resources the node doesn't advertise legacy capacity for
  4. in Reserve, create the spec and update the status for the in-memory ResourceClaims for the extended resources the selected node did not advertise capacity for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/ui Categorizes an issue or PR as relevant to SIG UI. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: In Progress
Status: Changes Requested
Status: In Progress
Status: Needs Review
Development

Successfully merging this pull request may close these issues.