Skip to content

KEP-5224: Node Resource Discovery #5319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

marquiz
Copy link
Contributor

@marquiz marquiz commented May 19, 2025

  • One-line PR description: Node Resource Discovery KEP
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: marquiz
Once this PR has been reviewed and has the lgtm label, please assign jpbetz, mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 19, 2025
@marquiz
Copy link
Contributor Author

marquiz commented May 19, 2025

@k8s-ci-robot
Copy link
Contributor

@marquiz: GitHub didn't allow me to request PR reviews from the following users: Tal-or.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @mrunalp @haircommander @SergeyKanzhelev @yujuhong @ffromani @tallclair @Karthik-K-N @kad @Tal-or @kannon92 @pbetkier

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@marquiz marquiz mentioned this pull request May 6, 2025
4 tasks

// GetDynamicRuntimeConfig is a streaming interface for receiving dynamically
// changing runtime and node configuration
rpc GetDynamicRuntimeConfig(DynamicRuntimeConfigRequest) returns (stream DynamicRuntimeConfigResponse) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this doesn't really seem like a runtime config..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name came into existence with a background thought of creating a more generic "channel" for the runtime to inform kubelet about changes (without kubelet polling). But I agree, the name is bad and the background idea probably too.

Related, @ffromani suggested having a completely separate service (e.g. DiscoveryService, in addition to RuntimeService and ImageService) for handling this.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah or maybe ResourceService or NodeService or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but big +1 on separate CRI service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check, I will change this in the next update. I don't have strong opinions on this. One detail is that with a separate service there is the possibility to connect to a 3rd-party NodeService/ResourceService agent. That may be a good thing for some special users(?)


```go
const (
ResourceTopologyZoneCore = "Core"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these currently are a subset of the cadvisor machine info fields. is this all the kubelet uses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, picked the ones that the kubelet is interested in. Of course we can add all that we can imagine (e.g. the current cpu topology levels from linux kernel, i.e. package, die, cluster, core)


The kubelet reads MacineID, BootID and SystemUUID from the attributes of the
resource topology tree. If this information is not present the kubelet uses the
cAdvisor MachineInfo as a fallback.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forever? or will it stop after GA? What if this info is never available? should the kubelet exit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I should read the section above: set the node not ready for the last question

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good question in general, whether the MachineID, BootID and SystemUUID should come from the runtime or should the kubelet figure out these itself. I put them here to be able to ditch cAdvisor MachineInfo completely. Glad to hear opinions on this.


A rollout could fail e.g. because of a bug in the CRI runtime, the runtime
returning data that the kubelet cannot consume. In this case the node will be
set into NotReady state. Existing workloads should not be affected but new pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the cluster admin do in this case? is NotReady the correct signal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster admin can rollback. Regarding NotReady, I'm open to guidance, suggestions. I think the node should not be ready as new stuff shouldn't be scheduled there. But what else? Events, conditions?


TBD.

- [ ] Events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want a metric if the kubelet falls back to cadvisor for a certain resource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean metric for falling back to cAdvisor? The thinking in this proposal is that it's all or nothing: all resources (native, i.e. cpu, memory, hugepages and swap) come from the cAdvisor or nothing.

heartbeats, leader election, etc.)
-->

If Node Resource Hotplug ([KEP-3953][kep-3953]) is enabled in tandem, the node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this creates a dependency on this for 3953. @Karthik-K-N do you +1 this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also remove this snippet. This proposal (KEP-5224) alone does not cause this. But with both features enabled at the same time, this is what will happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, These two KEP's compliment each other if enabled together otherwise works as expected independently.

- fix typos

A rollout could fail e.g. because of a bug in the CRI runtime, the runtime
returning data that the kubelet cannot consume. In this case the node will be
set into NotReady state. Existing workloads should not be affected but new pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When node becomes NotReady, pods remaining in the Running state is considered controversial behavior. We may correct this in future releases.
ref: kubernetes/kubernetes#125618

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HirazawaUi for pointing out this. I'll add this detail in the next update.


### Goals

- Ability for kubelet to get node resources (capacity) from the CRI runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your goals include maintaining the availability of the kubelet's /metrics/resource endpoint after migrating to CRI-based node resource discovery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is completely independent of the stats/metrics stuff

Copy link
Contributor

@HirazawaUi HirazawaUi May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a correlation here.

There is an ambiguity: in the future, will the data returned by the /stats and /metrics/resource endpoints represent the hardware resources owned by the Kubernetes Node resource object, or the actual physical hardware resources of the node?

Prior to this KEP, these were nearly equivalent. However, once a user enables the feature gate proposed in this KEP, it might allocate only a subset of the node's resources (e.g., a portion of CPU cores) to the kubelet through extensible mechanisms. This could create ambiguity in the reported metrics.

@fmuyassarov
Copy link
Member

/cc

@k8s-ci-robot k8s-ci-robot requested a review from fmuyassarov May 27, 2025 06:53
- Ability for kubelet to get node resources (capacity) from the CRI runtime
- Retain current functionality of cpu, memory and topology managers
- API that can support dynamic node capacity changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly state that we plan to have cpu topology info is enough and compatible with Slurm-like HPC workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants