Skip to content

Conversation

@rahulgurnani
Copy link
Contributor

@rahulgurnani rahulgurnani commented Oct 31, 2025

Add PrepareData and AdmitRequest plugins based on recommendations in evolving datalayer changes

The prepare data plugins are executed in the order such that the plugin dependent on other plugins waits for the execution of other plugins to complete.
Furthermore, the dependency graph (DAG) of data producers and consumers are validated at the startup for cycles. If there is a cycle, the startup fails with error.

The PR also does some refactor of director.go code.

cc @BenjaminBraunDev

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR is needed to make it easier to implement plugins that produce data consumed by other plugins. For instance latency predictor, prefix match plugin etc.

Read the doc evolving datalayer changes
for more details

Which issue(s) this PR fixes:

Addresses #1743

Does this PR introduce a user-facing change?:

Yes, enables writing prepare data and admit request plugins for users of IGW.

Add prepare data and admit request plugins

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 31, 2025
@netlify
Copy link

netlify bot commented Oct 31, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit b85113f
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6912784e6fab4300074d2679
😎 Deploy Preview https://deploy-preview-1796--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 31, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rahulgurnani
Once this PR has been reviewed and has the lgtm label, please assign nirrozenbaum for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and elevran October 31, 2025 16:22
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @rahulgurnani. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 31, 2025
@kfswain
Copy link
Collaborator

kfswain commented Oct 31, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 31, 2025
loggerDebug := log.FromContext(ctx).V(logutil.DEBUG)
for _, plugin := range d.requestControlPlugins.admitRequestPlugins {
loggerDebug.Info("Running AdmitRequest plugin", "plugin", plugin.TypedName())
if !plugin.Admit(ctx, request, pods) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow the plugin to return a string explaining the reason for rejection. We can then just treat the empty string as the allow mechanism (less opinionated on this part tho).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!


// Attributes provides a goroutine-safe implementation of AttributeMap.
type Attributes struct {
data sync.Map // key: attribute name (string), value: attribute value (opaque, Cloneable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically I am all for using prebuild libs to handle this type of complexity.

But since writes to specific attributes will lock the entire data object, we may have high lock contention here. Did we explore having a lock per attribute key?

That would let locks be at the granularity of a specific endpoint & a specific attribute, which should hopefully reduce lock contention & let our system be more performant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. This attribute map is a per request copy where we take a snapshot of the attributes so that we can use them in the scheduling layer. Given the per request nature of the map, I think it won't have contention because it will take like t < microsecond to update the map. I think its reasonable to use sync map here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we scale test to ensure we don't have any regression? We can consider baseline metrics what we have here: #1458

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to do this in next PRs since we are not actually using the map in this change. Thanks!

@rahulgurnani rahulgurnani force-pushed the preparedata-changes branch 2 times, most recently from e9f0b7f to 3e16069 Compare November 3, 2025 03:05
return reqCtx, errutil.Error{Code: errutil.ServiceUnavailable, Msg: "failed to find candidate pods for serving the request"}
}

// TODO(rahulgurnani/lukevandrie): Perhaps, refactor/implement Admit plugin for Admission control.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is important. what is the relation between admission controller Admit and the admitRequest plugins?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

This would be a break in contract of how flow control operates. This specific plugin is for request specific semantics. We have currently do not have Flow Control considering request specific semantics, and there hasnt been a proposal suggesting this change. I think we should remove this todo until we have strong reasoning to actually do this work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my apologize, but I don't understand the intention of this new Admission plugin.
I thought we want to have admission check pluggable, but it seems now that we have two types of admission checks, with two different interfaces.
this seems wrong.

Copy link
Contributor Author

@rahulgurnani rahulgurnani Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment. Thanks for the catch!

result, err := d.scheduler.Schedule(ctx, reqCtx.SchedulingRequest, d.toSchedulerPodMetrics(candidatePods))
// Prepare per request data
// TODO(rahulgurnani): Add retries and timeout in the preparedata step.
d.runPrepareDataPlugins(ctx, reqCtx.SchedulingRequest, snapshotOfCandidatePods)
Copy link
Contributor

@nirrozenbaum nirrozenbaum Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we create snapshot of candidate pods?
we should work with candidatePods and create a snapshot only when calling the scheduler.
this is true for both prep data and admit request.
helper functions in the director should not rely on the internal scheduler representation of the endpoints.

Copy link
Collaborator

@kfswain kfswain Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both prep data and admit request are request specific, and so if we add request specific data to the shared endpoints that could risk data corruption.

Snapshotting before these steps ensures that this data lifecycle is only in the context it is consumed in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention of converting the endpoints representation to scheduler internal structure was only for the purpose of sending it to the scheduler.
PodMetrics has MetricsState behind atomic pointer and reading the metrics is an atomic operation (read all metrics in one operation).
I must be missing something although I've read the proposal doc.

@rahulgurnani rahulgurnani changed the title Add DataProducer (PrepareData) and Admission control plugins [WIP] Add DataProducer (PrepareData) and Admission control plugins Nov 7, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2025
@rahulgurnani rahulgurnani force-pushed the preparedata-changes branch 2 times, most recently from a9976c9 to cf26496 Compare November 8, 2025 00:40
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 8, 2025
@rahulgurnani rahulgurnani changed the title [WIP] Add DataProducer (PrepareData) and Admission control plugins Add DataProducer (PrepareData) and Admission control plugins Nov 8, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 8, 2025
@rahulgurnani rahulgurnani force-pushed the preparedata-changes branch 4 times, most recently from 853ef84 to 0b0b7b7 Compare November 9, 2025 01:48
@rahulgurnani rahulgurnani changed the title Add DataProducer (PrepareData) and Admission control plugins Add DataProducer PrepareData and Admission control plugins Nov 9, 2025
@rahulgurnani rahulgurnani force-pushed the preparedata-changes branch 2 times, most recently from 15f6758 to 4cc246d Compare November 10, 2025 21:11
@rahulgurnani rahulgurnani force-pushed the preparedata-changes branch 2 times, most recently from 404a257 to f28020a Compare November 10, 2025 23:38
…timeout logic back since it got removed in previous commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants