-
Notifications
You must be signed in to change notification settings - Fork 191
Add DataProducer PrepareData and Admission control plugins #1796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add DataProducer PrepareData and Admission control plugins #1796
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rahulgurnani The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @rahulgurnani. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
pkg/epp/requestcontrol/director.go
Outdated
| loggerDebug := log.FromContext(ctx).V(logutil.DEBUG) | ||
| for _, plugin := range d.requestControlPlugins.admitRequestPlugins { | ||
| loggerDebug.Info("Running AdmitRequest plugin", "plugin", plugin.TypedName()) | ||
| if !plugin.Admit(ctx, request, pods) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow the plugin to return a string explaining the reason for rejection. We can then just treat the empty string as the allow mechanism (less opinionated on this part tho).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
pkg/epp/scheduling/types/types.go
Outdated
|
|
||
| // Attributes provides a goroutine-safe implementation of AttributeMap. | ||
| type Attributes struct { | ||
| data sync.Map // key: attribute name (string), value: attribute value (opaque, Cloneable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically I am all for using prebuild libs to handle this type of complexity.
But since writes to specific attributes will lock the entire data object, we may have high lock contention here. Did we explore having a lock per attribute key?
That would let locks be at the granularity of a specific endpoint & a specific attribute, which should hopefully reduce lock contention & let our system be more performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. This attribute map is a per request copy where we take a snapshot of the attributes so that we can use them in the scheduling layer. Given the per request nature of the map, I think it won't have contention because it will take like t < microsecond to update the map. I think its reasonable to use sync map here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we scale test to ensure we don't have any regression? We can consider baseline metrics what we have here: #1458
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to do this in next PRs since we are not actually using the map in this change. Thanks!
63208cf to
bf49fb5
Compare
e9f0b7f to
3e16069
Compare
3e16069 to
e3021c7
Compare
pkg/epp/requestcontrol/director.go
Outdated
| return reqCtx, errutil.Error{Code: errutil.ServiceUnavailable, Msg: "failed to find candidate pods for serving the request"} | ||
| } | ||
|
|
||
| // TODO(rahulgurnani/lukevandrie): Perhaps, refactor/implement Admit plugin for Admission control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is important. what is the relation between admission controller Admit and the admitRequest plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
This would be a break in contract of how flow control operates. This specific plugin is for request specific semantics. We have currently do not have Flow Control considering request specific semantics, and there hasnt been a proposal suggesting this change. I think we should remove this todo until we have strong reasoning to actually do this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my apologize, but I don't understand the intention of this new Admission plugin.
I thought we want to have admission check pluggable, but it seems now that we have two types of admission checks, with two different interfaces.
this seems wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the comment. Thanks for the catch!
pkg/epp/requestcontrol/director.go
Outdated
| result, err := d.scheduler.Schedule(ctx, reqCtx.SchedulingRequest, d.toSchedulerPodMetrics(candidatePods)) | ||
| // Prepare per request data | ||
| // TODO(rahulgurnani): Add retries and timeout in the preparedata step. | ||
| d.runPrepareDataPlugins(ctx, reqCtx.SchedulingRequest, snapshotOfCandidatePods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we create snapshot of candidate pods?
we should work with candidatePods and create a snapshot only when calling the scheduler.
this is true for both prep data and admit request.
helper functions in the director should not rely on the internal scheduler representation of the endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both prep data and admit request are request specific, and so if we add request specific data to the shared endpoints that could risk data corruption.
Snapshotting before these steps ensures that this data lifecycle is only in the context it is consumed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intention of converting the endpoints representation to scheduler internal structure was only for the purpose of sending it to the scheduler.
PodMetrics has MetricsState behind atomic pointer and reading the metrics is an atomic operation (read all metrics in one operation).
I must be missing something although I've read the proposal doc.
a9976c9 to
cf26496
Compare
cf26496 to
86dd466
Compare
853ef84 to
0b0b7b7
Compare
0b0b7b7 to
b5b3e23
Compare
b5b3e23 to
e34b49c
Compare
15f6758 to
4cc246d
Compare
4cc246d to
db8f9f7
Compare
404a257 to
f28020a
Compare
…timeout logic back since it got removed in previous commit.
f28020a to
b85113f
Compare
Add PrepareData and AdmitRequest plugins based on recommendations in evolving datalayer changes
The prepare data plugins are executed in the order such that the plugin dependent on other plugins waits for the execution of other plugins to complete.
Furthermore, the dependency graph (DAG) of data producers and consumers are validated at the startup for cycles. If there is a cycle, the startup fails with error.
The PR also does some refactor of director.go code.
cc @BenjaminBraunDev
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR is needed to make it easier to implement plugins that produce data consumed by other plugins. For instance latency predictor, prefix match plugin etc.
Read the doc evolving datalayer changes
for more details
Which issue(s) this PR fixes:
Addresses #1743
Does this PR introduce a user-facing change?:
Yes, enables writing prepare data and admit request plugins for users of IGW.