✨ Add new metrics and testing documentation#271
Conversation
✅ Deploy Preview for node-readiness-controller canceled.
|
|
/cc @AvineshTripathi |
AvineshTripathi
left a comment
There was a problem hiding this comment.
Thanks for the PR @Karthik-K-N. I have put some comments! PTAL
|
|
||
| // ReconciliationLatency tracks end-to-end latency from condition change to taint operation. | ||
| // This measures how quickly the controller responds to node condition changes. | ||
| ReconciliationLatency = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
looking at this, I was wondering if we should club this and https://github.com/kubernetes-sigs/node-readiness-controller/pull/271/changes#diff-6e56d7c057f6a96124b78f2b669dd7dface2718c304aece205017d114b3a7088R43?? What do you think?
I feel both share the same purpose
There was a problem hiding this comment.
which one are you referring to? I am not able to trace.
There was a problem hiding this comment.
There was a problem hiding this comment.
No Issues thanks.
EvaluationDuration is an internal profiling metric, it tells us how fast our Go code executes a single rule evaluation (usually in milliseconds). ReconciliationLatency is our external Service Level Indicator (SLI). It measures the total end-to-end reaction time from the moment the node changed state in the real world to the moment we successfully patched the taint (which includes API server lag, rate limiting, and queue wait times, usually in seconds).
Keeping them separate allows us to instantly know if a slowdown is caused by inefficient code vs. Kubernetes API queueing.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Karthik-K-N The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Thanks for looping me into this PR (somehow missed it earlier😅). I went through the changes, and this aligns quite well with the Headlamp project. The separation between Also nice to see Looking forward to building on top of this once it lands. |
|
In a high level, metric shapes look good to me. But I'm not sure if updating state / metric per event is scalable for the controller. We may need to take a second look at the compute cost for these |
|
May be can we get this in to unblock or not to overlap with other work and based on the scale test results we can revisit and update the metrics? |
Could we create separate feature branches for the LFX projects? it'd help with the velocity of the projects as well. |
Description
This PR adds few metrics into the controller.
Contains only metrics part from PR: #185
Related Issue
Type of Change
/kind feature
Testing
Checklist
make testpassesmake lintpassesDoes this PR introduce a user-facing change?
Doc #(issue)