Do not assign targets to a collector pod that are not Ready for accepting traffic more than a while #3807
+260
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
Ideally, TargetAllocator should not assign targets to an unhealthy pod that is not ready for accepting traffic.
By discussing with Mikolaj in #3781, we believe that it will be a good idea to NOT assign targets to an unhealthy collector pod that has been staying in the bad state for a grace period of
30s
or something.Note that the
unhealthy
here not only necessarily meansPod Phase is non-Running
, but more of saying the pod'sPodCondition Ready is non-True
.Running
is a Pod Phase that is defined as: "The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting."Ready
is a Pod Condition that is defined as: "The Pod is able to serve requests and should be added to the load balancing pools of all matching Services."Although I believe the check on
Ready
PodCondition is already covering the check onRunning
PodPhase, but I am fine to keep both of them as of now. Let me know your opinion if you have a strong one.Link to tracking Issue(s):
Resolves: #3781
Testing:
Unit Test:
given three pods (1) a running pod (2) a non-running pod but is still within grace period (3) a non-running pod and is already over grace period
, the test passed.given three pods (1) a ready pod (2) a non-ready pod but is still within grace period (3) a non-ready pod and is already over grace period
, the test passed.Local Test:
Local Test Step By Step:
By port-forwarding to
http://localhost:8080/jobs/kube-state-metrics/targets
, I could see 10 collectors are candidates.collector-9
went into non-Running statePod Condition of
Ready
also wentFalse
in a bit.By port-forwarding to
http://localhost:8080/jobs/kube-state-metrics/targets
, I could only see 9 collectors are candidates.