Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not assign targets to a collector pod that are not Ready for accepting traffic more than a while #3807

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rice-junhaoyu
Copy link

@rice-junhaoyu rice-junhaoyu commented Mar 15, 2025

Description:

Ideally, TargetAllocator should not assign targets to an unhealthy pod that is not ready for accepting traffic.

By discussing with Mikolaj in #3781, we believe that it will be a good idea to NOT assign targets to an unhealthy collector pod that has been staying in the bad state for a grace period of 30s or something.

Note that the unhealthy here not only necessarily means Pod Phase is non-Running, but more of saying the pod's PodCondition Ready is non-True.

  • Running is a Pod Phase that is defined as: "The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting."
  • Ready is a Pod Condition that is defined as: "The Pod is able to serve requests and should be added to the load balancing pools of all matching Services."

Although I believe the check on Ready PodCondition is already covering the check on Running PodPhase, but I am fine to keep both of them as of now. Let me know your opinion if you have a strong one.

Link to tracking Issue(s):

Resolves: #3781

Testing:

Unit Test:

  • I added two unit tests that
    • given three pods (1) a running pod (2) a non-running pod but is still within grace period (3) a non-running pod and is already over grace period, the test passed.
    • given three pods (1) a ready pod (2) a non-ready pod but is still within grace period (3) a non-ready pod and is already over grace period, the test passed.
  • Also I want to call out that existing unit tests are not impacted.

Local Test:

  • I built a docker image with this change and verified that if a pod is non-running for too long, TargetAllocator will not assign any more targets to it.

Local Test Step By Step:

  1. Initial State was 10 available collector pods and two target-allocator pods with the change in this PR
otel-collector-with-ta-kube-state-metrics-collector-0             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-1             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-2             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-3             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-4             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-5             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-6             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-7             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-8             1/1     Running       0            31h
otel-collector-with-ta-kube-state-metrics-collector-9             1/1     Running       0            61s
otel-collector-with-ta-kube-state-metrics-targetallocator-kp5xm   1/1     Running       0            5m15s
otel-collector-with-ta-kube-state-metrics-targetallocator-p9kvl   1/1     Running       0            3m50s

By port-forwarding to http://localhost:8080/jobs/kube-state-metrics/targets, I could see 10 collectors are candidates.

Screenshot 2025-03-14 at 7 04 47 PM

  1. I changed collectors to use a bad image and then collector-9 went into non-Running state
otel-collector-with-ta-kube-state-metrics-collector-0             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-1             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-2             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-3             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-4             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-5             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-6             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-7             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-8             1/1     Running        0            31h
otel-collector-with-ta-kube-state-metrics-collector-9             0/1     ErrImagePull   0            37s
otel-collector-with-ta-kube-state-metrics-targetallocator-kp5xm   1/1     Running        0            5m55s
otel-collector-with-ta-kube-state-metrics-targetallocator-p9kvl   1/1     Running        0            4m30s

Pod Condition of Ready also went False in a bit.

image

By port-forwarding to http://localhost:8080/jobs/kube-state-metrics/targets, I could only see 9 collectors are candidates.

Screenshot 2025-03-14 at 7 08 53 PM

@rice-junhaoyu rice-junhaoyu changed the title Do not assign targets to a collector pod that are not Running for more than a while Do not assign targets to a collector pod that are not Ready for accepting traffic more than a while Mar 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[target-allocator] TargetAllocator assigns Targets to an unhealthy collector?
1 participant