Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA scales up stable replicas when doing canary deployment with Argo Rollouts #3849

Open
anup1384 opened this issue Sep 24, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@anup1384
Copy link

We use canary deployments through Argo Rollouts to deploy our services. For services that utilize the Kubernetes Horizontal Pod Autoscaler (HPA) with CPU-based scaling, we observe the stable ReplicaSet scaling up during each deployment and then scaling back down after the deployment completes.

However, when reviewing the metrics for the service using both kubectl describe hpa and kubectl get hpa during these scale-ups, the reported metrics never exceed the configured threshold.

Reference: Rollout/abc-app Target CPU utilization: 60% Current CPU utilization: 4% Min replicas: 40 Max replicas: 120 Rollout pods: 94 current / 94 desired
`
Events:
Type Reason Age From Message


Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 86; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 87; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 88; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 89; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 90; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 91; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 37m (x73 over 4d23h) horizontal-pod-autoscaler (combined from similar events): New size: 91; reason: All metrics below target
Normal SuccessfulRescale 13m (x6 over 42m) horizontal-pod-autoscaler New size: 94; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m41s (x8 over 23h) horizontal-pod-autoscaler New size: 92; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m26s (x8 over 23h) horizontal-pod-autoscaler New size: 93; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 3m9s (x6 over 32m) horizontal-pod-autoscaler New size: 91; reason: All metrics below target
`
image

Keda version - 2.13.2
argo-rollout - v1.2.1
k8s - 1.28

@anup1384 anup1384 added the bug Something isn't working label Sep 24, 2024
@anup1384 anup1384 changed the title HPA scales up stable replicas when doing canary deployment HPA scales up stable replicas when doing canary deployment with Argo Rollouts Sep 24, 2024
@FaLxy
Copy link

FaLxy commented Oct 16, 2024

Were you able to resolve this issue @anup1384?

@anup1384
Copy link
Author

Hi @FaLxy
Not yet, still facing issue

@fernandrone
Copy link

fernandrone commented Nov 13, 2024

FWIW we've observed this in our infrastructure as well. In this instance a service was configured with ArgoRollouts with three steps: 1%, 10% and 65% of traffic. When traffic was increased from 10% to 65%, we saw a huge spike in both the canary replica and the stable replica.

Here's the steps as they show on Rollout dashboard:

image

At this point the stable replica set scaled up to 42 replicas, which is the maximum allow by the HPA, while the canary scaled up to 28 (which is 65% of 42, rounded up; I assume ArgoRollout is using a fraction of either the current number of replicas or the total number of replicas * the traffic percentage)

Inspecting the HPA we can see that cpu and memory (it scales on both) is below desired. It stayed like that for hours, all throughout the Canary, and never scaled down.

Reference:                                                                         Rollout/<redacted>
Metrics:                                                                           ( current / target )
  resource cpu of container "<redacted>" on pods  (as a percentage of request):     4% (41m) / 100%
  resource memory of container "<redacted>" on pods  (as a percentage of request):  42% (1834665691428m) / 60%
Min replicas:                                                                      10
Max replicas:                                                                      42
Rollout pods:                                                                      42 current / 42 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from memory container resource utilization (percentage of request)
  ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count
Events:           <none>

This is how it looks on Grafana, at about 13h30 when the traffic shift went from 10% to 65% we saw the huge spike in our replicas, in both the stable and canary, going from a total of 22 to 70.

image


ArgoRollouts v1.7.1
ArgoCD v2.9.1+58b04e5
Kubernetes v1.29.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants