HPA scales up stable replicas when doing canary deployment with Argo Rollouts #3849

anup1384 · 2024-09-24T18:13:12Z

We use canary deployments through Argo Rollouts to deploy our services. For services that utilize the Kubernetes Horizontal Pod Autoscaler (HPA) with CPU-based scaling, we observe the stable ReplicaSet scaling up during each deployment and then scaling back down after the deployment completes.

However, when reviewing the metrics for the service using both kubectl describe hpa and kubectl get hpa during these scale-ups, the reported metrics never exceed the configured threshold.

Reference: Rollout/abc-app Target CPU utilization: 60% Current CPU utilization: 4% Min replicas: 40 Max replicas: 120 Rollout pods: 94 current / 94 desired
`
Events:
Type Reason Age From Message

Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 86; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 87; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 88; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 89; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 90; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 91; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 37m (x73 over 4d23h) horizontal-pod-autoscaler (combined from similar events): New size: 91; reason: All metrics below target
Normal SuccessfulRescale 13m (x6 over 42m) horizontal-pod-autoscaler New size: 94; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m41s (x8 over 23h) horizontal-pod-autoscaler New size: 92; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m26s (x8 over 23h) horizontal-pod-autoscaler New size: 93; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 3m9s (x6 over 32m) horizontal-pod-autoscaler New size: 91; reason: All metrics below target
`

Keda version - 2.13.2
argo-rollout - v1.2.1
k8s - 1.28

The text was updated successfully, but these errors were encountered:

FaLxy · 2024-10-16T10:05:02Z

Were you able to resolve this issue @anup1384?

anup1384 · 2024-10-16T10:13:34Z

Hi @FaLxy
Not yet, still facing issue

fernandrone · 2024-11-13T18:51:03Z

FWIW we've observed this in our infrastructure as well. In this instance a service was configured with ArgoRollouts with three steps: 1%, 10% and 65% of traffic. When traffic was increased from 10% to 65%, we saw a huge spike in both the canary replica and the stable replica.

Here's the steps as they show on Rollout dashboard:

At this point the stable replica set scaled up to 42 replicas, which is the maximum allow by the HPA, while the canary scaled up to 28 (which is 65% of 42, rounded up; I assume ArgoRollout is using a fraction of either the current number of replicas or the total number of replicas * the traffic percentage)

Inspecting the HPA we can see that cpu and memory (it scales on both) is below desired. It stayed like that for hours, all throughout the Canary, and never scaled down.

Reference:                                                                         Rollout/<redacted>
Metrics:                                                                           ( current / target )
  resource cpu of container "<redacted>" on pods  (as a percentage of request):     4% (41m) / 100%
  resource memory of container "<redacted>" on pods  (as a percentage of request):  42% (1834665691428m) / 60%
Min replicas:                                                                      10
Max replicas:                                                                      42
Rollout pods:                                                                      42 current / 42 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from memory container resource utilization (percentage of request)
  ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count
Events:           <none>

This is how it looks on Grafana, at about 13h30 when the traffic shift went from 10% to 65% we saw the huge spike in our replicas, in both the stable and canary, going from a total of 22 to 70.

ArgoRollouts v1.7.1
ArgoCD v2.9.1+58b04e5
Kubernetes v1.29.4

anup1384 added the bug Something isn't working label Sep 24, 2024

anup1384 changed the title ~~HPA scales up stable replicas when doing canary deployment~~ HPA scales up stable replicas when doing canary deployment with Argo Rollouts Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPA scales up stable replicas when doing canary deployment with Argo Rollouts #3849

HPA scales up stable replicas when doing canary deployment with Argo Rollouts #3849

anup1384 commented Sep 24, 2024

FaLxy commented Oct 16, 2024

anup1384 commented Oct 16, 2024

fernandrone commented Nov 13, 2024 •

edited

Loading

HPA scales up stable replicas when doing canary deployment with Argo Rollouts #3849

HPA scales up stable replicas when doing canary deployment with Argo Rollouts #3849

Comments

anup1384 commented Sep 24, 2024

FaLxy commented Oct 16, 2024

anup1384 commented Oct 16, 2024

fernandrone commented Nov 13, 2024 • edited Loading

fernandrone commented Nov 13, 2024 •

edited

Loading