Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922

mmuth · 2025-02-11T08:32:55Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

- Deploy ARC and ScaleSet
- Have very frequent workflows scheduled
- observe the scaleset controller with the exception after some random time

Describe the bug

Hi, since February 9th, we observe on a daily basis the following logs in our scale set controllers:

2025/02/11 07:37:41 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 409, AcivityId "....": GitHub.Actions.Runtime.WebApi.RunnerScaleSetSessionConflictException, GitHub.Actions.Runtime │ │ .WebApi: The actions runner scaleset gha-standard-scale-set already has an active session.

This occurs not for all scale sets, but for those that are rather frequently used (e.g. translation jobs that run every 3 minutes).
It has a major impact as a lot of jobs queue up and all development teams are stuck then. But it worked for months like a charm, this is whats the surprising thing about it.

Does anyone have the same problem?

Describe the expected behavior

The Scaleset controller should just continue to work (or retry to get a session)

Additional Context

Controller Logs

2025/02/11 07:37:41 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 409, AcivityId "....": GitHub.Actions.Runtime.WebApi.RunnerScaleSetSessionConflictException, GitHub.Actions.Runtime │
│ .WebApi: The actions runner scaleset gha-standard-scale-set already has an active session.

Runner Pod Logs

(no runners launched)

The text was updated successfully, but these errors were encountered:

github-actions · 2025-02-11T08:33:19Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

sb185296 · 2025-02-11T09:02:17Z

i have same issue !!!
in the past 2 days
i find me sleaf with that issue !

ghcr.io/actions/gha-runner-scale-set-controller:0.8.1

  chart: gha-runner-scale-set-controller
  sourceRef:
    kind: HelmRepository
    name: actions-runner-controller-charts
  version: '0.8.1'


  chart: gha-runner-scale-set
  sourceRef:
    kind: HelmRepository
    name: actions-runner-controller-charts
  version: '0.8.1'

matteovivona · 2025-02-11T10:02:47Z

I have the same issue with version 0.10.1

kyrylomiro · 2025-02-11T10:27:00Z

Same issue in the past two days @Link- @nikola-jokic do you folks know about any changes in api? or what could be the problem?

konck2 · 2025-02-11T11:57:21Z

confirmed here as well, 0.9.3 runner versions. Issue happens since 7th February 2025

nikola-jokic · 2025-02-11T11:58:08Z

Hey everyone,

We rolled out changes to our API for stricter session validation that resulted in some unexpected conflict exceptions. We’ve rolled those changes back now. The listener should restart on its own, so the scale set should recover without any manual intervention. If not, please let me know, since the listener should recover from situations like this.

tomhaynes · 2025-02-11T12:05:20Z

🙏 that rollback works.. this issue has caused serious disruption & stress for a large team of engineers for 4 days.. maybe Jenkins wasn't so bad after all

nikola-jokic · 2025-02-11T12:16:28Z

Thank you, @tomhaynes, for letting us know!

And to everyone, we are very sorry for the disruption caused by the rollout. 😞

tomhaynes · 2025-02-11T15:59:13Z

Thanks @nikola-jokic for explaining the issue, you were far more forthcoming than the various support requests we raised with our reps. We're a fairly large enterprise customer, and we rely on these runners for our production workflows.. well we currently do, we will be re-evaluating this.

It is a bit of a concern that it took so long for the issue to be identified and rolled back - can you confirm that this will be looked at, so such an incident doesn't recur / is caught much more quickly?

mmuth · 2025-02-12T09:26:11Z

@tomhaynes yes, I would also be happy about a better monitoring/rollback etc.
We had a lot of stress here, as 2000 workflows queued up and 7 development teams were blocked in their daily work.

@nikola-jokic thanks for the rollback

matteovivona · 2025-02-12T09:56:24Z

I want to side with the maintainers, and maybe "maybe".. instead of complaining about an OSS project, it would be enough to develop a fallback system for all critical workflows that fetches the GitHub Hosted runners - that are almost always available :)

Also, if we are talking about "Enterprise," you should aim for other paid solutions. and not complain about a FOSS project
currently developed and maintained in collaboration with the GitHub Actions team, external maintainers @mumoshu and @toast-gear, various contributors, and the awesome community. cit.

my 2 cents

nikola-jokic · 2025-02-12T13:13:10Z

Hey @mmuth @tomhaynes, we added an alert to help us catch this kind of issue in the future, and the fix is rolled out.

Also, thank you, @matteovivona, for your support!

Closing this issue now since it seems to be resolved. We can always re-open it.

tomhaynes · 2025-02-12T17:04:52Z

@matteovivona we heavily use both types of runner.. we have network security requirements that mean we have to use self hosted runners for some use-cases. If you could point to the section of the Github docs that recommend not to use these runners for critical workloads?

Regarding it being OSS unless I'm misunderstanding the issue was with the Github SAAS API, not the arc controllers..

nikola-jokic · 2025-02-13T08:57:33Z

Hey @tomhaynes,

You are right; the issue was on the API side. Using ARC or any self-hosted runner solution for critical workloads is perfectly fine. The point I think @matteovivona raised is that you should probably have a fallback mechanism in case an issue like this occurs, so it doesn't cause disruptions.

marktuckcp · 2025-02-19T11:33:57Z

Hello @nikola-jokic we're starting to see the same errors that were occurring as last time? Do you know if the same change/change like the one that broke self hosted runners has been made again?

nikola-jokic · 2025-02-19T13:33:27Z

Hey @marktuckcp,

Can you please send the listener log so we can investigate it?

marktuckcp · 2025-02-20T12:37:02Z

Sorry missed this @nikola-jokic, you've been helping my colleague on this issue - #3935 - thanks

mmuth added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 11, 2025

nikola-jokic removed the needs triage Requires review from the maintainers label Feb 11, 2025

nikola-jokic closed this as completed Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922

Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922

mmuth commented Feb 11, 2025

github-actions bot commented Feb 11, 2025

sb185296 commented Feb 11, 2025 •

edited

Loading

matteovivona commented Feb 11, 2025

kyrylomiro commented Feb 11, 2025

konck2 commented Feb 11, 2025

nikola-jokic commented Feb 11, 2025

tomhaynes commented Feb 11, 2025

nikola-jokic commented Feb 11, 2025

tomhaynes commented Feb 11, 2025

mmuth commented Feb 12, 2025

matteovivona commented Feb 12, 2025

nikola-jokic commented Feb 12, 2025

tomhaynes commented Feb 12, 2025

nikola-jokic commented Feb 13, 2025

marktuckcp commented Feb 19, 2025

nikola-jokic commented Feb 19, 2025

marktuckcp commented Feb 20, 2025

Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922

Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922

Comments

mmuth commented Feb 11, 2025

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Feb 11, 2025

sb185296 commented Feb 11, 2025 • edited Loading

matteovivona commented Feb 11, 2025

kyrylomiro commented Feb 11, 2025

konck2 commented Feb 11, 2025

nikola-jokic commented Feb 11, 2025

tomhaynes commented Feb 11, 2025

nikola-jokic commented Feb 11, 2025

tomhaynes commented Feb 11, 2025

mmuth commented Feb 12, 2025

matteovivona commented Feb 12, 2025

nikola-jokic commented Feb 12, 2025

tomhaynes commented Feb 12, 2025

nikola-jokic commented Feb 13, 2025

marktuckcp commented Feb 19, 2025

nikola-jokic commented Feb 19, 2025

marktuckcp commented Feb 20, 2025

sb185296 commented Feb 11, 2025 •

edited

Loading