-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
i have same issue !!! ghcr.io/actions/gha-runner-scale-set-controller:0.8.1
|
I have the same issue with version 0.10.1 |
Same issue in the past two days @Link- @nikola-jokic do you folks know about any changes in api? or what could be the problem? |
confirmed here as well, 0.9.3 runner versions. Issue happens since 7th February 2025 |
Hey everyone, We rolled out changes to our API for stricter session validation that resulted in some unexpected conflict exceptions. We’ve rolled those changes back now. The listener should restart on its own, so the scale set should recover without any manual intervention. If not, please let me know, since the listener should recover from situations like this. |
🙏 that rollback works.. this issue has caused serious disruption & stress for a large team of engineers for 4 days.. maybe Jenkins wasn't so bad after all |
Thank you, @tomhaynes, for letting us know! And to everyone, we are very sorry for the disruption caused by the rollout. 😞 |
Thanks @nikola-jokic for explaining the issue, you were far more forthcoming than the various support requests we raised with our reps. We're a fairly large enterprise customer, and we rely on these runners for our production workflows.. well we currently do, we will be re-evaluating this. It is a bit of a concern that it took so long for the issue to be identified and rolled back - can you confirm that this will be looked at, so such an incident doesn't recur / is caught much more quickly? |
@tomhaynes yes, I would also be happy about a better monitoring/rollback etc. @nikola-jokic thanks for the rollback |
I want to side with the maintainers, and maybe "maybe".. instead of complaining about an OSS project, it would be enough to develop a fallback system for all critical workflows that fetches the GitHub Hosted runners - that are almost always available :) Also, if we are talking about "Enterprise," you should aim for other paid solutions. and not complain about a FOSS project my 2 cents |
Hey @mmuth @tomhaynes, we added an alert to help us catch this kind of issue in the future, and the fix is rolled out. Also, thank you, @matteovivona, for your support! Closing this issue now since it seems to be resolved. We can always re-open it. |
@matteovivona we heavily use both types of runner.. we have network security requirements that mean we have to use self hosted runners for some use-cases. If you could point to the section of the Github docs that recommend not to use these runners for critical workloads? Regarding it being OSS unless I'm misunderstanding the issue was with the Github SAAS API, not the arc controllers.. |
Hey @tomhaynes, You are right; the issue was on the API side. Using ARC or any self-hosted runner solution for critical workloads is perfectly fine. The point I think @matteovivona raised is that you should probably have a fallback mechanism in case an issue like this occurs, so it doesn't cause disruptions. |
Hello @nikola-jokic we're starting to see the same errors that were occurring as last time? Do you know if the same change/change like the one that broke self hosted runners has been made again? |
Hey @marktuckcp, Can you please send the listener log so we can investigate it? |
Sorry missed this @nikola-jokic, you've been helping my colleague on this issue - #3935 - thanks |
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Hi, since February 9th, we observe on a daily basis the following logs in our scale set controllers:
2025/02/11 07:37:41 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 409, AcivityId "....": GitHub.Actions.Runtime.WebApi.RunnerScaleSetSessionConflictException, GitHub.Actions.Runtime │ │ .WebApi: The actions runner scaleset gha-standard-scale-set already has an active session.
This occurs not for all scale sets, but for those that are rather frequently used (e.g. translation jobs that run every 3 minutes).
It has a major impact as a lot of jobs queue up and all development teams are stuck then. But it worked for months like a charm, this is whats the surprising thing about it.
Does anyone have the same problem?
Describe the expected behavior
Additional Context
Controller Logs
2025/02/11 07:37:41 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 409, AcivityId "....": GitHub.Actions.Runtime.WebApi.RunnerScaleSetSessionConflictException, GitHub.Actions.Runtime │ │ .WebApi: The actions runner scaleset gha-standard-scale-set already has an active session.
Runner Pod Logs
The text was updated successfully, but these errors were encountered: