Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922

Closed
4 tasks done
mmuth opened this issue Feb 11, 2025 · 17 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@mmuth
Copy link

mmuth commented Feb 11, 2025

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

- Deploy ARC and ScaleSet
- Have very frequent workflows scheduled
- observe the scaleset controller with the exception after some random time

Describe the bug

Hi, since February 9th, we observe on a daily basis the following logs in our scale set controllers:

2025/02/11 07:37:41 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 409, AcivityId "....": GitHub.Actions.Runtime.WebApi.RunnerScaleSetSessionConflictException, GitHub.Actions.Runtime │ │ .WebApi: The actions runner scaleset gha-standard-scale-set already has an active session.

This occurs not for all scale sets, but for those that are rather frequently used (e.g. translation jobs that run every 3 minutes).
It has a major impact as a lot of jobs queue up and all development teams are stuck then. But it worked for months like a charm, this is whats the surprising thing about it.

Does anyone have the same problem?

Describe the expected behavior

  • The Scaleset controller should just continue to work (or retry to get a session)

Additional Context

-

Controller Logs

2025/02/11 07:37:41 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 409, AcivityId "....": GitHub.Actions.Runtime.WebApi.RunnerScaleSetSessionConflictException, GitHub.Actions.Runtime │
│ .WebApi: The actions runner scaleset gha-standard-scale-set already has an active session.

Runner Pod Logs

(no runners launched)
@mmuth mmuth added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 11, 2025
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@sb185296
Copy link

sb185296 commented Feb 11, 2025

i have same issue !!!
in the past 2 days
i find me sleaf with that issue !

ghcr.io/actions/gha-runner-scale-set-controller:0.8.1

Image

Image

  chart: gha-runner-scale-set-controller
  sourceRef:
    kind: HelmRepository
    name: actions-runner-controller-charts
  version: '0.8.1'


  chart: gha-runner-scale-set
  sourceRef:
    kind: HelmRepository
    name: actions-runner-controller-charts
  version: '0.8.1'

@matteovivona
Copy link

I have the same issue with version 0.10.1

@kyrylomiro
Copy link

Same issue in the past two days @Link- @nikola-jokic do you folks know about any changes in api? or what could be the problem?

@konck2
Copy link

konck2 commented Feb 11, 2025

confirmed here as well, 0.9.3 runner versions. Issue happens since 7th February 2025

@nikola-jokic
Copy link
Collaborator

Hey everyone,

We rolled out changes to our API for stricter session validation that resulted in some unexpected conflict exceptions. We’ve rolled those changes back now. The listener should restart on its own, so the scale set should recover without any manual intervention. If not, please let me know, since the listener should recover from situations like this.

@nikola-jokic nikola-jokic removed the needs triage Requires review from the maintainers label Feb 11, 2025
@tomhaynes
Copy link

🙏 that rollback works.. this issue has caused serious disruption & stress for a large team of engineers for 4 days.. maybe Jenkins wasn't so bad after all

@nikola-jokic
Copy link
Collaborator

Thank you, @tomhaynes, for letting us know!

And to everyone, we are very sorry for the disruption caused by the rollout. 😞

@tomhaynes
Copy link

Thanks @nikola-jokic for explaining the issue, you were far more forthcoming than the various support requests we raised with our reps. We're a fairly large enterprise customer, and we rely on these runners for our production workflows.. well we currently do, we will be re-evaluating this.

It is a bit of a concern that it took so long for the issue to be identified and rolled back - can you confirm that this will be looked at, so such an incident doesn't recur / is caught much more quickly?

@mmuth
Copy link
Author

mmuth commented Feb 12, 2025

@tomhaynes yes, I would also be happy about a better monitoring/rollback etc.
We had a lot of stress here, as 2000 workflows queued up and 7 development teams were blocked in their daily work.

@nikola-jokic thanks for the rollback

@matteovivona
Copy link

I want to side with the maintainers, and maybe "maybe".. instead of complaining about an OSS project, it would be enough to develop a fallback system for all critical workflows that fetches the GitHub Hosted runners - that are almost always available :)

Also, if we are talking about "Enterprise," you should aim for other paid solutions. and not complain about a FOSS project
currently developed and maintained in collaboration with the GitHub Actions team, external maintainers @mumoshu and @toast-gear, various contributors, and the awesome community. cit.

my 2 cents

@nikola-jokic
Copy link
Collaborator

Hey @mmuth @tomhaynes, we added an alert to help us catch this kind of issue in the future, and the fix is rolled out.

Also, thank you, @matteovivona, for your support!

Closing this issue now since it seems to be resolved. We can always re-open it.

@tomhaynes
Copy link

@matteovivona we heavily use both types of runner.. we have network security requirements that mean we have to use self hosted runners for some use-cases. If you could point to the section of the Github docs that recommend not to use these runners for critical workloads?

Regarding it being OSS unless I'm misunderstanding the issue was with the Github SAAS API, not the arc controllers..

@nikola-jokic
Copy link
Collaborator

Hey @tomhaynes,

You are right; the issue was on the API side. Using ARC or any self-hosted runner solution for critical workloads is perfectly fine. The point I think @matteovivona raised is that you should probably have a fallback mechanism in case an issue like this occurs, so it doesn't cause disruptions.

@marktuckcp
Copy link

Hello @nikola-jokic we're starting to see the same errors that were occurring as last time? Do you know if the same change/change like the one that broke self hosted runners has been made again?

@nikola-jokic
Copy link
Collaborator

Hey @marktuckcp,

Can you please send the listener log so we can investigate it?

@marktuckcp
Copy link

Sorry missed this @nikola-jokic, you've been helping my colleague on this issue - #3935 - thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

No branches or pull requests

8 participants