Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"inprogress" jobs aren't actually executing after a worker terminates abnormally #542

Open
Ed1lan opened this issue Mar 25, 2025 · 4 comments

Comments

@Ed1lan
Copy link

Ed1lan commented Mar 25, 2025

Hi! I am working with SolidQueue in a project and recently I have encountered a problem, the mission_control-jobs gem shows as some jobs are running, but when checking the system processes they do not exist.

Investigating further what could be happening, I have discovered some things. I don't know why, but the workers shutted down and started automatically and tried to reclaim the jobs that were running at that moment, but this failed showing the following logs:

Image

I have read the documentation and I have seen that it talks about the case in which “someone pulls the cable”, and checked that indeed, as mentioned, the jobs that were running at that time are in the SolidQueue::ClaimedExecution table and if I check the current status of each of those jobs marks as “inprogress”, but when checking the associated SolidQueue::Process, it does not exist.

I ran a little code to show up the status of the jobs claimed by processes that do not exists to show this

Image

Image

Image

Trying to see how it behaved, I modified the “finished_at” field of one of the jobs, which made it to be marked internally as :finished, but it still appears in the list of inprogress jobs of mission_control and it is still in the SolidQueue::ClaimedExecution table.

Image

Additionally, if I inspect that job is specific, it now comes up as if it is :finished.

Image

How can I release these jobs correctly? How can I prevent this from happening? Am I missing something?
For the moment I planned to set manually those jobs as finished setting the finished_at date and removing them from SolidQueue::ClaimedExecution table, but I would like to know how could I prevent this from happening or if this is a bug

PD: Sorry if my english is poor

@rosa
Copy link
Member

rosa commented Mar 25, 2025

Hey @Ed1lan, sorry about that! Those jobs should be released automatically (marked as failed) when the supervisor starts next time. Is this happening in development only? Was your computer going to sleep or something when all those workers died?

@Ed1lan
Copy link
Author

Ed1lan commented Mar 25, 2025

Hey @rosa, thank you for your fast reply! This happened on production environment and in that moment the server was getting a backup snapshot going on. Maybe the backup shutted down the workers? But after that they didn't got marked as failed as they supposed to

@rosa
Copy link
Member

rosa commented Mar 25, 2025

Maybe the backup shutted down the workers? But after that they didn't got marked as failed as they supposed to

Hmm no, that shouldn't be related 🤔 They could have crashed or something, but that shouldn't happen 🤔

I just noticed, in your first screenshot above, that the code that should have marked your in-progress jobs as failed did run, these are the lines that say:

Fail claimed jobs (..) job_ids: ... 

and a list of job IDs. You should see similar lines for the jobs in your other screenshots, the ones in progress for which the process doesn't exist. That didn't happen?

@Ed1lan
Copy link
Author

Ed1lan commented Mar 26, 2025

No, it didn't. Those jobs stayed in-progress and claimed as shown in the others screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants