[14.x] Add worker keepalive support for lease-based queue drivers#60637
Open
brecht-vermeersch wants to merge 8 commits into
Open
[14.x] Add worker keepalive support for lease-based queue drivers#60637brecht-vermeersch wants to merge 8 commits into
brecht-vermeersch wants to merge 8 commits into
Conversation
Author
|
For now, this is only implemented for the daemon worker path. It does not currently apply to |
1c8b63f to
953169b
Compare
953169b to
c57d265
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This got a bit longer than intended, but I wanted to capture the background and tradeoffs clearly :)
Summary
This adds opt-in keepalive support to the queue worker for drivers that use a lease, visibility timeout, or reservation window while a job is in progress.
I ran into this while working on an Azure Storage Queue driver. That driver has the same basic problem as other lease-based queues: if a job runs longer than the queue’s lease, the message can become visible again and get picked up a second time even though the first worker is still processing it.
Today, the usual way to deal with this in Laravel is to make sure the worker timeout and the queue lease are configured with enough headroom.
For most queue drivers, that means setting the connection's
retry_aftervalue high enough to cover the longest expected job, and keeping the worker--timeouta little lower so a stuck worker is killed before the job is made available again. For SQS, Laravel relies on the queue's own visibility timeout instead of aretry_aftersetting.That works, but it is still a static limit. If a job occasionally runs longer than expected, the message can still become visible again before the worker finishes processing it.
The idea here is to give queue connections a small contract they can implement if they know how to renew that lease while the job is still running.
Why this belongs in the worker
This is mainly useful for queues where “in progress” is time-bound on the transport side. A few examples that could benefit from this:
I used Symfony Messenger’s keepalive support as a reference point here. Symfony has a similar feature for transports that can mark a message as still being processed, and that was a useful starting point.
Design
The feature is opt-in.
A connection advertises support by implementing
Illuminate\Contracts\Queue\KeepsJobsAlive, and the worker only attempts keepalive calls when the current connection implements that contract. The keepalive interval is configured at the worker level throughWorkerOptionsand the--keepaliveoption onqueue:work/queue:listen.I intentionally kept this worker-level for now. I started out exploring per-job configuration, but after stepping back it felt like extra API surface without a strong use case.
Implementation notes
The main wrinkle is that the worker already uses
SIGALRMfor job timeouts.Because of that, I could not just bolt on a second independent alarm loop. Instead, the worker now tracks two deadlines for the current job:
It arms a single alarm for whichever one comes first. When the alarm fires, timeout still wins if both deadlines have been reached.
There is an internal breaking change in
Workerto make that work. In particular, one protected method was removed as part of simplifying the signal handling path, so this is aimed at a new major release. Even with that change, I tried to keep the overall diff as small as I could and avoid turning this into a larger scheduler abstraction.Tradeoffs
I looked at two directions here:
A child-process approach would avoid doing more work from the alarm path, but it adds a lot more coordination, process management, and failure handling. I went with signals because it fits the worker’s existing timeout model, keeps the implementation much smaller, and is also the direction Symfony took for Messenger’s keepalive support.
That does come with an important caveat: a transport’s
keepAlive()implementation should stay cheap.For transports that renew a lease over HTTP, long blocking calls in the keepalive path are a real concern. Drivers should use aggressive request timeouts and avoid treating keepalive as a general-purpose API call. If a transport cannot renew its lease quickly and predictably, it may not be a good fit for this model.
Driver guidance
One thing worth calling out for driver authors: the transport lease should be longer than the keepalive interval.
Symfony’s Amazon SQS transport enforces the weaker rule that the queue visibility timeout must not be smaller than the keepalive interval. I think the practical guidance should be a bit stricter than that: leave some headroom. Alarms are not perfectly punctual, and network-backed renewals can be delayed. Running the lease and keepalive cadence edge-to-edge leaves very little margin for jitter.
Scope
This PR adds the worker support and the opt-in contract, but does not update any specific transport yet. I think transport-specific implementations are easier to review as follow-up changes once the worker-level behavior is agreed on.