Skip to content

refactor!: Align RequestQueueClient interface with its Python counterpart#3729

Draft
janbuchar wants to merge 8 commits into
v4from
refactor/align-request-queue-client-interface
Draft

refactor!: Align RequestQueueClient interface with its Python counterpart#3729
janbuchar wants to merge 8 commits into
v4from
refactor/align-request-queue-client-interface

Conversation

@janbuchar

@janbuchar janbuchar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@janbuchar janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 6803f1f to acc2337 Compare June 10, 2026 10:13
Reduce `RequestQueueClient` from 12 methods to 9.
Replace `listHead`/`addRequest`/`batchAddRequests`/`updateRequest`/
`listAndLockHead`/`prolongRequestLock`/`deleteRequestLock` with
`addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/
`reclaimRequest`/`isEmpty`, and drop the now-dead head/lock option types
(`QueueHead`, `ListOptions`, `ListAndLockOptions`, `ListAndLockHeadResult`,
`ProlongRequestLockOptions`, `ProlongRequestLockResult`,
`DeleteRequestLockOptions`, `RequestQueueHeadItem`).

Locking/coordination of multiple clients on the same queue is now an
internal concern of the client implementation, not part of the interface.

Part of #3075.
@janbuchar janbuchar force-pushed the refactor/align-request-queue-client-interface branch 2 times, most recently from c16bb30 to 01d189a Compare June 10, 2026 13:16
Reimplement the in-memory request queue client against the new interface.
The client now owns the pending/in-progress/handled bookkeeping (an
`inProgress` set on top of the existing `orderNo`-based ordering):

- `addBatchOfRequests` replaces `addRequest`/`batchAddRequests`
- `fetchNextRequest` pops the next pending request and marks it in progress
- `markRequestAsHandled`/`reclaimRequest` replace `updateRequest` and
  operate on in-progress requests (returning `null` when not in progress)
- `getRequest` is keyed by `uniqueKey`
- `isEmpty` reports whether any pending request remains (in-progress
  requests are not counted)

The lock-based `listHead`/`listAndLockHead`/`prolongRequestLock`/
`deleteRequestLock` methods are removed.

Part of #3075.
…estQueue` class

`RequestProvider`, `RequestQueueV1` and `RequestQueueV2` no longer differ in
behaviour — request coordination (locking, queue-head management) is now an
internal concern of the storage client — so they are merged into a single
concrete `RequestQueue` class:

- `RequestProvider` becomes `RequestQueue`; the `request_provider.ts` and
  `request_queue_v2.ts` modules are removed and the implementation lives in
  `request_queue.ts`.
- `fetchNextRequest`/`markRequestHandled`/`reclaimRequest`/`isEmpty` delegate to
  the slim client; the queue-head, locking, consistency and
  `recentlyHandledRequests` bookkeeping is gone.
- `isFinished` returns `false` while a background add batch is in flight,
  otherwise `client.isEmpty()`.
- `isEmpty()` reflects only pending requests (the next `fetchNextRequest()`
  would return `null`), preserving the crawler's task-scheduling contract.
- `getRequest` is keyed by `uniqueKey`.

The `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` exports are removed; use
`RequestQueue` instead. `RequestProviderOptions` is renamed to
`RequestQueueOptions`.

Part of #3075.
Adapt `BasicCrawler` to the slim `RequestQueueClient` and the merged
`RequestQueue` class:

- Use `RequestQueue` in place of the removed `RequestProvider`/`RequestQueueV1`
  (instanceof checks, `open()`, parameter and field types).
- The same-domain-delay path no longer pokes the queue's private `inProgress`
  set; it relies on `reclaimRequest` to return the request to the queue.
- The error-handling safety net reclaims the request via `reclaimRequest`
  instead of calling the removed `deleteRequestLock`. Reclaiming a request that
  is no longer in progress is a harmless no-op on the client.

Part of #3075.
Rewrite the request-queue test suites against the new API. Obsolete white-box
tests of the removed locking/queue-head machinery are replaced with behavioral
tests using a real MemoryStorage-backed client:

- memory-storage forefront/handledRequestCount/ignore-non-json tests now drive
  `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/`reclaimRequest`/
  `isEmpty`.
- core `request-queue-v2` and `request_queue` tests cover the new lifecycle and
  `isEmpty`/`isFinished` semantics.
- `MemoryStorageEmulator.getRequestQueueItems` drains pending requests via
  `fetchNextRequest` and reclaims them, since `listHead` no longer exists.

Part of #3075.
Expand the `RequestQueueClient` migration table in the v4 upgrading guide with
the full method mapping, the new fetch/handle/reclaim lifecycle, the `isEmpty`
semantics, and the merge of `RequestProvider`/`RequestQueueV1`/`RequestQueueV2`
into a single `RequestQueue` class. List the removed head/lock types.

Drop the obsolete request-locking experiment guide (locking is no longer an
opt-in experiment) and remove its now-empty "Experiments" sidebar category.
Update the parallel scraping guide to use `RequestQueue` and drop the
`requestLocking` experiment flag.

Closes #3075.
@janbuchar janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 01d189a to 1d3ccf5 Compare June 10, 2026 14:35
@janbuchar janbuchar marked this pull request as draft June 10, 2026 14:40
@janbuchar

janbuchar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@vdusek — a heads-up on a RequestQueueClient.is_empty semantic that surfaced while porting this RQ-client rework to crawlee-js. It's not a bug, but the contract is subtler than the method name and the base docstring suggest, and I think it's worth documenting on the Python side too.

TL;DR — In crawlee-python, RequestQueueClient.is_empty() does not mean "the next fetch_next_request() would return None". It means "there is no outstanding work left at all" — including requests currently in progress / locked (fetched but not yet handled or reclaimed). It's effectively a building block for is_finished(), not an "is there anything fetchable right now" check.

Where this comes from — all three clients agree, even though the base interface doesn't state it:

  • FS client: returns False as soon as len(state.in_progress_requests) > 0.
  • Apify single client: return not self._head_requests and not self._requests_in_progress.
  • Apify shared client: return len(head.items) == 0 and not self._queue_has_locked_requests.

The RequestQueue.is_empty() storage-wrapper docstring states it correctly ("either pending or being processed"), but the RequestQueueClient.is_empty() base interface docstring just says "True if the request queue is empty" — which reads like the weaker guarantee. The fetch_next_request docstrings ("a None return value does not mean processing finished … use is_finished") reinforce the wrong mental model.

Why it matters (multi-consumer)is_empty() feeds is_finished(), and the autoscaled pool calls is_finished() on every orchestrator loop iteration, not only when idle. So a narrower is_empty() would let a consumer stop while it (or, in the shared case, another consumer) still holds the last requests under a lock. queue_has_locked_requests is the multi-consumer analogue of the single client's _requests_in_progress set.

is_empty() feeds is_finished()

... and also this is kinda weird in itself.

A request fetched via `fetchNextRequest` is locked (in progress) for 3
minutes by persisting a future `orderNo` to disk. If the process ends
before the request is handled or reclaimed, that lock used to linger
until it expired, blocking the request for the next consumer of the same
on-disk queue.

The in-memory client now tracks the requests it has locked and releases
them in `MemoryStorage.teardown()`, resetting their `orderNo` (sign
preserved, so forefront/normal ordering survives) so they become
immediately fetchable again.

Only this client needs the cleanup: the Apify platform releases a run's
locks automatically on migrate/abort, and the file-system storage does
not lock at all.
`RequestQueueClient.isEmpty()` previously reported `true` as soon as no
request was immediately fetchable, ignoring requests that are merely
locked (in progress) by another consumer. With multiple consumers
sharing a queue, a consumer could therefore see the queue as empty —
and let the crawler finish — while another consumer still held the last
requests under a lock.

`listPendingHead` now also reports whether it skipped any unhandled-but-
locked request, and `isEmpty()` returns `true` only when nothing is
pending AND nothing is locked. This mirrors the Apify platform shared
client, whose `isEmpty` accounts for `queueHasLockedRequests`.

Tests that encoded the old "in-progress means empty" semantics are
updated accordingly (an in-progress request now keeps the queue
non-empty and unfinished until it is handled).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants