refactor!: Align RequestQueueClient interface with its Python counterpart#3729
refactor!: Align RequestQueueClient interface with its Python counterpart#3729janbuchar wants to merge 8 commits into
RequestQueueClient interface with its Python counterpart#3729Conversation
6803f1f to
acc2337
Compare
Reduce `RequestQueueClient` from 12 methods to 9. Replace `listHead`/`addRequest`/`batchAddRequests`/`updateRequest`/ `listAndLockHead`/`prolongRequestLock`/`deleteRequestLock` with `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/ `reclaimRequest`/`isEmpty`, and drop the now-dead head/lock option types (`QueueHead`, `ListOptions`, `ListAndLockOptions`, `ListAndLockHeadResult`, `ProlongRequestLockOptions`, `ProlongRequestLockResult`, `DeleteRequestLockOptions`, `RequestQueueHeadItem`). Locking/coordination of multiple clients on the same queue is now an internal concern of the client implementation, not part of the interface. Part of #3075.
c16bb30 to
01d189a
Compare
Reimplement the in-memory request queue client against the new interface. The client now owns the pending/in-progress/handled bookkeeping (an `inProgress` set on top of the existing `orderNo`-based ordering): - `addBatchOfRequests` replaces `addRequest`/`batchAddRequests` - `fetchNextRequest` pops the next pending request and marks it in progress - `markRequestAsHandled`/`reclaimRequest` replace `updateRequest` and operate on in-progress requests (returning `null` when not in progress) - `getRequest` is keyed by `uniqueKey` - `isEmpty` reports whether any pending request remains (in-progress requests are not counted) The lock-based `listHead`/`listAndLockHead`/`prolongRequestLock`/ `deleteRequestLock` methods are removed. Part of #3075.
…estQueue` class `RequestProvider`, `RequestQueueV1` and `RequestQueueV2` no longer differ in behaviour — request coordination (locking, queue-head management) is now an internal concern of the storage client — so they are merged into a single concrete `RequestQueue` class: - `RequestProvider` becomes `RequestQueue`; the `request_provider.ts` and `request_queue_v2.ts` modules are removed and the implementation lives in `request_queue.ts`. - `fetchNextRequest`/`markRequestHandled`/`reclaimRequest`/`isEmpty` delegate to the slim client; the queue-head, locking, consistency and `recentlyHandledRequests` bookkeeping is gone. - `isFinished` returns `false` while a background add batch is in flight, otherwise `client.isEmpty()`. - `isEmpty()` reflects only pending requests (the next `fetchNextRequest()` would return `null`), preserving the crawler's task-scheduling contract. - `getRequest` is keyed by `uniqueKey`. The `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` exports are removed; use `RequestQueue` instead. `RequestProviderOptions` is renamed to `RequestQueueOptions`. Part of #3075.
Adapt `BasicCrawler` to the slim `RequestQueueClient` and the merged `RequestQueue` class: - Use `RequestQueue` in place of the removed `RequestProvider`/`RequestQueueV1` (instanceof checks, `open()`, parameter and field types). - The same-domain-delay path no longer pokes the queue's private `inProgress` set; it relies on `reclaimRequest` to return the request to the queue. - The error-handling safety net reclaims the request via `reclaimRequest` instead of calling the removed `deleteRequestLock`. Reclaiming a request that is no longer in progress is a harmless no-op on the client. Part of #3075.
Rewrite the request-queue test suites against the new API. Obsolete white-box tests of the removed locking/queue-head machinery are replaced with behavioral tests using a real MemoryStorage-backed client: - memory-storage forefront/handledRequestCount/ignore-non-json tests now drive `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/`reclaimRequest`/ `isEmpty`. - core `request-queue-v2` and `request_queue` tests cover the new lifecycle and `isEmpty`/`isFinished` semantics. - `MemoryStorageEmulator.getRequestQueueItems` drains pending requests via `fetchNextRequest` and reclaims them, since `listHead` no longer exists. Part of #3075.
Expand the `RequestQueueClient` migration table in the v4 upgrading guide with the full method mapping, the new fetch/handle/reclaim lifecycle, the `isEmpty` semantics, and the merge of `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` into a single `RequestQueue` class. List the removed head/lock types. Drop the obsolete request-locking experiment guide (locking is no longer an opt-in experiment) and remove its now-empty "Experiments" sidebar category. Update the parallel scraping guide to use `RequestQueue` and drop the `requestLocking` experiment flag. Closes #3075.
01d189a to
1d3ccf5
Compare
|
@vdusek — a heads-up on a TL;DR — In crawlee-python, Where this comes from — all three clients agree, even though the base interface doesn't state it:
The Why it matters (multi-consumer) —
... and also this is kinda weird in itself. |
A request fetched via `fetchNextRequest` is locked (in progress) for 3 minutes by persisting a future `orderNo` to disk. If the process ends before the request is handled or reclaimed, that lock used to linger until it expired, blocking the request for the next consumer of the same on-disk queue. The in-memory client now tracks the requests it has locked and releases them in `MemoryStorage.teardown()`, resetting their `orderNo` (sign preserved, so forefront/normal ordering survives) so they become immediately fetchable again. Only this client needs the cleanup: the Apify platform releases a run's locks automatically on migrate/abort, and the file-system storage does not lock at all.
`RequestQueueClient.isEmpty()` previously reported `true` as soon as no request was immediately fetchable, ignoring requests that are merely locked (in progress) by another consumer. With multiple consumers sharing a queue, a consumer could therefore see the queue as empty — and let the crawler finish — while another consumer still held the last requests under a lock. `listPendingHead` now also reports whether it skipped any unhandled-but- locked request, and `isEmpty()` returns `true` only when nothing is pending AND nothing is locked. This mirrors the Apify platform shared client, whose `isEmpty` accounts for `queueHasLockedRequests`. Tests that encoded the old "in-progress means empty" semantics are updated accordingly (an in-progress request now keeps the queue non-empty and unfinished until it is handled).
DatasetClientandKeyValueStoreClientinterfaces with their Python counterparts #3627BasicCrawler#2840MemoryStorageEmulator, which is slated for removal in Split off FileSystemStorage from MemoryStorage #3593. ItsgetRequestQueueItemshelper now drains the queue viafetchNextRequest()and reclaims everything afterwards. BecausereclaimRequestrecomputesorderNofrom scratch, this does not faithfully restore the original ordering (forefront requests lose their sign/priority, and same-millisecond reclaims can collapse relative order). That's fine here: all current callers use non-forefront requests with order-tolerant assertions, and the helper is going away with Split off FileSystemStorage from MemoryStorage #3593 — so it's not worth fixing.