perf(redis_admin): lazy clear-by-filter preview (count off the request thread)#907
perf(redis_admin): lazy clear-by-filter preview (count off the request thread)#907thomasrockhu-codecov wants to merge 8 commits intomainfrom
Conversation
…t thread)
The clear-by-filter preview page used to call streaming_celery_count +
materialise up to CELERY_BROKER_DISPLAY_LIMIT messages inside the GET
handler before painting anything; on a 1M-deep queue that's tens of
seconds of blocking work in the gunicorn request thread.
Four-tier fix, layered:
1. Lazy preview load. GET handler renders the form + skeleton in
<200ms, no Redis scans. New JSON endpoint
`clear-by-filter/preview/` computes the count + sample on demand;
`celery_clear_by_filter_preview.{js,css}` fills them in once they
arrive. Submit buttons start disabled and unlock on count
completion.
2. Background count job for huge queues. Above
REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_INLINE_LIMIT (default 200_000)
the preview endpoint kicks off a `dry_run=True` chunked-clear job
(existing start_celery_broker_clear_job machinery — no new
infrastructure) and returns {job_id, status_url}; JS polls the
existing status hash and reads `matched` as the count on
completion.
3. 60s synchronous-mode count cache in cache Redis under
`redis_admin:preview_count:<sha256(queue|task|repoid|commitid)>`.
Refresh / Back-Forward / multi-tab return the cached count
instantly with a `cached_at` field. `?bust=1` skips the cache.
4. Skip-count escape hatch. Muted-italic link routes to a
tiny re-render of the preview page with `count_skipped=1` set;
the operator types the queue name and proceeds without waiting
for the count.
Coverage: 12 new tests in
`redis_admin/tests/test_clear_by_filter_preview.py`, all under
`-m 'not django_db'`. No `@django_db` tests added — the audit-log
integration for the job-mode dry-run is already covered by
`test_celery_broker_clear_job.py`'s chunked-mode audit test, which
runs in CI.
PR #907 made the clear-by-filter confirmation page render a skeleton on initial GET; the count + sample now load via the JSON preview endpoint that the JS hits on DOMContentLoaded. The four failing tests were still asserting the old eager-render "Matched: N message(s)" string against the GET body. Update them to: * pin the skeleton markers (#celery-clear-by-filter-preview-root, .celery-clear-preview-loader, "Counting matches…") * pin the dry-run flash text ("would clear N of M matching message(s)") which still surfaces synchronously via the Django messages framework on the dry-run response * pin the keep-one button's server-rendered hidden state (.celery-clear-keep-one.celery-clear-preview-hidden) — the JS-driven match_count<=1 invariant moved to the preview JSON endpoint's tests
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #907 +/- ##
==========================================
+ Coverage 91.89% 91.90% +0.01%
==========================================
Files 1316 1316
Lines 50586 50741 +155
Branches 1625 1625
==========================================
+ Hits 46485 46633 +148
- Misses 3795 3802 +7
Partials 306 306
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Bugbot caught a HIGH severity bug in the lazy clear-by-filter preview commit (46b6e5f): when `streaming_celery_count` raised on the destructive POST path, the fallback set `match_count = 0`, which then triggered the very next `match_count == 0` safety check and silently 302'd to "No messages match the filter — nothing to clear." A transient Redis blip during the count probe would convert a typed-confirmed `clear_all` / `clear_keep_one` into a misleading no-op redirect, with no signal to the operator that anything had gone wrong. Track count failure with an explicit `count_failed` flag and short-circuit the count-based safety checks (`match_count == 0` early-return; `clear_keep_one` single-match guard) when the count probe raised. The destructive path keeps gating on the typed- confirmation, and `start_celery_broker_clear_job` re-snapshots `LLEN(queue)` itself so the progress page total is still accurate. The dry-run banner gets a count-failed variant that drops the "of {match_count}" denominator and surfaces "(count probe failed; ...)" so the operator knows why the matched-total is missing. Two new TestCase tests pin the behavior: * `test_clear_all_proceeds_when_count_probe_raises` — POST `clear_all` with a typed confirmation must redirect to the background-job progress page even when `streaming_celery_count` raises, NOT to the changelist with a "nothing to clear" flash. * `test_dry_run_surfaces_count_failed_banner_when_count_raises` — dry-run banner drops the "of N" denominator and includes "count probe failed" when the count probe raised.
The new `test_clear_all_proceeds_when_count_probe_raises` test (added in 87af8f5) asserted the destructive POST redirect landed under `/clear-by-filter/progress/<uuid>/`, but the actual URL pattern wired up in `CeleryBrokerQueueAdmin.get_urls` is `/clear-by-filter/job/<uuid>/` (the URL name is `_clear_by_filter_progress` but the path segment is `job/`). CI run 25241620362 surfaced the mismatch — the redirect Location was `/admin/redis_admin/celerybrokerqueue/clear-by-filter/job/ 044680cf-…/`, which proves the bug fix itself works (the destructive action proceeded past the count failure and started the background job) but my assertion substring was wrong. Update the assertion to pin `/clear-by-filter/job/` and add a guardrail that the location is NOT the `?queue_name__exact=…` changelist redirect that the original (buggy) commit produced.
Three Bugbot findings on commit 87af8f5 — addressing all of them in one focused commit since they all touch `clear_by_filter_preview_view`: * **Medium — sample materialise bubbles HTML 500 from a JSON endpoint.** `_materialise_sample_targets` was unwrapped, so a broker outage during the sample LRANGE raised through to Django's HTML 500 page from a JSON endpoint — defeating the carefully-shaped `{"error": ...}` JSON the JS client unmarshals from the count / LLEN paths. Wrap it in try/except mirroring the LLEN fallback pattern so a flaky sample lookup degrades to an empty sample table while the count payload still ships. * **Medium — deep-queue preview spawns duplicate background jobs on refresh.** `start_celery_broker_clear_job` was called unconditionally on every preview request above the inline threshold, so a page refresh / second tab / JS retry kicked off N parallel full-queue scans for the same filter. Add `get_or_start_clear_by_filter_preview_job` in `services.py` — best-effort dedup keyed on a SHA-256 of the filter tuple (mirrors `_clear_by_filter_preview_cache_key` for operator- mental-model consistency), with a 5-minute TTL (new `CLEAR_BY_FILTER_PREVIEW_JOB_LOCK_TTL_SECONDS` setting). Stale-id guard: if the cached `job_id` no longer resolves to a live job hash, fall through to spawn a fresh job rather than hand the JS a dead id. Cache outage falls through to always-spawn so the preview never returns a 500 because of a flaky lock. * **Low — duplicated URL query-building across preview / skip-count.** The `urlencode({k: v for k, v in {…}.items() if v != ""})` comprehension was copy-pasted across three call sites. Extract `CeleryBrokerQueueAdmin._build_clear_by_filter_querystring` with an `extras={...}` slot so `clear_by_filter_skip_count_view` can add `count_skipped=1` without re-spelling the filter dict. Pure refactor — output is byte-identical to the old comprehension. Three new pytest tests pin the behaviour: * `test_preview_endpoint_dedups_job_for_same_filter_within_lock_ttl` — two preview requests for the same filter return the same `job_id` and `start_celery_broker_clear_job` is called exactly once; a third request with a different `repoid` spawns a fresh job (the lock is keyed on the full filter tuple). * `test_preview_endpoint_dedup_falls_through_when_cached_job_is_stale` — pin the stale-id guard: deleting the job hash between two preview calls forces the second call to spawn a fresh job rather than hand the JS a dead `job_id`. * `test_preview_endpoint_returns_empty_sample_when_materialise_raises` — JSON endpoint contract preserved (200 + `application/json`) when `_materialise_sample_targets` raises; sample degrades to `[]` and the count still ships.
Two Bugbot Low-severity findings on commit dd8ad8d: * **Sample-table heading is hardcoded "20".** PR #907 replaced the pre-PR dynamic `<p>The first {{ sample_targets|length }} matching message(s):</p>` with a static `"The first 20"` — misleading when fewer than 20 messages match (the table below it would show only e.g. 3 rows). The pre-PR template also had a `"Showing first X of Y"` note when the sample window was narrower than the match set, lost in the rewrite. Make the heading JS-rewritten on the synchronous-mode preview response. The new `renderSampleHeader(sampleLength, matchCount)` helper produces: - `"Showing first {sample} of {match} matching message(s):"` when the sample window is narrower than the match set, - `"The first {sample} matching message(s):"` when the sample covers the whole match set, - the harmless fallback `"Matching messages:"` when the JS preview never lands (Tier 4 skip-count render path, which intentionally ships an empty sample). * **Duplicated SHA-256 logic between count cache + job lock.** `_clear_by_filter_preview_cache_key` and the new `_clear_by_filter_preview_job_lock_key` had identical bodies — same canonical `"|".join(...)` shape, same `hashlib.sha256(...)` call. Extract `_clear_by_filter_filter_digest` as the single chokepoint so the two key families can never accidentally diverge in their hashing logic. Pure refactor — both keys produce the same bytes as before.
Bugbot flagged a HIGH-severity false positive on commit ea75f1b ("Job-mode match count always reads as zero"), claiming `snapshot.matched` arrives at the JS as a string. It does not — `_job_progress_context._int("matched")` casts it to int before the `JsonResponse`, mirroring `_int("processed")` and `_int("total_estimated")`. The existing `test_clear_job_status_view_returns_json_for_running_job` already pins `processed` / `total_estimated` as JSON ints; extend it to also pin `matched` / `drifted` / `passes_run` so a future regression that drops the `_int(...)` cast lands here loudly instead of silently re-rendering "0 matching messages" on the operator's preview page. (Reply on the inline review comment explains the analysis.)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ca453c2. Configure here.
When the lazy clear-by-filter preview enters Tier 2 deep-queue mode it calls `show()` on `#celery-clear-preview-job-card` to surface the chunked-job progress bar (line ~367 of celery_clear_by_filter_preview.js). On any terminal state the polling loop hands off to either `fillSynchronousResult` (on `completed`) or `showError` (on `cancelled` / `failed`). Neither helper hid the job card, so the operator saw the now-stale progress card stacked alongside the resolved count card or the error card. Hide the job card from both helpers so the terminal-state UI replaces the progress card cleanly, instead of layering on top of it. No-op for the synchronous tier (the job card was never shown there). Bug surfaced by Cursor Bugbot on PR #907 (BUGBOT_BUG_ID 2115197d-d079-4f62-a9ed-7f1be226af88, MEDIUM).
|
Folded into #905; keeping this open as the per-feature review record. |

Summary
Makes the
/admin/redis_admin/celerybrokerqueue/clear-by-filter/preview page render instantly even on million-message queues. The previous synchronous GET path calledstreaming_celery_count(fullLLEN(queue)walk) + materialised up toCELERY_BROKER_DISPLAY_LIMITmessages from the queryset before painting anything; on a 1M-deep queue that's 30s+ of blocking work in the request thread.Four tiers, layered:
< 200ms. A new JSON endpoint (clear-by-filter/preview/?queue_name=…) runs the count + sample on demand. JS fills them in once they arrive; submit buttons start disabled and unlock on count completion.REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_INLINE_LIMIT(default 200k), the preview endpoint kicks off adry_run=Truechunked-clear-job and returns{job_id, status_url}instead of the count. JS polls the existing clear-job status hash; onstatus=completedit readsmatchedas the match count. Reuses feat(redis_admin): chunked celery broker clear with progress bar + wildcard sentinel fix #904 machinery (start_celery_broker_clear_job,clear_by_filter_status_view) — no new audit modes, no new infrastructure.redis_admin:preview_count:<sha256(queue|task|repoid|commitid)>with aREDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_CACHE_TTL_SECONDSTTL (default 60). Refresh / Back-Forward returns the cached count instantly with acached_atfield.?bust=1skips the cache.clear-by-filter/skip-count/view that 302s back to the preview page withcount_skipped=1set; the GET handler then suppresses the count card and unlocks the submit buttons.The four tiers stack: a deep queue with a warm cache hits Tier 3 first; a deep queue with a cold cache + inline-limit-busting depth hits Tier 2 and polls; a typical queue hits Tier 1 inline; the operator can fall back to Tier 4 at any point if the count surface is broken or just too slow.
Test plan
test_clear_by_filter_get_renders_skeleton_without_redis_scans—streaming_celery_countand queryset materialisation both monkeypatched to raise; GET still renders 200 with skeleton placeholders + data attrs + disabled submit buttons.test_preview_endpoint_returns_synchronous_count_for_small_queue— pushes 100 envelopes, hits the endpoint, assertsmode="synchronous",match_count=100, sample length 20.test_preview_endpoint_returns_job_mode_for_deep_queue— lowers the threshold under test, assertsmode="job"with a UUID-shapedjob_idand astatus_urlending in/job/<uuid>/status/.test_preview_endpoint_inline_limit_setting_overrides_threshold— same push pattern flips from synchronous → job mode when the setting drops below the queue depth.test_preview_endpoint_uses_cache_on_second_call— monkeypatchesstreaming_celery_countwith a counter; second call doesn't re-walk the queue and the response advertisescached_at.test_preview_endpoint_bust_param_skips_cache—?bust=1forces a recompute even when the cache is warm.test_skip_count_view_redirects_to_clear_form_with_count_skipped_flag— pins the 302 location format.test_clear_by_filter_get_with_count_skipped_renders_form_without_loader— the GET handler suppresses the loader card and unlocks submit buttons whencount_skipped=1is set.test_preview_endpoint_validates_required_filters,test_preview_endpoint_repoid_must_be_integer,test_skip_count_view_validates_filters— same invariants asclear_by_filter_view, surfaced as JSON 400 on the preview endpoint.test_preview_endpoint_requires_superuser— staff-only and anonymous users bothPermissionDenied; superuser sails through.test_celery_broker_clear_job.py's existing chunked-mode audit test (the preview endpoint just callsstart_celery_broker_clear_job(..., dry_run=True), which is the path that test already covers). Not duplicated here. Verified in CI's@pytest.mark.django_dbjob.Sandbox results: 189 passed, 30 deselected, 90 errors in
redis_admin/tests/(errors are all the documented "could not translate host name 'postgres'" failures for@pytest.mark.django_dbtests — same shape as onmain; the +12 new tests all pass and bump the green count from 177 → 189).Settings introduced
REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_INLINE_LIMIT200_000LLEN(queue)threshold above which the preview endpoint switches to background-job mode.REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_CACHE_TTL_SECONDS60Files
Notes for review
{% static %}+deferfromclear_by_filter.html, with no inline<script>— same CSP-compatible pattern ascelery_clear_progress.{js,css}from feat(redis_admin): chunked celery broker clear with progress bar + wildcard sentinel fix #904._parse_clear_by_filter_paramsis the single chokepoint for filter validation; both the HTML and JSON entrypoints route through it so a future filter rule (e.g. tighteningcommitidlength) lands in one place.LRANGE 0 Nis fast even on million-message queues), so the preview table renders immediately even while the count job is still running.match_countsynchronously for the safety checks (match_count == 0early-return,clear_keep_onesingle-match guard) and the dry-run banner. That's intentional: those checks need the live count, andstart_celery_broker_clear_jobitself re-snapshotsLLEN(queue)so the progress page total isn't tied to the preview's count.Made with Cursor
Note
Medium Risk
Touches superuser-only admin flows that can trigger destructive queue-clears, and adds new JSON/job orchestration + caching that could affect operator UX and Redis load if incorrect.
Overview
Speeds up the
celerybrokerqueue/clear-by-filter/preview by removing synchronous Redis scans from the GET handler and replacing them with a skeleton page populated by new lazy-preview JS.Adds a superuser-only
clear-by-filter/preview/JSON endpoint that returns either an inline count (with a short-lived Redis cache and?bust=1) or adry_runchunked-job id for deep queues (with job-id deduping), plus aclear-by-filter/skip-count/escape hatch that unlocks destructive actions without waiting for counts. Updates the template/UI accordingly, introduces new settings knobs for thresholds/TTLs, and adds/adjusts tests to pin the new response shapes and failure-mode behavior.Reviewed by Cursor Bugbot for commit 308ea62. Bugbot is set up for automated code reviews on this repo. Configure here.