Skip to content

perf(redis_admin): lazy clear-by-filter preview (count off the request thread)#907

Open
thomasrockhu-codecov wants to merge 8 commits intomainfrom
redis-admin/lazy-clear-by-filter-preview
Open

perf(redis_admin): lazy clear-by-filter preview (count off the request thread)#907
thomasrockhu-codecov wants to merge 8 commits intomainfrom
redis-admin/lazy-clear-by-filter-preview

Conversation

@thomasrockhu-codecov
Copy link
Copy Markdown
Contributor

@thomasrockhu-codecov thomasrockhu-codecov commented May 1, 2026

Summary

Makes the /admin/redis_admin/celerybrokerqueue/clear-by-filter/ preview page render instantly even on million-message queues. The previous synchronous GET path called streaming_celery_count (full LLEN(queue) walk) + materialised up to CELERY_BROKER_DISPLAY_LIMIT messages from the queryset before painting anything; on a 1M-deep queue that's 30s+ of blocking work in the request thread.

Four tiers, layered:

  1. Lazy preview load. GET handler now renders only the filter summary + skeleton placeholders in < 200ms. A new JSON endpoint (clear-by-filter/preview/?queue_name=…) runs the count + sample on demand. JS fills them in once they arrive; submit buttons start disabled and unlock on count completion.
  2. Background count job for huge queues. Above REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_INLINE_LIMIT (default 200k), the preview endpoint kicks off a dry_run=True chunked-clear-job and returns {job_id, status_url} instead of the count. JS polls the existing clear-job status hash; on status=completed it reads matched as the match count. Reuses feat(redis_admin): chunked celery broker clear with progress bar + wildcard sentinel fix #904 machinery (start_celery_broker_clear_job, clear_by_filter_status_view) — no new audit modes, no new infrastructure.
  3. 60s count cache. Synchronous-mode counts are cached in cache Redis under redis_admin:preview_count:<sha256(queue|task|repoid|commitid)> with a REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_CACHE_TTL_SECONDS TTL (default 60). Refresh / Back-Forward returns the cached count instantly with a cached_at field. ?bust=1 skips the cache.
  4. Skip-count escape hatch. Muted-italic "Trust the filter, skip the count →" link on the preview, visible from page-load. Routes through a tiny clear-by-filter/skip-count/ view that 302s back to the preview page with count_skipped=1 set; the GET handler then suppresses the count card and unlocks the submit buttons.

The four tiers stack: a deep queue with a warm cache hits Tier 3 first; a deep queue with a cold cache + inline-limit-busting depth hits Tier 2 and polls; a typical queue hits Tier 1 inline; the operator can fall back to Tier 4 at any point if the count surface is broken or just too slow.

Test plan

  • Tier 1: test_clear_by_filter_get_renders_skeleton_without_redis_scansstreaming_celery_count and queryset materialisation both monkeypatched to raise; GET still renders 200 with skeleton placeholders + data attrs + disabled submit buttons.
  • Tier 1: test_preview_endpoint_returns_synchronous_count_for_small_queue — pushes 100 envelopes, hits the endpoint, asserts mode="synchronous", match_count=100, sample length 20.
  • Tier 2: test_preview_endpoint_returns_job_mode_for_deep_queue — lowers the threshold under test, asserts mode="job" with a UUID-shaped job_id and a status_url ending in /job/<uuid>/status/.
  • Tier 2: test_preview_endpoint_inline_limit_setting_overrides_threshold — same push pattern flips from synchronous → job mode when the setting drops below the queue depth.
  • Tier 3: test_preview_endpoint_uses_cache_on_second_call — monkeypatches streaming_celery_count with a counter; second call doesn't re-walk the queue and the response advertises cached_at.
  • Tier 3: test_preview_endpoint_bust_param_skips_cache?bust=1 forces a recompute even when the cache is warm.
  • Tier 4: test_skip_count_view_redirects_to_clear_form_with_count_skipped_flag — pins the 302 location format.
  • Tier 4: test_clear_by_filter_get_with_count_skipped_renders_form_without_loader — the GET handler suppresses the loader card and unlocks submit buttons when count_skipped=1 is set.
  • Validation parity: test_preview_endpoint_validates_required_filters, test_preview_endpoint_repoid_must_be_integer, test_skip_count_view_validates_filters — same invariants as clear_by_filter_view, surfaced as JSON 400 on the preview endpoint.
  • Auth: test_preview_endpoint_requires_superuser — staff-only and anonymous users both PermissionDenied; superuser sails through.
  • Audit-log integration for the job-mode dry-run path lives in test_celery_broker_clear_job.py's existing chunked-mode audit test (the preview endpoint just calls start_celery_broker_clear_job(..., dry_run=True), which is the path that test already covers). Not duplicated here. Verified in CI's @pytest.mark.django_db job.

Sandbox results: 189 passed, 30 deselected, 90 errors in redis_admin/tests/ (errors are all the documented "could not translate host name 'postgres'" failures for @pytest.mark.django_db tests — same shape as on main; the +12 new tests all pass and bump the green count from 177 → 189).

RUN_ENV=TESTING DJANGO_SETTINGS_MODULE=codecov.settings_test uv run pytest \
    -m 'not django_db' redis_admin/tests/test_clear_by_filter_preview.py -v
============================== 12 passed in 1.63s ==============================

Settings introduced

Setting Default Purpose
REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_INLINE_LIMIT 200_000 LLEN(queue) threshold above which the preview endpoint switches to background-job mode.
REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_CACHE_TTL_SECONDS 60 TTL for the synchronous-mode count cache.

Files

 apps/codecov-api/redis_admin/README.md             |  70 ++-
 apps/codecov-api/redis_admin/admin.py              | 624 +++++++++++++++------
 apps/codecov-api/redis_admin/services.py           | 207 +++++++
 apps/codecov-api/redis_admin/settings.py           |  26 +
 .../css/celery_clear_by_filter_preview.css         | 173 ++++++
 .../js/celery_clear_by_filter_preview.js           | 360 ++++++++++++
 .../celerybrokerqueue/clear_by_filter.html         | 237 +++++---
 .../tests/test_clear_by_filter_preview.py          | 554 ++++++++++++++++++
 8 files changed, 2017 insertions(+), 234 deletions(-)

Notes for review

  • The new JS / CSS load via {% static %} + defer from clear_by_filter.html, with no inline <script> — same CSP-compatible pattern as celery_clear_progress.{js,css} from feat(redis_admin): chunked celery broker clear with progress bar + wildcard sentinel fix #904.
  • _parse_clear_by_filter_params is the single chokepoint for filter validation; both the HTML and JSON entrypoints route through it so a future filter rule (e.g. tightening commitid length) lands in one place.
  • Job-mode response carries the synchronously-fetched sample (LRANGE 0 N is fast even on million-message queues), so the preview table renders immediately even while the count job is still running.
  • The destructive POST path still computes match_count synchronously for the safety checks (match_count == 0 early-return, clear_keep_one single-match guard) and the dry-run banner. That's intentional: those checks need the live count, and start_celery_broker_clear_job itself re-snapshots LLEN(queue) so the progress page total isn't tied to the preview's count.
  • No conflict with chore(redis_admin): lower frequency-chart sample limit to 20k #906 (already merged).

Made with Cursor


Note

Medium Risk
Touches superuser-only admin flows that can trigger destructive queue-clears, and adds new JSON/job orchestration + caching that could affect operator UX and Redis load if incorrect.

Overview
Speeds up the celerybrokerqueue/clear-by-filter/ preview by removing synchronous Redis scans from the GET handler and replacing them with a skeleton page populated by new lazy-preview JS.

Adds a superuser-only clear-by-filter/preview/ JSON endpoint that returns either an inline count (with a short-lived Redis cache and ?bust=1) or a dry_run chunked-job id for deep queues (with job-id deduping), plus a clear-by-filter/skip-count/ escape hatch that unlocks destructive actions without waiting for counts. Updates the template/UI accordingly, introduces new settings knobs for thresholds/TTLs, and adds/adjusts tests to pin the new response shapes and failure-mode behavior.

Reviewed by Cursor Bugbot for commit 308ea62. Bugbot is set up for automated code reviews on this repo. Configure here.

…t thread)

The clear-by-filter preview page used to call streaming_celery_count +
materialise up to CELERY_BROKER_DISPLAY_LIMIT messages inside the GET
handler before painting anything; on a 1M-deep queue that's tens of
seconds of blocking work in the gunicorn request thread.

Four-tier fix, layered:

1. Lazy preview load. GET handler renders the form + skeleton in
   <200ms, no Redis scans. New JSON endpoint
   `clear-by-filter/preview/` computes the count + sample on demand;
   `celery_clear_by_filter_preview.{js,css}` fills them in once they
   arrive. Submit buttons start disabled and unlock on count
   completion.
2. Background count job for huge queues. Above
   REDIS_ADMIN_CLEAR_BY_FILTER_PREVIEW_INLINE_LIMIT (default 200_000)
   the preview endpoint kicks off a `dry_run=True` chunked-clear job
   (existing start_celery_broker_clear_job machinery — no new
   infrastructure) and returns {job_id, status_url}; JS polls the
   existing status hash and reads `matched` as the count on
   completion.
3. 60s synchronous-mode count cache in cache Redis under
   `redis_admin:preview_count:<sha256(queue|task|repoid|commitid)>`.
   Refresh / Back-Forward / multi-tab return the cached count
   instantly with a `cached_at` field. `?bust=1` skips the cache.
4. Skip-count escape hatch. Muted-italic link routes to a
   tiny re-render of the preview page with `count_skipped=1` set;
   the operator types the queue name and proceeds without waiting
   for the count.

Coverage: 12 new tests in
`redis_admin/tests/test_clear_by_filter_preview.py`, all under
`-m 'not django_db'`. No `@django_db` tests added — the audit-log
integration for the job-mode dry-run is already covered by
`test_celery_broker_clear_job.py`'s chunked-mode audit test, which
runs in CI.
Comment thread apps/codecov-api/redis_admin/admin.py
PR #907 made the clear-by-filter confirmation page render a skeleton
on initial GET; the count + sample now load via the JSON preview
endpoint that the JS hits on DOMContentLoaded. The four failing
tests were still asserting the old eager-render "Matched: N
message(s)" string against the GET body.

Update them to:
  * pin the skeleton markers (#celery-clear-by-filter-preview-root,
    .celery-clear-preview-loader, "Counting matches…")
  * pin the dry-run flash text ("would clear N of M matching
    message(s)") which still surfaces synchronously via the Django
    messages framework on the dry-run response
  * pin the keep-one button's server-rendered hidden state
    (.celery-clear-keep-one.celery-clear-preview-hidden) — the
    JS-driven match_count<=1 invariant moved to the preview JSON
    endpoint's tests
@sentry
Copy link
Copy Markdown
Contributor

sentry Bot commented May 2, 2026

Codecov Report

❌ Patch coverage is 92.53731% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.90%. Comparing base (6fb832f) to head (308ea62).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/redis_admin/admin.py 92.91% 9 Missing ⚠️
apps/codecov-api/redis_admin/services.py 91.54% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #907      +/-   ##
==========================================
+ Coverage   91.89%   91.90%   +0.01%     
==========================================
  Files        1316     1316              
  Lines       50586    50741     +155     
  Branches     1625     1625              
==========================================
+ Hits        46485    46633     +148     
- Misses       3795     3802       +7     
  Partials      306      306              
Flag Coverage Δ
apiunit 94.90% <92.53%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link
Copy Markdown

codecov-notifications Bot commented May 2, 2026

Codecov Report

❌ Patch coverage is 92.53731% with 15 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/redis_admin/admin.py 92.91% 9 Missing ⚠️
apps/codecov-api/redis_admin/services.py 91.54% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

Bugbot caught a HIGH severity bug in the lazy clear-by-filter
preview commit (46b6e5f): when `streaming_celery_count` raised
on the destructive POST path, the fallback set `match_count = 0`,
which then triggered the very next `match_count == 0` safety
check and silently 302'd to "No messages match the filter —
nothing to clear." A transient Redis blip during the count probe
would convert a typed-confirmed `clear_all` / `clear_keep_one`
into a misleading no-op redirect, with no signal to the operator
that anything had gone wrong.

Track count failure with an explicit `count_failed` flag and
short-circuit the count-based safety checks (`match_count == 0`
early-return; `clear_keep_one` single-match guard) when the count
probe raised. The destructive path keeps gating on the typed-
confirmation, and `start_celery_broker_clear_job` re-snapshots
`LLEN(queue)` itself so the progress page total is still
accurate.

The dry-run banner gets a count-failed variant that drops the
"of {match_count}" denominator and surfaces "(count probe
failed; ...)" so the operator knows why the matched-total is
missing.

Two new TestCase tests pin the behavior:
  * `test_clear_all_proceeds_when_count_probe_raises` — POST
    `clear_all` with a typed confirmation must redirect to the
    background-job progress page even when `streaming_celery_count`
    raises, NOT to the changelist with a "nothing to clear" flash.
  * `test_dry_run_surfaces_count_failed_banner_when_count_raises` —
    dry-run banner drops the "of N" denominator and includes
    "count probe failed" when the count probe raised.
Comment thread apps/codecov-api/redis_admin/admin.py Outdated
Comment thread apps/codecov-api/redis_admin/admin.py
Comment thread apps/codecov-api/redis_admin/admin.py Outdated
The new `test_clear_all_proceeds_when_count_probe_raises` test
(added in 87af8f5) asserted the destructive POST redirect
landed under `/clear-by-filter/progress/<uuid>/`, but the
actual URL pattern wired up in `CeleryBrokerQueueAdmin.get_urls`
is `/clear-by-filter/job/<uuid>/` (the URL name is
`_clear_by_filter_progress` but the path segment is `job/`).
CI run 25241620362 surfaced the mismatch — the redirect Location
was `/admin/redis_admin/celerybrokerqueue/clear-by-filter/job/
044680cf-…/`, which proves the bug fix itself works (the
destructive action proceeded past the count failure and started
the background job) but my assertion substring was wrong.

Update the assertion to pin `/clear-by-filter/job/` and add a
guardrail that the location is NOT the `?queue_name__exact=…`
changelist redirect that the original (buggy) commit produced.
Three Bugbot findings on commit 87af8f5 — addressing all of them
in one focused commit since they all touch
`clear_by_filter_preview_view`:

* **Medium — sample materialise bubbles HTML 500 from a JSON endpoint.**
  `_materialise_sample_targets` was unwrapped, so a broker outage
  during the sample LRANGE raised through to Django's HTML 500
  page from a JSON endpoint — defeating the carefully-shaped
  `{"error": ...}` JSON the JS client unmarshals from the count
  / LLEN paths. Wrap it in try/except mirroring the LLEN
  fallback pattern so a flaky sample lookup degrades to an empty
  sample table while the count payload still ships.

* **Medium — deep-queue preview spawns duplicate background jobs on
  refresh.** `start_celery_broker_clear_job` was called
  unconditionally on every preview request above the inline
  threshold, so a page refresh / second tab / JS retry kicked
  off N parallel full-queue scans for the same filter. Add
  `get_or_start_clear_by_filter_preview_job` in `services.py` —
  best-effort dedup keyed on a SHA-256 of the filter tuple
  (mirrors `_clear_by_filter_preview_cache_key` for operator-
  mental-model consistency), with a 5-minute TTL (new
  `CLEAR_BY_FILTER_PREVIEW_JOB_LOCK_TTL_SECONDS` setting).
  Stale-id guard: if the cached `job_id` no longer resolves to
  a live job hash, fall through to spawn a fresh job rather
  than hand the JS a dead id. Cache outage falls through to
  always-spawn so the preview never returns a 500 because of a
  flaky lock.

* **Low — duplicated URL query-building across preview / skip-count.**
  The `urlencode({k: v for k, v in {…}.items() if v != ""})`
  comprehension was copy-pasted across three call sites.
  Extract `CeleryBrokerQueueAdmin._build_clear_by_filter_querystring`
  with an `extras={...}` slot so `clear_by_filter_skip_count_view`
  can add `count_skipped=1` without re-spelling the filter dict.
  Pure refactor — output is byte-identical to the old comprehension.

Three new pytest tests pin the behaviour:

  * `test_preview_endpoint_dedups_job_for_same_filter_within_lock_ttl` —
    two preview requests for the same filter return the same
    `job_id` and `start_celery_broker_clear_job` is called exactly
    once; a third request with a different `repoid` spawns a
    fresh job (the lock is keyed on the full filter tuple).
  * `test_preview_endpoint_dedup_falls_through_when_cached_job_is_stale`
    — pin the stale-id guard: deleting the job hash between two
    preview calls forces the second call to spawn a fresh job
    rather than hand the JS a dead `job_id`.
  * `test_preview_endpoint_returns_empty_sample_when_materialise_raises` —
    JSON endpoint contract preserved (200 + `application/json`)
    when `_materialise_sample_targets` raises; sample degrades
    to `[]` and the count still ships.
Comment thread apps/codecov-api/redis_admin/services.py
Two Bugbot Low-severity findings on commit dd8ad8d:

* **Sample-table heading is hardcoded "20".** PR #907 replaced
  the pre-PR dynamic `<p>The first {{ sample_targets|length }}
  matching message(s):</p>` with a static `"The first 20"` —
  misleading when fewer than 20 messages match (the table below
  it would show only e.g. 3 rows). The pre-PR template also had
  a `"Showing first X of Y"` note when the sample window was
  narrower than the match set, lost in the rewrite.

  Make the heading JS-rewritten on the synchronous-mode preview
  response. The new `renderSampleHeader(sampleLength, matchCount)`
  helper produces:

  - `"Showing first {sample} of {match} matching message(s):"`
    when the sample window is narrower than the match set,
  - `"The first {sample} matching message(s):"` when the sample
    covers the whole match set,
  - the harmless fallback `"Matching messages:"` when the JS
    preview never lands (Tier 4 skip-count render path, which
    intentionally ships an empty sample).

* **Duplicated SHA-256 logic between count cache + job lock.**
  `_clear_by_filter_preview_cache_key` and the new
  `_clear_by_filter_preview_job_lock_key` had identical bodies
  — same canonical `"|".join(...)` shape, same `hashlib.sha256(...)`
  call. Extract `_clear_by_filter_filter_digest` as the single
  chokepoint so the two key families can never accidentally
  diverge in their hashing logic. Pure refactor — both keys
  produce the same bytes as before.
Bugbot flagged a HIGH-severity false positive on commit ea75f1b
("Job-mode match count always reads as zero"), claiming
`snapshot.matched` arrives at the JS as a string. It does not —
`_job_progress_context._int("matched")` casts it to int before
the `JsonResponse`, mirroring `_int("processed")` and
`_int("total_estimated")`. The existing
`test_clear_job_status_view_returns_json_for_running_job` already
pins `processed` / `total_estimated` as JSON ints; extend it to
also pin `matched` / `drifted` / `passes_run` so a future
regression that drops the `_int(...)` cast lands here loudly
instead of silently re-rendering "0 matching messages" on the
operator's preview page.

(Reply on the inline review comment explains the analysis.)
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ca453c2. Configure here.

When the lazy clear-by-filter preview enters Tier 2 deep-queue mode it
calls `show()` on `#celery-clear-preview-job-card` to surface the
chunked-job progress bar (line ~367 of celery_clear_by_filter_preview.js).
On any terminal state the polling loop hands off to either
`fillSynchronousResult` (on `completed`) or `showError` (on
`cancelled` / `failed`). Neither helper hid the job card, so the
operator saw the now-stale progress card stacked alongside the
resolved count card or the error card.

Hide the job card from both helpers so the terminal-state UI replaces
the progress card cleanly, instead of layering on top of it. No-op for
the synchronous tier (the job card was never shown there).

Bug surfaced by Cursor Bugbot on PR #907 (BUGBOT_BUG_ID
2115197d-d079-4f62-a9ed-7f1be226af88, MEDIUM).
@thomasrockhu-codecov
Copy link
Copy Markdown
Contributor Author

Folded into #905; keeping this open as the per-feature review record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant