feat(redis_admin): unacked queue per-message drill-down + frequency chart + clear-by-filter#911
feat(redis_admin): unacked queue per-message drill-down + frequency chart + clear-by-filter#911thomasrockhu-codecov wants to merge 6 commits intomainfrom
Conversation
…hart + clear-by-filter
The Grafana broker dashboard's `redis_key_size{key="unacked"}`
panel had no clickable surface — when it climbed, on-call had
to shell into Redis to figure out which queue's messages were
stuck and why. This mirrors the existing `CeleryBrokerQueueAdmin`
end-to-end for Kombu's `unacked` HASH + `unacked_index` ZSET so
the operator can drill in from the panel, see the per-message
detail (task / repoid / commitid / visibility-timeout deadline /
routing_key), inspect the `(routing_key, task, repoid, commitid)`
frequency chart, and trigger a chunked clear-by-filter job that
keeps the two paired keys in sync via `HDEL` + `ZREM` in one
pipeline.
The clear pipeline diverges from celery_broker's LSET-tombstone
idiom because HASH fields don't shift indices: we just `HDEL` +
`ZREM` per delivery_tag in batches. `keep_one` keeps the
lowest-deadline match (most-likely-to-be-restored-next entry, so
the operator preserves the example most likely to reappear in
the queue if they re-trigger work).
The `unacked` family is registered with `connection_kind="broker"`
(it lives next to the queue LISTs, not on cache) and excluded
from `RedisQueueAdmin.get_queryset` via
`.family_exclude("celery_broker", "unacked")` so the generic
queue browser doesn't surface a duplicate row that overlaps this
admin's surface.
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (78.88%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #911 +/- ##
==========================================
- Coverage 91.89% 91.60% -0.29%
==========================================
Files 1316 1316
Lines 50630 51774 +1144
Branches 1625 1625
==========================================
+ Hits 46525 47428 +903
- Misses 3799 4040 +241
Partials 306 306
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
… pass 1 Bugbot review on PR #911 caught a duplicate-pipeline bug in `_streaming_unacked_clear`'s keep_one branch: when pass 1 walked a chunk-buffer with multiple matches, `pass_matches` was accumulated twice — once via the per-field `pass_matches.append(field_name)` inside the loop, once via a trailing `pass_matches.extend(chunk_targets)` after the loop. The post-pass HDEL+ZREM pipeline then issued two commands per non-keeper. Idempotent on Redis (second HDEL/ZREM returns 0, so `total_hdel` / `total_zrem` stayed truthful) but doubled the round-trip volume on a long clear and contradicted the inline comment about "stash matches into pass_matches only". Drop the redundant `pass_matches.extend(chunk_targets)`. The per-field append already covers chunk-level matches exactly once; the post-pass pipeline now hits each non-keeper field exactly once. Add a regression test `test_unacked_clear_keep_one_pass_one_does_not_double_pipeline` that spies on `pipeline.hdel` and asserts each field appears in the recorded arg lists exactly once across the run.
…nto detail mode Bugbot review on PR #911 caught a self-referencing-link bug in the unacked summary changelist: the `messages_link` column rendered an admin URL with no `routing_key__exact` query param, but `_is_summary_request` returned True whenever `routing_key__exact` was absent. So clicking "view N unacked message(s) →" just re-rendered the summary page — a self-loop. The celery_broker equivalent works because each summary row carries a `queue_name` that gets passed as `queue_name__exact` to flip into detail mode. The unacked summary row has no single routing_key to point at because the unacked HASH is one global bucket carrying every queue's reservations. Add a dedicated `view_all=1` flag that `_is_summary_request` recognises as the explicit "show every reservation" mode, and have the summary's `messages_link` render `?view_all=1` so the link lands on detail mode. From there the operator narrows via the sidebar `routing_key` filter. Hide the lazy frequency chart in `view_all=1` mode (the chart-fragment URL requires a specific `routing_key` kwarg, and aggregating across every queue at once would mix unrelated buckets). Add a regression test pinning both halves of the contract: the rendered URL carries `view_all=1`, and `_is_summary_request` treats `view_all=1` as a detail-mode request even when `routing_key__exact` is empty.
a963651 to
8553086
Compare
…l mode Bugbot caught a HIGH-severity navigation bug on PR #911: the summary row's "view N unacked message(s) →" link landed on `?view_all=1`, and `_is_summary_request` correctly recognised that as detail mode for column rendering, but `UnackedQueueAdmin.get_queryset` only flipped the queryset out of summary mode when `routing_key__exact` was set. With `view_all=1` alone, the queryset's `routing_key` stayed `None`, `is_summary_mode()` returned `True`, and the admin rendered a single dashes-row under the per-message detail columns — breaking the feature at its primary navigation entry point. Fix: - Add `view_all: bool` to `UnackedQueueQuerySet`, propagate through `_clone`, and have `is_summary_mode()` return False when `view_all=True` even with no `routing_key`. - `_request_cache_key` carries the `view_all` flag so a routing_key-scoped render and a `view_all=1` render don't share an HSCAN snapshot. - `UnackedQueueAdmin.get_queryset` flips `queryset.view_all=True` when the request URL carries `view_all=1` so the admin and the queryset agree on detail mode. Also fixes two co-located Bugbot findings: - MEDIUM: keep_one progress counters never reflected post-pass mutations. Pass-1 keep_one defers all HDEL+ZREM to the post-pass pipeline; the per-chunk progress callback ran with zero deltas, and the post-pass drain never invoked the callback. Operator's progress page permanently showed `matched=0` / `zrem_removed=0` even on a successful clear. Now we capture pre-drain counters, run the post-pass pipeline, then call the progress callback so the job hash gets the cumulative `matched`/`zrem_removed`. - LOW: dead `pass1_match_fields` list in `_streaming_unacked_clear` was never read after the earlier de-dup fix landed. Removed. Adds `comparison_id: int | None = None` to `CeleryEnvelopeMeta` in families.py, populated from `kwargs.get("comparison_id")` in `parse_celery_envelope`. The unacked queryset already references `meta.comparison_id` for `ComputeComparisonTask` hydration, so this closes the families/queryset gap that was causing the `AttributeError: 'CeleryEnvelopeMeta' object has no attribute 'comparison_id'` test failures on CI. Regression test (`test_unacked_admin_view_all_drill_in_renders_per_message_rows`) seeds two reservations under different routing_keys, asserts `get_queryset` for `?view_all=1` returns a queryset that's NOT in summary mode, and verifies both rows surface as per-message detail rows (depth=None, both routing_keys preserved). Validation: 45 unacked tests pass locally; ruff check + format clean.
…nto detail mode Bugbot review on PR #911 caught a self-referencing-link bug in the unacked summary changelist: the `messages_link` column rendered an admin URL with no `routing_key__exact` query param, but `_is_summary_request` returned True whenever `routing_key__exact` was absent. So clicking "view N unacked message(s) →" just re-rendered the summary page — a self-loop. The celery_broker equivalent works because each summary row carries a `queue_name` that gets passed as `queue_name__exact` to flip into detail mode. The unacked summary row has no single routing_key to point at because the unacked HASH is one global bucket carrying every queue's reservations. Add a dedicated `view_all=1` flag that `_is_summary_request` recognises as the explicit "show every reservation" mode, and have the summary's `messages_link` render `?view_all=1` so the link lands on detail mode. From there the operator narrows via the sidebar `routing_key` filter. Hide the lazy frequency chart in `view_all=1` mode (the chart-fragment URL requires a specific `routing_key` kwarg, and aggregating across every queue at once would mix unrelated buckets). Add a regression test pinning both halves of the contract: the rendered URL carries `view_all=1`, and `_is_summary_request` treats `view_all=1` as a detail-mode request even when `routing_key__exact` is empty.
…l mode Bugbot caught a HIGH-severity navigation bug on PR #911: the summary row's "view N unacked message(s) →" link landed on `?view_all=1`, and `_is_summary_request` correctly recognised that as detail mode for column rendering, but `UnackedQueueAdmin.get_queryset` only flipped the queryset out of summary mode when `routing_key__exact` was set. With `view_all=1` alone, the queryset's `routing_key` stayed `None`, `is_summary_mode()` returned `True`, and the admin rendered a single dashes-row under the per-message detail columns — breaking the feature at its primary navigation entry point. Fix: - Add `view_all: bool` to `UnackedQueueQuerySet`, propagate through `_clone`, and have `is_summary_mode()` return False when `view_all=True` even with no `routing_key`. - `_request_cache_key` carries the `view_all` flag so a routing_key-scoped render and a `view_all=1` render don't share an HSCAN snapshot. - `UnackedQueueAdmin.get_queryset` flips `queryset.view_all=True` when the request URL carries `view_all=1` so the admin and the queryset agree on detail mode. Also fixes two co-located Bugbot findings: - MEDIUM: keep_one progress counters never reflected post-pass mutations. Pass-1 keep_one defers all HDEL+ZREM to the post-pass pipeline; the per-chunk progress callback ran with zero deltas, and the post-pass drain never invoked the callback. Operator's progress page permanently showed `matched=0` / `zrem_removed=0` even on a successful clear. Now we capture pre-drain counters, run the post-pass pipeline, then call the progress callback so the job hash gets the cumulative `matched`/`zrem_removed`. - LOW: dead `pass1_match_fields` list in `_streaming_unacked_clear` was never read after the earlier de-dup fix landed. Removed. Adds `comparison_id: int | None = None` to `CeleryEnvelopeMeta` in families.py, populated from `kwargs.get("comparison_id")` in `parse_celery_envelope`. The unacked queryset already references `meta.comparison_id` for `ComputeComparisonTask` hydration, so this closes the families/queryset gap that was causing the `AttributeError: 'CeleryEnvelopeMeta' object has no attribute 'comparison_id'` test failures on CI. Regression test (`test_unacked_admin_view_all_drill_in_renders_per_message_rows`) seeds two reservations under different routing_keys, asserts `get_queryset` for `?view_all=1` returns a queryset that's NOT in summary mode, and verifies both rows surface as per-message detail rows (depth=None, both routing_keys preserved). Validation: 45 unacked tests pass locally; ruff check + format clean.
…velopes Bugbot caught a brittle fake-meta object in `_streaming_unacked_clear`'s envelope-parse fallback: when an unacked HASH value can't be decoded as a JSON triple, we built a one-off type instance with only `task` / `repoid` / `commitid` attributes. The current `_unacked_envelope_matches_any_filter` only reads those three, so it works today, but any future filter expansion (e.g. matching on `comparison_id`, `task_id`, `ownerid`, `pullid`) would touch a missing attribute on the fake and raise `AttributeError` mid-clear. Use the canonical empty `CeleryEnvelopeMeta()` instead — same shape `parse_celery_envelope` returns when it gives up — so the fallback meta has every field defaulted to `None` and is robust against future filter additions. Mirrors how the queryset's `_materialise` handles the same case.
Bugbot flagged a percentage-display bug in `_stream_unacked_frequency_aggregate`: all-None unparseable envelopes incremented `total` (the percentage denominator) but never landed in `counter`, so the rendered chart's bars summed to less than 100% on a HASH that mixed parseable and unparseable messages. Fix mirrors the cached path's behaviour in `UnackedQueueQuerySet.frequency_by_routing_task_repo_commit`: re-normalise `total` from the surviving counter sum after the post-loop comparison-merge so the denominator only includes buckets that actually render in the chart. Regression test pins the contract: a HASH with one parseable real message and one all-None unparseable message must yield a single bucket whose `pct` is ~100% (i.e. sum of `pct` across visible buckets ∈ (99%, 101%)).
dfd2417 to
763ca71
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 763ca71. Configure here.
| ) | ||
| ) | ||
| self._write_request_cache(rows) | ||
| return rows |
There was a problem hiding this comment.
Display cap applied before routing_key filter on global HASH
Medium Severity
_materialise scans the global unacked HASH up to CELERY_BROKER_DISPLAY_LIMIT entries before _fetch_all applies _matches_filters to narrow by routing_key. Unlike CeleryBrokerQueueQuerySet which scans a single per-queue LIST (all entries match), the unacked HASH mixes every queue's reservations. In detail mode for a minority routing_key (e.g., 100 entries in a 100k HASH), the cap consumes its budget on unrelated entries, leaving the changelist with near-zero matching rows while the frequency chart (using the larger CELERY_BROKER_SCAN_LIMIT) correctly reports hundreds — a confusing mismatch for the operator investigating a specific queue.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 763ca71. Configure here.
|
|
||
| chunk_processed = len(chunk_buffer) | ||
| chunk_hdel_before = matches_hdel | ||
| chunk_zrem_before = matches_zrem |
There was a problem hiding this comment.
Unused variable chunk_zrem_before is a dead store
Low Severity
chunk_zrem_before is assigned from matches_zrem at the top of _flush_chunk but is never referenced anywhere in the closure. The matched_delta uses chunk_hdel_before and found_delta uses chunk_found_before, but there is no corresponding zrem_delta computation. This dead store suggests either a missing ZREM delta field in _UnackedChunkProgress or simply leftover copy-paste from the HDEL counterpart.
Reviewed by Cursor Bugbot for commit 763ca71. Configure here.


Summary
This maps the
redis_key_size{key="unacked", cluster="$var_cluster"}panel on the Grafana broker dashboard into a per-message admin surface, mirroring the existingCeleryBrokerQueueAdminend-to-end (PRs #894 / #895 / #899 / #905 lineage — reviewers familiar with that surface can review this one by analogy).When
unackedclimbs in Grafana, on-call now has a clickable next step instead of aredis-clishell session.Kombu unacked layout
kombu.transport.redis.Channel.unacked_key("unacked") andChannel.unacked_index_key("unacked_index") move worker-reserved messages onto two paired keys on the broker Redis:unacked— Redis HASH. Field =delivery_tag(UUID-like). Value = JSON[message_dict, exchange_str, routing_key_str].message_dict["body"]is base64 of the celery envelope, parsed by the existingparse_celery_envelope.unacked_index— Redis ZSET. Score = visibility-timeout deadline (unix ts). Member =delivery_tag. Used by Kombu's restore-loop to re-enqueue messages whose worker died.To clear an entry the admin issues both
HDEL unacked <delivery_tag>andZREM unacked_index <delivery_tag>in one pipeline. Skipping theZREMwould leave a phantom inunacked_indexthat costs aZPOPMINevery visibility-timeout tick.Chart shape
New URLs
/admin/redis_admin/unackedqueueitem/HLEN(unacked)/admin/redis_admin/unackedqueueitem/?routing_key__exact=<queue>/admin/redis_admin/unackedqueueitem/<routing_key>/chart-fragment//admin/redis_admin/unackedqueueitem/clear-by-filter//admin/redis_admin/unackedqueueitem/clear-by-filter/job/<uuid>//admin/redis_admin/unackedqueueitem/clear-by-filter/job/<uuid>/status//admin/redis_admin/unackedqueueitem/clear-by-filter/job/<uuid>/cancel/Notable design decisions
keep_onesemantic differs from celery_broker: keeping the lowest-deadline match (the entry most likely to be re-enqueued first by Kombu's restore-loop), so the operator preserves the example most likely to reappear if they re-trigger work. The celery_broker version keeps the lowest list index (oldest by FIFO), which doesn't translate to a HASH._FILTER_ANYextended to four slots(routing_key, task, repoid, commitid)via_substitute_filter_any_unackedrather than reusing the 3-slot helper; documented inline. The fourth slot is mandatory because routing_key is a per-message field, not a queue-level scope (one HASH carries every queue's reservations).unackedfamily is registered withconnection_kind="broker"and excluded fromRedisQueueAdmin.get_querysetvia.family_exclude("celery_broker", "unacked")so the generic queue browser doesn't surface a duplicate row._streaming_unacked_clearpaired-pipelineHDEL + ZREMper batch;redis_admin.unacked_clear: pass=N depth=M hdel_removed=K zrem_removed=Klog line per pass for postmortem traceability.celery_clear_progress.jsintounacked_clear_progress.jsbecause the status snapshot's mutation field iszrem_removed(paired-key removals), notdrifted(LSET-tombstone reconciliation).routing_keyso the operator can spot a single misbehaving queue inside the global HASH at a glance.Operator self-verify
Healthy:
HLEN ≈ ZCARD. A persistent gap means a previous clear half-completed; visit/admin/redis_admin/unackedqueueitem/?routing_key__exact=<queue>and re-run a clear job to re-pair the keys.Test plan
routing_key=""normalization), and the_substitute_filter_any_unackedhelper._streaming_unacked_clearcovered for HDEL+ZREM pairing, dry-run, keep-one, wildcard, and cancellation.unacked).Made with Cursor
Note
Medium Risk
Adds a new admin surface that can delete broker-reserved messages (
unacked/unacked_index) and introduces a chunked background clear job; mistakes could clear the wrong in-flight work or orphan index entries if bugs slip through.Overview
Adds a dedicated Django admin for Kombu/Celery reserved messages stored in Redis
unacked(HASH) paired withunacked_index(ZSET), including summary (HLEN) and per-routing-key drill-down views with a lazy-loaded frequency chart.Introduces clear tooling for unacked entries: per-row delete and a superuser-only clear-by-filter flow (dry-run + typed confirmation) that runs as a chunked background job with progress/status/cancel endpoints and paired
HDEL+ZREMsemantics to keep the HASH/ZSET consistent.Registers
unackedas a broker-connection Redis family (and excludes it from the genericRedisQueueAdminlist), extends envelope parsing metadata (comparison_id), and adds templates/static JS plus smoke tests covering the new admin modes and hiding behavior.Reviewed by Cursor Bugbot for commit 763ca71. Bugbot is set up for automated code reviews on this repo. Configure here.