Skip to content

feat(redis_admin): unacked queue per-message drill-down + frequency chart + clear-by-filter#911

Open
thomasrockhu-codecov wants to merge 6 commits intomainfrom
redis-admin/unacked-queue-drilldown
Open

feat(redis_admin): unacked queue per-message drill-down + frequency chart + clear-by-filter#911
thomasrockhu-codecov wants to merge 6 commits intomainfrom
redis-admin/unacked-queue-drilldown

Conversation

@thomasrockhu-codecov
Copy link
Copy Markdown
Contributor

@thomasrockhu-codecov thomasrockhu-codecov commented May 5, 2026

Summary

This maps the redis_key_size{key="unacked", cluster="$var_cluster"} panel on the Grafana broker dashboard into a per-message admin surface, mirroring the existing CeleryBrokerQueueAdmin end-to-end (PRs #894 / #895 / #899 / #905 lineage — reviewers familiar with that surface can review this one by analogy).

When unacked climbs in Grafana, on-call now has a clickable next step instead of a redis-cli shell session.

Kombu unacked layout

kombu.transport.redis.Channel.unacked_key ("unacked") and Channel.unacked_index_key ("unacked_index") move worker-reserved messages onto two paired keys on the broker Redis:

  • unacked — Redis HASH. Field = delivery_tag (UUID-like). Value = JSON [message_dict, exchange_str, routing_key_str]. message_dict["body"] is base64 of the celery envelope, parsed by the existing parse_celery_envelope.
  • unacked_index — Redis ZSET. Score = visibility-timeout deadline (unix ts). Member = delivery_tag. Used by Kombu's restore-loop to re-enqueue messages whose worker died.

To clear an entry the admin issues both HDEL unacked <delivery_tag> and ZREM unacked_index <delivery_tag> in one pipeline. Skipping the ZREM would leave a phantom in unacked_index that costs a ZPOPMIN every visibility-timeout tick.

Chart shape

┌─────────────────────────────────────────────────────────────────────────────┐
│  unacked frequency by (routing_key, task, repoid, commitid) — sample 20000  │
├──────────┬─────────────────────────┬────────┬──────────────┬──────┬─────────┤
│  routing │  task                    │ repoid │ commitid     │  n  │ pct     │
├──────────┼─────────────────────────┼────────┼──────────────┼──────┼─────────┤
│  notify  │ app.tasks.notify.Notify │ 1234   │ abcdef0123…  │ 412 │ 38.2%   │ [Clear]
│  celery  │ app.tasks.bundle.Proc   │ 5678   │ deadbeefcaf… │ 198 │ 18.4%   │ [Clear]
│  …       │ …                       │ …      │ …            │ …   │ …       │
└──────────┴─────────────────────────┴────────┴──────────────┴──────┴─────────┘
                                                              total visible 1077

New URLs

Path Purpose
/admin/redis_admin/unackedqueueitem/ Summary — one row carrying HLEN(unacked)
/admin/redis_admin/unackedqueueitem/?routing_key__exact=<queue> Per-message detail rows for one routing_key + frequency chart
/admin/redis_admin/unackedqueueitem/<routing_key>/chart-fragment/ Lazy-loaded frequency chart fragment
/admin/redis_admin/unackedqueueitem/clear-by-filter/ Confirmation + dry-run form (chart-hint URL params, no Redis I/O on GET)
/admin/redis_admin/unackedqueueitem/clear-by-filter/job/<uuid>/ Progress page for a chunked clear job
/admin/redis_admin/unackedqueueitem/clear-by-filter/job/<uuid>/status/ JSON snapshot of a chunked clear job
/admin/redis_admin/unackedqueueitem/clear-by-filter/job/<uuid>/cancel/ POST-only cancel endpoint

Notable design decisions

  • keep_one semantic differs from celery_broker: keeping the lowest-deadline match (the entry most likely to be re-enqueued first by Kombu's restore-loop), so the operator preserves the example most likely to reappear if they re-trigger work. The celery_broker version keeps the lowest list index (oldest by FIFO), which doesn't translate to a HASH.
  • _FILTER_ANY extended to four slots (routing_key, task, repoid, commitid) via _substitute_filter_any_unacked rather than reusing the 3-slot helper; documented inline. The fourth slot is mandatory because routing_key is a per-message field, not a queue-level scope (one HASH carries every queue's reservations).
  • unacked family is registered with connection_kind="broker" and excluded from RedisQueueAdmin.get_queryset via .family_exclude("celery_broker", "unacked") so the generic queue browser doesn't surface a duplicate row.
  • _streaming_unacked_clear paired-pipeline HDEL + ZREM per batch; redis_admin.unacked_clear: pass=N depth=M hdel_removed=K zrem_removed=K log line per pass for postmortem traceability.
  • Forked celery_clear_progress.js into unacked_clear_progress.js because the status snapshot's mutation field is zrem_removed (paired-key removals), not drifted (LSET-tombstone reconciliation).
  • Chart fragment template adds a fourth column for routing_key so the operator can spot a single misbehaving queue inside the global HASH at a glance.

Operator self-verify

from redis_admin import conn as _c

r = _c.get_connection(kind="broker")
print("unacked HLEN:", r.hlen("unacked"))
print("unacked_index ZCARD:", r.zcard("unacked_index"))

Healthy: HLEN ≈ ZCARD. A persistent gap means a previous clear half-completed; visit /admin/redis_admin/unackedqueueitem/?routing_key__exact=<queue> and re-run a clear job to re-pair the keys.

Test plan

  • Unit tests for queryset summary/detail two-mode behaviour, deadline ZSCORE join, frequency aggregator (including routing_key="" normalization), and the _substitute_filter_any_unacked helper.
  • _streaming_unacked_clear covered for HDEL+ZREM pairing, dry-run, keep-one, wildcard, and cancellation.
  • Chunked job worker covered for start / progress / completion / cancel / keep-one.
  • Admin endpoints covered: chart fragment, clear-by-filter (GET + POST + dry-run + typed-confirm), progress + status + cancel.
  • Smoke tests exercise the three changelist URLs (summary mode, detail mode, generic queue admin still hides unacked).
  • Sandbox limitation: TestCase-based smoke tests need Postgres → CI handles those (94 fakeredis-only tests pass locally).
  • CI green
  • Manual operator click-through against staging broker Redis

Made with Cursor


Note

Medium Risk
Adds a new admin surface that can delete broker-reserved messages (unacked/unacked_index) and introduces a chunked background clear job; mistakes could clear the wrong in-flight work or orphan index entries if bugs slip through.

Overview
Adds a dedicated Django admin for Kombu/Celery reserved messages stored in Redis unacked (HASH) paired with unacked_index (ZSET), including summary (HLEN) and per-routing-key drill-down views with a lazy-loaded frequency chart.

Introduces clear tooling for unacked entries: per-row delete and a superuser-only clear-by-filter flow (dry-run + typed confirmation) that runs as a chunked background job with progress/status/cancel endpoints and paired HDEL+ZREM semantics to keep the HASH/ZSET consistent.

Registers unacked as a broker-connection Redis family (and excludes it from the generic RedisQueueAdmin list), extends envelope parsing metadata (comparison_id), and adds templates/static JS plus smoke tests covering the new admin modes and hiding behavior.

Reviewed by Cursor Bugbot for commit 763ca71. Bugbot is set up for automated code reviews on this repo. Configure here.

…hart + clear-by-filter

The Grafana broker dashboard's `redis_key_size{key="unacked"}`
panel had no clickable surface — when it climbed, on-call had
to shell into Redis to figure out which queue's messages were
stuck and why. This mirrors the existing `CeleryBrokerQueueAdmin`
end-to-end for Kombu's `unacked` HASH + `unacked_index` ZSET so
the operator can drill in from the panel, see the per-message
detail (task / repoid / commitid / visibility-timeout deadline /
routing_key), inspect the `(routing_key, task, repoid, commitid)`
frequency chart, and trigger a chunked clear-by-filter job that
keeps the two paired keys in sync via `HDEL` + `ZREM` in one
pipeline.

The clear pipeline diverges from celery_broker's LSET-tombstone
idiom because HASH fields don't shift indices: we just `HDEL` +
`ZREM` per delivery_tag in batches. `keep_one` keeps the
lowest-deadline match (most-likely-to-be-restored-next entry, so
the operator preserves the example most likely to reappear in
the queue if they re-trigger work).

The `unacked` family is registered with `connection_kind="broker"`
(it lives next to the queue LISTs, not on cache) and excluded
from `RedisQueueAdmin.get_queryset` via
`.family_exclude("celery_broker", "unacked")` so the generic
queue browser doesn't surface a duplicate row that overlaps this
admin's surface.
Comment thread apps/codecov-api/redis_admin/services.py Outdated
@sentry
Copy link
Copy Markdown
Contributor

sentry Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 78.88792% with 243 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.60%. Comparing base (c88232d) to head (763ca71).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/redis_admin/admin.py 73.38% 103 Missing ⚠️
apps/codecov-api/redis_admin/queryset.py 74.30% 92 Missing ⚠️
apps/codecov-api/redis_admin/services.py 88.69% 38 Missing ⚠️
apps/codecov-api/redis_admin/families.py 87.50% 5 Missing ⚠️
apps/codecov-api/redis_admin/models.py 83.33% 5 Missing ⚠️

❌ Your patch check has failed because the patch coverage (78.88%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #911      +/-   ##
==========================================
- Coverage   91.89%   91.60%   -0.29%     
==========================================
  Files        1316     1316              
  Lines       50630    51774    +1144     
  Branches     1625     1625              
==========================================
+ Hits        46525    47428     +903     
- Misses       3799     4040     +241     
  Partials      306      306              
Flag Coverage Δ
apiunit 94.07% <78.88%> (-0.87%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link
Copy Markdown

codecov-notifications Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 78.88792% with 243 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/redis_admin/admin.py 73.38% 103 Missing ⚠️
apps/codecov-api/redis_admin/queryset.py 74.30% 92 Missing ⚠️
apps/codecov-api/redis_admin/services.py 88.69% 38 Missing ⚠️
apps/codecov-api/redis_admin/families.py 87.50% 5 Missing ⚠️
apps/codecov-api/redis_admin/models.py 83.33% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

… pass 1

Bugbot review on PR #911 caught a duplicate-pipeline bug in
`_streaming_unacked_clear`'s keep_one branch: when pass 1
walked a chunk-buffer with multiple matches, `pass_matches`
was accumulated twice — once via the per-field
`pass_matches.append(field_name)` inside the loop, once via a
trailing `pass_matches.extend(chunk_targets)` after the loop.
The post-pass HDEL+ZREM pipeline then issued two commands per
non-keeper. Idempotent on Redis (second HDEL/ZREM returns 0,
so `total_hdel` / `total_zrem` stayed truthful) but doubled
the round-trip volume on a long clear and contradicted the
inline comment about "stash matches into pass_matches only".

Drop the redundant `pass_matches.extend(chunk_targets)`. The
per-field append already covers chunk-level matches exactly
once; the post-pass pipeline now hits each non-keeper field
exactly once. Add a regression test
`test_unacked_clear_keep_one_pass_one_does_not_double_pipeline`
that spies on `pipeline.hdel` and asserts each field appears
in the recorded arg lists exactly once across the run.
Comment thread apps/codecov-api/redis_admin/admin.py
thomasrockhu-codecov added a commit that referenced this pull request May 5, 2026
…nto detail mode

Bugbot review on PR #911 caught a self-referencing-link bug in
the unacked summary changelist: the `messages_link` column
rendered an admin URL with no `routing_key__exact` query
param, but `_is_summary_request` returned True whenever
`routing_key__exact` was absent. So clicking "view N unacked
message(s) →" just re-rendered the summary page — a self-loop.

The celery_broker equivalent works because each summary row
carries a `queue_name` that gets passed as `queue_name__exact`
to flip into detail mode. The unacked summary row has no
single routing_key to point at because the unacked HASH is
one global bucket carrying every queue's reservations.

Add a dedicated `view_all=1` flag that `_is_summary_request`
recognises as the explicit "show every reservation" mode, and
have the summary's `messages_link` render `?view_all=1` so the
link lands on detail mode. From there the operator narrows via
the sidebar `routing_key` filter. Hide the lazy frequency
chart in `view_all=1` mode (the chart-fragment URL requires a
specific `routing_key` kwarg, and aggregating across every
queue at once would mix unrelated buckets).

Add a regression test pinning both halves of the contract: the
rendered URL carries `view_all=1`, and `_is_summary_request`
treats `view_all=1` as a detail-mode request even when
`routing_key__exact` is empty.
Comment thread apps/codecov-api/redis_admin/services.py
Comment thread apps/codecov-api/redis_admin/services.py Outdated
Comment thread apps/codecov-api/redis_admin/queryset.py Outdated
Comment thread apps/codecov-api/redis_admin/admin.py
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the redis-admin/unacked-queue-drilldown branch from a963651 to 8553086 Compare May 5, 2026 15:44
thomasrockhu-codecov added a commit that referenced this pull request May 5, 2026
…l mode

Bugbot caught a HIGH-severity navigation bug on PR #911: the
summary row's "view N unacked message(s) →" link landed on
`?view_all=1`, and `_is_summary_request` correctly recognised
that as detail mode for column rendering, but
`UnackedQueueAdmin.get_queryset` only flipped the queryset out
of summary mode when `routing_key__exact` was set. With
`view_all=1` alone, the queryset's `routing_key` stayed `None`,
`is_summary_mode()` returned `True`, and the admin rendered a
single dashes-row under the per-message detail columns —
breaking the feature at its primary navigation entry point.

Fix:
- Add `view_all: bool` to `UnackedQueueQuerySet`, propagate
  through `_clone`, and have `is_summary_mode()` return False
  when `view_all=True` even with no `routing_key`.
- `_request_cache_key` carries the `view_all` flag so a
  routing_key-scoped render and a `view_all=1` render don't
  share an HSCAN snapshot.
- `UnackedQueueAdmin.get_queryset` flips `queryset.view_all=True`
  when the request URL carries `view_all=1` so the admin and the
  queryset agree on detail mode.

Also fixes two co-located Bugbot findings:
- MEDIUM: keep_one progress counters never reflected post-pass
  mutations. Pass-1 keep_one defers all HDEL+ZREM to the
  post-pass pipeline; the per-chunk progress callback ran with
  zero deltas, and the post-pass drain never invoked the
  callback. Operator's progress page permanently showed
  `matched=0` / `zrem_removed=0` even on a successful clear.
  Now we capture pre-drain counters, run the post-pass pipeline,
  then call the progress callback so the job hash gets the
  cumulative `matched`/`zrem_removed`.
- LOW: dead `pass1_match_fields` list in `_streaming_unacked_clear`
  was never read after the earlier de-dup fix landed. Removed.

Adds `comparison_id: int | None = None` to `CeleryEnvelopeMeta`
in families.py, populated from `kwargs.get("comparison_id")` in
`parse_celery_envelope`. The unacked queryset already references
`meta.comparison_id` for `ComputeComparisonTask` hydration, so
this closes the families/queryset gap that was causing the
`AttributeError: 'CeleryEnvelopeMeta' object has no attribute
'comparison_id'` test failures on CI.

Regression test (`test_unacked_admin_view_all_drill_in_renders_per_message_rows`)
seeds two reservations under different routing_keys, asserts
`get_queryset` for `?view_all=1` returns a queryset that's NOT
in summary mode, and verifies both rows surface as per-message
detail rows (depth=None, both routing_keys preserved).

Validation: 45 unacked tests pass locally; ruff check + format
clean.
Comment thread apps/codecov-api/redis_admin/services.py Outdated
Comment thread apps/codecov-api/redis_admin/queryset.py
…nto detail mode

Bugbot review on PR #911 caught a self-referencing-link bug in
the unacked summary changelist: the `messages_link` column
rendered an admin URL with no `routing_key__exact` query
param, but `_is_summary_request` returned True whenever
`routing_key__exact` was absent. So clicking "view N unacked
message(s) →" just re-rendered the summary page — a self-loop.

The celery_broker equivalent works because each summary row
carries a `queue_name` that gets passed as `queue_name__exact`
to flip into detail mode. The unacked summary row has no
single routing_key to point at because the unacked HASH is
one global bucket carrying every queue's reservations.

Add a dedicated `view_all=1` flag that `_is_summary_request`
recognises as the explicit "show every reservation" mode, and
have the summary's `messages_link` render `?view_all=1` so the
link lands on detail mode. From there the operator narrows via
the sidebar `routing_key` filter. Hide the lazy frequency
chart in `view_all=1` mode (the chart-fragment URL requires a
specific `routing_key` kwarg, and aggregating across every
queue at once would mix unrelated buckets).

Add a regression test pinning both halves of the contract: the
rendered URL carries `view_all=1`, and `_is_summary_request`
treats `view_all=1` as a detail-mode request even when
`routing_key__exact` is empty.
…l mode

Bugbot caught a HIGH-severity navigation bug on PR #911: the
summary row's "view N unacked message(s) →" link landed on
`?view_all=1`, and `_is_summary_request` correctly recognised
that as detail mode for column rendering, but
`UnackedQueueAdmin.get_queryset` only flipped the queryset out
of summary mode when `routing_key__exact` was set. With
`view_all=1` alone, the queryset's `routing_key` stayed `None`,
`is_summary_mode()` returned `True`, and the admin rendered a
single dashes-row under the per-message detail columns —
breaking the feature at its primary navigation entry point.

Fix:
- Add `view_all: bool` to `UnackedQueueQuerySet`, propagate
  through `_clone`, and have `is_summary_mode()` return False
  when `view_all=True` even with no `routing_key`.
- `_request_cache_key` carries the `view_all` flag so a
  routing_key-scoped render and a `view_all=1` render don't
  share an HSCAN snapshot.
- `UnackedQueueAdmin.get_queryset` flips `queryset.view_all=True`
  when the request URL carries `view_all=1` so the admin and the
  queryset agree on detail mode.

Also fixes two co-located Bugbot findings:
- MEDIUM: keep_one progress counters never reflected post-pass
  mutations. Pass-1 keep_one defers all HDEL+ZREM to the
  post-pass pipeline; the per-chunk progress callback ran with
  zero deltas, and the post-pass drain never invoked the
  callback. Operator's progress page permanently showed
  `matched=0` / `zrem_removed=0` even on a successful clear.
  Now we capture pre-drain counters, run the post-pass pipeline,
  then call the progress callback so the job hash gets the
  cumulative `matched`/`zrem_removed`.
- LOW: dead `pass1_match_fields` list in `_streaming_unacked_clear`
  was never read after the earlier de-dup fix landed. Removed.

Adds `comparison_id: int | None = None` to `CeleryEnvelopeMeta`
in families.py, populated from `kwargs.get("comparison_id")` in
`parse_celery_envelope`. The unacked queryset already references
`meta.comparison_id` for `ComputeComparisonTask` hydration, so
this closes the families/queryset gap that was causing the
`AttributeError: 'CeleryEnvelopeMeta' object has no attribute
'comparison_id'` test failures on CI.

Regression test (`test_unacked_admin_view_all_drill_in_renders_per_message_rows`)
seeds two reservations under different routing_keys, asserts
`get_queryset` for `?view_all=1` returns a queryset that's NOT
in summary mode, and verifies both rows surface as per-message
detail rows (depth=None, both routing_keys preserved).

Validation: 45 unacked tests pass locally; ruff check + format
clean.
…velopes

Bugbot caught a brittle fake-meta object in
`_streaming_unacked_clear`'s envelope-parse fallback: when an
unacked HASH value can't be decoded as a JSON triple, we built
a one-off type instance with only `task` / `repoid` / `commitid`
attributes. The current `_unacked_envelope_matches_any_filter`
only reads those three, so it works today, but any future
filter expansion (e.g. matching on `comparison_id`,
`task_id`, `ownerid`, `pullid`) would touch a missing
attribute on the fake and raise `AttributeError` mid-clear.

Use the canonical empty `CeleryEnvelopeMeta()` instead — same
shape `parse_celery_envelope` returns when it gives up — so
the fallback meta has every field defaulted to `None` and is
robust against future filter additions. Mirrors how the
queryset's `_materialise` handles the same case.
Bugbot flagged a percentage-display bug in
`_stream_unacked_frequency_aggregate`: all-None unparseable
envelopes incremented `total` (the percentage denominator) but
never landed in `counter`, so the rendered chart's bars summed
to less than 100% on a HASH that mixed parseable and
unparseable messages.

Fix mirrors the cached path's behaviour in
`UnackedQueueQuerySet.frequency_by_routing_task_repo_commit`:
re-normalise `total` from the surviving counter sum after the
post-loop comparison-merge so the denominator only includes
buckets that actually render in the chart.

Regression test pins the contract: a HASH with one parseable
real message and one all-None unparseable message must yield a
single bucket whose `pct` is ~100% (i.e. sum of `pct` across
visible buckets ∈ (99%, 101%)).
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 763ca71. Configure here.

)
)
self._write_request_cache(rows)
return rows
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Display cap applied before routing_key filter on global HASH

Medium Severity

_materialise scans the global unacked HASH up to CELERY_BROKER_DISPLAY_LIMIT entries before _fetch_all applies _matches_filters to narrow by routing_key. Unlike CeleryBrokerQueueQuerySet which scans a single per-queue LIST (all entries match), the unacked HASH mixes every queue's reservations. In detail mode for a minority routing_key (e.g., 100 entries in a 100k HASH), the cap consumes its budget on unrelated entries, leaving the changelist with near-zero matching rows while the frequency chart (using the larger CELERY_BROKER_SCAN_LIMIT) correctly reports hundreds — a confusing mismatch for the operator investigating a specific queue.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 763ca71. Configure here.


chunk_processed = len(chunk_buffer)
chunk_hdel_before = matches_hdel
chunk_zrem_before = matches_zrem
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable chunk_zrem_before is a dead store

Low Severity

chunk_zrem_before is assigned from matches_zrem at the top of _flush_chunk but is never referenced anywhere in the closure. The matched_delta uses chunk_hdel_before and found_delta uses chunk_found_before, but there is no corresponding zrem_delta computation. This dead store suggests either a missing ZREM delta field in _UnackedChunkProgress or simply leftover copy-paste from the HDEL counterpart.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 763ca71. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant