Skip to content

feat(lw-deletions): Add throttling mechanisms to slow down deletion processing#7771

Open
onewland wants to merge 1 commit intomasterfrom
feat/lw-deletion-throttling
Open

feat(lw-deletions): Add throttling mechanisms to slow down deletion processing#7771
onewland wants to merge 1 commit intomasterfrom
feat/lw-deletion-throttling

Conversation

@onewland
Copy link
Contributor

@onewland onewland commented Feb 25, 2026

(#7772 should get merged first, or these should be merged together)

The LW deletion consumer relies solely on system.mutations WHERE is_done=0 for backpressure, but with mutations_sync=0 DELETEs return instantly so mutations accumulate before ClickHouse reflects them. This creates a sawtooth pattern: bursts of mutations accumulate, backpressure eventually kicks in, the consumer pauses, mutations drain, then another burst. This is especially problematic in the partition-split path which can fire 50+ DELETEs in a single submit() call.

This PR adds three complementary throttling mechanisms:

1. Local in-flight mutation counter — Tracks mutations issued in-memory. Provides a fast-path check in _check_ongoing_mutations that raises immediately (without querying CH) when the local count exceeds the max and we recently queried CH. Reconciles with CH's system.mutations on each periodic check, trusting CH as the source of truth for completions.

2. Inter-delete delay — Configurable time.sleep() after each DELETE via lw_delete_inter_mutation_delay_ms runtime config. Defaults to 0 (disabled). Gives CH time to register mutations when the counter alone isn't sufficient.

3. Per-submit mutation budget — Caps DELETEs per submit() call in the partition-split path via lw_deletes_per_submit_budget (default 5). When exhausted, raises TooManyOngoingMutationsErrorMessageRejected, triggering Arroyo retry. The existing Redis partition tracking skips already-processed partitions on retry, naturally breaking large partition sets into chunks.

All three default to backward-compatible behavior (counter starts at 0, delay is 0, budget only constrains partition-split path).

Mechanism Solves Runtime config
Local counter Lag between issuing & detecting mutations max_ongoing_mutations_for_delete (existing)
Inter-delete delay Smooths burst rate lw_delete_inter_mutation_delay_ms
Per-submit budget Partition-split firing 50+ mutations lw_deletes_per_submit_budget

@onewland onewland force-pushed the feat/lw-deletion-throttling branch from 042733f to 1271829 Compare February 25, 2026 21:50
@onewland onewland changed the base branch from master to worktree-debug-split-issue February 25, 2026 21:50
@onewland onewland marked this pull request as ready for review February 25, 2026 21:52
@onewland onewland requested a review from a team as a code owner February 25, 2026 21:52
Comment on lines 301 to 308
attribution_info=self._get_attribute_info(),
query_settings=query_settings,
)
self.__local_inflight_count += 1
self.__mutations_issued_this_submit += 1
self.__metrics.timing(
"execute_delete_query_ms",
(time.time() - start) * 1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The _execute_single_delete function catches and logs specific ClickhouseError exceptions but doesn't re-raise them, causing failed deletions to be incorrectly marked as successful and skipped on retry.
Severity: CRITICAL

Suggested Fix

Re-raise the exception after it is logged within the except block in the _execute_single_delete function. This will ensure that the failure propagates, preventing the partition from being marked as successfully processed in Redis and allowing the deletion to be retried.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: snuba/lw_deletions/strategy.py#L301-L308

Potential issue: In the `_execute_single_delete` function, when a delete operation fails
with a `ClickhouseError` whose error code is listed in
`LW_DELETE_NON_RETRYABLE_CLICKHOUSE_ERROR_CODES` (such as `TIMEOUT_EXCEEDED` or
`MEMORY_LIMIT_EXCEEDED`), the exception is caught and logged, but not re-raised. This
causes the system to incorrectly treat the failed deletion as successful. As a result,
the `partition_delete_executed` metric is incremented, and the partition is marked as
processed in Redis. This prevents any future attempts to delete the data in that
partition, leading to permanent data retention.

Did we get this right? 👍 / 👎 to inform future reviews.

Base automatically changed from worktree-debug-split-issue to master February 25, 2026 23:27
…rocessing

The LW deletion consumer relies solely on `system.mutations WHERE is_done=0`
for backpressure, but with `mutations_sync=0` DELETEs return instantly so
mutations accumulate before ClickHouse reflects them. This creates burst/pause
sawtooth patterns, especially in the partition-split path which can fire 50+
DELETEs in a single submit() call.

Adds three complementary mechanisms:

1. Local in-flight mutation counter: tracks mutations issued in-memory,
   provides a fast-path check without querying CH, and reconciles with
   CH on periodic checks.

2. Inter-delete delay: configurable sleep after each DELETE via
   `lw_delete_inter_mutation_delay_ms` (default 0, disabled).

3. Per-submit mutation budget: caps DELETEs per submit() in the
   partition-split path via `lw_deletes_per_submit_budget` (default 5).
   When exhausted, raises TooManyOngoingMutationsError which triggers
   Arroyo retry; existing Redis partition tracking skips already-processed
   partitions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@onewland onewland force-pushed the feat/lw-deletion-throttling branch from 1271829 to 26ecf4d Compare February 26, 2026 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant