Skip to content

fix(lw-deletions): Query system.parts on storage node instead of query node#7772

Merged
onewland merged 2 commits intomasterfrom
worktree-debug-split-issue
Feb 25, 2026
Merged

fix(lw-deletions): Query system.parts on storage node instead of query node#7772
onewland merged 2 commits intomasterfrom
worktree-debug-split-issue

Conversation

@onewland
Copy link
Contributor

_get_partition_dates was querying system.parts via
cluster.get_query_connection(), which connects to a query node. In our
cluster topology, query nodes only have distributed tables (_dist), not
the local tables (_local) listed in DeletionSettings.tables. This meant
system.parts always returned zero rows, causing every partition-split
delete to fall back to un-split.

The fix connects to a storage node via cluster.get_local_nodes()[0]
instead, where the local tables and their system.parts metadata actually
live. This mirrors how the optimize CLI handles the same problem (it
requires an explicit --clickhouse-host pointing at a storage node).

Also incorporates review feedback from #7766:

  • Redis client is now initialized once in __init__ instead of on every
    _execute_delete_by_partition call
  • Partition metric tags use relative week offset (e.g. -4, 0, +2)
    instead of full dates to reduce cardinality

Fixes the partition-split fallback observed in production.

…y node

Query nodes only have distributed tables (_dist), so system.parts returns
no rows for _local tables, causing partition-split deletes to always fall
back to un-split. Connect to a storage node via get_local_nodes() instead.

Also addresses PR #7766 review feedback:
- Move Redis client initialization to __init__
- Use relative week offset for partition metrics to reduce tag cardinality

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@onewland onewland marked this pull request as ready for review February 25, 2026 21:18
@onewland onewland requested a review from a team as a code owner February 25, 2026 21:18
In the distributed test suite, get_local_nodes() queries system.clusters
which isn't available. Mock it to return a dummy node alongside the
existing get_node_connection mock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

assert isinstance(schema, TableSchema)
partition_format = schema.get_partition_format()
assert partition_format is not None
parts = [decode_part_str(part, partition_format) for (part,) in response.results]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated get_active_partitions logic instead of reusing existing function

Medium Severity

The new inline code in _get_partition_dates (the query, schema lookup, decode_part_str call) is an exact copy of get_active_partitions from snuba/cleanup.py. Since get_active_partitions accepts a ClickhousePool as its first parameter and get_node_connection returns a ClickhousePool, the fix only needed to change the connection passed to the existing function — not duplicate all 12+ lines. Maintaining two identical copies risks divergent bug fixes.

Fix in Cursor Fix in Web

member = f"{table}:{partition_date}"
if redis_client.sismember(tracking_key, member):
days_delta = (datetime.strptime(partition_date, "%Y-%m-%d") - datetime.now()).days
partition_week = str(days_delta // 7)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Week offset miscalculated due to time-of-day component

Low Severity

datetime.strptime(partition_date, "%Y-%m-%d") produces midnight, while datetime.now() includes the current time. The .days attribute of the resulting timedelta is systematically one less than the actual calendar-day difference (e.g., today's partition yields .days = -1 instead of 0). This shifts all partition_week tags by roughly one day, so a partition from today gets week "-1" rather than the expected "0".

Fix in Cursor Fix in Web

member = f"{table}:{partition_date}"
if redis_client.sismember(tracking_key, member):
days_delta = (datetime.strptime(partition_date, "%Y-%m-%d") - datetime.now()).days
partition_week = str(days_delta // 7)
Copy link
Member

@MeredithAnya MeredithAnya Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this would be the week number in the year like datetime.now().isocalendar().week , not sure if that makes more or less sense than what you have here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's actually what the AI did first, but I think it makes more sense to normalize it the way we do weeks_ago in some existing metrics

@onewland onewland merged commit 0b93a40 into master Feb 25, 2026
35 checks passed
@onewland onewland deleted the worktree-debug-split-issue branch February 25, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants