Skip to content

fix(data export): properly handle 128-bit item id#7769

Merged
xurui-c merged 4 commits intomasterfrom
rachel/loop
Feb 25, 2026
Merged

fix(data export): properly handle 128-bit item id#7769
xurui-c merged 4 commits intomasterfrom
rachel/loop

Conversation

@xurui-c
Copy link
Member

@xurui-c xurui-c commented Feb 25, 2026

EAP-427

GDPR breaks when we pass in a 128-bit item id and the pagination advances within the same project id, item type, timestap, and trace id.

Root cause

The HexIntColumnProcessor pipeline

item_id is stored as UInt128 in ClickHouse. HexIntColumnProcessor (configured with size=32) has two jobs:

  • _process_expressions — replaces any Column("item_id") reference with lower(leftPad(hex(item_id), 32, '0')), converting the raw UInt128 to a 32-char zero-padded
    lowercase hex string for display/comparison.
  • _translate_literal — converts a hex string literal to a ClickHouse-safe decimal string (e.g. '295990755636730491527085960987654321'), because ClickHouse cannot
    correctly parse bare 128-bit integer literals in SQL.

The pagination cursor

The query uses a tuple comparison for the cursor:

f.greater(
    f.tuple(column("project_id"), column("item_type"), column("timestamp"),
            column("trace_id"), <item_id_expression>),
    f.tuple(literal(last_project_id), literal(last_item_type), literal(last_timestamp),
            literal(last_trace_id), literal(last_item_id)),
)

BaseTypeConverter.__process_optimizable_condition only pattern-matches simple op(col, literal) forms. A greater(tuple(...), tuple(...)) expression does not match, so
_translate_literal is never called on the cursor's literals. The only transformation that fires is _process_expressions walking the tree and replacing any Column("item_id")
it finds.

Before the fix

init

self.last_seen_item_id = int(last_seen_item_id, 16) # Python int, up to 128 bits

to_protobuf

value=AttributeValue(val_str=hex(self.last_seen_item_id)) # e.g. "0xdeadbeef..."

_build_query — LHS

f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id"))))

_build_query — RHS

literal(page_token.last_seen_item_id) # Literal(value=295990755636730491527085960987654321)

After _process_expressions replaces Column("item_id") → lower(leftPad(hex(item_id), 32, '0')), the final SQL is:

reinterpretAsUInt128(reverse(unhex(lower(leftPad(hex(item_id), 32, '0')))))
> 295990755636730491527085960987654321

The LHS is a round-trip identity (hex → bytes → reverse → reinterpret = original item_id as UInt128). The RHS is a bare decimal integer, serialized via str(exp.value).

The bug: ClickHouse cannot correctly parse bare 128-bit integer literals. They get interpreted as Float64, which has only 52 bits of mantissa — the lower-order bits are
lost. If Float64(item_id_X) rounds down from the true value:

item_id_X > Float64(item_id_X) → True

So when the cursor is set to item_id_X, the next query's condition is still satisfied by item_id_X itself. The same item is returned, the cursor never advances, and
pagination loops forever.

After the fix

init

self.last_seen_item_id = last_seen_item_id # hex string as-is from ClickHouse

to_protobuf

value=AttributeValue(val_str=self.last_seen_item_id) # e.g. "deadbeef0000...cafebabe"

_build_query — LHS

column("item_id") # _process_expressions converts this to lower(leftPad(hex(item_id), 32, '0'))

_build_query — RHS

literal(page_token.last_seen_item_id) # Literal(value="deadbeef0000...cafebabe")

The final SQL is:

lower(leftPad(hex(item_id), 32, '0')) > 'deadbeef0000...cafebabe'

Both sides are 32-char zero-padded lowercase hex strings. No _translate_literal needed, no 128-bit literal parsing, no float truncation. Lexicographic ordering of
fixed-width zero-padded hex strings is equivalent to numeric ordering of the underlying integers, so the comparison is consistent with the ORDER BY item_id (raw UInt128).
Pagination advances correctly.

Rachel Chen added 2 commits February 24, 2026 16:47
@xurui-c xurui-c marked this pull request as ready for review February 25, 2026 19:53
@xurui-c xurui-c requested review from a team as code owners February 25, 2026 19:53
Copilot AI review requested due to automatic review settings February 25, 2026 19:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix GDPR export pagination for trace items when item_id is 128-bit, by adjusting how pagination compares and serializes the last-seen item_id.

Changes:

  • Adds a regression test intended to validate pagination progress with 128-bit item_ids.
  • Updates ExportTraceItemsPageToken to store/serialize last_seen_item_id as a string rather than converting to an int.
  • Changes the pagination “seek” tuple comparison to use column("item_id") instead of a reinterpretAsUInt128(reverse(unhex(...))) expression.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
snuba/web/rpc/v1/endpoint_export_trace_items.py Alters page-token handling and the pagination tuple comparison for item_id.
tests/web/rpc/v1/test_endpoint_export_trace_items.py Adds a new pagination test targeting 128-bit item_id scenarios.
Comments suppressed due to low confidence (1)

snuba/web/rpc/v1/endpoint_export_trace_items.py:244

  • The pagination filter compares column("item_id") (stored as UInt128 in eap_items) against literal(page_token.last_seen_item_id) which is a hex string coming from the selected sentry.item_id value. Because this comparison is inside a tuple, the HexIntColumnProcessor condition matcher will not translate the hex literal into a UInt128-compatible value, which can cause the tuple comparison to cast unexpectedly and pagination to stop making progress or skip/duplicate rows. Convert/cast the page-token item id to the same type as the item_id column (e.g., parse hex -> decimal and cast to UInt128) before putting it into the tuple, or store the numeric value in the page token and serialize it consistently.
                f.tuple(
                    column("project_id"),
                    column("item_type"),
                    column("timestamp"),
                    column("trace_id"),
                    column("item_id"),
                ),
                f.tuple(
                    literal(page_token.last_seen_project_id),
                    literal(page_token.last_seen_item_type),
                    literal(page_token.last_seen_timestamp),
                    literal(page_token.last_seen_trace_id),
                    literal(page_token.last_seen_item_id),
                ),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 160 to 164
name="last_seen_item_id", type=AttributeKey.Type.TYPE_STRING
),
op=ComparisonFilter.OP_EQUALS,
value=AttributeValue(val_str=hex(self.last_seen_item_id)),
value=AttributeValue(val_str=self.last_seen_item_id),
)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the page token serialization for last_seen_item_id from hex(...) to the raw value alters the on-the-wire format (previous tokens likely included a 0x prefix and were not zero-padded). This can break clients retrying/resuming pagination with a previously issued token. Consider keeping the old format, or making from_protobuf accept both formats (e.g., strip an optional 0x prefix / normalize length) while writing the new normalized format.

Copilot uses AI. Check for mistakes.
Comment on lines 133 to 140
items_data = [
gen_item_message(
start_timestamp=BASE_TIME,
trace_id=uuid.uuid4().hex,
item_id=uuid.uuid4().int.to_bytes(16, byteorder="little"),
project_id=1,
)
for i in range(num_items)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a different trace_id for every generated item, but the reported pagination bug occurs when pagination advances within the same project_id, item_type, timestamp and trace_id (leaving item_id as the tie-breaker). With randomized trace_ids, pagination can still make progress even if the item_id tie-breaker is broken. To cover the regression, use a constant trace_id for all generated items (and keep timestamp/item_type constant) so that ordering/progress depends on item_id.

Copilot uses AI. Check for mistakes.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

column("timestamp"),
column("trace_id"),
f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id")))),
column("item_id"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A tuple comparison in a ClickHouse query compares a UInt128 column (item_id) with a string literal, creating a type mismatch that relies on implicit database coercion.
Severity: MEDIUM

Suggested Fix

Ensure type consistency in the tuple comparison. Convert the item_id column to a hex string within the query to explicitly match the type of the literal value. This can be done using ClickHouse functions like lower(hex(item_id)) to match the string literal's type and avoid relying on implicit coercion.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: snuba/web/rpc/v1/endpoint_export_trace_items.py#L236

Potential issue: The code change introduces a type mismatch in a ClickHouse tuple
comparison used for pagination. The query at `endpoint_export_trace_items.py:236` now
compares the `item_id` column, which is a `UInt128` type, with a string literal
representing the last seen item ID. ClickHouse tuple comparisons require matching types.
While tests indicate this may work due to implicit type coercion by the database,
relying on this behavior is fragile. This could lead to query errors or incorrect
sorting if ClickHouse's type coercion rules change or if it handles certain hex values
in a way that differs from a numeric comparison, breaking pagination.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no typing mismatch:
greater((project_id, item_type, timestamp, replaceAll(toString(trace_id), '-', ''), lower(leftPad(hex(item_id), if(greater(length(hex(item_id)), 16), 32, 16), '0'))), (1, 1, 1772038800.0, 'c0b108ace16742a3a74a6f3cfe0b3fb6', '00bb7052a98b41c9a4088a9355be9c0c'))

)
items: list[TraceItem] = []
seen_item_ids: set[bytes] = set()
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's find another way to do this loop so if somebody breaks response.page_token.end_pagination in the response it doesn't require a CI timeout to cause this to fail. for _i in range(1,100) would probably always iterate through enough pages

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this one specifically, we will iterate 120 times because limit = 1, but yeah good point, implemented this change in both tests

@xurui-c xurui-c changed the title fix(gdpr): 128-bit item id fix(data export): properly handle 128-bit item id Feb 25, 2026
Copy link
Contributor

@onewland onewland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine, though I'm curious if you've verified that test_pagination_with_128_bit_item_id fails without your change in snuba/web/rpc/v1/ (e.g. do we know we're solving the right problem?)

@xurui-c
Copy link
Member Author

xurui-c commented Feb 25, 2026

@onewland yep!

@xurui-c xurui-c enabled auto-merge (squash) February 25, 2026 21:57
@xurui-c xurui-c merged commit 0e623bf into master Feb 25, 2026
35 checks passed
@xurui-c xurui-c deleted the rachel/loop branch February 25, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants