fix(data export): properly handle 128-bit item id#7769
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to fix GDPR export pagination for trace items when item_id is 128-bit, by adjusting how pagination compares and serializes the last-seen item_id.
Changes:
- Adds a regression test intended to validate pagination progress with 128-bit
item_ids. - Updates
ExportTraceItemsPageTokento store/serializelast_seen_item_idas a string rather than converting to an int. - Changes the pagination “seek” tuple comparison to use
column("item_id")instead of areinterpretAsUInt128(reverse(unhex(...)))expression.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
snuba/web/rpc/v1/endpoint_export_trace_items.py |
Alters page-token handling and the pagination tuple comparison for item_id. |
tests/web/rpc/v1/test_endpoint_export_trace_items.py |
Adds a new pagination test targeting 128-bit item_id scenarios. |
Comments suppressed due to low confidence (1)
snuba/web/rpc/v1/endpoint_export_trace_items.py:244
- The pagination filter compares
column("item_id")(stored as UInt128 in eap_items) againstliteral(page_token.last_seen_item_id)which is a hex string coming from the selectedsentry.item_idvalue. Because this comparison is inside a tuple, the HexIntColumnProcessor condition matcher will not translate the hex literal into a UInt128-compatible value, which can cause the tuple comparison to cast unexpectedly and pagination to stop making progress or skip/duplicate rows. Convert/cast the page-token item id to the same type as theitem_idcolumn (e.g., parse hex -> decimal and cast to UInt128) before putting it into the tuple, or store the numeric value in the page token and serialize it consistently.
f.tuple(
column("project_id"),
column("item_type"),
column("timestamp"),
column("trace_id"),
column("item_id"),
),
f.tuple(
literal(page_token.last_seen_project_id),
literal(page_token.last_seen_item_type),
literal(page_token.last_seen_timestamp),
literal(page_token.last_seen_trace_id),
literal(page_token.last_seen_item_id),
),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| name="last_seen_item_id", type=AttributeKey.Type.TYPE_STRING | ||
| ), | ||
| op=ComparisonFilter.OP_EQUALS, | ||
| value=AttributeValue(val_str=hex(self.last_seen_item_id)), | ||
| value=AttributeValue(val_str=self.last_seen_item_id), | ||
| ) |
There was a problem hiding this comment.
Changing the page token serialization for last_seen_item_id from hex(...) to the raw value alters the on-the-wire format (previous tokens likely included a 0x prefix and were not zero-padded). This can break clients retrying/resuming pagination with a previously issued token. Consider keeping the old format, or making from_protobuf accept both formats (e.g., strip an optional 0x prefix / normalize length) while writing the new normalized format.
| items_data = [ | ||
| gen_item_message( | ||
| start_timestamp=BASE_TIME, | ||
| trace_id=uuid.uuid4().hex, | ||
| item_id=uuid.uuid4().int.to_bytes(16, byteorder="little"), | ||
| project_id=1, | ||
| ) | ||
| for i in range(num_items) |
There was a problem hiding this comment.
This test uses a different trace_id for every generated item, but the reported pagination bug occurs when pagination advances within the same project_id, item_type, timestamp and trace_id (leaving item_id as the tie-breaker). With randomized trace_ids, pagination can still make progress even if the item_id tie-breaker is broken. To cover the regression, use a constant trace_id for all generated items (and keep timestamp/item_type constant) so that ordering/progress depends on item_id.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| column("timestamp"), | ||
| column("trace_id"), | ||
| f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id")))), | ||
| column("item_id"), |
There was a problem hiding this comment.
Bug: A tuple comparison in a ClickHouse query compares a UInt128 column (item_id) with a string literal, creating a type mismatch that relies on implicit database coercion.
Severity: MEDIUM
Suggested Fix
Ensure type consistency in the tuple comparison. Convert the item_id column to a hex string within the query to explicitly match the type of the literal value. This can be done using ClickHouse functions like lower(hex(item_id)) to match the string literal's type and avoid relying on implicit coercion.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: snuba/web/rpc/v1/endpoint_export_trace_items.py#L236
Potential issue: The code change introduces a type mismatch in a ClickHouse tuple
comparison used for pagination. The query at `endpoint_export_trace_items.py:236` now
compares the `item_id` column, which is a `UInt128` type, with a string literal
representing the last seen item ID. ClickHouse tuple comparisons require matching types.
While tests indicate this may work due to implicit type coercion by the database,
relying on this behavior is fragile. This could lead to query errors or incorrect
sorting if ClickHouse's type coercion rules change or if it handles certain hex values
in a way that differs from a numeric comparison, breaking pagination.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
there is no typing mismatch:
greater((project_id, item_type, timestamp, replaceAll(toString(trace_id), '-', ''), lower(leftPad(hex(item_id), if(greater(length(hex(item_id)), 16), 32, 16), '0'))), (1, 1, 1772038800.0, 'c0b108ace16742a3a74a6f3cfe0b3fb6', '00bb7052a98b41c9a4088a9355be9c0c'))
| ) | ||
| items: list[TraceItem] = [] | ||
| seen_item_ids: set[bytes] = set() | ||
| while True: |
There was a problem hiding this comment.
let's find another way to do this loop so if somebody breaks response.page_token.end_pagination in the response it doesn't require a CI timeout to cause this to fail. for _i in range(1,100) would probably always iterate through enough pages
There was a problem hiding this comment.
for this one specifically, we will iterate 120 times because limit = 1, but yeah good point, implemented this change in both tests
onewland
left a comment
There was a problem hiding this comment.
seems fine, though I'm curious if you've verified that test_pagination_with_128_bit_item_id fails without your change in snuba/web/rpc/v1/ (e.g. do we know we're solving the right problem?)
|
@onewland yep! |
EAP-427
GDPR breaks when we pass in a 128-bit item id and the pagination advances within the same project id, item type, timestap, and trace id.
Root cause
The
HexIntColumnProcessorpipelineitem_idis stored asUInt128in ClickHouse.HexIntColumnProcessor(configured withsize=32) has two jobs:_process_expressions— replaces anyColumn("item_id")reference withlower(leftPad(hex(item_id), 32, '0')), converting the raw UInt128 to a 32-char zero-paddedlowercase hex string for display/comparison.
_translate_literal— converts a hex string literal to a ClickHouse-safe decimal string (e.g.'295990755636730491527085960987654321'), because ClickHouse cannotcorrectly parse bare 128-bit integer literals in SQL.
The pagination cursor
The query uses a tuple comparison for the cursor:
BaseTypeConverter.__process_optimizable_condition only pattern-matches simple op(col, literal) forms. A greater(tuple(...), tuple(...)) expression does not match, so
_translate_literal is never called on the cursor's literals. The only transformation that fires is _process_expressions walking the tree and replacing any Column("item_id")
it finds.
Before the fix
init
self.last_seen_item_id = int(last_seen_item_id, 16) # Python int, up to 128 bits
to_protobuf
value=AttributeValue(val_str=hex(self.last_seen_item_id)) # e.g. "0xdeadbeef..."
_build_query — LHS
f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id"))))
_build_query — RHS
literal(page_token.last_seen_item_id) # Literal(value=295990755636730491527085960987654321)
After _process_expressions replaces Column("item_id") → lower(leftPad(hex(item_id), 32, '0')), the final SQL is:
reinterpretAsUInt128(reverse(unhex(lower(leftPad(hex(item_id), 32, '0')))))
> 295990755636730491527085960987654321
The LHS is a round-trip identity (hex → bytes → reverse → reinterpret = original item_id as UInt128). The RHS is a bare decimal integer, serialized via str(exp.value).
The bug: ClickHouse cannot correctly parse bare 128-bit integer literals. They get interpreted as Float64, which has only 52 bits of mantissa — the lower-order bits are
lost. If Float64(item_id_X) rounds down from the true value:
item_id_X > Float64(item_id_X) → True
So when the cursor is set to item_id_X, the next query's condition is still satisfied by item_id_X itself. The same item is returned, the cursor never advances, and
pagination loops forever.
After the fix
init
self.last_seen_item_id = last_seen_item_id # hex string as-is from ClickHouse
to_protobuf
value=AttributeValue(val_str=self.last_seen_item_id) # e.g. "deadbeef0000...cafebabe"
_build_query — LHS
column("item_id") # _process_expressions converts this to lower(leftPad(hex(item_id), 32, '0'))
_build_query — RHS
literal(page_token.last_seen_item_id) # Literal(value="deadbeef0000...cafebabe")
The final SQL is:
lower(leftPad(hex(item_id), 32, '0')) > 'deadbeef0000...cafebabe'
Both sides are 32-char zero-padded lowercase hex strings. No _translate_literal needed, no 128-bit literal parsing, no float truncation. Lexicographic ordering of
fixed-width zero-padded hex strings is equivalent to numeric ordering of the underlying integers, so the comparison is consistent with the ORDER BY item_id (raw UInt128).
Pagination advances correctly.