Skip to content

Normalize executor and utility query text with $N placeholders#51

Open
iskakaushik wants to merge 8 commits intomainfrom
feat/query-normalization
Open

Normalize executor and utility query text with $N placeholders#51
iskakaushik wants to merge 8 commits intomainfrom
feat/query-normalization

Conversation

@iskakaushik
Copy link
Copy Markdown
Collaborator

@iskakaushik iskakaushik commented Apr 3, 2026

Summary

  • add post_parse_analyze_hook query normalization based on PostgreSQL's JumbleState, porting generate_normalized_query / fill_in_constant_lengths from pg_stat_statements
  • carry normalized text to ExecutorEnd / ProcessUtility using exact statement-scoped state, including nested SPI reuse and multi-statement handling
  • scope normalization out of emit_log_hook; error events are still captured, but event.query is left empty instead of doing fuzzy reconstruction
  • fix duplicate escape-string warnings, nested normalization regressions, and PG16/17 build compatibility
  • add TAP coverage for normalization behavior, failed-statement isolation, warning behavior, and nested SPI reuse

Test plan

  • Builds against PG16, PG17, and PG18
  • ./scripts/run-tests.sh ../postgres/install_tap tap 008
  • ./scripts/run-tests.sh ../postgres/install_tap tap 010
  • ./scripts/run-tests.sh ../postgres/install_tap tap 022
  • ./scripts/run-tests.sh ../postgres/install_tap tap 027
  • ./scripts/run-tests.sh ../postgres/install_tap tap 028
  • ./scripts/run-tests.sh ../postgres/install_tap tap 029
  • ./scripts/run-tests.sh ../postgres/install_tap tap 030

Port pg_stat_statements' query normalization to pg_stat_ch so that
captured queries never contain raw literal values (passwords, PII,
tokens, etc.). Both ClickHouse and OTel exporters now receive
normalized text like:

  SELECT * FROM users WHERE email = $1 AND age > $2

Instead of:

  SELECT * FROM users WHERE email = 'alice@example.com' AND age > 30

Implementation:
- Add post_parse_analyze_hook to capture JumbleState at parse time
- Port generate_normalized_query / fill_in_constant_lengths from
  pg_stat_statements (static functions, not exported by PG)
- Store normalized text per-backend in TopMemoryContext, consumed
  by ExecutorEnd/ProcessUtility/emit_log_hook
- Schema names, table names, column names, operators, and SQL
  structure are preserved; only literal constants are replaced

Includes TAP test (027_query_normalization.pl) with 16 subtests
covering strings, numbers, negatives, floats, IN lists, LIKE,
BETWEEN, subqueries, multi-statement, INSERT/UPDATE/DELETE, and
schema-qualified names.
Copilot AI review requested due to automatic review settings April 3, 2026 01:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces literal-constant query normalization (replacing strings/numbers/etc. with $N placeholders) so that exported query text to ClickHouse and OpenTelemetry never contains raw parameter values, leveraging PostgreSQL’s JumbleState from post_parse_analyze_hook.

Changes:

  • Add a post_parse_analyze_hook to generate and stash a normalized query string during parse/analyze.
  • Port/implement pg_stat_statements-style query normalization utilities (PschNormalizeQuery) to replace constants with $N.
  • Add a TAP test suite (t/027_query_normalization.pl) covering a variety of constant forms and SQL shapes.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
t/027_query_normalization.pl Adds integration coverage validating that exported query text contains placeholders and not raw literals.
src/hooks/query_normalize.h Declares the normalization API used by hooks to parameterize query text.
src/hooks/query_normalize.cc Implements normalization logic (ported from pg_stat_statements approach) to rewrite constants into $N.
src/hooks/hooks.cc Installs/uses post_parse_analyze_hook and prefers normalized query text when populating events (including error events).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +476 to +505
// post_parse_analyze_hook — normalize query text at parse time.
// The JumbleState (with constant locations) is only available here, so we
// must generate the normalized text now and stash it for ExecutorEnd.
static void PschPostParseAnalyze(ParseState* pstate, Query* query, JumbleState* jstate) {
if (prev_post_parse_analyze != nullptr) {
prev_post_parse_analyze(pstate, query, jstate);
}

// Only normalize if enabled and the query has constants to replace
if (!psch_enabled || jstate == nullptr || jstate->clocations_count <= 0) {
return;
}

// Free any stale normalized text from a previous parse in this backend
ClearNormalizedQuery();

const char* query_text = pstate->p_sourcetext;
int query_loc = query->stmt_location;
int query_len = query->stmt_len;

// CleanQuerytext extracts the relevant portion for multi-statement strings
query_text = CleanQuerytext(query_text, &query_loc, &query_len);

// Allocate in TopMemoryContext so the normalized text survives until
// ExecutorEnd consumes it (the parse-time context is short-lived).
MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
normalized_query = PschNormalizeQuery(query_text, query_loc, &query_len, jstate);
MemoryContextSwitchTo(oldcxt);
normalized_query_len = query_len;
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalized_query is a single per-backend buffer that can become stale or be overwritten before it is consumed. In particular: (1) PschPostParseAnalyze() can run for statements that later hit the early-return path in PschExecutorEnd() (e.g. queryId == 0), leaving normalized_query set; the next captured query will then incorrectly use this stale normalized text. (2) nested statements (SPI / functions) can overwrite normalized_query before the outer statement reaches ExecutorEnd, causing the outer event to lose normalization or pick up the inner query’s text. Consider storing normalized text in a stack keyed by nesting_level, or in a small per-backend map keyed by queryId/statement identity, and ensure the entry is removed on all completion paths (including the PschExecutorEnd() early return and utility paths).

Copilot uses AI. Check for mistakes.
Comment on lines 839 to 848
// Use normalized query if available; fall back to debug_query_string
if (normalized_query != nullptr) {
size_t len = Min(static_cast<size_t>(normalized_query_len), sizeof(event.query) - 1);
memcpy(event.query, normalized_query, len);
event.query[len] = '\0';
event.query_len = static_cast<uint16>(len);
} else if (debug_query_string != nullptr && debug_query_string[0] != '\0') {
event.query_len =
static_cast<uint16>(CopyTrimmed(event.query, PSCH_MAX_QUERY_LEN, debug_query_string));
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CaptureLogEvent() copies from normalized_query but never frees/clears it. If an error occurs before the normal query event is built (or if the statement never reaches ExecutorEnd), this leaks TopMemoryContext memory and can cause later events to reuse the wrong normalized text. After copying into event.query, clear the stored normalized buffer (or pop the current entry if switching to a per-statement stack/map).

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +138
// A constant is at least 1 byte; $N is at most 11 bytes (INT_MAX).
norm_query_buflen = query_len + (jstate->clocations_count * 10);
norm_query = static_cast<char*>(palloc(norm_query_buflen + 1));

for (int i = 0; i < jstate->clocations_count; i++) {
int off;
int tok_len;

// If we have an external param at this location but no squashed lists,
// skip it so the original $N text is preserved.
if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists) {
continue;
}

off = jstate->clocations[i].location;
off -= query_loc;
tok_len = jstate->clocations[i].length;

if (tok_len < 0) {
continue; // duplicate, ignore
}

// Copy the chunk between last constant and this one
len_to_wrt = off - last_off - last_tok_len;
Assert(len_to_wrt >= 0);
memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
n_quer_loc += len_to_wrt;

// Insert $N placeholder (and squashed-list comment if applicable)
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
jstate->clocations[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;

quer_loc = off + tok_len;
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm_query_buflen sizing (query_len + clocations_count * 10) is not a guaranteed upper bound once you account for the optional squashed-list suffix (" /*, ... */") and larger $N values. Combined with the use of sprintf(), this can lead to a write past the end of norm_query before the later Assert() triggers. Use a dynamically-growing buffer (e.g., StringInfo) or compute a strict upper bound and write with snprintf() using the remaining capacity, erroring out safely if it would overflow.

Copilot uses AI. Check for mistakes.
);

# Helper: run a query, flush, and return the captured query text from ClickHouse.
# Uses queryid to identify the specific query.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper comment says it "Uses queryid to identify the specific query", but get_captured_query() actually selects the most recent non-extension query by timestamp (no queryid filtering). Either update the comment to match the implementation or add a queryid filter (e.g., capture pg_stat_activity.query_id/pg_query_id() equivalent) to make the selection deterministic when multiple statements are exported close together.

Suggested change
# Uses queryid to identify the specific query.
# Clears prior exported rows and returns the most recent non-extension query by timestamp.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 3, 2026 03:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

static void FillInConstantLengths(JumbleState* jstate, const char* query, int query_loc) {
LocationLen* locs;
core_yyscan_t yyscanner;
core_yy_extra_type yyextra;
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core_yy_extra_type yyextra is uninitialized when passed to scanner_init(). If scanner_init() doesn’t fully overwrite all fields, this risks undefined behavior. Consider zero-initializing yyextra before calling scanner_init().

Suggested change
core_yy_extra_type yyextra;
core_yy_extra_type yyextra = {};

Copilot uses AI. Check for mistakes.
Comment on lines +379 to +388
// The normalized registry is keyed by statement identity and reused across
// repeated executions of cached plans, so this helper first looks up the
// normalized entry stashed at parse time. If no match exists, it falls back to
// CopyRawStatementText, which preserves the literal SQL text for the current
// statement only.
static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,
int stmt_len) {
if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,
sizeof(event->query), &event->query_len, query_text,
stmt_location, stmt_len, false)) {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyQueryText() uses consume=false when calling PschCopyNormalizedQueryForStatement(), so normalized entries remembered at parse time are never removed on the successful path. Since those entries live in TopMemoryContext and are keyed by raw source_text, sessions that execute many literal-variant statements can accumulate entries and grow backend memory without bound. Consider consuming/removing entries after enqueue (when safe) and/or adding an eviction/size cap (or keying by queryId) to prevent unbounded growth.

Suggested change
// The normalized registry is keyed by statement identity and reused across
// repeated executions of cached plans, so this helper first looks up the
// normalized entry stashed at parse time. If no match exists, it falls back to
// CopyRawStatementText, which preserves the literal SQL text for the current
// statement only.
static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,
int stmt_len) {
if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,
sizeof(event->query), &event->query_len, query_text,
stmt_location, stmt_len, false)) {
// The normalized entry is only needed until it has been copied into the event
// buffer, so consume it on a successful lookup to avoid unbounded retention in
// the per-backend registry. If no match exists, fall back to the raw statement
// text for the current statement only.
static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,
int stmt_len) {
if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,
sizeof(event->query), &event->query_len, query_text,
stmt_location, stmt_len, true)) {

Copilot uses AI. Check for mistakes.
Comment on lines +876 to +880
if (PschCopyNormalizedQueryForLog(&backend_state.normalized_queries, event.query,
sizeof(event.query), &event.query_len, debug_query_string,
pgstat_get_my_query_id(), edata->cursorpos)) {
} else if (debug_query_string != nullptr && debug_query_string[0] != '\0') {
event.query_len =
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PschCopyNormalizedQueryForLog() can’t find a match, the code falls back to copying debug_query_string verbatim into the event, which can still include raw literals (passwords/PII). This contradicts the PR description’s claim that “no raw parameter values are ever captured”. If the intent is a hard guarantee, consider emitting an empty/"" query (or gating raw capture behind a GUC) when normalization lookup fails.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 3, 2026 04:04
@iskakaushik iskakaushik changed the title Normalize query text: replace constants with $N placeholders Normalize executor and utility query text with $N placeholders Apr 3, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +30 to +33
PschNormalizedQueryEntry* next;
};

PschStatementKey PschMakeStatementKey(const char* source_text, int stmt_location, int stmt_len) {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SourceTextMatch() treats pointer equality (entry->source_text == source_text) as an unconditional match. Postgres can reuse memory addresses for new query strings across statements (e.g., after memory context resets), so pointer equality alone can incorrectly match a different statement and return the wrong normalized query. Consider removing the pointer fast-path, or at least validating length/content even when pointers match (e.g., compare lengths and memcmp against source_text_copy).

Suggested change
PschNormalizedQueryEntry* next;
};
PschStatementKey PschMakeStatementKey(const char* source_text, int stmt_location, int stmt_len) {

Copilot uses AI. Check for mistakes.
Comment on lines +384 to +393
static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,
int stmt_len) {
const PschStatementKey statement_key = PschMakeStatementKey(query_text, stmt_location, stmt_len);
if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,
sizeof(event->query), &event->query_len, statement_key,
false)) {
return;
}

CopyRawStatementText(event, query_text, stmt_location, stmt_len);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalized query entries are always copied with consume=false, and successful executor/utility paths never call ForgetNormalizedStatement(). This makes the per-backend normalized query registry grow without bound over long-lived sessions that execute many distinct statements, which can lead to backend memory bloat. Consider consuming entries by default after they’re used (and only retaining them for known-reuse cases like cached plans/SPI), or enforcing an upper bound/eviction strategy for the registry.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants