Fix SQLInterpolator treating ? in comments and literals as placeholders + shorten next_changelog#1424
Conversation
When supportManyParameters=1, SQLInterpolator split the SQL on every '?', so question marks inside line/block comments, string literals, and quoted identifiers were counted as parameter placeholders. This caused "Parameter count does not match" errors for queries like: -- does this work? select 'hello?', * from mytable where id = ? Add SqlCommentParser.findPlaceholderPositions to locate only real '?' markers (state == NORMAL) and use it from SQLInterpolator. Fixes databricks#1331. Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
- Add IndexedSqlCharConsumer overload to forEachNonCommentChar so callers can recover the source-string index of each emitted character. Refactor findPlaceholderPositions to delegate to it, removing ~70 lines of duplicated state-machine logic. - Apply the same comment/literal-aware fix to surroundPlaceholdersWithQuotes, which previously used a regex that ignored comments, double-quoted identifiers, and backtick identifiers. A '?' inside any of those is now preserved instead of being wrapped in single quotes. - Drop the now-unused Matcher/Pattern imports from SQLInterpolator. - Add tests for CRLF line comments, nested block comments containing '?', adjacent placeholders, leading/trailing placeholders, escaped backticks, and the new surroundPlaceholdersWithQuotes cases. Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
| if (blockCommentDepth == 0) { | ||
| state = State.NORMAL; | ||
| consumer.accept(state, ' '); | ||
| consumer.accept(state, ' ', i); |
There was a problem hiding this comment.
Databricks SQL supports
Looking at the code, the state machine only knows about ', ", and `. Dollar-quoting isn't supported.
Impact: If a user writes select
Severity: Low — dollar-quoted strings aren't common in Databricks SQL. But the changelog and JavaDoc say "string literals" without qualification, which is misleading. Worth noting as a limitation, or adding
$$ support.
There was a problem hiding this comment.
Thanks for the suggestion, but I will just keep this out of scope for now.
| // --- findPlaceholderPositions --- | ||
|
|
||
| @Test | ||
| public void testFindPlaceholderPositionsNullAndEmpty() { |
There was a problem hiding this comment.
Missing tests:
🟡 #13: No test for unclosed quote / comment (Corner Case #10).
🟡 #14: No test for surroundPlaceholdersWithQuotes with adjacent placeholders like select ?,? from t.
🟡 #15: No test for SQL containing only comments (-- foo\n with 0 params).
🟡 #16: No test for mixed escape inside identifier ( `col? ``) — the escape happens between ? characters.
🟡 #17: No test for interpolateSQL with extra parameters (params.size() > placeholders) — does it throw the right error?
🟡 #18: No test for very large SQL (perf regression check).
🟡 #19: ? immediately after -- start (--?). The -- enters IN_LINE_COMMENT; subsequent chars including ? are consumed in IN_LINE_COMMENT state. Should be skipped. Not tested directly.
There was a problem hiding this comment.
Thanks — added the ones I think are worth tracking:
- ✅ [PECO-948] Adds iterator for ArrowResultChunk #13 (unclosed quote/comment): added
testFindPlaceholderPositionsUnclosedSingleQuote,…UnclosedDoubleQuote,…UnclosedBacktick,…UnclosedBlockComment— all assert 0 placeholders. - ✅ [PECO-948] Read Data from Input Stream #14 (adjacent placeholders for
surroundPlaceholdersWithQuotes): added aSELECT ?,?,? FROM t→SELECT '?','?','?' FROM tcase to the parameterized test. - ✅ [PECO-948] ArrowResultChunk Iterator implementation #15 (comments-only SQL): added
testFindPlaceholderPositionsCommentsOnlySqlcovering line-only, block-only, and mixed comment-only inputs. - ✅ [PECO-910] Implement parallel downloader for chunks as per configurable in-memory chunks #16 (mixed escape inside identifier): added
testFindPlaceholderPositionsBacktickWithEscapeBetweenQuestionMarkswith`a?b?`` — both?` chars stay inside the identifier. - ✅ [PECO-996] Adding implementation for Prepared Statement #19 (
--?immediately after line-comment start): addedtestFindPlaceholderPositionsLineCommentImmediatelyConsumesQuestionMarkcovering--?,--?\n, and--?\nselect ? from t.
Skipped:
- [PECO-948] Adds iterator logic to ArrowStreamResult and adds tests #17 (interpolateSQL with extra params) — already covered by
SQLInterpolatorTest#testExtraParameters(line 64). - [PECO-982] Changes to make the driver work with squirrel client for metadata #18 (very large SQL perf check) — skipping for now; perf assertions in JUnit tend to be flaky in CI, and a JMH micro-benchmark is the better venue if we ever want regression coverage.
gopalldb
left a comment
There was a problem hiding this comment.
Some minor comments, thanks for fixing this
Adds tests requested in databricks#1424 review: - unclosed single/double/backtick quote and unclosed block comment in findPlaceholderPositions - comments-only SQL (line-only, block-only, mixed) - `--?` immediately after line-comment start - backtick identifier with escape between ? characters (`a?``b?`) - adjacent placeholders for surroundPlaceholdersWithQuotes Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Summary
Fixes #1331 + some cosmetic change on next changelog
When
supportManyParameters=1,SQLInterpolator.interpolateSQLsplit the SQL on every?and counted every occurrence as a placeholder. Question marks inside line/block comments, single-quoted strings, double-quoted identifiers, and backtick-quoted identifiers were all counted, producing spurious"Parameter count does not match"errors for queries such as:This had 3
?characters but only 1 real placeholder, so the driver rejected the perfectly valid binding of 1 parameter.What changed
SqlCommentParser.findPlaceholderPositions(sql)which walks the SQL with the existing comment/literal state machine and returns the source indices of?characters that appear inState.NORMALonly.SQLInterpolator.interpolateSQLto use those positions: a single pass that slices the original SQL between placeholders and substitutes formatted parameter values. Comments, string literals, and quoted identifiers are preserved verbatim in the output.countPlaceholdershelper that scanned every?.DatabricksParameterMetaData.countParametersalready usedSqlCommentParser.forEachNonCommentCharand is unaffected.Test plan
SQLInterpolatorTestcases (basic interpolation, mixed types, escaped quotes, mismatch errors, etc.) all pass.SQLInterpolatorcovering?inside line comments, block comments, single-quoted strings, double-quoted identifiers, backtick identifiers, and a combined case.SqlCommentParser.findPlaceholderPositionscovering null/empty input, basic positions, comments, literals, quoted identifiers, and escaped quotes.mvn spotless:applyclean.