[BUG] Fix regex transpiler truncating supplementary codepoints in \x{...} and \0NNN escapes (#14744)#14869
[BUG] Fix regex transpiler truncating supplementary codepoints in \x{...} and \0NNN escapes (#14744)#14869wjxiz1992 wants to merge 2 commits into
Conversation
…...} and \0NNN escapes (NVIDIA#14744) Hex escapes (\x{NNNN}) and octal escapes (\0NNN) for supplementary codepoints (cp > U+FFFF) were silently truncated to their low 16 bits via `.toChar` in the RegexParser / CudfRegexTranspiler. For example, the pattern `\x{1F600}` (grinning face 😀) was rewritten to `RegexChar(0xF600.toChar)` — a BMP private-use codepoint — so the GPU matched the wrong character. cuDF's regex JNI cannot represent supplementary codepoints natively (see the deviation note below), so this PR makes the parser and transpiler throw `RegexUnsupportedException` for any hex/octal escape whose codepoint exceeds U+FFFF. spark-rapids then falls back to the CPU regex engine, which Java's `Pattern` handles correctly. This guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the cost of one CPU fallback per affected pattern. Patterns containing only BMP codepoints (cp <= U+FFFF) are unaffected. Four throw sites in `RegexParser.scala`: 1. `parseHex` codepoint > 0xFFFF 2. `parseOctal` codepoint > 0xFFFF 3. `CudfRegexTranspiler.rewrite` — `RegexOctalChar` arm 4. `CudfRegexTranspiler.rewrite` — `RegexHexDigit` arm Deviation from suggested fix: the issue proposed encoding supplementary codepoints as multi-byte UTF-8 sequences at the AST level. That does not work end-to-end because cuDF's regex JNI consumes Unicode codepoints (not raw UTF-8 bytes), so a synthesized byte sequence still fails to match the actual supplementary codepoint in the data column. The truncation symptom would be replaced by a different wrong-match symptom (GPU matches nothing instead of matching U+F600 by accident). The CPU-fallback path used here is the same contract spark-rapids uses for every other unsupported regex feature. Tests: * Scala: `RegularExpressionTranspilerSuite` -> "issue-14744: supplementary codepoint hex/octal escapes fall back to CPU" asserts `RegexUnsupportedException` for six representative patterns (`\x{10000}`, `\x{1F600}`, `\x{10FFFF}`, embedded, in character class, in range) and adds a regression guard for the BMP boundary U+FFFF. * Python IT: `regexp_test.py` adds `test_rlike_supplementary_codepoint_fallback_issue_14744` (4 patterns parametrized) and `test_regexp_replace_supplementary_codepoint_fallback_issue_14744` using `assert_gpu_fallback_collect`. Updates `test_regexp_hexadecimal_digits` to use `\x{0000ffff}` so its projection still runs fully on GPU. Local validation: * mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -> Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0 * run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744 or test_regexp_hexadecimal_digits' -> 6 passed, 39867 deselected, 8 warnings in 13.34s * spark-shell end-to-end repro on the patched dist JAR: GPU == CPU == [true, false, true] for inputs ["😀", "a", "hello 😀 world"] matched against the pattern `\x{1F600}`. Closes NVIDIA#14744 Contributes to NVIDIA#14733 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
|
build |
1 similar comment
|
build |
There was a problem hiding this comment.
Pull request overview
Fixes a correctness bug in the GPU regex transpilation/parsing pipeline where hex/octal escapes could be converted via .toChar, truncating supplementary Unicode codepoints (> U+FFFF) and causing silent wrong-result matches. The new behavior explicitly rejects such escapes with RegexUnsupportedException so Spark-RAPIDS reliably falls back to the CPU regex engine for those patterns.
Changes:
- Add parser/transpiler guards that throw
RegexUnsupportedExceptionwhen\x{...}/\0NNNescapes resolve to codepoints > U+FFFF, preventing UTF-16 truncation. - Add Scala unit coverage asserting CPU fallback for supplementary-codepoint escapes and a regression guard at the BMP boundary (U+FFFF).
- Add Python integration tests asserting GPU fallback for
rlikeandregexp_replaceusing supplementary-codepoint hex escapes; adjust an existing projection test to stay fully on-GPU by using\x{0000ffff}.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala | Adds regression tests asserting unsupported supplementary-codepoint escapes and keeps BMP boundary coverage. |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala | Prevents .toChar truncation by throwing RegexUnsupportedException for supplementary codepoints during parsing/transpilation of hex/octal escapes. |
| integration_tests/src/main/python/regexp_test.py | Adds ITs that assert GPU fallback for supplementary-codepoint patterns and updates an existing test to avoid mixed GPU/CPU projection behavior. |
Greptile SummaryFixes a silent data-corruption bug where supplementary Unicode codepoints (cp > U+FFFF) encoded as
Confidence Score: 4/5Safe to merge. The core change is a targeted exception-throw at four well-identified sites that previously produced wrong-result matches; BMP patterns are entirely unaffected. The parser and transpiler changes are minimal, surgical, and covered by both Scala unit tests and Python integration tests. The one finding is a misleading code comment in the Python test (the data generator likely does not produce actual supplementary Unicode characters), but this does not affect the validity of the fallback assertion or the correctness of the fix itself. The Python comment in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Spark SQL regex pattern\n(e.g. \\x{1F600})"] --> B{Contains\nhex/octal escape?}
B -- No --> C[Normal GPU transpilation path]
B -- Yes --> D{codePoint\n> U+FFFF?}
D -- No\n(BMP codepoint\n≤ U+FFFF) --> E["Transpile normally\n(RegexChar or RegexHexDigit)"]
E --> F[GPU execution via cuDF JNI]
D -- Yes\n(Supplementary codepoint) --> G["Throw RegexUnsupportedException\nat parser OR transpiler\n(4 guard sites)"]
G --> H[spark-rapids catches exception]
H --> I[CPU fallback via Java Pattern]
I --> J["Correct result\n(GPU == CPU)"]
Reviews (1): Last reviewed commit: "[BUG] Fix regex transpiler truncating su..." | Re-trigger Greptile |
| # The data gen below seeds inputs that contain the actual | ||
| # supplementary codepoints we test for, so CPU matches are real | ||
| # rather than always-False. | ||
| gen = mk_str_gen('[abcd]\\\\x{1F600}\\\\x{10000}\\\\x{10FFFF}[abcd]') |
There was a problem hiding this comment.
Misleading "real matches" comment for data generator
StringGen uses sre_yield, which delegates to Python's sre_parse. Python's re module does not support the \x{HHHH} hex-escape syntax — it only accepts \xHH (exactly 2 hex digits). As a result, mk_str_gen('[abcd]\\\\x{1F600}\\\\x{10000}\\\\x{10FFFF}[abcd]') produces strings containing the literal text \x{1F600} (backslash + x{1F600}), not the actual Unicode codepoints U+1F600/U+10000/U+10FFFF. The comment "CPU matches are real rather than always-False" is therefore incorrect — rlike(a, "\\x{1F600}") on Java's regex engine will not match literal backslash-x text, so the CPU column will be all-false. The test still validates the GPU-fallback behaviour correctly (that is its primary purpose), but the comment overstates what the data generator produces.
There was a problem hiding this comment.
Addressed in 835458394 — deleted the inaccurate "data gen below seeds inputs that contain the actual supplementary codepoints" paragraph. You are correct that mk_str_gen uses Python re for the seed regex, which does not understand \x{HHHH} (only \xHH with exactly 2 hex digits), so the generated inputs contain literal backslash-x text, not the codepoints themselves. The test still does its job — it verifies RegexUnsupportedException triggers GPU→CPU fallback for these patterns and that CPU returns the result Javas Pattern` produces — but the justification was wrong. Removed the misleading paragraph rather than rewriting it, since the remaining comment block already explains the regression and the fallback expectation.
|
NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release. |
The deleted comment claimed `mk_str_gen` seeds inputs containing actual
supplementary codepoints. Python's `re` module (which `mk_str_gen` uses
for the seed regex) does not support the `\x{HHHH}` syntax — only `\xHH`
with exactly 2 hex digits — so the gen produces strings with literal
backslash-x text rather than real codepoints. The test still validates
the correct behavior (GPU falls back to CPU via RegexUnsupportedException
and CPU returns the same RLike result the Java engine produces), but the
"CPU matches are real" justification was factually wrong.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Closes #14744
Contributes to #14733
What this fixes
CudfRegexTranspiler.rewrite's arms forRegexHexDigitandRegexOctalCharrewrote codepoints>= 0x80asRegexChar(r.codePoint.toChar).Charis a 16-bit UTF-16 code unit, so any codepoint above U+FFFF was silently truncated to its low 16 bits. For example,\x{1F600}(😀, codepoint U+1F600) was rewritten asRegexChar(0xF600.toChar)— a private-use BMP code unit — and cuDF matched U+F600 instead of the supplementary U+1F600. This produced silent wrong-result matches for any pattern containing emoji or many CJK supplementary characters: the issue body documentsGPU False vs CPU Trueon input"😀"matched against\x{1F600}.The parallel
parseHexandparseOctalpaths inRegexParserhad the same.toChartruncation, so a\x{...}parsed directly into aRegexChar(not viaRegexHexDigit) would also leak the truncation past the transpiler.Approach
Deviated from the issue's suggested fix. The issue proposed encoding supplementary codepoints (cp > U+FFFF) as a UTF-8 byte sequence by emitting a
RegexSequence(encodeUtf8(cp).map(b => RegexChar(b.toChar))). This approach does not actually work end-to-end: cuDF's regex JNI consumes Unicode codepoints, not raw UTF-8 byte sequences, so synthesizing a multi-byte literal at the AST level still fails to match the actual supplementary codepoint in the data column (which cuDF iterates as one Unicode codepoint, not as 4 UTF-8 bytes). The truncation symptom would be replaced by a different wrong-match symptom (the GPU would match nothing instead of matching U+F600 by accident).Instead, this PR makes the parser and transpiler throw
RegexUnsupportedExceptionfor any hex/octal escape whose codepoint exceeds U+FFFF, at four sites inRegexParser.scala:RegexParser.parseHex—\x{...}parsed directly into aRegexChar.RegexParser.parseOctal—\0NNNparsed directly into aRegexChar.CudfRegexTranspiler.rewrite—RegexOctalChararm.CudfRegexTranspiler.rewrite—RegexHexDigitarm.RegexUnsupportedExceptionis the contract spark-rapids uses to signal "GPU cannot transpile this pattern; fall back to CPU regex engine." This guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the cost of one CPU fallback per affected pattern. Patterns containing only BMP codepoints (cp <= U+FFFF) are unaffected —\x{FFFF}and below still transpile and run on the GPU.Tests added
RegularExpressionTranspilerSuite→issue-14744: supplementary codepoint hex/octal escapes fall back to CPU— assertsRegexUnsupportedExceptionfor six representative patterns (\x{10000},\x{1F600},\x{10FFFF}, embedded, in character class, in range) and adds a regression guard for the BMP boundary U+FFFF.regexp_test.pyaddstest_rlike_supplementary_codepoint_fallback_issue_14744(4 patterns parametrized) andtest_regexp_replace_supplementary_codepoint_fallback_issue_14744usingassert_gpu_fallback_collect. Also updatestest_regexp_hexadecimal_digitsto use\x{0000ffff}so its projection still runs fully on GPU (the previous\x{10ffff}would now fall back). The new tests follow the extend-existing-IT convention except where the SQL shape differs (regexp_replacevsrlike), which justifies the separate test.Local validation
Scala suite:
Python IT:
End-to-end repro re-run on the patched dist JAR via
spark-shell(input column["😀", "a", "hello 😀 world"], pattern\x{1F600}):GPU output now matches CPU for every input listed in the sub-issue's "Observed" section.
Documentation
Testing
Performance