[BUG] Fix regex transpiler truncating supplementary codepoints in \x{...} and \0NNN escapes (#14744) by wjxiz1992 · Pull Request #14869 · NVIDIA/spark-rapids

wjxiz1992 · 2026-05-25T04:21:38Z

Closes #14744
Contributes to #14733

What this fixes

CudfRegexTranspiler.rewrite's arms for RegexHexDigit and RegexOctalChar rewrote codepoints >= 0x80 as RegexChar(r.codePoint.toChar). Char is a 16-bit UTF-16 code unit, so any codepoint above U+FFFF was silently truncated to its low 16 bits. For example, \x{1F600} (😀, codepoint U+1F600) was rewritten as RegexChar(0xF600.toChar) — a private-use BMP code unit — and cuDF matched U+F600 instead of the supplementary U+1F600. This produced silent wrong-result matches for any pattern containing emoji or many CJK supplementary characters: the issue body documents GPU False vs CPU True on input "😀" matched against \x{1F600}.

The parallel parseHex and parseOctal paths in RegexParser had the same .toChar truncation, so a \x{...} parsed directly into a RegexChar (not via RegexHexDigit) would also leak the truncation past the transpiler.

Approach

Deviated from the issue's suggested fix. The issue proposed encoding supplementary codepoints (cp > U+FFFF) as a UTF-8 byte sequence by emitting a RegexSequence(encodeUtf8(cp).map(b => RegexChar(b.toChar))). This approach does not actually work end-to-end: cuDF's regex JNI consumes Unicode codepoints, not raw UTF-8 byte sequences, so synthesizing a multi-byte literal at the AST level still fails to match the actual supplementary codepoint in the data column (which cuDF iterates as one Unicode codepoint, not as 4 UTF-8 bytes). The truncation symptom would be replaced by a different wrong-match symptom (the GPU would match nothing instead of matching U+F600 by accident).

Instead, this PR makes the parser and transpiler throw RegexUnsupportedException for any hex/octal escape whose codepoint exceeds U+FFFF, at four sites in RegexParser.scala:

RegexParser.parseHex — \x{...} parsed directly into a RegexChar.
RegexParser.parseOctal — \0NNN parsed directly into a RegexChar.
CudfRegexTranspiler.rewrite — RegexOctalChar arm.
CudfRegexTranspiler.rewrite — RegexHexDigit arm.

RegexUnsupportedException is the contract spark-rapids uses to signal "GPU cannot transpile this pattern; fall back to CPU regex engine." This guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the cost of one CPU fallback per affected pattern. Patterns containing only BMP codepoints (cp <= U+FFFF) are unaffected — \x{FFFF} and below still transpile and run on the GPU.

Tests added

Scala unit: RegularExpressionTranspilerSuite → issue-14744: supplementary codepoint hex/octal escapes fall back to CPU — asserts RegexUnsupportedException for six representative patterns (\x{10000}, \x{1F600}, \x{10FFFF}, embedded, in character class, in range) and adds a regression guard for the BMP boundary U+FFFF.
Python IT: regexp_test.py adds test_rlike_supplementary_codepoint_fallback_issue_14744 (4 patterns parametrized) and test_regexp_replace_supplementary_codepoint_fallback_issue_14744 using assert_gpu_fallback_collect. Also updates test_regexp_hexadecimal_digits to use \x{0000ffff} so its projection still runs fully on GPU (the previous \x{10ffff} would now fall back). The new tests follow the extend-existing-IT convention except where the SQL shape differs (regexp_replace vs rlike), which justifies the separate test.

Local validation

Scala suite:

Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0

Python IT:

6 passed, 39867 deselected, 8 warnings in 13.34s

End-to-end repro re-run on the patched dist JAR via spark-shell (input column ["😀", "a", "hello 😀 world"], pattern \x{1F600}):

===== rapids.sql.enabled=true (patched GPU plugin) =====
a=😀  matches=true
a=a  matches=false
a=hello 😀 world  matches=true
===== rapids.sql.enabled=false (CPU baseline) =====
a=😀  matches=true
a=a  matches=false
a=hello 😀 world  matches=true
GPU == CPU: true

GPU output now matches CPU for every input listed in the sub-issue's "Observed" section.

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

…...} and \0NNN escapes (NVIDIA#14744) Hex escapes (\x{NNNN}) and octal escapes (\0NNN) for supplementary codepoints (cp > U+FFFF) were silently truncated to their low 16 bits via `.toChar` in the RegexParser / CudfRegexTranspiler. For example, the pattern `\x{1F600}` (grinning face 😀) was rewritten to `RegexChar(0xF600.toChar)` — a BMP private-use codepoint — so the GPU matched the wrong character. cuDF's regex JNI cannot represent supplementary codepoints natively (see the deviation note below), so this PR makes the parser and transpiler throw `RegexUnsupportedException` for any hex/octal escape whose codepoint exceeds U+FFFF. spark-rapids then falls back to the CPU regex engine, which Java's `Pattern` handles correctly. This guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the cost of one CPU fallback per affected pattern. Patterns containing only BMP codepoints (cp <= U+FFFF) are unaffected. Four throw sites in `RegexParser.scala`: 1. `parseHex` codepoint > 0xFFFF 2. `parseOctal` codepoint > 0xFFFF 3. `CudfRegexTranspiler.rewrite` — `RegexOctalChar` arm 4. `CudfRegexTranspiler.rewrite` — `RegexHexDigit` arm Deviation from suggested fix: the issue proposed encoding supplementary codepoints as multi-byte UTF-8 sequences at the AST level. That does not work end-to-end because cuDF's regex JNI consumes Unicode codepoints (not raw UTF-8 bytes), so a synthesized byte sequence still fails to match the actual supplementary codepoint in the data column. The truncation symptom would be replaced by a different wrong-match symptom (GPU matches nothing instead of matching U+F600 by accident). The CPU-fallback path used here is the same contract spark-rapids uses for every other unsupported regex feature. Tests: * Scala: `RegularExpressionTranspilerSuite` -> "issue-14744: supplementary codepoint hex/octal escapes fall back to CPU" asserts `RegexUnsupportedException` for six representative patterns (`\x{10000}`, `\x{1F600}`, `\x{10FFFF}`, embedded, in character class, in range) and adds a regression guard for the BMP boundary U+FFFF. * Python IT: `regexp_test.py` adds `test_rlike_supplementary_codepoint_fallback_issue_14744` (4 patterns parametrized) and `test_regexp_replace_supplementary_codepoint_fallback_issue_14744` using `assert_gpu_fallback_collect`. Updates `test_regexp_hexadecimal_digits` to use `\x{0000ffff}` so its projection still runs fully on GPU. Local validation: * mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -> Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0 * run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744 or test_regexp_hexadecimal_digits' -> 6 passed, 39867 deselected, 8 warnings in 13.34s * spark-shell end-to-end repro on the patched dist JAR: GPU == CPU == [true, false, true] for inputs ["😀", "a", "hello 😀 world"] matched against the pattern `\x{1F600}`. Closes NVIDIA#14744 Contributes to NVIDIA#14733 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-05-25T04:22:36Z

build

wjxiz1992 · 2026-05-25T04:25:29Z

build

Copilot

Pull request overview

Fixes a correctness bug in the GPU regex transpilation/parsing pipeline where hex/octal escapes could be converted via .toChar, truncating supplementary Unicode codepoints (> U+FFFF) and causing silent wrong-result matches. The new behavior explicitly rejects such escapes with RegexUnsupportedException so Spark-RAPIDS reliably falls back to the CPU regex engine for those patterns.

Changes:

Add parser/transpiler guards that throw RegexUnsupportedException when \x{...} / \0NNN escapes resolve to codepoints > U+FFFF, preventing UTF-16 truncation.
Add Scala unit coverage asserting CPU fallback for supplementary-codepoint escapes and a regression guard at the BMP boundary (U+FFFF).
Add Python integration tests asserting GPU fallback for rlike and regexp_replace using supplementary-codepoint hex escapes; adjust an existing projection test to stay fully on-GPU by using \x{0000ffff}.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala	Adds regression tests asserting unsupported supplementary-codepoint escapes and keeps BMP boundary coverage.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala	Prevents `.toChar` truncation by throwing `RegexUnsupportedException` for supplementary codepoints during parsing/transpilation of hex/octal escapes.
integration_tests/src/main/python/regexp_test.py	Adds ITs that assert GPU fallback for supplementary-codepoint patterns and updates an existing test to avoid mixed GPU/CPU projection behavior.

greptile-apps · 2026-05-25T04:29:36Z

Greptile Summary

Fixes a silent data-corruption bug where supplementary Unicode codepoints (cp > U+FFFF) encoded as \x{...} or \0NNN regex escapes were truncated to their low-16-bit char value via .toChar, causing the GPU to match the wrong character (e.g., U+F600 instead of U+1F600). The fix throws RegexUnsupportedException at four sites — two in the character-class parser path and two in the transpiler's RegexOctalChar/RegexHexDigit arms — so spark-rapids falls back to the CPU engine for any pattern containing supplementary codepoints.

Core fix (RegexParser.scala): four guard clauses checking codePoint > 0xFFFF replace the silent .toChar truncation with a well-typed CPU fallback; BMP codepoints (cp ≤ U+FFFF) are completely unaffected.
Scala unit tests (RegularExpressionTranspilerSuite.scala): six supplementary-codepoint patterns now assert RegexUnsupportedException, and a BMP-boundary regression guard verifies \x{FFFF} still transpiles on GPU.
Python integration tests (regexp_test.py): two new assert_gpu_fallback_collect tests cover rlike and regexp_replace; the existing hex-digit GPU projection test is updated to use \x{0000ffff} (U+FFFF, still within BMP) instead of the now-unsupported \x{10ffff}.

Confidence Score: 4/5

Safe to merge. The core change is a targeted exception-throw at four well-identified sites that previously produced wrong-result matches; BMP patterns are entirely unaffected.

The parser and transpiler changes are minimal, surgical, and covered by both Scala unit tests and Python integration tests. The one finding is a misleading code comment in the Python test (the data generator likely does not produce actual supplementary Unicode characters), but this does not affect the validity of the fallback assertion or the correctness of the fix itself.

The Python comment in regexp_test.py around line 659 overstates what the data generator produces; worth a quick clarification but not a blocker.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala	Adds four guard clauses that throw RegexUnsupportedException for supplementary codepoints (cp > U+FFFF) in hex and octal escapes at both the parser (character-class path) and transpiler (RegexHexDigit / RegexOctalChar arms) — correctly eliminating the silent .toChar truncation.
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala	Adds Scala unit coverage for six supplementary-codepoint hex patterns (including embedded, in character class, and in range) asserting RegexUnsupportedException, plus a BMP-boundary regression guard for U+FFFF; correctly removes the pre-fix \x{10FFFF} GPU pattern test.
integration_tests/src/main/python/regexp_test.py	Adds Python integration tests verifying GPU fallback for supplementary codepoints in rlike and regexp_replace; replaces \x{10ffff} with \x{0000ffff} in the GPU-only projection. One comment overstates that the data generator produces actual supplementary Unicode characters.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Spark SQL regex pattern\n(e.g. \\x{1F600})"] --> B{Contains\nhex/octal escape?}
    B -- No --> C[Normal GPU transpilation path]
    B -- Yes --> D{codePoint\n> U+FFFF?}
    D -- No\n(BMP codepoint\n≤ U+FFFF) --> E["Transpile normally\n(RegexChar or RegexHexDigit)"]
    E --> F[GPU execution via cuDF JNI]
    D -- Yes\n(Supplementary codepoint) --> G["Throw RegexUnsupportedException\nat parser OR transpiler\n(4 guard sites)"]
    G --> H[spark-rapids catches exception]
    H --> I[CPU fallback via Java Pattern]
    I --> J["Correct result\n(GPU == CPU)"]

_{Reviews (1): Last reviewed commit: "[BUG] Fix regex transpiler truncating su..." | Re-trigger Greptile}

greptile-apps · 2026-05-25T04:29:41Z

+    # The data gen below seeds inputs that contain the actual
+    # supplementary codepoints we test for, so CPU matches are real
+    # rather than always-False.
+    gen = mk_str_gen('[abcd]\\\\x{1F600}\\\\x{10000}\\\\x{10FFFF}[abcd]')


Misleading "real matches" comment for data generator

StringGen uses sre_yield, which delegates to Python's sre_parse. Python's re module does not support the \x{HHHH} hex-escape syntax — it only accepts \xHH (exactly 2 hex digits). As a result, mk_str_gen('[abcd]\\\\x{1F600}\\\\x{10000}\\\\x{10FFFF}[abcd]') produces strings containing the literal text \x{1F600} (backslash + x{1F600}), not the actual Unicode codepoints U+1F600/U+10000/U+10FFFF. The comment "CPU matches are real rather than always-False" is therefore incorrect — rlike(a, "\\x{1F600}") on Java's regex engine will not match literal backslash-x text, so the CPU column will be all-false. The test still validates the GPU-fallback behaviour correctly (that is its primary purpose), but the comment overstates what the data generator produces.

Addressed in 835458394 — deleted the inaccurate "data gen below seeds inputs that contain the actual supplementary codepoints" paragraph. You are correct that mk_str_gen uses Python re for the seed regex, which does not understand \x{HHHH} (only \xHH with exactly 2 hex digits), so the generated inputs contain literal backslash-x text, not the codepoints themselves. The test still does its job — it verifies RegexUnsupportedException triggers GPU→CPU fallback for these patterns and that CPU returns the result Javas Pattern` produces — but the justification was wrong. Removed the misleading paragraph rather than rewriting it, since the remaining comment block already explains the regression and the fallback expectation.

nvauto · 2026-05-25T05:03:00Z

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

The deleted comment claimed `mk_str_gen` seeds inputs containing actual supplementary codepoints. Python's `re` module (which `mk_str_gen` uses for the seed regex) does not support the `\x{HHHH}` syntax — only `\xHH` with exactly 2 hex digits — so the gen produces strings with literal backslash-x text rather than real codepoints. The test still validates the correct behavior (GPU falls back to CPU via RegexUnsupportedException and CPU returns the same RLike result the Java engine produces), but the "CPU matches are real" justification was factually wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

Copilot AI review requested due to automatic review settings May 25, 2026 04:21

Copilot started reviewing on behalf of wjxiz1992 May 25, 2026 04:21 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

greptile-apps Bot reviewed May 25, 2026

View reviewed changes

wjxiz1992 mentioned this pull request May 25, 2026

[BUG] Fix regex CharacterRange not recursing into endpoints (#14745) #14874

Draft

8 tasks

wjxiz1992 marked this pull request as draft May 25, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix regex transpiler truncating supplementary codepoints in \x{...} and \0NNN escapes (#14744)#14869

[BUG] Fix regex transpiler truncating supplementary codepoints in \x{...} and \0NNN escapes (#14744)#14869
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14744

wjxiz1992 commented May 25, 2026

Uh oh!

wjxiz1992 commented May 25, 2026

Uh oh!

wjxiz1992 commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

greptile-apps Bot commented May 25, 2026

Uh oh!

greptile-apps Bot May 25, 2026

Uh oh!

wjxiz1992 May 28, 2026

Uh oh!

nvauto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wjxiz1992 commented May 25, 2026

What this fixes

Approach

Tests added

Local validation

Uh oh!

wjxiz1992 commented May 25, 2026

Uh oh!

wjxiz1992 commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

greptile-apps Bot commented May 25, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

nvauto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants