Skip to content

[BUG] Fix regex transpiler truncating supplementary codepoints in \x{...} and \0NNN escapes (#14744)#14869

Draft
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14744
Draft

[BUG] Fix regex transpiler truncating supplementary codepoints in \x{...} and \0NNN escapes (#14744)#14869
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14744

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

Closes #14744
Contributes to #14733

What this fixes

CudfRegexTranspiler.rewrite's arms for RegexHexDigit and RegexOctalChar rewrote codepoints >= 0x80 as RegexChar(r.codePoint.toChar). Char is a 16-bit UTF-16 code unit, so any codepoint above U+FFFF was silently truncated to its low 16 bits. For example, \x{1F600} (😀, codepoint U+1F600) was rewritten as RegexChar(0xF600.toChar) — a private-use BMP code unit — and cuDF matched U+F600 instead of the supplementary U+1F600. This produced silent wrong-result matches for any pattern containing emoji or many CJK supplementary characters: the issue body documents GPU False vs CPU True on input "😀" matched against \x{1F600}.

The parallel parseHex and parseOctal paths in RegexParser had the same .toChar truncation, so a \x{...} parsed directly into a RegexChar (not via RegexHexDigit) would also leak the truncation past the transpiler.

Approach

Deviated from the issue's suggested fix. The issue proposed encoding supplementary codepoints (cp > U+FFFF) as a UTF-8 byte sequence by emitting a RegexSequence(encodeUtf8(cp).map(b => RegexChar(b.toChar))). This approach does not actually work end-to-end: cuDF's regex JNI consumes Unicode codepoints, not raw UTF-8 byte sequences, so synthesizing a multi-byte literal at the AST level still fails to match the actual supplementary codepoint in the data column (which cuDF iterates as one Unicode codepoint, not as 4 UTF-8 bytes). The truncation symptom would be replaced by a different wrong-match symptom (the GPU would match nothing instead of matching U+F600 by accident).

Instead, this PR makes the parser and transpiler throw RegexUnsupportedException for any hex/octal escape whose codepoint exceeds U+FFFF, at four sites in RegexParser.scala:

  1. RegexParser.parseHex\x{...} parsed directly into a RegexChar.
  2. RegexParser.parseOctal\0NNN parsed directly into a RegexChar.
  3. CudfRegexTranspiler.rewriteRegexOctalChar arm.
  4. CudfRegexTranspiler.rewriteRegexHexDigit arm.

RegexUnsupportedException is the contract spark-rapids uses to signal "GPU cannot transpile this pattern; fall back to CPU regex engine." This guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the cost of one CPU fallback per affected pattern. Patterns containing only BMP codepoints (cp <= U+FFFF) are unaffected — \x{FFFF} and below still transpile and run on the GPU.

Tests added

  • Scala unit: RegularExpressionTranspilerSuiteissue-14744: supplementary codepoint hex/octal escapes fall back to CPU — asserts RegexUnsupportedException for six representative patterns (\x{10000}, \x{1F600}, \x{10FFFF}, embedded, in character class, in range) and adds a regression guard for the BMP boundary U+FFFF.
  • Python IT: regexp_test.py adds test_rlike_supplementary_codepoint_fallback_issue_14744 (4 patterns parametrized) and test_regexp_replace_supplementary_codepoint_fallback_issue_14744 using assert_gpu_fallback_collect. Also updates test_regexp_hexadecimal_digits to use \x{0000ffff} so its projection still runs fully on GPU (the previous \x{10ffff} would now fall back). The new tests follow the extend-existing-IT convention except where the SQL shape differs (regexp_replace vs rlike), which justifies the separate test.

Local validation

Scala suite:

Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0

Python IT:

6 passed, 39867 deselected, 8 warnings in 13.34s

End-to-end repro re-run on the patched dist JAR via spark-shell (input column ["😀", "a", "hello 😀 world"], pattern \x{1F600}):

===== rapids.sql.enabled=true (patched GPU plugin) =====
a=😀  matches=true
a=a  matches=false
a=hello 😀 world  matches=true
===== rapids.sql.enabled=false (CPU baseline) =====
a=😀  matches=true
a=a  matches=false
a=hello 😀 world  matches=true
GPU == CPU: true

GPU output now matches CPU for every input listed in the sub-issue's "Observed" section.

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

…...} and \0NNN escapes (NVIDIA#14744)

Hex escapes (\x{NNNN}) and octal escapes (\0NNN) for supplementary
codepoints (cp > U+FFFF) were silently truncated to their low 16 bits
via `.toChar` in the RegexParser / CudfRegexTranspiler. For example,
the pattern `\x{1F600}` (grinning face 😀) was rewritten to
`RegexChar(0xF600.toChar)` — a BMP private-use codepoint — so the GPU
matched the wrong character.

cuDF's regex JNI cannot represent supplementary codepoints natively
(see the deviation note below), so this PR makes the parser and
transpiler throw `RegexUnsupportedException` for any hex/octal escape
whose codepoint exceeds U+FFFF. spark-rapids then falls back to the
CPU regex engine, which Java's `Pattern` handles correctly. This
guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the
cost of one CPU fallback per affected pattern. Patterns containing
only BMP codepoints (cp <= U+FFFF) are unaffected.

Four throw sites in `RegexParser.scala`:
1. `parseHex` codepoint > 0xFFFF
2. `parseOctal` codepoint > 0xFFFF
3. `CudfRegexTranspiler.rewrite` — `RegexOctalChar` arm
4. `CudfRegexTranspiler.rewrite` — `RegexHexDigit` arm

Deviation from suggested fix: the issue proposed encoding
supplementary codepoints as multi-byte UTF-8 sequences at the AST
level. That does not work end-to-end because cuDF's regex JNI consumes
Unicode codepoints (not raw UTF-8 bytes), so a synthesized byte
sequence still fails to match the actual supplementary codepoint in
the data column. The truncation symptom would be replaced by a
different wrong-match symptom (GPU matches nothing instead of matching
U+F600 by accident). The CPU-fallback path used here is the same
contract spark-rapids uses for every other unsupported regex feature.

Tests:
* Scala: `RegularExpressionTranspilerSuite` -> "issue-14744:
  supplementary codepoint hex/octal escapes fall back to CPU" asserts
  `RegexUnsupportedException` for six representative patterns
  (`\x{10000}`, `\x{1F600}`, `\x{10FFFF}`, embedded, in character
  class, in range) and adds a regression guard for the BMP boundary
  U+FFFF.
* Python IT: `regexp_test.py` adds
  `test_rlike_supplementary_codepoint_fallback_issue_14744` (4 patterns
  parametrized) and `test_regexp_replace_supplementary_codepoint_fallback_issue_14744`
  using `assert_gpu_fallback_collect`. Updates `test_regexp_hexadecimal_digits`
  to use `\x{0000ffff}` so its projection still runs fully on GPU.

Local validation:
* mvn package -pl tests -am -Dbuildver=330
  -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite
  -> Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0
* run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744
  or test_regexp_hexadecimal_digits'
  -> 6 passed, 39867 deselected, 8 warnings in 13.34s
* spark-shell end-to-end repro on the patched dist JAR:
  GPU == CPU == [true, false, true] for inputs ["😀", "a", "hello 😀 world"]
  matched against the pattern `\x{1F600}`.

Closes NVIDIA#14744
Contributes to NVIDIA#14733

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings May 25, 2026 04:21
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

1 similar comment
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a correctness bug in the GPU regex transpilation/parsing pipeline where hex/octal escapes could be converted via .toChar, truncating supplementary Unicode codepoints (> U+FFFF) and causing silent wrong-result matches. The new behavior explicitly rejects such escapes with RegexUnsupportedException so Spark-RAPIDS reliably falls back to the CPU regex engine for those patterns.

Changes:

  • Add parser/transpiler guards that throw RegexUnsupportedException when \x{...} / \0NNN escapes resolve to codepoints > U+FFFF, preventing UTF-16 truncation.
  • Add Scala unit coverage asserting CPU fallback for supplementary-codepoint escapes and a regression guard at the BMP boundary (U+FFFF).
  • Add Python integration tests asserting GPU fallback for rlike and regexp_replace using supplementary-codepoint hex escapes; adjust an existing projection test to stay fully on-GPU by using \x{0000ffff}.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala Adds regression tests asserting unsupported supplementary-codepoint escapes and keeps BMP boundary coverage.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Prevents .toChar truncation by throwing RegexUnsupportedException for supplementary codepoints during parsing/transpilation of hex/octal escapes.
integration_tests/src/main/python/regexp_test.py Adds ITs that assert GPU fallback for supplementary-codepoint patterns and updates an existing test to avoid mixed GPU/CPU projection behavior.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 25, 2026

Greptile Summary

Fixes a silent data-corruption bug where supplementary Unicode codepoints (cp > U+FFFF) encoded as \x{...} or \0NNN regex escapes were truncated to their low-16-bit char value via .toChar, causing the GPU to match the wrong character (e.g., U+F600 instead of U+1F600). The fix throws RegexUnsupportedException at four sites — two in the character-class parser path and two in the transpiler's RegexOctalChar/RegexHexDigit arms — so spark-rapids falls back to the CPU engine for any pattern containing supplementary codepoints.

  • Core fix (RegexParser.scala): four guard clauses checking codePoint > 0xFFFF replace the silent .toChar truncation with a well-typed CPU fallback; BMP codepoints (cp ≤ U+FFFF) are completely unaffected.
  • Scala unit tests (RegularExpressionTranspilerSuite.scala): six supplementary-codepoint patterns now assert RegexUnsupportedException, and a BMP-boundary regression guard verifies \x{FFFF} still transpiles on GPU.
  • Python integration tests (regexp_test.py): two new assert_gpu_fallback_collect tests cover rlike and regexp_replace; the existing hex-digit GPU projection test is updated to use \x{0000ffff} (U+FFFF, still within BMP) instead of the now-unsupported \x{10ffff}.

Confidence Score: 4/5

Safe to merge. The core change is a targeted exception-throw at four well-identified sites that previously produced wrong-result matches; BMP patterns are entirely unaffected.

The parser and transpiler changes are minimal, surgical, and covered by both Scala unit tests and Python integration tests. The one finding is a misleading code comment in the Python test (the data generator likely does not produce actual supplementary Unicode characters), but this does not affect the validity of the fallback assertion or the correctness of the fix itself.

The Python comment in regexp_test.py around line 659 overstates what the data generator produces; worth a quick clarification but not a blocker.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Adds four guard clauses that throw RegexUnsupportedException for supplementary codepoints (cp > U+FFFF) in hex and octal escapes at both the parser (character-class path) and transpiler (RegexHexDigit / RegexOctalChar arms) — correctly eliminating the silent .toChar truncation.
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala Adds Scala unit coverage for six supplementary-codepoint hex patterns (including embedded, in character class, and in range) asserting RegexUnsupportedException, plus a BMP-boundary regression guard for U+FFFF; correctly removes the pre-fix \x{10FFFF} GPU pattern test.
integration_tests/src/main/python/regexp_test.py Adds Python integration tests verifying GPU fallback for supplementary codepoints in rlike and regexp_replace; replaces \x{10ffff} with \x{0000ffff} in the GPU-only projection. One comment overstates that the data generator produces actual supplementary Unicode characters.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Spark SQL regex pattern\n(e.g. \\x{1F600})"] --> B{Contains\nhex/octal escape?}
    B -- No --> C[Normal GPU transpilation path]
    B -- Yes --> D{codePoint\n> U+FFFF?}
    D -- No\n(BMP codepoint\n≤ U+FFFF) --> E["Transpile normally\n(RegexChar or RegexHexDigit)"]
    E --> F[GPU execution via cuDF JNI]
    D -- Yes\n(Supplementary codepoint) --> G["Throw RegexUnsupportedException\nat parser OR transpiler\n(4 guard sites)"]
    G --> H[spark-rapids catches exception]
    H --> I[CPU fallback via Java Pattern]
    I --> J["Correct result\n(GPU == CPU)"]
Loading

Reviews (1): Last reviewed commit: "[BUG] Fix regex transpiler truncating su..." | Re-trigger Greptile

Comment on lines +659 to +662
# The data gen below seeds inputs that contain the actual
# supplementary codepoints we test for, so CPU matches are real
# rather than always-False.
gen = mk_str_gen('[abcd]\\\\x{1F600}\\\\x{10000}\\\\x{10FFFF}[abcd]')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading "real matches" comment for data generator

StringGen uses sre_yield, which delegates to Python's sre_parse. Python's re module does not support the \x{HHHH} hex-escape syntax — it only accepts \xHH (exactly 2 hex digits). As a result, mk_str_gen('[abcd]\\\\x{1F600}\\\\x{10000}\\\\x{10FFFF}[abcd]') produces strings containing the literal text \x{1F600} (backslash + x{1F600}), not the actual Unicode codepoints U+1F600/U+10000/U+10FFFF. The comment "CPU matches are real rather than always-False" is therefore incorrect — rlike(a, "\\x{1F600}") on Java's regex engine will not match literal backslash-x text, so the CPU column will be all-false. The test still validates the GPU-fallback behaviour correctly (that is its primary purpose), but the comment overstates what the data generator produces.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 835458394 — deleted the inaccurate "data gen below seeds inputs that contain the actual supplementary codepoints" paragraph. You are correct that mk_str_gen uses Python re for the seed regex, which does not understand \x{HHHH} (only \xHH with exactly 2 hex digits), so the generated inputs contain literal backslash-x text, not the codepoints themselves. The test still does its job — it verifies RegexUnsupportedException triggers GPU→CPU fallback for these patterns and that CPU returns the result Javas Pattern` produces — but the justification was wrong. Removed the misleading paragraph rather than rewriting it, since the remaining comment block already explains the regression and the fallback expectation.

@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented May 25, 2026

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

The deleted comment claimed `mk_str_gen` seeds inputs containing actual
supplementary codepoints. Python's `re` module (which `mk_str_gen` uses
for the seed regex) does not support the `\x{HHHH}` syntax — only `\xHH`
with exactly 2 hex digits — so the gen produces strings with literal
backslash-x text rather than real codepoints. The test still validates
the correct behavior (GPU falls back to CPU via RegexUnsupportedException
and CPU returns the same RLike result the Java engine produces), but the
"CPU matches are real" justification was factually wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Transpiler truncates supplementary codepoints (\\x{1F600} becomes U+F600); silent wrong matches for non-BMP characters

3 participants