[BUG] Fix regex CharacterRange not recursing into endpoints (#14745)#14874
[BUG] Fix regex CharacterRange not recursing into endpoints (#14745)#14874wjxiz1992 wants to merge 3 commits into
Conversation
…...} and \0NNN escapes (NVIDIA#14744) Hex escapes (\x{NNNN}) and octal escapes (\0NNN) for supplementary codepoints (cp > U+FFFF) were silently truncated to their low 16 bits via `.toChar` in the RegexParser / CudfRegexTranspiler. For example, the pattern `\x{1F600}` (grinning face 😀) was rewritten to `RegexChar(0xF600.toChar)` — a BMP private-use codepoint — so the GPU matched the wrong character. cuDF's regex JNI cannot represent supplementary codepoints natively (see the deviation note below), so this PR makes the parser and transpiler throw `RegexUnsupportedException` for any hex/octal escape whose codepoint exceeds U+FFFF. spark-rapids then falls back to the CPU regex engine, which Java's `Pattern` handles correctly. This guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the cost of one CPU fallback per affected pattern. Patterns containing only BMP codepoints (cp <= U+FFFF) are unaffected. Four throw sites in `RegexParser.scala`: 1. `parseHex` codepoint > 0xFFFF 2. `parseOctal` codepoint > 0xFFFF 3. `CudfRegexTranspiler.rewrite` — `RegexOctalChar` arm 4. `CudfRegexTranspiler.rewrite` — `RegexHexDigit` arm Deviation from suggested fix: the issue proposed encoding supplementary codepoints as multi-byte UTF-8 sequences at the AST level. That does not work end-to-end because cuDF's regex JNI consumes Unicode codepoints (not raw UTF-8 bytes), so a synthesized byte sequence still fails to match the actual supplementary codepoint in the data column. The truncation symptom would be replaced by a different wrong-match symptom (GPU matches nothing instead of matching U+F600 by accident). The CPU-fallback path used here is the same contract spark-rapids uses for every other unsupported regex feature. Tests: * Scala: `RegularExpressionTranspilerSuite` -> "issue-14744: supplementary codepoint hex/octal escapes fall back to CPU" asserts `RegexUnsupportedException` for six representative patterns (`\x{10000}`, `\x{1F600}`, `\x{10FFFF}`, embedded, in character class, in range) and adds a regression guard for the BMP boundary U+FFFF. * Python IT: `regexp_test.py` adds `test_rlike_supplementary_codepoint_fallback_issue_14744` (4 patterns parametrized) and `test_regexp_replace_supplementary_codepoint_fallback_issue_14744` using `assert_gpu_fallback_collect`. Updates `test_regexp_hexadecimal_digits` to use `\x{0000ffff}` so its projection still runs fully on GPU. Local validation: * mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -> Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0 * run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744 or test_regexp_hexadecimal_digits' -> 6 passed, 39867 deselected, 8 warnings in 13.34s * spark-shell end-to-end repro on the patched dist JAR: GPU == CPU == [true, false, true] for inputs ["😀", "a", "hello 😀 world"] matched against the pattern `\x{1F600}`. Closes NVIDIA#14744 Contributes to NVIDIA#14733 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
…VIDIA#14745) `CudfRegexTranspiler.rewrite` had a no-op arm for `RegexCharacterRange`: case RegexCharacterRange(_, _) => regex so range endpoints never received the same hex/octal/meta-character normalisation that standalone literals do. The arm now recurses on both endpoints, mirroring the `^$.` bypass that `RegexCharacterClass` already applies (those three characters are literal inside `[]`, never anchors or dot, so they must NOT be re-routed through the top-level RegexChar arm — omitting that bypass rewrites `RegexChar('$')` to `\Z` and produces `[\Z-\Z]`, which breaks cuDF compilation and shows up in the existing `string split fuzz` tests). Interaction with NVIDIA#14744: the supplementary-codepoint throw from NVIDIA#14744 fires on range endpoints via the recursive call, so ranges whose endpoints exceed U+FFFF (e.g. `[\x{1F600}-\x{1F64F}]`) now propagate the throw correctly and spark-rapids falls back to CPU. This restores the GPU == CPU parity supreme rule from CLAUDE.md. Stacking note: this branch is stacked on top of PR NVIDIA#14869 (the NVIDIA#14744 fix). Until NVIDIA#14869 merges, this PR's diff includes both commits. Reviewers should treat the CharacterRange recursion as the only change unique to this PR. Tests: * Scala: `RegularExpressionTranspilerSuite` -> "issue-14745: CudfRegexTranspiler.rewrite recurses into RegexCharacterRange" asserts (a) supplementary-endpoint ranges still surface as `RegexUnsupportedException` and (b) BMP ranges with non-trivial endpoint shapes (`[\x41-\x5a]`, `[\x{41}-\x{5a}]`, `[\x00-\x7f]`, `[\x{0000}-\x{007F}]`, `[a-\xff]`, `[\x20-z]`) still hit GPU == CPU parity through `assertCpuGpuMatchesRegexpFind`. * Python IT: extends `test_rlike_supplementary_codepoint_fallback_issue_14744`'s parametrize with `[\x{1F600}-\x{1F64F}]` and `[\x{10000}-\x{10FFFF}]` (literal-in-decorator per the `extend-existing-it-over-new` rule), so the same fallback contract that `regexp_test.py` covers for standalone supplementary escapes also covers ranges. Local validation (worktree-local .mvn-repo, gpu.allocFraction=0.3): * mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330 -> BUILD SUCCESS (compile gate) * mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -> Tests: succeeded 96, failed 0, canceled 6, ignored 0, pending 0 * run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744' -> 7 passed, 39868 deselected * spark-submit end-to-end repro on the patched dist JAR for `[\x{1F600}-\x{1F64F}]`: - CPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True] - GPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True] -> no divergence (GPU falls back to CPU via the NVIDIA#14744 throw, which this fix routes correctly through the recursive endpoint rewrite) Closes NVIDIA#14745 Contributes to NVIDIA#14733 Depends on NVIDIA#14869 ### Review notes Informational findings recorded during pre-commit self-review (no must-fix items): * Comment "RegexCharacterClass (a few lines below)" is geographically correct but brittle to future reordering — kept for cross-referencing the `^$.` bypass. * `bmpRangeInputs` contains two empty strings; harmless duplicate, no behavioural impact on `assertCpuGpuMatchesRegexpFind`. * Inline closure `rewriteEndpoint` could be replaced by two literal rewrite invocations; current form is preferred for readability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
|
build |
There was a problem hiding this comment.
Pull request overview
Fixes a latent correctness bug in the cuDF regex transpiler where RegexCharacterRange endpoints were not rewritten, so escape/meta-character normalization (and supplementary-codepoint fallback behavior) could be skipped for range endpoints.
Changes:
- Update
CudfRegexTranspiler.rewriteto recurse intoRegexCharacterRangestart/end endpoints, with the same^$.literal bypass behavior used forRegexCharacterClass. - Add defensive validation so rewritten range endpoints must remain
RegexCharacterClassComponents (otherwise trigger CPU fallback viaRegexUnsupportedException). - Extend Scala + Python test coverage for supplementary-codepoint fallback and for the range-endpoint recursion regression.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala |
Make RegexCharacterRange rewriting recurse into endpoints and propagate unsupported supplementary-codepoint behavior safely. |
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala |
Add unit tests covering supplementary-codepoint fallback and the fixed range-endpoint recursion behavior. |
integration_tests/src/main/python/regexp_test.py |
Add/adjust integration tests to assert CPU fallback for supplementary-codepoint hex escapes (including in ranges) and keep GPU/CPU parity. |
Greptile SummaryThis PR fixes two related bugs in the regex transpiler: it adds CPU fallback (via
Confidence Score: 4/5Safe to merge; the CharacterRange recursion and ^$. bypass are correct and consistent with the adjacent CharacterClass arm, and the test coverage spans BMP boundary values and GPU fallback paths. The core logic — recursing into range endpoints with the ^$. bypass, and throwing on supplementary codepoints — correctly mirrors the existing CharacterClass arm. The two octal The dead octal guards in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Regex Pattern Input"] --> B["RegexParser.parseCharacterClass / parseEscapedCharacter"]
B --> C{"Hex escape \\x{...}?"}
C -->|"codePoint > U+FFFF"| D["throw RegexUnsupportedException\n(#14744 fix)"]
C -->|"codePoint <= U+FFFF"| E["RegexHexDigit / RegexChar node"]
B --> F{"Octal escape \\0nnn?"}
F -->|"max value 0xFF"| G["RegexOctalChar / RegexChar node"]
E --> H["CudfRegexTranspiler.rewrite"]
G --> H
H --> I{"AST node type?"}
I -->|"RegexHexDigit codePoint > U+FFFF"| D
I -->|"RegexOctalChar codePoint > U+FFFF (dead)"| D
I -->|"RegexCharacterRange(start, end)"| J["rewriteEndpoint(start)\nrewriteEndpoint(end)\n(#14745 fix — was no-op)"]
J --> K{"endpoint is RegexChar('^','$','.')"}
K -->|"yes — literal inside [ ]"| L["return as-is (^$. bypass)"]
K -->|"no"| M["recursive rewrite(endpoint, ...)"]
M --> N{"result is RegexCharacterClassComponent?"}
N -->|"yes"| O["RegexCharacterRange(rewritten endpoints)"]
N -->|"no e.g. \\s → CharClass"| D
I -->|"RegexCharacterClass"| P["rewrite each member (same ^$. bypass)"]
D --> Q["CPU Fallback (Java Pattern engine)"]
O --> R["cuDF Regex Compilation"]
P --> R
|
| case Some('0') => | ||
| val octalChar = parseOctalDigit | ||
| octalChar.codePoint match { | ||
| case 0 => RegexHexDigit("00") | ||
| case codePoint if codePoint > 0xFFFF => | ||
| throw new RegexUnsupportedException( | ||
| "cuDF does not support supplementary codepoints (cp > U+FFFF) " + | ||
| "in octal escapes; falling back to CPU", octalChar.position) | ||
| case codePoint => RegexChar(codePoint.toChar) | ||
| } |
There was a problem hiding this comment.
The
codePoint > 0xFFFF guard in the octal branch of the parser is unreachable. Java regex octal escapes top out at \0mnn with the constraint 0 <= m <= 3, giving a maximum encoded value of \0377 = 255 decimal (0xFF), which is far below U+FFFF. The dead guard will never fire and may mislead future readers into thinking octal can encode supplementary codepoints. The hex counterpart is correctly reachable and should stay; only the octal arm is dead code.
| case Some('0') => | |
| val octalChar = parseOctalDigit | |
| octalChar.codePoint match { | |
| case 0 => RegexHexDigit("00") | |
| case codePoint if codePoint > 0xFFFF => | |
| throw new RegexUnsupportedException( | |
| "cuDF does not support supplementary codepoints (cp > U+FFFF) " + | |
| "in octal escapes; falling back to CPU", octalChar.position) | |
| case codePoint => RegexChar(codePoint.toChar) | |
| } | |
| case Some('0') => | |
| val octalChar = parseOctalDigit | |
| octalChar.codePoint match { | |
| case 0 => RegexHexDigit("00") | |
| // Note: octal escapes max out at \0377 = 255 (U+00FF), so codePoint | |
| // can never exceed U+FFFF here; no supplementary-codepoint guard needed. | |
| case codePoint => RegexChar(codePoint.toChar) | |
| } |
There was a problem hiding this comment.
Addressed in 7c1505385. You are right — parseOctalDigit accepts at most \0mnn with 0 <= m <= 3 per the constraints at RegexParser.scala:572-573, so the maximum encoded codepoint is 0o377 = 255 = U+00FF, well below U+FFFF. Removed the dead guard and replaced it with a one-line invariant note so the asymmetry with the hex arm (which is correctly reachable) does not mislead future readers.
Verified locally: mvn package -pl tests -am -Dbuildver=330 ... -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite → 96 succeeded, 0 failed, 6 canceled (unrelated #7585 unicode-line-separator cancellations).
…NVIDIA#14745) `parseOctalDigit` accepts at most `\0mnn` with `0 <= m <= 3, 0 <= n <= 7` (see `parseOctalDigits(4)` constraints in RegexParser.scala:572-573), so the maximum encoded `codePoint` is 0o377 = 255 decimal = U+00FF. The `codePoint > 0xFFFF` guard at the same site as the hex branch's guard is therefore unreachable; replaced with a one-line invariant comment so the asymmetry with the hex arm doesn't mislead readers. Verified: `RegularExpressionTranspilerSuite` 96 succeeded, 0 failed, 6 canceled (unrelated NVIDIA#7585 unicode-line-separator cancellations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
Summary
Closes #14745
Contributes to #14733
Depends on #14869
CudfRegexTranspiler.rewritehad a no-op arm forRegexCharacterRange(case RegexCharacterRange(_, _) => regex), so range endpoints never received the same hex/octal/meta-character normalisation that standalone literals do. With the supplementary-codepoint fix from #14744 applied, the issue's specific example ([\x{1F600}-\x{1F64F}]) was already CPU-fallback-rescued via the parser's early throw — but the transpiler arm is still latently buggy (it would surface for any future path that constructs aRegexCharacterRangewith a non-RegexCharendpoint requiring normalisation). This PR makes the arm recurse into both endpoints, eliminating the no-op and re-using the existing endpoint rewrite logic.Mirrors the
^$.bypass thatRegexCharacterClassalready applies (those three characters are literal inside[], never anchors or dot, so they must NOT be re-routed through the top-levelRegexChararm; omitting that bypass produces[\Z-\Z]and breaks cuDF compilation in the existingstring split fuzztests).Stacking note
This branch is stacked on PR #14869 (the #14744 supplementary-codepoint fix). Until #14869 merges, this PR's diff includes both commits. Reviewers should treat the CharacterRange recursion (commit
dbaf6f9a8) as the only change unique to this PR; commit388f79fb4is the #14869 commit and will rebase out automatically once #14869 lands.Deviation from issue's suggested fix
The issue suggested:
Applied as suggested with two additional safety checks:
^$.bypass on endpoint rewrites (matches the existingRegexCharacterClassarm).RegexAST → RegexCharacterClassComponentcast with an explicitRegexUnsupportedExceptionif rewrite produces a non-component (e.g. aRegexCharacterClassfrom\s). This propagates to CPU fallback rather than silently corrupting the AST.These guards reuse the same defensive pattern the immediately-following
RegexCharacterClassarm uses, so the two arms behave consistently.Local validation
mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330-> BUILD SUCCESSmvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -Drapids.test.gpu.allocFraction=0.3 -Drapids.test.gpu.maxAllocFraction=0.3 -Drapids.test.gpu.minAllocFraction=0->Tests: succeeded 96, failed 0, canceled 6, ignored 0, pending 0run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744'-> 7 passed, 39868 deselected[\x{1F600}-\x{1F64F}]:["U+1F600" -> True, "a" -> False, "U+1F602" -> True]["U+1F600" -> True, "a" -> False, "U+1F602" -> True]\\x{1F600}becomes U+F600); silent wrong matches for non-BMP characters #14744 throw, which this fix routes correctly through the recursive endpoint rewrite.Documentation
Testing
Performance