[BUG] Fix regex CharacterRange not recursing into endpoints (#14745) by wjxiz1992 · Pull Request #14874 · NVIDIA/spark-rapids

wjxiz1992 · 2026-05-25T06:10:10Z

Summary

Closes #14745
Contributes to #14733
Depends on #14869

CudfRegexTranspiler.rewrite had a no-op arm for RegexCharacterRange (case RegexCharacterRange(_, _) => regex), so range endpoints never received the same hex/octal/meta-character normalisation that standalone literals do. With the supplementary-codepoint fix from #14744 applied, the issue's specific example ([\x{1F600}-\x{1F64F}]) was already CPU-fallback-rescued via the parser's early throw — but the transpiler arm is still latently buggy (it would surface for any future path that constructs a RegexCharacterRange with a non-RegexChar endpoint requiring normalisation). This PR makes the arm recurse into both endpoints, eliminating the no-op and re-using the existing endpoint rewrite logic.

Mirrors the ^$. bypass that RegexCharacterClass already applies (those three characters are literal inside [], never anchors or dot, so they must NOT be re-routed through the top-level RegexChar arm; omitting that bypass produces [\Z-\Z] and breaks cuDF compilation in the existing string split fuzz tests).

Stacking note

This branch is stacked on PR #14869 (the #14744 supplementary-codepoint fix). Until #14869 merges, this PR's diff includes both commits. Reviewers should treat the CharacterRange recursion (commit dbaf6f9a8) as the only change unique to this PR; commit 388f79fb4 is the #14869 commit and will rebase out automatically once #14869 lands.

Deviation from issue's suggested fix

The issue suggested:

case RegexCharacterRange(start, end) =>
  RegexCharacterRange(
    rewrite(start, replacement, None, flags),
    rewrite(end,   replacement, None, flags))

Applied as suggested with two additional safety checks:

^$. bypass on endpoint rewrites (matches the existing RegexCharacterClass arm).
RegexAST → RegexCharacterClassComponent cast with an explicit RegexUnsupportedException if rewrite produces a non-component (e.g. a RegexCharacterClass from \s). This propagates to CPU fallback rather than silently corrupting the AST.

These guards reuse the same defensive pattern the immediately-following RegexCharacterClass arm uses, so the two arms behave consistently.

Local validation

mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330 -> BUILD SUCCESS
mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -Drapids.test.gpu.allocFraction=0.3 -Drapids.test.gpu.maxAllocFraction=0.3 -Drapids.test.gpu.minAllocFraction=0 -> Tests: succeeded 96, failed 0, canceled 6, ignored 0, pending 0
run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744' -> 7 passed, 39868 deselected
spark-submit end-to-end repro on the patched dist JAR for [\x{1F600}-\x{1F64F}]:
- CPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True]
- GPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True]
- No divergence — GPU falls back to CPU via the [BUG] Transpiler truncates supplementary codepoints (\\x{1F600} becomes U+F600); silent wrong matches for non-BMP characters #14744 throw, which this fix routes correctly through the recursive endpoint rewrite.

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

…...} and \0NNN escapes (NVIDIA#14744) Hex escapes (\x{NNNN}) and octal escapes (\0NNN) for supplementary codepoints (cp > U+FFFF) were silently truncated to their low 16 bits via `.toChar` in the RegexParser / CudfRegexTranspiler. For example, the pattern `\x{1F600}` (grinning face 😀) was rewritten to `RegexChar(0xF600.toChar)` — a BMP private-use codepoint — so the GPU matched the wrong character. cuDF's regex JNI cannot represent supplementary codepoints natively (see the deviation note below), so this PR makes the parser and transpiler throw `RegexUnsupportedException` for any hex/octal escape whose codepoint exceeds U+FFFF. spark-rapids then falls back to the CPU regex engine, which Java's `Pattern` handles correctly. This guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the cost of one CPU fallback per affected pattern. Patterns containing only BMP codepoints (cp <= U+FFFF) are unaffected. Four throw sites in `RegexParser.scala`: 1. `parseHex` codepoint > 0xFFFF 2. `parseOctal` codepoint > 0xFFFF 3. `CudfRegexTranspiler.rewrite` — `RegexOctalChar` arm 4. `CudfRegexTranspiler.rewrite` — `RegexHexDigit` arm Deviation from suggested fix: the issue proposed encoding supplementary codepoints as multi-byte UTF-8 sequences at the AST level. That does not work end-to-end because cuDF's regex JNI consumes Unicode codepoints (not raw UTF-8 bytes), so a synthesized byte sequence still fails to match the actual supplementary codepoint in the data column. The truncation symptom would be replaced by a different wrong-match symptom (GPU matches nothing instead of matching U+F600 by accident). The CPU-fallback path used here is the same contract spark-rapids uses for every other unsupported regex feature. Tests: * Scala: `RegularExpressionTranspilerSuite` -> "issue-14744: supplementary codepoint hex/octal escapes fall back to CPU" asserts `RegexUnsupportedException` for six representative patterns (`\x{10000}`, `\x{1F600}`, `\x{10FFFF}`, embedded, in character class, in range) and adds a regression guard for the BMP boundary U+FFFF. * Python IT: `regexp_test.py` adds `test_rlike_supplementary_codepoint_fallback_issue_14744` (4 patterns parametrized) and `test_regexp_replace_supplementary_codepoint_fallback_issue_14744` using `assert_gpu_fallback_collect`. Updates `test_regexp_hexadecimal_digits` to use `\x{0000ffff}` so its projection still runs fully on GPU. Local validation: * mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -> Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0 * run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744 or test_regexp_hexadecimal_digits' -> 6 passed, 39867 deselected, 8 warnings in 13.34s * spark-shell end-to-end repro on the patched dist JAR: GPU == CPU == [true, false, true] for inputs ["😀", "a", "hello 😀 world"] matched against the pattern `\x{1F600}`. Closes NVIDIA#14744 Contributes to NVIDIA#14733 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

…VIDIA#14745) `CudfRegexTranspiler.rewrite` had a no-op arm for `RegexCharacterRange`: case RegexCharacterRange(_, _) => regex so range endpoints never received the same hex/octal/meta-character normalisation that standalone literals do. The arm now recurses on both endpoints, mirroring the `^$.` bypass that `RegexCharacterClass` already applies (those three characters are literal inside `[]`, never anchors or dot, so they must NOT be re-routed through the top-level RegexChar arm — omitting that bypass rewrites `RegexChar('$')` to `\Z` and produces `[\Z-\Z]`, which breaks cuDF compilation and shows up in the existing `string split fuzz` tests). Interaction with NVIDIA#14744: the supplementary-codepoint throw from NVIDIA#14744 fires on range endpoints via the recursive call, so ranges whose endpoints exceed U+FFFF (e.g. `[\x{1F600}-\x{1F64F}]`) now propagate the throw correctly and spark-rapids falls back to CPU. This restores the GPU == CPU parity supreme rule from CLAUDE.md. Stacking note: this branch is stacked on top of PR NVIDIA#14869 (the NVIDIA#14744 fix). Until NVIDIA#14869 merges, this PR's diff includes both commits. Reviewers should treat the CharacterRange recursion as the only change unique to this PR. Tests: * Scala: `RegularExpressionTranspilerSuite` -> "issue-14745: CudfRegexTranspiler.rewrite recurses into RegexCharacterRange" asserts (a) supplementary-endpoint ranges still surface as `RegexUnsupportedException` and (b) BMP ranges with non-trivial endpoint shapes (`[\x41-\x5a]`, `[\x{41}-\x{5a}]`, `[\x00-\x7f]`, `[\x{0000}-\x{007F}]`, `[a-\xff]`, `[\x20-z]`) still hit GPU == CPU parity through `assertCpuGpuMatchesRegexpFind`. * Python IT: extends `test_rlike_supplementary_codepoint_fallback_issue_14744`'s parametrize with `[\x{1F600}-\x{1F64F}]` and `[\x{10000}-\x{10FFFF}]` (literal-in-decorator per the `extend-existing-it-over-new` rule), so the same fallback contract that `regexp_test.py` covers for standalone supplementary escapes also covers ranges. Local validation (worktree-local .mvn-repo, gpu.allocFraction=0.3): * mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330 -> BUILD SUCCESS (compile gate) * mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -> Tests: succeeded 96, failed 0, canceled 6, ignored 0, pending 0 * run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744' -> 7 passed, 39868 deselected * spark-submit end-to-end repro on the patched dist JAR for `[\x{1F600}-\x{1F64F}]`: - CPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True] - GPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True] -> no divergence (GPU falls back to CPU via the NVIDIA#14744 throw, which this fix routes correctly through the recursive endpoint rewrite) Closes NVIDIA#14745 Contributes to NVIDIA#14733 Depends on NVIDIA#14869 ### Review notes Informational findings recorded during pre-commit self-review (no must-fix items): * Comment "RegexCharacterClass (a few lines below)" is geographically correct but brittle to future reordering — kept for cross-referencing the `^$.` bypass. * `bmpRangeInputs` contains two empty strings; harmless duplicate, no behavioural impact on `assertCpuGpuMatchesRegexpFind`. * Inline closure `rewriteEndpoint` could be replaced by two literal rewrite invocations; current form is preferred for readability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-05-25T06:10:57Z

build

Copilot

Pull request overview

Fixes a latent correctness bug in the cuDF regex transpiler where RegexCharacterRange endpoints were not rewritten, so escape/meta-character normalization (and supplementary-codepoint fallback behavior) could be skipped for range endpoints.

Changes:

Update CudfRegexTranspiler.rewrite to recurse into RegexCharacterRange start/end endpoints, with the same ^$. literal bypass behavior used for RegexCharacterClass.
Add defensive validation so rewritten range endpoints must remain RegexCharacterClassComponents (otherwise trigger CPU fallback via RegexUnsupportedException).
Extend Scala + Python test coverage for supplementary-codepoint fallback and for the range-endpoint recursion regression.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala`	Make `RegexCharacterRange` rewriting recurse into endpoints and propagate unsupported supplementary-codepoint behavior safely.
`tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala`	Add unit tests covering supplementary-codepoint fallback and the fixed range-endpoint recursion behavior.
`integration_tests/src/main/python/regexp_test.py`	Add/adjust integration tests to assert CPU fallback for supplementary-codepoint hex escapes (including in ranges) and keep GPU/CPU parity.

greptile-apps · 2026-05-25T06:14:37Z

Greptile Summary

This PR fixes two related bugs in the regex transpiler: it adds CPU fallback (via RegexUnsupportedException) for supplementary codepoints (cp > U+FFFF) in hex and octal escapes (#14744), and it eliminates the no-op RegexCharacterRange arm in CudfRegexTranspiler.rewrite so that range endpoints now receive the same hex/octal/meta-character normalisation as standalone literals (#14745).

RegexParser.scala: Parser and transpiler now throw RegexUnsupportedException for hex escapes whose codepoint exceeds U+FFFF, routing those patterns to CPU fallback rather than silently truncating via toChar. The RegexCharacterRange transpiler arm is changed from a no-op to a recursive rewrite of both endpoints, with the same ^$. bypass used by the adjacent RegexCharacterClass arm.
RegularExpressionTranspilerSuite.scala: Two new unit tests cover the supplementary-codepoint throw path and the BMP-range round-trip parity, including boundary values like U+FFFF.
regexp_test.py: Supplementary-codepoint patterns are moved from the GPU-only assertion into new assert_gpu_fallback_collect tests that verify CPU fallback correctness for both rlike and regexp_replace.

Confidence Score: 4/5

Safe to merge; the CharacterRange recursion and ^$. bypass are correct and consistent with the adjacent CharacterClass arm, and the test coverage spans BMP boundary values and GPU fallback paths.

The core logic — recursing into range endpoints with the ^$. bypass, and throwing on supplementary codepoints — correctly mirrors the existing CharacterClass arm. The two octal > 0xFFFF guards (in the parser and transpiler) are unreachable dead code since octal can encode at most U+00FF, but they are harmless. Everything else, including decorator ordering and assertion helpers in the integration tests, looks correct.

The dead octal guards in RegexParser.scala (both the parseCharacterClass branch and the transpiler's RegexOctalChar arm) are the only items worth revisiting; the test files are straightforward.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala	Adds supplementary-codepoint (cp > U+FFFF) throw in parser and transpiler for hex/octal arms, and fixes the no-op RegexCharacterRange arm to recurse into both endpoints with correct ^$. bypass; the octal > 0xFFFF guard in both the parser and transpiler is unreachable dead code (max octal value is 255)
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala	Adds two new unit tests covering supplementary codepoint CPU fallback (issue #14744) and CharacterRange recursion (issue #14745), including BMP boundary regression guards
integration_tests/src/main/python/regexp_test.py	Removes supplementary codepoint patterns from GPU-only test, adds parametrized fallback tests for rlike/regexp_replace using assert_gpu_fallback_collect with correct @allow_non_gpu decorators

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Regex Pattern Input"] --> B["RegexParser.parseCharacterClass / parseEscapedCharacter"]
    B --> C{"Hex escape \\x{...}?"}
    C -->|"codePoint > U+FFFF"| D["throw RegexUnsupportedException\n(#14744 fix)"]
    C -->|"codePoint <= U+FFFF"| E["RegexHexDigit / RegexChar node"]
    B --> F{"Octal escape \\0nnn?"}
    F -->|"max value 0xFF"| G["RegexOctalChar / RegexChar node"]
    E --> H["CudfRegexTranspiler.rewrite"]
    G --> H
    H --> I{"AST node type?"}
    I -->|"RegexHexDigit codePoint > U+FFFF"| D
    I -->|"RegexOctalChar codePoint > U+FFFF (dead)"| D
    I -->|"RegexCharacterRange(start, end)"| J["rewriteEndpoint(start)\nrewriteEndpoint(end)\n(#14745 fix — was no-op)"]
    J --> K{"endpoint is RegexChar('^','$','.')"}
    K -->|"yes — literal inside [ ]"| L["return as-is (^$. bypass)"]
    K -->|"no"| M["recursive rewrite(endpoint, ...)"]
    M --> N{"result is RegexCharacterClassComponent?"}
    N -->|"yes"| O["RegexCharacterRange(rewritten endpoints)"]
    N -->|"no e.g. \\s → CharClass"| D
    I -->|"RegexCharacterClass"| P["rewrite each member (same ^$. bypass)"]
    D --> Q["CPU Fallback (Java Pattern engine)"]
    O --> R["cuDF Regex Compilation"]
    P --> R

Comments Outside Diff (1)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala, line 1135-1152 (link)

The same dead-code guard appears in the transpiler's RegexOctalChar arm. Since any RegexOctalChar node was produced from a 4-digit-max octal pattern (max value 255), the > 0xFFFF branch can never execute. Removing it keeps the octal and hex arms visually distinct and avoids misleading readers about octal's range.

_{Reviews (1): Last reviewed commit: "[BUG] Recurse CudfRegexTranspiler into R..." | Re-trigger Greptile}

greptile-apps · 2026-05-25T06:14:41Z

        case Some('0') =>
          val octalChar = parseOctalDigit
          octalChar.codePoint match {
            case 0 => RegexHexDigit("00")
+            case codePoint if codePoint > 0xFFFF =>
+              throw new RegexUnsupportedException(
+                "cuDF does not support supplementary codepoints (cp > U+FFFF) " +
+                "in octal escapes; falling back to CPU", octalChar.position)
            case codePoint => RegexChar(codePoint.toChar)
          }


The codePoint > 0xFFFF guard in the octal branch of the parser is unreachable. Java regex octal escapes top out at \0mnn with the constraint 0 <= m <= 3, giving a maximum encoded value of \0377 = 255 decimal (0xFF), which is far below U+FFFF. The dead guard will never fire and may mislead future readers into thinking octal can encode supplementary codepoints. The hex counterpart is correctly reachable and should stay; only the octal arm is dead code.

Suggested change

case Some('0') =>

val octalChar = parseOctalDigit

octalChar.codePoint match {

case 0 => RegexHexDigit("00")

case codePoint if codePoint > 0xFFFF =>

throw new RegexUnsupportedException(

"cuDF does not support supplementary codepoints (cp > U+FFFF) " +

"in octal escapes; falling back to CPU", octalChar.position)

case codePoint => RegexChar(codePoint.toChar)

}

case Some('0') =>

val octalChar = parseOctalDigit

octalChar.codePoint match {

case 0 => RegexHexDigit("00")

// Note: octal escapes max out at \0377 = 255 (U+00FF), so codePoint

// can never exceed U+FFFF here; no supplementary-codepoint guard needed.

case codePoint => RegexChar(codePoint.toChar)

}

Addressed in 7c1505385. You are right — parseOctalDigit accepts at most \0mnn with 0 <= m <= 3 per the constraints at RegexParser.scala:572-573, so the maximum encoded codepoint is 0o377 = 255 = U+00FF, well below U+FFFF. Removed the dead guard and replaced it with a one-line invariant note so the asymmetry with the hex arm (which is correctly reachable) does not mislead future readers.

Verified locally: mvn package -pl tests -am -Dbuildver=330 ... -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite → 96 succeeded, 0 failed, 6 canceled (unrelated #7585 unicode-line-separator cancellations).

…NVIDIA#14745) `parseOctalDigit` accepts at most `\0mnn` with `0 <= m <= 3, 0 <= n <= 7` (see `parseOctalDigits(4)` constraints in RegexParser.scala:572-573), so the maximum encoded `codePoint` is 0o377 = 255 decimal = U+00FF. The `codePoint > 0xFFFF` guard at the same site as the hex branch's guard is therefore unreachable; replaced with a one-line invariant comment so the asymmetry with the hex arm doesn't mislead readers. Verified: `RegularExpressionTranspilerSuite` 96 succeeded, 0 failed, 6 canceled (unrelated NVIDIA#7585 unicode-line-separator cancellations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 and others added 2 commits May 25, 2026 12:20

Copilot AI review requested due to automatic review settings May 25, 2026 06:10

Copilot started reviewing on behalf of wjxiz1992 May 25, 2026 06:10 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

greptile-apps Bot reviewed May 25, 2026

View reviewed changes

wjxiz1992 marked this pull request as draft May 25, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix regex CharacterRange not recursing into endpoints (#14745)#14874

[BUG] Fix regex CharacterRange not recursing into endpoints (#14745)#14874
wjxiz1992 wants to merge 3 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14745

wjxiz1992 commented May 25, 2026

Uh oh!

wjxiz1992 commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

greptile-apps Bot commented May 25, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 25, 2026

Uh oh!

wjxiz1992 May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wjxiz1992 commented May 25, 2026

Summary

Stacking note

Deviation from issue's suggested fix

Local validation

Documentation

Testing

Performance

Uh oh!

wjxiz1992 commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

greptile-apps Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 25, 2026 •

edited

Loading