Skip to content

[BUG] Fix regex CharacterRange not recursing into endpoints (#14745)#14874

Draft
wjxiz1992 wants to merge 3 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14745
Draft

[BUG] Fix regex CharacterRange not recursing into endpoints (#14745)#14874
wjxiz1992 wants to merge 3 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14745

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

Summary

Closes #14745
Contributes to #14733
Depends on #14869

CudfRegexTranspiler.rewrite had a no-op arm for RegexCharacterRange (case RegexCharacterRange(_, _) => regex), so range endpoints never received the same hex/octal/meta-character normalisation that standalone literals do. With the supplementary-codepoint fix from #14744 applied, the issue's specific example ([\x{1F600}-\x{1F64F}]) was already CPU-fallback-rescued via the parser's early throw — but the transpiler arm is still latently buggy (it would surface for any future path that constructs a RegexCharacterRange with a non-RegexChar endpoint requiring normalisation). This PR makes the arm recurse into both endpoints, eliminating the no-op and re-using the existing endpoint rewrite logic.

Mirrors the ^$. bypass that RegexCharacterClass already applies (those three characters are literal inside [], never anchors or dot, so they must NOT be re-routed through the top-level RegexChar arm; omitting that bypass produces [\Z-\Z] and breaks cuDF compilation in the existing string split fuzz tests).

Stacking note

This branch is stacked on PR #14869 (the #14744 supplementary-codepoint fix). Until #14869 merges, this PR's diff includes both commits. Reviewers should treat the CharacterRange recursion (commit dbaf6f9a8) as the only change unique to this PR; commit 388f79fb4 is the #14869 commit and will rebase out automatically once #14869 lands.

Deviation from issue's suggested fix

The issue suggested:

case RegexCharacterRange(start, end) =>
  RegexCharacterRange(
    rewrite(start, replacement, None, flags),
    rewrite(end,   replacement, None, flags))

Applied as suggested with two additional safety checks:

  1. ^$. bypass on endpoint rewrites (matches the existing RegexCharacterClass arm).
  2. RegexAST → RegexCharacterClassComponent cast with an explicit RegexUnsupportedException if rewrite produces a non-component (e.g. a RegexCharacterClass from \s). This propagates to CPU fallback rather than silently corrupting the AST.

These guards reuse the same defensive pattern the immediately-following RegexCharacterClass arm uses, so the two arms behave consistently.

Local validation

  • mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330 -> BUILD SUCCESS
  • mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite -Drapids.test.gpu.allocFraction=0.3 -Drapids.test.gpu.maxAllocFraction=0.3 -Drapids.test.gpu.minAllocFraction=0 -> Tests: succeeded 96, failed 0, canceled 6, ignored 0, pending 0
  • run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744' -> 7 passed, 39868 deselected
  • spark-submit end-to-end repro on the patched dist JAR for [\x{1F600}-\x{1F64F}]:

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

wjxiz1992 and others added 2 commits May 25, 2026 12:20
…...} and \0NNN escapes (NVIDIA#14744)

Hex escapes (\x{NNNN}) and octal escapes (\0NNN) for supplementary
codepoints (cp > U+FFFF) were silently truncated to their low 16 bits
via `.toChar` in the RegexParser / CudfRegexTranspiler. For example,
the pattern `\x{1F600}` (grinning face 😀) was rewritten to
`RegexChar(0xF600.toChar)` — a BMP private-use codepoint — so the GPU
matched the wrong character.

cuDF's regex JNI cannot represent supplementary codepoints natively
(see the deviation note below), so this PR makes the parser and
transpiler throw `RegexUnsupportedException` for any hex/octal escape
whose codepoint exceeds U+FFFF. spark-rapids then falls back to the
CPU regex engine, which Java's `Pattern` handles correctly. This
guarantees GPU == CPU parity (the supreme rule from CLAUDE.md) at the
cost of one CPU fallback per affected pattern. Patterns containing
only BMP codepoints (cp <= U+FFFF) are unaffected.

Four throw sites in `RegexParser.scala`:
1. `parseHex` codepoint > 0xFFFF
2. `parseOctal` codepoint > 0xFFFF
3. `CudfRegexTranspiler.rewrite` — `RegexOctalChar` arm
4. `CudfRegexTranspiler.rewrite` — `RegexHexDigit` arm

Deviation from suggested fix: the issue proposed encoding
supplementary codepoints as multi-byte UTF-8 sequences at the AST
level. That does not work end-to-end because cuDF's regex JNI consumes
Unicode codepoints (not raw UTF-8 bytes), so a synthesized byte
sequence still fails to match the actual supplementary codepoint in
the data column. The truncation symptom would be replaced by a
different wrong-match symptom (GPU matches nothing instead of matching
U+F600 by accident). The CPU-fallback path used here is the same
contract spark-rapids uses for every other unsupported regex feature.

Tests:
* Scala: `RegularExpressionTranspilerSuite` -> "issue-14744:
  supplementary codepoint hex/octal escapes fall back to CPU" asserts
  `RegexUnsupportedException` for six representative patterns
  (`\x{10000}`, `\x{1F600}`, `\x{10FFFF}`, embedded, in character
  class, in range) and adds a regression guard for the BMP boundary
  U+FFFF.
* Python IT: `regexp_test.py` adds
  `test_rlike_supplementary_codepoint_fallback_issue_14744` (4 patterns
  parametrized) and `test_regexp_replace_supplementary_codepoint_fallback_issue_14744`
  using `assert_gpu_fallback_collect`. Updates `test_regexp_hexadecimal_digits`
  to use `\x{0000ffff}` so its projection still runs fully on GPU.

Local validation:
* mvn package -pl tests -am -Dbuildver=330
  -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite
  -> Tests: succeeded 95, failed 0, canceled 6, ignored 0, pending 0
* run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744
  or test_regexp_hexadecimal_digits'
  -> 6 passed, 39867 deselected, 8 warnings in 13.34s
* spark-shell end-to-end repro on the patched dist JAR:
  GPU == CPU == [true, false, true] for inputs ["😀", "a", "hello 😀 world"]
  matched against the pattern `\x{1F600}`.

Closes NVIDIA#14744
Contributes to NVIDIA#14733

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
…VIDIA#14745)

`CudfRegexTranspiler.rewrite` had a no-op arm for `RegexCharacterRange`:

    case RegexCharacterRange(_, _) => regex

so range endpoints never received the same hex/octal/meta-character
normalisation that standalone literals do. The arm now recurses on both
endpoints, mirroring the `^$.` bypass that `RegexCharacterClass` already
applies (those three characters are literal inside `[]`, never anchors or
dot, so they must NOT be re-routed through the top-level RegexChar arm —
omitting that bypass rewrites `RegexChar('$')` to `\Z` and produces
`[\Z-\Z]`, which breaks cuDF compilation and shows up in the existing
`string split fuzz` tests).

Interaction with NVIDIA#14744: the supplementary-codepoint throw from NVIDIA#14744
fires on range endpoints via the recursive call, so ranges whose
endpoints exceed U+FFFF (e.g. `[\x{1F600}-\x{1F64F}]`) now propagate the
throw correctly and spark-rapids falls back to CPU. This restores the
GPU == CPU parity supreme rule from CLAUDE.md.

Stacking note: this branch is stacked on top of PR NVIDIA#14869 (the NVIDIA#14744
fix). Until NVIDIA#14869 merges, this PR's diff includes both commits.
Reviewers should treat the CharacterRange recursion as the only change
unique to this PR.

Tests:
* Scala: `RegularExpressionTranspilerSuite` -> "issue-14745:
  CudfRegexTranspiler.rewrite recurses into RegexCharacterRange"
  asserts (a) supplementary-endpoint ranges still surface as
  `RegexUnsupportedException` and (b) BMP ranges with non-trivial
  endpoint shapes (`[\x41-\x5a]`, `[\x{41}-\x{5a}]`, `[\x00-\x7f]`,
  `[\x{0000}-\x{007F}]`, `[a-\xff]`, `[\x20-z]`) still hit GPU == CPU
  parity through `assertCpuGpuMatchesRegexpFind`.
* Python IT: extends `test_rlike_supplementary_codepoint_fallback_issue_14744`'s
  parametrize with `[\x{1F600}-\x{1F64F}]` and `[\x{10000}-\x{10FFFF}]`
  (literal-in-decorator per the `extend-existing-it-over-new` rule), so
  the same fallback contract that `regexp_test.py` covers for standalone
  supplementary escapes also covers ranges.

Local validation (worktree-local .mvn-repo, gpu.allocFraction=0.3):

* mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330
  -> BUILD SUCCESS (compile gate)
* mvn package -pl tests -am -Dbuildver=330
  -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite
  -> Tests: succeeded 96, failed 0, canceled 6, ignored 0, pending 0
* run_pyspark_from_build.sh -k 'supplementary_codepoint_fallback_issue_14744'
  -> 7 passed, 39868 deselected
* spark-submit end-to-end repro on the patched dist JAR for
  `[\x{1F600}-\x{1F64F}]`:
  - CPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True]
  - GPU: ["U+1F600" -> True, "a" -> False, "U+1F602" -> True]
  -> no divergence (GPU falls back to CPU via the NVIDIA#14744 throw, which
     this fix routes correctly through the recursive endpoint rewrite)

Closes NVIDIA#14745
Contributes to NVIDIA#14733
Depends on NVIDIA#14869

### Review notes

Informational findings recorded during pre-commit self-review (no
must-fix items):

* Comment "RegexCharacterClass (a few lines below)" is geographically
  correct but brittle to future reordering — kept for cross-referencing
  the `^$.` bypass.
* `bmpRangeInputs` contains two empty strings; harmless duplicate, no
  behavioural impact on `assertCpuGpuMatchesRegexpFind`.
* Inline closure `rewriteEndpoint` could be replaced by two literal
  rewrite invocations; current form is preferred for readability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings May 25, 2026 06:10
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a latent correctness bug in the cuDF regex transpiler where RegexCharacterRange endpoints were not rewritten, so escape/meta-character normalization (and supplementary-codepoint fallback behavior) could be skipped for range endpoints.

Changes:

  • Update CudfRegexTranspiler.rewrite to recurse into RegexCharacterRange start/end endpoints, with the same ^$. literal bypass behavior used for RegexCharacterClass.
  • Add defensive validation so rewritten range endpoints must remain RegexCharacterClassComponents (otherwise trigger CPU fallback via RegexUnsupportedException).
  • Extend Scala + Python test coverage for supplementary-codepoint fallback and for the range-endpoint recursion regression.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Make RegexCharacterRange rewriting recurse into endpoints and propagate unsupported supplementary-codepoint behavior safely.
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala Add unit tests covering supplementary-codepoint fallback and the fixed range-endpoint recursion behavior.
integration_tests/src/main/python/regexp_test.py Add/adjust integration tests to assert CPU fallback for supplementary-codepoint hex escapes (including in ranges) and keep GPU/CPU parity.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 25, 2026

Greptile Summary

This PR fixes two related bugs in the regex transpiler: it adds CPU fallback (via RegexUnsupportedException) for supplementary codepoints (cp > U+FFFF) in hex and octal escapes (#14744), and it eliminates the no-op RegexCharacterRange arm in CudfRegexTranspiler.rewrite so that range endpoints now receive the same hex/octal/meta-character normalisation as standalone literals (#14745).

  • RegexParser.scala: Parser and transpiler now throw RegexUnsupportedException for hex escapes whose codepoint exceeds U+FFFF, routing those patterns to CPU fallback rather than silently truncating via toChar. The RegexCharacterRange transpiler arm is changed from a no-op to a recursive rewrite of both endpoints, with the same ^$. bypass used by the adjacent RegexCharacterClass arm.
  • RegularExpressionTranspilerSuite.scala: Two new unit tests cover the supplementary-codepoint throw path and the BMP-range round-trip parity, including boundary values like U+FFFF.
  • regexp_test.py: Supplementary-codepoint patterns are moved from the GPU-only assertion into new assert_gpu_fallback_collect tests that verify CPU fallback correctness for both rlike and regexp_replace.

Confidence Score: 4/5

Safe to merge; the CharacterRange recursion and ^$. bypass are correct and consistent with the adjacent CharacterClass arm, and the test coverage spans BMP boundary values and GPU fallback paths.

The core logic — recursing into range endpoints with the ^$. bypass, and throwing on supplementary codepoints — correctly mirrors the existing CharacterClass arm. The two octal > 0xFFFF guards (in the parser and transpiler) are unreachable dead code since octal can encode at most U+00FF, but they are harmless. Everything else, including decorator ordering and assertion helpers in the integration tests, looks correct.

The dead octal guards in RegexParser.scala (both the parseCharacterClass branch and the transpiler's RegexOctalChar arm) are the only items worth revisiting; the test files are straightforward.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Adds supplementary-codepoint (cp > U+FFFF) throw in parser and transpiler for hex/octal arms, and fixes the no-op RegexCharacterRange arm to recurse into both endpoints with correct ^$. bypass; the octal > 0xFFFF guard in both the parser and transpiler is unreachable dead code (max octal value is 255)
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala Adds two new unit tests covering supplementary codepoint CPU fallback (issue #14744) and CharacterRange recursion (issue #14745), including BMP boundary regression guards
integration_tests/src/main/python/regexp_test.py Removes supplementary codepoint patterns from GPU-only test, adds parametrized fallback tests for rlike/regexp_replace using assert_gpu_fallback_collect with correct @allow_non_gpu decorators

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Regex Pattern Input"] --> B["RegexParser.parseCharacterClass / parseEscapedCharacter"]
    B --> C{"Hex escape \\x{...}?"}
    C -->|"codePoint > U+FFFF"| D["throw RegexUnsupportedException\n(#14744 fix)"]
    C -->|"codePoint <= U+FFFF"| E["RegexHexDigit / RegexChar node"]
    B --> F{"Octal escape \\0nnn?"}
    F -->|"max value 0xFF"| G["RegexOctalChar / RegexChar node"]
    E --> H["CudfRegexTranspiler.rewrite"]
    G --> H
    H --> I{"AST node type?"}
    I -->|"RegexHexDigit codePoint > U+FFFF"| D
    I -->|"RegexOctalChar codePoint > U+FFFF (dead)"| D
    I -->|"RegexCharacterRange(start, end)"| J["rewriteEndpoint(start)\nrewriteEndpoint(end)\n(#14745 fix — was no-op)"]
    J --> K{"endpoint is RegexChar('^','$','.')"}
    K -->|"yes — literal inside [ ]"| L["return as-is (^$. bypass)"]
    K -->|"no"| M["recursive rewrite(endpoint, ...)"]
    M --> N{"result is RegexCharacterClassComponent?"}
    N -->|"yes"| O["RegexCharacterRange(rewritten endpoints)"]
    N -->|"no e.g. \\s → CharClass"| D
    I -->|"RegexCharacterClass"| P["rewrite each member (same ^$. bypass)"]
    D --> Q["CPU Fallback (Java Pattern engine)"]
    O --> R["cuDF Regex Compilation"]
    P --> R
Loading

Comments Outside Diff (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala, line 1135-1152 (link)

    P2 The same dead-code guard appears in the transpiler's RegexOctalChar arm. Since any RegexOctalChar node was produced from a 4-digit-max octal pattern (max value 255), the > 0xFFFF branch can never execute. Removing it keeps the octal and hex arms visually distinct and avoids misleading readers about octal's range.

Reviews (1): Last reviewed commit: "[BUG] Recurse CudfRegexTranspiler into R..." | Re-trigger Greptile

Comment on lines 221 to 230
case Some('0') =>
val octalChar = parseOctalDigit
octalChar.codePoint match {
case 0 => RegexHexDigit("00")
case codePoint if codePoint > 0xFFFF =>
throw new RegexUnsupportedException(
"cuDF does not support supplementary codepoints (cp > U+FFFF) " +
"in octal escapes; falling back to CPU", octalChar.position)
case codePoint => RegexChar(codePoint.toChar)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The codePoint > 0xFFFF guard in the octal branch of the parser is unreachable. Java regex octal escapes top out at \0mnn with the constraint 0 <= m <= 3, giving a maximum encoded value of \0377 = 255 decimal (0xFF), which is far below U+FFFF. The dead guard will never fire and may mislead future readers into thinking octal can encode supplementary codepoints. The hex counterpart is correctly reachable and should stay; only the octal arm is dead code.

Suggested change
case Some('0') =>
val octalChar = parseOctalDigit
octalChar.codePoint match {
case 0 => RegexHexDigit("00")
case codePoint if codePoint > 0xFFFF =>
throw new RegexUnsupportedException(
"cuDF does not support supplementary codepoints (cp > U+FFFF) " +
"in octal escapes; falling back to CPU", octalChar.position)
case codePoint => RegexChar(codePoint.toChar)
}
case Some('0') =>
val octalChar = parseOctalDigit
octalChar.codePoint match {
case 0 => RegexHexDigit("00")
// Note: octal escapes max out at \0377 = 255 (U+00FF), so codePoint
// can never exceed U+FFFF here; no supplementary-codepoint guard needed.
case codePoint => RegexChar(codePoint.toChar)
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7c1505385. You are right — parseOctalDigit accepts at most \0mnn with 0 <= m <= 3 per the constraints at RegexParser.scala:572-573, so the maximum encoded codepoint is 0o377 = 255 = U+00FF, well below U+FFFF. Removed the dead guard and replaced it with a one-line invariant note so the asymmetry with the hex arm (which is correctly reachable) does not mislead future readers.

Verified locally: mvn package -pl tests -am -Dbuildver=330 ... -DwildcardSuites=com.nvidia.spark.rapids.RegularExpressionTranspilerSuite → 96 succeeded, 0 failed, 6 canceled (unrelated #7585 unicode-line-separator cancellations).

@wjxiz1992 wjxiz1992 marked this pull request as draft May 25, 2026 10:07
…NVIDIA#14745)

`parseOctalDigit` accepts at most `\0mnn` with `0 <= m <= 3, 0 <= n <= 7`
(see `parseOctalDigits(4)` constraints in RegexParser.scala:572-573), so
the maximum encoded `codePoint` is 0o377 = 255 decimal = U+00FF. The
`codePoint > 0xFFFF` guard at the same site as the hex branch's guard is
therefore unreachable; replaced with a one-line invariant comment so the
asymmetry with the hex arm doesn't mislead readers.

Verified: `RegularExpressionTranspilerSuite` 96 succeeded, 0 failed,
6 canceled (unrelated NVIDIA#7585 unicode-line-separator cancellations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] CudfRegexTranspiler.rewrite does not recurse into RegexCharacterRange endpoints; non-BMP / non-ASCII range endpoints get the wrong match

3 participants