[BUG] Fix regex replacement-string parser Java spec gaps (#14742) by wjxiz1992 · Pull Request #14862 · NVIDIA/spark-rapids

wjxiz1992 · 2026-05-22T09:14:20Z

Summary

RegexParser.parseBackrefOrEscaped and parseBackrefOrLiteralDollar (plus the
downstream GpuRegExpUtils.backrefConversion helper) diverged from
java.util.regex.Matcher#appendReplacement in five observable ways. This PR
rewrites both parser methods and trims backrefConversion so the spark-rapids
replacement-string path matches the Java spec.

For each sub-bug the GPU now matches the CPU contract — either both produce the
same value or both engines throw the same error class. Sub-bugs 2-5 raise
RegexUnsupportedException during meta-tagging, which causes RegExpReplace
to fall back to CPU; Spark's CPU Matcher then raises the user-visible
IllegalArgumentException defined by Java's spec.

#	Input	Before (GPU)	After (GPU)	Java / Spark CPU
1	`\1` in replacement	`RegexBackref(1)` substitutes group 1	literal `1`	literal `1`
2	trailing `\`	literal `\` emitted	CPU fallback + Java throws	`IllegalArgumentException("character to be escaped is missing")`
3	`$X` (non-digit X)	literal `$X` emitted	CPU fallback + Java throws	`IllegalArgumentException("Illegal group reference")`
4	`${1}`	`RegexBackref(1)` substitutes group 1	CPU fallback + Java throws	`IllegalArgumentException("capturing group name {1} starts with digit character")`
5	`${name}`	literal `${name}` emitted	CPU fallback + Java throws	`IllegalArgumentException("No group with name {name}")`

The AST shape for \X is intentionally a two-node sequence
(RegexChar('\\') + RegexChar(X)); the existing unescapeReplaceString
downstream still collapses the \X pair to literal X. The only semantic delta
in backrefConversion is dropping the \digit -> ${digit} rewrite, which was
the root cause of sub-bug 1.

Tests use the DataFrame API rather than selectExpr because Spark SQL's
Hive-inherited variable substitution silently expands \${...} inside SQL
string literals at parse time and would mask sub-bugs 4-5.

Test plan

13 new Scala unit tests in RegularExpressionParserSuite covering each
sub-bug plus regression cases (\$1, \$12, \\\\, \\\$).
5 new Python integration tests in regexp_test.py (test_regexp_replace_subbug{1..5}_*_14742).
Local Maven runs (Spark 3.3.0, GPU allocFraction 0.3):
- RegularExpressionParserSuite + RegularExpressionTranspilerSuite:
  Tests: succeeded 130, failed 0, canceled 6, ignored 0, pending 0.
- All regexp_replace / re_replace Python ITs:
  28 passed, 1 skipped, 0 failed.
- Targeted -k 14742 Python ITs: 5 passed.

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

RegexParser.parseBackrefOrEscaped and parseBackrefOrLiteralDollar diverged from java.util.regex.Matcher#appendReplacement in five ways. Rewrite both methods (plus the downstream GpuRegExpUtils.backrefConversion helper) to faithfully implement the Java spec: 1. \N for any N (digit or not) emits the literal character N. Previously \digit was incorrectly parsed as RegexBackref. 2. A trailing bare \ raises RegexUnsupportedException (Java throws "character to be escaped is missing"). Previously emitted a literal backslash silently. 3. $X for non-digit X (and EOF) raises RegexUnsupportedException (Java throws "Illegal group reference"). Previously silently emitted a literal '$' followed by X. 4. ${digits} raises RegexUnsupportedException (Java throws "capturing group name {N} starts with digit character"). Previously emitted RegexBackref(digits). 5. ${name} raises RegexUnsupportedException. spark-rapids has no named-group support, so any ${name} reference is unsupported on the GPU and falls back to CPU; Java's CPU Matcher then throws "No group with name {name}" if the name is unknown, preserving the user-visible error contract. The exceptions thrown during meta-tagging are caught by GpuRegExpReplaceMeta and propagated as willNotWorkOnGpu, so the operator falls back to CPU rather than failing with an internal error. The AST emission for \X is intentionally a two-node sequence (RegexChar('\\') + RegexChar(X)) so the downstream unescapeReplaceString pipeline keeps producing the same cuDF replacement string for non-backref escapes. The only semantic delta in backrefConversion is dropping the (incorrect) \digit -> backref rewrite; $digit is preserved. Add 13 Scala unit tests in RegularExpressionParserSuite covering each sub-bug plus the still-supported $1 / $12 / \\ / \$ shapes, and 5 Python integration tests in regexp_test.py using the DataFrame API (selectExpr is avoided because Hive variable substitution masks ${...} at SQL parse time). Closes NVIDIA#14742 Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-05-22T09:19:36Z

build

Copilot

Pull request overview

This PR fixes correctness gaps in spark-rapids’ regexp replacement-string parsing so GPU behavior matches java.util.regex.Matcher#appendReplacement semantics (or deliberately falls back to CPU to preserve Java/Spark error behavior), closing #14742.

Changes:

Reworked RegexParser replacement parsing for \X, $X, and ${...} so unsupported/illegal shapes trigger CPU fallback and Java-consistent errors.
Simplified GpuRegExpUtils.backrefConversion to only treat $<digits> as backrefs (no longer treating \digit as a backref).
Added Scala unit tests and Python integration tests covering the five sub-bugs and key regressions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala	Adds unit tests validating replacement-string parsing behavior for `\`, `$`, and `${...}` cases.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala	Updates backref conversion logic/docs to align with Java replacement semantics (`$digit` only).
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala	Rewrites replacement parsing for `\X` and `$...` to match Java spec and enforce CPU fallback on illegal forms.
integration_tests/src/main/python/regexp_test.py	Adds DataFrame-API integration tests for sub-bugs 1–5 (avoiding SQL `${...}` substitution pitfalls).

wjxiz1992 · 2026-05-28T04:27:39Z

+      case Some(ch) if ch.isDigit =>
+        // Numbered backref. consumeInt always succeeds here because we peeked a digit.
+        RegexBackref(consumeInt().get)


Fixed in 7882b02. Restricted every digit-branch check in parseBackrefOrLiteralDollar (and the underlying consumeInt loop) to ASCII 0-9 via a new isAsciiDigit predicate. Non-ASCII Unicode digits now fall through to the catch-all and surface as RegexUnsupportedException("Illegal group reference"), which the tag path converts to CPU fallback. Java's Matcher.appendReplacement then handles the Unicode digit value itself. Added a Scala unit test (RegularExpressionParserSuite — "non-ASCII Unicode digit after $ triggers GPU fallback") that exercises Arabic-Indic, Devanagari, and Arabic-Persian digits.

greptile-apps · 2026-05-22T09:23:59Z

Greptile Summary

This PR fixes five spec gaps where RegexParser.parseBackrefOrEscaped / parseBackrefOrLiteralDollar (and the downstream backrefConversion helper) diverged from java.util.regex.Matcher#appendReplacement. The core fix is that \digit in a replacement string is now emitted as a literal \X pair (not RegexBackref) so downstream unescapeReplaceString collapses it to the bare digit, matching Java's semantics. Invalid replacement forms (${digit}, ${name}, bare trailing \, and $X with non-digit X) now throw RegexUnsupportedException to force a CPU fallback where Java raises the appropriate IllegalArgumentException.

parseBackrefOrEscaped now emits RegexSequence([RegexChar('\\'), RegexChar(X)]) for any \X, and throws on a trailing bare \ (sub-bugs 1 & 2).
parseBackrefOrLiteralDollar now strictly validates the ${...} form and throws on digit-leading names, unsupported named-group refs, empty/malformed braces, and bare $ not followed by a digit or { (sub-bugs 3, 4, & 5).
backrefConversion drops the old \digit → ${digit} rewrite branch, which was the root cause of sub-bug 1.
13 new Scala unit tests and 5 new Python integration tests cover every sub-bug plus regression cases.

Confidence Score: 4/5

The replacement-string parser rewrites are logically correct and well-tested; the only change with any surface area is the removal of the \digit→backref branch in backrefConversion, which is the root-cause fix.

All five sub-bugs are addressed with matching unit and integration tests. The new parser correctly aligns with Java's appendReplacement semantics for every documented case. The only finding is a missing PEP-8 blank line in the test file, which does not affect correctness or CI.

No files require special attention; the regression cases in RegularExpressionParserSuite.scala cover the happy path and every error branch added in RegexParser.scala.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala	Core logic fix: parseBackrefOrEscaped and parseBackrefOrLiteralDollar rewritten to match Java spec; adds isLetter helper (ASCII-only, consistent with Java's Matcher internals).
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala	backrefConversion simplified to strip the incorrect \digit→${digit} rewrite; comment updated to document the new contract. Change is minimal and correct.
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala	13 new unit tests cover all 5 sub-bugs plus regression cases; assertions are precise and test both the happy path and every error path.
integration_tests/src/main/python/regexp_test.py	5 new Python integration tests; sub-bug 1 uses assert_gpu_and_cpu_are_equal_collect (GPU execution verified), sub-bugs 2–5 correctly use @allow_non_gpu + assert_gpu_and_cpu_error since those paths fall back to CPU. Missing blank line before the next test section.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Replacement string input"] --> B{First char?}
    B -- "\\" --> C["parseBackrefOrEscaped()"]
    B -- "$" --> D["parseBackrefOrLiteralDollar()"]
    B -- other --> E["RegexChar(ch)"]

    C --> C1{Next char?}
    C1 -- "EOF" --> C2["throw RegexUnsupportedException\n(trailing backslash)"]
    C1 -- "any X" --> C3["RegexSequence([RegexChar('\\'), RegexChar(X)])\n literal X after unescape"]

    D --> D1{Next char?}
    D1 -- "digit" --> D2["RegexBackref(consumeInt())\n cuDF ${n}"]
    D1 -- "'{'" --> D3{Name char?}
    D3 -- "digit" --> D4["throw digit-leading name\n CPU fallback"]
    D3 -- "letter" --> D5["consume name expect '}'\n throw named group not supported"]
    D3 -- "other/EOF" --> D6["throw empty/malformed\n CPU fallback"]
    D1 -- "other/EOF" --> D7["throw Illegal group ref\n CPU fallback"]

    C3 --> F["backrefConversion: leave \\X verbatim"]
    D2 --> G["backrefConversion: $digit to ${digit}"]
    F --> H["unescapeReplaceString: strip leading \\"]
    G --> I["cuDF: ${n} backref"]
    H --> J["cuDF: literal char"]

_{Reviews (1): Last reviewed commit: "Fix regex replacement-string parser Java..." | Re-trigger Greptile}

nvauto · 2026-05-25T05:03:00Z

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

Copilot review at NVIDIA#14862 (comment) flagged that `parseBackrefOrLiteralDollar` accepted any `Char.isDigit` codepoint (including non-ASCII Unicode digits such as Arabic-Indic ٠-٩ and Devanagari ०-९) and then called `consumeInt().get`. `consumeInt` uses `pattern.substring(...).toInt`, which only parses ASCII digits, so a non-ASCII Unicode digit would surface as an uncaught `NumberFormatException` at GPU tagging time instead of as the `RegexUnsupportedException` that triggers a CPU fallback. Fix: - Add `isAsciiDigit` predicate to gate every digit-branch check in `parseBackrefOrLiteralDollar` (both the bare `$<digit>` numbered backref and the `${name}` named-group `name-starts-with-digit` check and the `${name}` name-body loop). - Also tighten `consumeInt`'s digit loop to ASCII so the routine cannot consume a non-ASCII digit and then throw `NumberFormatException` in `toInt`. This matches Java's regex parser, which only reads ASCII digits in numeric contexts. - Non-ASCII digits now fall through to the `case _ =>` catch-all and raise `RegexUnsupportedException("Illegal group reference")`, which the tag path catches and converts to CPU fallback. CPU then runs Java's `Matcher.appendReplacement`, which honours the Unicode value of the digit (e.g. `'٢'.getNumericValue == 2`). Local Scala unit test (37 tests, 0 failures): com.nvidia.spark.rapids.RegularExpressionParserSuite::issue-14742: non-ASCII Unicode digit after `$` triggers GPU fallback PASSED Signed-off-by: Allen Xu <allxu@nvidia.com>

The exception message, predicate name (`isAsciiDigit`), and test names already carry the WHY; drop the "how we got here" prose per reviewer feedback style. Signed-off-by: Allen Xu <allxu@nvidia.com>

Copilot AI review requested due to automatic review settings May 22, 2026 09:14

Copilot started reviewing on behalf of wjxiz1992 May 22, 2026 09:15 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

wjxiz1992 marked this pull request as draft May 25, 2026 10:07

wjxiz1992 added 2 commits May 28, 2026 12:27

chore: trim narrative comments from NVIDIA#14742 regex fix

0384a90

The exception message, predicate name (`isAsciiDigit`), and test names already carry the WHY; drop the "how we got here" prose per reviewer feedback style. Signed-off-by: Allen Xu <allxu@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix regex replacement-string parser Java spec gaps (#14742)#14862

[BUG] Fix regex replacement-string parser Java spec gaps (#14742)#14862
wjxiz1992 wants to merge 3 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14742

wjxiz1992 commented May 22, 2026

Uh oh!

wjxiz1992 commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

wjxiz1992 May 28, 2026

Uh oh!

greptile-apps Bot commented May 22, 2026

Uh oh!

nvauto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wjxiz1992 commented May 22, 2026

Summary

Test plan

Uh oh!

wjxiz1992 commented May 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

wjxiz1992 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 22, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

nvauto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants