Skip to content

[BUG] Fix regex replacement-string parser Java spec gaps (#14742)#14862

Draft
wjxiz1992 wants to merge 3 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14742
Draft

[BUG] Fix regex replacement-string parser Java spec gaps (#14742)#14862
wjxiz1992 wants to merge 3 commits into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14742

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

Closes #14742

Summary

RegexParser.parseBackrefOrEscaped and parseBackrefOrLiteralDollar (plus the
downstream GpuRegExpUtils.backrefConversion helper) diverged from
java.util.regex.Matcher#appendReplacement in five observable ways. This PR
rewrites both parser methods and trims backrefConversion so the spark-rapids
replacement-string path matches the Java spec.

For each sub-bug the GPU now matches the CPU contract — either both produce the
same value or both engines throw the same error class. Sub-bugs 2-5 raise
RegexUnsupportedException during meta-tagging, which causes RegExpReplace
to fall back to CPU; Spark's CPU Matcher then raises the user-visible
IllegalArgumentException defined by Java's spec.

# Input Before (GPU) After (GPU) Java / Spark CPU
1 \1 in replacement RegexBackref(1) substitutes group 1 literal 1 literal 1
2 trailing \ literal \ emitted CPU fallback + Java throws IllegalArgumentException("character to be escaped is missing")
3 $X (non-digit X) literal $X emitted CPU fallback + Java throws IllegalArgumentException("Illegal group reference")
4 ${1} RegexBackref(1) substitutes group 1 CPU fallback + Java throws IllegalArgumentException("capturing group name {1} starts with digit character")
5 ${name} literal ${name} emitted CPU fallback + Java throws IllegalArgumentException("No group with name {name}")

The AST shape for \X is intentionally a two-node sequence
(RegexChar('\\') + RegexChar(X)); the existing unescapeReplaceString
downstream still collapses the \X pair to literal X. The only semantic delta
in backrefConversion is dropping the \digit -> ${digit} rewrite, which was
the root cause of sub-bug 1.

Tests use the DataFrame API rather than selectExpr because Spark SQL's
Hive-inherited variable substitution silently expands \${...} inside SQL
string literals at parse time and would mask sub-bugs 4-5.

Test plan

  • 13 new Scala unit tests in RegularExpressionParserSuite covering each
    sub-bug plus regression cases (\$1, \$12, \\\\, \\\$).
  • 5 new Python integration tests in regexp_test.py (test_regexp_replace_subbug{1..5}_*_14742).
  • Local Maven runs (Spark 3.3.0, GPU allocFraction 0.3):
    • RegularExpressionParserSuite + RegularExpressionTranspilerSuite:
      Tests: succeeded 130, failed 0, canceled 6, ignored 0, pending 0.
    • All regexp_replace / re_replace Python ITs:
      28 passed, 1 skipped, 0 failed.
    • Targeted -k 14742 Python ITs: 5 passed.

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

RegexParser.parseBackrefOrEscaped and parseBackrefOrLiteralDollar
diverged from java.util.regex.Matcher#appendReplacement in five ways.
Rewrite both methods (plus the downstream GpuRegExpUtils.backrefConversion
helper) to faithfully implement the Java spec:

  1. \N for any N (digit or not) emits the literal character N.
     Previously \digit was incorrectly parsed as RegexBackref.
  2. A trailing bare \ raises RegexUnsupportedException (Java throws
     "character to be escaped is missing"). Previously emitted a
     literal backslash silently.
  3. $X for non-digit X (and EOF) raises RegexUnsupportedException
     (Java throws "Illegal group reference"). Previously silently
     emitted a literal '$' followed by X.
  4. ${digits} raises RegexUnsupportedException (Java throws
     "capturing group name {N} starts with digit character").
     Previously emitted RegexBackref(digits).
  5. ${name} raises RegexUnsupportedException. spark-rapids has no
     named-group support, so any ${name} reference is unsupported on
     the GPU and falls back to CPU; Java's CPU Matcher then throws
     "No group with name {name}" if the name is unknown, preserving
     the user-visible error contract.

The exceptions thrown during meta-tagging are caught by
GpuRegExpReplaceMeta and propagated as willNotWorkOnGpu, so the
operator falls back to CPU rather than failing with an internal error.
The AST emission for \X is intentionally a two-node sequence
(RegexChar('\\') + RegexChar(X)) so the downstream
unescapeReplaceString pipeline keeps producing the same cuDF
replacement string for non-backref escapes. The only semantic delta
in backrefConversion is dropping the (incorrect) \digit -> backref
rewrite; $digit is preserved.

Add 13 Scala unit tests in RegularExpressionParserSuite covering each
sub-bug plus the still-supported $1 / $12 / \\ / \$ shapes, and 5
Python integration tests in regexp_test.py using the DataFrame API
(selectExpr is avoided because Hive variable substitution masks
${...} at SQL parse time).

Closes NVIDIA#14742

Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings May 22, 2026 09:14
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes correctness gaps in spark-rapids’ regexp replacement-string parsing so GPU behavior matches java.util.regex.Matcher#appendReplacement semantics (or deliberately falls back to CPU to preserve Java/Spark error behavior), closing #14742.

Changes:

  • Reworked RegexParser replacement parsing for \X, $X, and ${...} so unsupported/illegal shapes trigger CPU fallback and Java-consistent errors.
  • Simplified GpuRegExpUtils.backrefConversion to only treat $<digits> as backrefs (no longer treating \digit as a backref).
  • Added Scala unit tests and Python integration tests covering the five sub-bugs and key regressions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala Adds unit tests validating replacement-string parsing behavior for \, $, and ${...} cases.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Updates backref conversion logic/docs to align with Java replacement semantics ($digit only).
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Rewrites replacement parsing for \X and $... to match Java spec and enforce CPU fallback on illegal forms.
integration_tests/src/main/python/regexp_test.py Adds DataFrame-API integration tests for sub-bugs 1–5 (avoiding SQL ${...} substitution pitfalls).

Comment on lines +447 to +449
case Some(ch) if ch.isDigit =>
// Numbered backref. consumeInt always succeeds here because we peeked a digit.
RegexBackref(consumeInt().get)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7882b02. Restricted every digit-branch check in parseBackrefOrLiteralDollar (and the underlying consumeInt loop) to ASCII 0-9 via a new isAsciiDigit predicate. Non-ASCII Unicode digits now fall through to the catch-all and surface as RegexUnsupportedException("Illegal group reference"), which the tag path converts to CPU fallback. Java's Matcher.appendReplacement then handles the Unicode digit value itself. Added a Scala unit test (RegularExpressionParserSuite — "non-ASCII Unicode digit after $ triggers GPU fallback") that exercises Arabic-Indic, Devanagari, and Arabic-Persian digits.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR fixes five spec gaps where RegexParser.parseBackrefOrEscaped / parseBackrefOrLiteralDollar (and the downstream backrefConversion helper) diverged from java.util.regex.Matcher#appendReplacement. The core fix is that \digit in a replacement string is now emitted as a literal \X pair (not RegexBackref) so downstream unescapeReplaceString collapses it to the bare digit, matching Java's semantics. Invalid replacement forms (${digit}, ${name}, bare trailing \, and $X with non-digit X) now throw RegexUnsupportedException to force a CPU fallback where Java raises the appropriate IllegalArgumentException.

  • parseBackrefOrEscaped now emits RegexSequence([RegexChar('\\'), RegexChar(X)]) for any \X, and throws on a trailing bare \ (sub-bugs 1 & 2).
  • parseBackrefOrLiteralDollar now strictly validates the ${...} form and throws on digit-leading names, unsupported named-group refs, empty/malformed braces, and bare $ not followed by a digit or { (sub-bugs 3, 4, & 5).
  • backrefConversion drops the old \digit → ${digit} rewrite branch, which was the root cause of sub-bug 1.
  • 13 new Scala unit tests and 5 new Python integration tests cover every sub-bug plus regression cases.

Confidence Score: 4/5

The replacement-string parser rewrites are logically correct and well-tested; the only change with any surface area is the removal of the \digit→backref branch in backrefConversion, which is the root-cause fix.

All five sub-bugs are addressed with matching unit and integration tests. The new parser correctly aligns with Java's appendReplacement semantics for every documented case. The only finding is a missing PEP-8 blank line in the test file, which does not affect correctness or CI.

No files require special attention; the regression cases in RegularExpressionParserSuite.scala cover the happy path and every error branch added in RegexParser.scala.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Core logic fix: parseBackrefOrEscaped and parseBackrefOrLiteralDollar rewritten to match Java spec; adds isLetter helper (ASCII-only, consistent with Java's Matcher internals).
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala backrefConversion simplified to strip the incorrect \digit→${digit} rewrite; comment updated to document the new contract. Change is minimal and correct.
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala 13 new unit tests cover all 5 sub-bugs plus regression cases; assertions are precise and test both the happy path and every error path.
integration_tests/src/main/python/regexp_test.py 5 new Python integration tests; sub-bug 1 uses assert_gpu_and_cpu_are_equal_collect (GPU execution verified), sub-bugs 2–5 correctly use @allow_non_gpu + assert_gpu_and_cpu_error since those paths fall back to CPU. Missing blank line before the next test section.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Replacement string input"] --> B{First char?}
    B -- "\\" --> C["parseBackrefOrEscaped()"]
    B -- "$" --> D["parseBackrefOrLiteralDollar()"]
    B -- other --> E["RegexChar(ch)"]

    C --> C1{Next char?}
    C1 -- "EOF" --> C2["throw RegexUnsupportedException\n(trailing backslash)"]
    C1 -- "any X" --> C3["RegexSequence([RegexChar('\\'), RegexChar(X)])\n literal X after unescape"]

    D --> D1{Next char?}
    D1 -- "digit" --> D2["RegexBackref(consumeInt())\n cuDF ${n}"]
    D1 -- "'{'" --> D3{Name char?}
    D3 -- "digit" --> D4["throw digit-leading name\n CPU fallback"]
    D3 -- "letter" --> D5["consume name expect '}'\n throw named group not supported"]
    D3 -- "other/EOF" --> D6["throw empty/malformed\n CPU fallback"]
    D1 -- "other/EOF" --> D7["throw Illegal group ref\n CPU fallback"]

    C3 --> F["backrefConversion: leave \\X verbatim"]
    D2 --> G["backrefConversion: $digit to ${digit}"]
    F --> H["unescapeReplaceString: strip leading \\"]
    G --> I["cuDF: ${n} backref"]
    H --> J["cuDF: literal char"]
Loading

Reviews (1): Last reviewed commit: "Fix regex replacement-string parser Java..." | Re-trigger Greptile

@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented May 25, 2026

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

@wjxiz1992 wjxiz1992 marked this pull request as draft May 25, 2026 10:07
wjxiz1992 added 2 commits May 28, 2026 12:27
Copilot review at
NVIDIA#14862 (comment)
flagged that `parseBackrefOrLiteralDollar` accepted any `Char.isDigit`
codepoint (including non-ASCII Unicode digits such as Arabic-Indic ٠-٩
and Devanagari ०-९) and then called `consumeInt().get`. `consumeInt`
uses `pattern.substring(...).toInt`, which only parses ASCII digits, so
a non-ASCII Unicode digit would surface as an uncaught
`NumberFormatException` at GPU tagging time instead of as the
`RegexUnsupportedException` that triggers a CPU fallback.

Fix:
- Add `isAsciiDigit` predicate to gate every digit-branch check in
  `parseBackrefOrLiteralDollar` (both the bare `$<digit>` numbered
  backref and the `${name}` named-group `name-starts-with-digit` check
  and the `${name}` name-body loop).
- Also tighten `consumeInt`'s digit loop to ASCII so the routine cannot
  consume a non-ASCII digit and then throw `NumberFormatException` in
  `toInt`. This matches Java's regex parser, which only reads ASCII
  digits in numeric contexts.
- Non-ASCII digits now fall through to the `case _ =>` catch-all and
  raise `RegexUnsupportedException("Illegal group reference")`, which
  the tag path catches and converts to CPU fallback. CPU then runs
  Java's `Matcher.appendReplacement`, which honours the Unicode value of
  the digit (e.g. `'٢'.getNumericValue == 2`).

Local Scala unit test (37 tests, 0 failures):
  com.nvidia.spark.rapids.RegularExpressionParserSuite::issue-14742:
    non-ASCII Unicode digit after `$` triggers GPU fallback   PASSED

Signed-off-by: Allen Xu <allxu@nvidia.com>
The exception message, predicate name (`isAsciiDigit`), and test names
already carry the WHY; drop the "how we got here" prose per reviewer
feedback style.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Replacement-string parser diverges from Java spec in five places: \\N as backref, trailing \\, bare $X, and malformed ${...}

3 participants