[BUG] Fix regex replacement-string parser Java spec gaps (#14742)#14862
[BUG] Fix regex replacement-string parser Java spec gaps (#14742)#14862wjxiz1992 wants to merge 3 commits into
Conversation
RegexParser.parseBackrefOrEscaped and parseBackrefOrLiteralDollar
diverged from java.util.regex.Matcher#appendReplacement in five ways.
Rewrite both methods (plus the downstream GpuRegExpUtils.backrefConversion
helper) to faithfully implement the Java spec:
1. \N for any N (digit or not) emits the literal character N.
Previously \digit was incorrectly parsed as RegexBackref.
2. A trailing bare \ raises RegexUnsupportedException (Java throws
"character to be escaped is missing"). Previously emitted a
literal backslash silently.
3. $X for non-digit X (and EOF) raises RegexUnsupportedException
(Java throws "Illegal group reference"). Previously silently
emitted a literal '$' followed by X.
4. ${digits} raises RegexUnsupportedException (Java throws
"capturing group name {N} starts with digit character").
Previously emitted RegexBackref(digits).
5. ${name} raises RegexUnsupportedException. spark-rapids has no
named-group support, so any ${name} reference is unsupported on
the GPU and falls back to CPU; Java's CPU Matcher then throws
"No group with name {name}" if the name is unknown, preserving
the user-visible error contract.
The exceptions thrown during meta-tagging are caught by
GpuRegExpReplaceMeta and propagated as willNotWorkOnGpu, so the
operator falls back to CPU rather than failing with an internal error.
The AST emission for \X is intentionally a two-node sequence
(RegexChar('\\') + RegexChar(X)) so the downstream
unescapeReplaceString pipeline keeps producing the same cuDF
replacement string for non-backref escapes. The only semantic delta
in backrefConversion is dropping the (incorrect) \digit -> backref
rewrite; $digit is preserved.
Add 13 Scala unit tests in RegularExpressionParserSuite covering each
sub-bug plus the still-supported $1 / $12 / \\ / \$ shapes, and 5
Python integration tests in regexp_test.py using the DataFrame API
(selectExpr is avoided because Hive variable substitution masks
${...} at SQL parse time).
Closes NVIDIA#14742
Signed-off-by: Allen Xu <allxu@nvidia.com>
|
build |
There was a problem hiding this comment.
Pull request overview
This PR fixes correctness gaps in spark-rapids’ regexp replacement-string parsing so GPU behavior matches java.util.regex.Matcher#appendReplacement semantics (or deliberately falls back to CPU to preserve Java/Spark error behavior), closing #14742.
Changes:
- Reworked
RegexParserreplacement parsing for\X,$X, and${...}so unsupported/illegal shapes trigger CPU fallback and Java-consistent errors. - Simplified
GpuRegExpUtils.backrefConversionto only treat$<digits>as backrefs (no longer treating\digitas a backref). - Added Scala unit tests and Python integration tests covering the five sub-bugs and key regressions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala | Adds unit tests validating replacement-string parsing behavior for \, $, and ${...} cases. |
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala | Updates backref conversion logic/docs to align with Java replacement semantics ($digit only). |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala | Rewrites replacement parsing for \X and $... to match Java spec and enforce CPU fallback on illegal forms. |
| integration_tests/src/main/python/regexp_test.py | Adds DataFrame-API integration tests for sub-bugs 1–5 (avoiding SQL ${...} substitution pitfalls). |
| case Some(ch) if ch.isDigit => | ||
| // Numbered backref. consumeInt always succeeds here because we peeked a digit. | ||
| RegexBackref(consumeInt().get) |
There was a problem hiding this comment.
Fixed in 7882b02. Restricted every digit-branch check in parseBackrefOrLiteralDollar (and the underlying consumeInt loop) to ASCII 0-9 via a new isAsciiDigit predicate. Non-ASCII Unicode digits now fall through to the catch-all and surface as RegexUnsupportedException("Illegal group reference"), which the tag path converts to CPU fallback. Java's Matcher.appendReplacement then handles the Unicode digit value itself. Added a Scala unit test (RegularExpressionParserSuite — "non-ASCII Unicode digit after $ triggers GPU fallback") that exercises Arabic-Indic, Devanagari, and Arabic-Persian digits.
Greptile SummaryThis PR fixes five spec gaps where
Confidence Score: 4/5The replacement-string parser rewrites are logically correct and well-tested; the only change with any surface area is the removal of the \digit→backref branch in backrefConversion, which is the root-cause fix. All five sub-bugs are addressed with matching unit and integration tests. The new parser correctly aligns with Java's appendReplacement semantics for every documented case. The only finding is a missing PEP-8 blank line in the test file, which does not affect correctness or CI. No files require special attention; the regression cases in RegularExpressionParserSuite.scala cover the happy path and every error branch added in RegexParser.scala. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Replacement string input"] --> B{First char?}
B -- "\\" --> C["parseBackrefOrEscaped()"]
B -- "$" --> D["parseBackrefOrLiteralDollar()"]
B -- other --> E["RegexChar(ch)"]
C --> C1{Next char?}
C1 -- "EOF" --> C2["throw RegexUnsupportedException\n(trailing backslash)"]
C1 -- "any X" --> C3["RegexSequence([RegexChar('\\'), RegexChar(X)])\n literal X after unescape"]
D --> D1{Next char?}
D1 -- "digit" --> D2["RegexBackref(consumeInt())\n cuDF ${n}"]
D1 -- "'{'" --> D3{Name char?}
D3 -- "digit" --> D4["throw digit-leading name\n CPU fallback"]
D3 -- "letter" --> D5["consume name expect '}'\n throw named group not supported"]
D3 -- "other/EOF" --> D6["throw empty/malformed\n CPU fallback"]
D1 -- "other/EOF" --> D7["throw Illegal group ref\n CPU fallback"]
C3 --> F["backrefConversion: leave \\X verbatim"]
D2 --> G["backrefConversion: $digit to ${digit}"]
F --> H["unescapeReplaceString: strip leading \\"]
G --> I["cuDF: ${n} backref"]
H --> J["cuDF: literal char"]
Reviews (1): Last reviewed commit: "Fix regex replacement-string parser Java..." | Re-trigger Greptile |
|
NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release. |
Copilot review at NVIDIA#14862 (comment) flagged that `parseBackrefOrLiteralDollar` accepted any `Char.isDigit` codepoint (including non-ASCII Unicode digits such as Arabic-Indic ٠-٩ and Devanagari ०-९) and then called `consumeInt().get`. `consumeInt` uses `pattern.substring(...).toInt`, which only parses ASCII digits, so a non-ASCII Unicode digit would surface as an uncaught `NumberFormatException` at GPU tagging time instead of as the `RegexUnsupportedException` that triggers a CPU fallback. Fix: - Add `isAsciiDigit` predicate to gate every digit-branch check in `parseBackrefOrLiteralDollar` (both the bare `$<digit>` numbered backref and the `${name}` named-group `name-starts-with-digit` check and the `${name}` name-body loop). - Also tighten `consumeInt`'s digit loop to ASCII so the routine cannot consume a non-ASCII digit and then throw `NumberFormatException` in `toInt`. This matches Java's regex parser, which only reads ASCII digits in numeric contexts. - Non-ASCII digits now fall through to the `case _ =>` catch-all and raise `RegexUnsupportedException("Illegal group reference")`, which the tag path catches and converts to CPU fallback. CPU then runs Java's `Matcher.appendReplacement`, which honours the Unicode value of the digit (e.g. `'٢'.getNumericValue == 2`). Local Scala unit test (37 tests, 0 failures): com.nvidia.spark.rapids.RegularExpressionParserSuite::issue-14742: non-ASCII Unicode digit after `$` triggers GPU fallback PASSED Signed-off-by: Allen Xu <allxu@nvidia.com>
The exception message, predicate name (`isAsciiDigit`), and test names already carry the WHY; drop the "how we got here" prose per reviewer feedback style. Signed-off-by: Allen Xu <allxu@nvidia.com>
Closes #14742
Summary
RegexParser.parseBackrefOrEscapedandparseBackrefOrLiteralDollar(plus thedownstream
GpuRegExpUtils.backrefConversionhelper) diverged fromjava.util.regex.Matcher#appendReplacementin five observable ways. This PRrewrites both parser methods and trims
backrefConversionso the spark-rapidsreplacement-string path matches the Java spec.
For each sub-bug the GPU now matches the CPU contract — either both produce the
same value or both engines throw the same error class. Sub-bugs 2-5 raise
RegexUnsupportedExceptionduring meta-tagging, which causesRegExpReplaceto fall back to CPU; Spark's CPU Matcher then raises the user-visible
IllegalArgumentExceptiondefined by Java's spec.\1in replacementRegexBackref(1)substitutes group 111\\emittedIllegalArgumentException("character to be escaped is missing")$X(non-digit X)$XemittedIllegalArgumentException("Illegal group reference")${1}RegexBackref(1)substitutes group 1IllegalArgumentException("capturing group name {1} starts with digit character")${name}${name}emittedIllegalArgumentException("No group with name {name}")The AST shape for
\Xis intentionally a two-node sequence(
RegexChar('\\') + RegexChar(X)); the existingunescapeReplaceStringdownstream still collapses the
\Xpair to literal X. The only semantic deltain
backrefConversionis dropping the\digit -> ${digit}rewrite, which wasthe root cause of sub-bug 1.
Tests use the DataFrame API rather than
selectExprbecause Spark SQL'sHive-inherited variable substitution silently expands
\${...}inside SQLstring literals at parse time and would mask sub-bugs 4-5.
Test plan
RegularExpressionParserSuitecovering eachsub-bug plus regression cases (
\$1,\$12,\\\\,\\\$).regexp_test.py(test_regexp_replace_subbug{1..5}_*_14742).RegularExpressionParserSuite+RegularExpressionTranspilerSuite:Tests: succeeded 130, failed 0, canceled 6, ignored 0, pending 0.regexp_replace/re_replacePython ITs:28 passed, 1 skipped, 0 failed.-k 14742Python ITs: 5 passed.Documentation
Testing
Performance