[BUG] Fix regex transpiler line-anchor backref index + countCaptureGroups recursion (#14735 #14855)#14856
[BUG] Fix regex transpiler line-anchor backref index + countCaptureGroups recursion (#14735 #14855)#14856wjxiz1992 wants to merge 1 commit into
Conversation
…reGroups `CudfRegexTranspiler.countCaptureGroups` was missing arms for `RegexChoice` and `RegexRepetition`, so capture groups under `|` or quantifiers were invisible. Adding those arms exposed a second, latent off-by-one in the line-anchor `$` rewrite: the synthesized `(\r\n)?` group's backref index was computed as `numCaptureGroups + 1`, assuming the synthesized group is appended at the END of the cuDF pattern, but it is positionally inserted at the `$` token. When source capture groups appear AFTER `$`, the synthesized index is less than `numCaptureGroups + 1`. Pre-fix, the two bugs cancelled out for some patterns; fixing only the first regressed `AST fuzz test - regexp_replace` on pattern NVIDIA#3712 `(?:\x{f91b}\>|[\02:#-,]$(\D))*`. The fix replaces the static formula with a running counter `emittedCaptureGroups` that is incremented at every cuDF capture-group emission site (source `(...)` groups, `\R` rewrite, `(\A)+` collapse, and the synthesized line-anchor group itself), reset at the top of `getTranspiledAST`. Tests: - `RegularExpressionTranspilerSuite::issue-14735` (NVIDIA#14735): CPU-GPU equality on `(foo)|(bar)$`, `(a)|(b)`, `(a)+`, `(ab)+|(cd)`. - `RegularExpressionTranspilerSuite::issue-14855` (NVIDIA#14855): direct assertion on the transpiled replacement's trailing synthesized backref for six `$`-positional patterns. - `regexp_test.test_re_replace_backrefs`: extended `parametrize` inline with the NVIDIA#14735 patterns. - AST fuzz `regexp_replace` (which originally failed on NVIDIA#3712) now passes. Mvn summary: Tests: succeeded 102, failed 0, canceled 6, ignored 0, pending 0. Python IT: 4 passed, 39858 deselected. ### Review notes - `emittedCaptureGroups` is incremented unconditionally even outside replace mode; safe (counter only read at the `$` synth site, guarded by `RegexReplaceMode`) but flagged by reviewer for cosmetic coupling. - Three `(\A)+` repetition-collapse arms now have near-identical `if (capture) emittedCaptureGroups += 1` insertions; a future refactor could extract a helper. - The `issue-14855` test asserts the transpiled output structure rather than running on GPU. The bug is only reachable end-to-end via patterns `checkUnsupported` rejects (e.g. `$\D`, `$(\D)` outside a Choice); the AST fuzz test exercises the only generative shape that bypasses that gate, so the existing structural assertion + the fuzz test together pin the fix. Closes NVIDIA#14735 Closes NVIDIA#14855 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
|
build |
There was a problem hiding this comment.
Pull request overview
Fixes two intertwined correctness bugs in the regex parser/transpiler that affect regexp_replace capture-group accounting and $ line-anchor rewrite backref indexing, ensuring GPU behavior matches Java/Spark semantics even for patterns with alternation and quantifiers.
Changes:
- Fix
countCaptureGroupsto recurse intoRegexChoiceandRegexRepetition, preventing under-counted capture groups and misindexed$Nreplacements. - Replace the
$-anchor synthesized backref index formula with a runningemittedCaptureGroupscounter that tracks cuDF capture-group emission order. - Add targeted Scala + Python integration tests covering the reported failing shapes and regressions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala |
Correct capture-group counting and compute $-rewrite synthesized backref indices based on emitted cuDF group order. |
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala |
Add regression tests for capture-group counting under choice/repetition and $-anchor backref indexing with groups after $. |
integration_tests/src/main/python/regexp_test.py |
Extend integration coverage for backrefs with alternation and repetition patterns (issue #14735). |
Greptile SummaryThis PR fixes two interleaved bugs in
Confidence Score: 4/5Safe to merge; the two-bug fix is logically consistent and covered by targeted unit and integration tests. The emittedCaptureGroups running counter correctly tracks pre-order group openings across all emission sites. The reset at getTranspiledAST entry handles sequential reuse. The only non-blocking concern is the mutable instance variable making CudfRegexTranspiler non-thread-safe for concurrent callers; in current Spark usage this path is single-threaded, so it is not an active defect. No files require special attention; the main Scala change in RegexParser.scala is worth a second read-through to verify the pre-order counter placement at every group-open site. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["transpile(pattern, extractIndex, repl)"] --> B["getTranspiledAST(pattern, ...)"]
B --> C["emittedCaptureGroups = 0"]
C --> D["countCaptureGroups(regex)\n[now recurses into RegexChoice + RegexRepetition]"]
D --> E["parseReplacement(numCaptureGroups)"]
E --> F["rewrite(regex, replacement, ...)"]
F --> G{AST node type}
G -->|"RegexGroup(capture=true, term)"| H["emittedCaptureGroups += 1\nthen recurse into term"]
G -->|"RegexChar('$') in ReplaceMode"| I["emittedCaptureGroups += 1\nappendBackref(emittedCaptureGroups)\nemit (\r\n)? group"]
G -->|"RegexEscaped('R') — \\R"| J["emittedCaptureGroups += 1\nemit capture group"]
G -->|"(\\A)+ collapse arms"| K["if capture: emittedCaptureGroups += 1\nemit RegexGroup"]
G -->|"RegexChoice(l, r)"| L["rewrite(l) then rewrite(r)\ncounter accumulates across both branches"]
G -->|"RegexRepetition(base, q)"| M["rewrite(base) — RegexGroup arm\nwill increment if capture=true"]
H --> F
L --> F
M --> F
Reviews (1): Last reviewed commit: "Fix regex transpiler off-by-one in line-..." | Re-trigger Greptile |
| // getTranspiledAST. Used to compute the cuDF index of the synthesized (\r\n)? group | ||
| // inserted at the position of `$` so the appended backref points at the right group even | ||
| // when source capture groups appear AFTER `$` (issues #14855 and #14735). | ||
| private var emittedCaptureGroups: Int = 0 |
There was a problem hiding this comment.
Mutable instance state is not thread-safe for concurrent callers
emittedCaptureGroups is a private var on the CudfRegexTranspiler instance, reset at the top of getTranspiledAST for sequential reuse. If two threads call transpile() (or getTranspiledAST) on the same instance concurrently — e.g., if a transpiler is accidentally cached at a scope shared across tasks — they will corrupt each other's counter, producing wrong backref indices silently. Spark planning is single-threaded so this is unlikely to bite in production today, but making emittedCaptureGroups a local variable threaded through rewrite calls (or scoping it inside getTranspiledAST) would eliminate the risk entirely and also remove the need for the reset guard.
There was a problem hiding this comment.
Good point on the abstract risk. Looking at the actual call sites, CudfRegexTranspiler is instantiated per-call in every Spark GpuOverrides hot path (e.g. GpuRLike, GpuRegExpReplace, GpuStringSplit); we never cache a single instance across tasks/threads. The emittedCaptureGroups = 0 reset on line 744 exists specifically to keep that contract honest when the same test instance is reused inside RegularExpressionTranspilerSuite (which does transpiler.transpile(a); transpiler.transpile(b); in a single thread), not because real production code shares an instance across threads.
Lifting the counter to a local threaded through rewrite would touch all 8 mutation sites (1116, 1118, 1272, 1444, 1463, 1482, 1598) in recursive paths that already pass several other contextual parameters (previous, flags, RegexRewriteFlags, mode-specific state). The transpiler also has other instance-level mutable state (the parser allocator inside RegexParser), so even after this change concurrency wouldn't be safe — we'd be making one specific field safe without changing the broader contract.
I'd prefer to keep the per-call reset for now and file the broader thread-safety refactor as a separate cleanup if/when we ever cache transpiler instances. Open to revisiting if you're aware of a code path that does share instances across threads — I couldn't find one in git grep CudfRegexTranspiler, all uses are new CudfRegexTranspiler(mode).transpile(...) inline.
|
NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release. |
revans2
left a comment
There was a problem hiding this comment.
I also found another related issue. The code that rejects $ inside of a capture group can be confused by choices or other operations within a capture group. (a$|b)(c) is accepted, but will produce incorrect answers, both because if the $ in the capture group capturing too much and also from the renumbering of the capture groups not working properly.
| # used to be invisible to countCaptureGroups, mis-indexing $N. | ||
| 'REGEXP_REPLACE(a, "(TEST)|(\ud720)", "[$1][$2]")', | ||
| 'REGEXP_REPLACE(a, "(TEST)+", "<$1>")', | ||
| 'REGEXP_REPLACE(a, "(T)(E)|(S)(T)", "{$1}{$2}{$3}{$4}")' |
There was a problem hiding this comment.
I believe that we likely still have some bugs in here. T$|(E) or (?:T$|(E)) with a replacement of [$1], I think would produce the wrong answer.
Source Java numbering says $1 is (E). After rewrite, the cuDF pattern effectively has the synthetic $ group before (E): T(\r\n)?$|(E).
Now cuDF group 1 is synthetic, and (E) is group 2.
There was a problem hiding this comment.
You are right — traced through and the bug reproduces. For T$|(E) with replacement [$1] on input "E":
- Java: right branch
(E)matches; group 1 =(E)= "E"; substitute$1→ output[E] - cuDF post-
5af9ff869: right branch(E)matches; cuDF group 1 = synth(\r\n)?(empty in right branch), cuDF group 2 = "E"; user$1emitted verbatim → looks up cuDF $1 → empty → output[]
This is a pre-existing remap gap, not introduced by 5af9ff869. Pre-fix had two cancelling off-by-ones (synth backref written as $(n+1) where n was the total-source count, so for T$|(E) the appended synth backref was $2 which coincidentally pointed at (E) for THIS pattern but was wrong for (?:\x{f91b}\>|[\02:#-,]$(\D))* fuzz #3712). 5af9ff869 fixes the synth-side correctly but leaves user-side $N emitted with Java numbering.
The proper fix is to maintain a sourceGroupToCudfIndex map during transpile (source (...) emits bump both the cuDF counter and a parallel source counter, recording the map entry) and then walk replacement.parts rewriting RegexBackref(N, isNew=false) → RegexBackref(sourceGroupToCudfIndex(N)) before returning. Roughly 30–50 lines + tests.
Two options — let me know which you prefer:
- Extend this PR to add the user-side remap + tests covering
T$|(E),(a)$|(b),(?:T$|(E)). - Land this PR as scoped (synth index + countCaptureGroups recursion), file a follow-up tracking issue for the user-side remap, and optionally add a defensive
checkUnsupportedarm here that rejects patterns where a source capture group is positionally AFTER a$so the GPU falls back to CPU instead of silently producing wrong output until the remap lands.
Slight preference for option 2 from me (keeps this PR scope-tight and the remap fix gets its own focused review), but happy to do option 1 if you would rather see them together.
Summary
Bundled fix for two entangled bugs in the regex parser/transpiler epic (#14733):
RegexChoiceandRegexRepetition; replacement-string semantics wrong for very common patterns #14735 —CudfRegexTranspiler.countCaptureGroupswas missing arms forRegexChoiceandRegexRepetition, so capture groups under|or quantifiers were invisible.numCaptureGroupson the parsedRegexReplacementwas under-counted, mangling$Nreferences.$rewrite. The synthesized(\r\n)?group's backref index was computed asnumCaptureGroups + 1, assuming the synthesized group is appended at the END of the cuDF pattern. In reality it is positionally inserted at the$token, so when source capture groups appear AFTER$, the synthesized index is less thannumCaptureGroups + 1.Pre-fix, the two bugs cancelled out for some patterns. Landing only the #14735 fix regressed
AST fuzz test - regexp_replaceon pattern #3712(?:\x{f91b}\>|[\02:#-,]$(\D))*— the cap-count fix made the synthesized backref dereference(\D)instead of(\r\n)?, leaking a stray\rinto the GPU output. Both fixes ship together.What the fix looks like
countCaptureGroupsgainsRegexChoiceandRegexRepetitionarms (issue's suggested fix, applied verbatim).rr.numCaptureGroups + 1at the$rewrite is replaced with a running counteremittedCaptureGroupsonCudfRegexTranspiler, reset at the top ofgetTranspiledASTand incremented at every cuDF-capture-group emission site (source(...)groups,\Rrewrite,(\A)+collapse arms, and the synthesized(\r\n)?group itself).Tests
RegularExpressionTranspilerSuite::issue-14735: CPU-GPU equality on(foo)|(bar)$,(a)|(b),(a)+,(ab)+|(cd).RegularExpressionTranspilerSuite::issue-14855: direct assertion on the transpiled replacement's trailing synthesized backref for six$-positional shapes.regexp_test.test_re_replace_backrefs: extendedparametrizeinline with [BUG] CudfRegexTranspiler.countCaptureGroups misses arms forRegexChoiceandRegexRepetition; replacement-string semantics wrong for very common patterns #14735 patterns.AST fuzz test - regexp_replace(which originally failed on [auto-merge] branch-21.10 to branch-21.12 [skip ci] [bot] #3712) now passes.Local validation
mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330→BUILD SUCCESS.mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=RegularExpressionTranspilerSuite→Tests: succeeded 102, failed 0, canceled 6, ignored 0, pending 0.pytest regexp_test.py -k 'test_re_replace_backrefs or test_re_replace_anchors'→4 passed, 39858 deselected.RegexChoiceandRegexRepetition; replacement-string semantics wrong for very common patterns #14735 (regexp_replace(a, '(foo)|(bar)$', '[$1][$2]')on['foo', 'bar']) on the patched dist JAR:GPU result: ['[foo][]', '[][bar]']— matches CPU expectation in the issue body.Closes #14735
Closes #14855
Test plan
Documentation
Performance
🤖 Generated with Claude Code