Skip to content

[BUG] Fix regex transpiler line-anchor backref index + countCaptureGroups recursion (#14735 #14855)#14856

Open
wjxiz1992 wants to merge 1 commit into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14735
Open

[BUG] Fix regex transpiler line-anchor backref index + countCaptureGroups recursion (#14735 #14855)#14856
wjxiz1992 wants to merge 1 commit into
NVIDIA:mainfrom
wjxiz1992:fix/regex-14735

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

Summary

Bundled fix for two entangled bugs in the regex parser/transpiler epic (#14733):

Pre-fix, the two bugs cancelled out for some patterns. Landing only the #14735 fix regressed AST fuzz test - regexp_replace on pattern #3712 (?:\x{f91b}\>|[\02:#-,]$(\D))* — the cap-count fix made the synthesized backref dereference (\D) instead of (\r\n)?, leaking a stray \r into the GPU output. Both fixes ship together.

What the fix looks like

  • countCaptureGroups gains RegexChoice and RegexRepetition arms (issue's suggested fix, applied verbatim).
  • The static formula rr.numCaptureGroups + 1 at the $ rewrite is replaced with a running counter emittedCaptureGroups on CudfRegexTranspiler, reset at the top of getTranspiledAST and incremented at every cuDF-capture-group emission site (source (...) groups, \R rewrite, (\A)+ collapse arms, and the synthesized (\r\n)? group itself).

Tests

Local validation

Closes #14735
Closes #14855

Test plan

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

🤖 Generated with Claude Code

…reGroups

`CudfRegexTranspiler.countCaptureGroups` was missing arms for `RegexChoice`
and `RegexRepetition`, so capture groups under `|` or quantifiers were
invisible. Adding those arms exposed a second, latent off-by-one in the
line-anchor `$` rewrite: the synthesized `(\r\n)?` group's backref index
was computed as `numCaptureGroups + 1`, assuming the synthesized group is
appended at the END of the cuDF pattern, but it is positionally inserted
at the `$` token. When source capture groups appear AFTER `$`, the
synthesized index is less than `numCaptureGroups + 1`. Pre-fix, the two
bugs cancelled out for some patterns; fixing only the first regressed
`AST fuzz test - regexp_replace` on pattern NVIDIA#3712
`(?:\x{f91b}\>|[\02:#-,]$(\D))*`.

The fix replaces the static formula with a running counter
`emittedCaptureGroups` that is incremented at every cuDF capture-group
emission site (source `(...)` groups, `\R` rewrite, `(\A)+` collapse, and
the synthesized line-anchor group itself), reset at the top of
`getTranspiledAST`.

Tests:
- `RegularExpressionTranspilerSuite::issue-14735` (NVIDIA#14735): CPU-GPU equality
  on `(foo)|(bar)$`, `(a)|(b)`, `(a)+`, `(ab)+|(cd)`.
- `RegularExpressionTranspilerSuite::issue-14855` (NVIDIA#14855): direct assertion
  on the transpiled replacement's trailing synthesized backref for six
  `$`-positional patterns.
- `regexp_test.test_re_replace_backrefs`: extended `parametrize` inline with
  the NVIDIA#14735 patterns.
- AST fuzz `regexp_replace` (which originally failed on NVIDIA#3712) now passes.

Mvn summary: Tests: succeeded 102, failed 0, canceled 6, ignored 0, pending 0.
Python IT: 4 passed, 39858 deselected.

### Review notes
- `emittedCaptureGroups` is incremented unconditionally even outside replace
  mode; safe (counter only read at the `$` synth site, guarded by
  `RegexReplaceMode`) but flagged by reviewer for cosmetic coupling.
- Three `(\A)+` repetition-collapse arms now have near-identical
  `if (capture) emittedCaptureGroups += 1` insertions; a future refactor
  could extract a helper.
- The `issue-14855` test asserts the transpiled output structure rather than
  running on GPU. The bug is only reachable end-to-end via patterns
  `checkUnsupported` rejects (e.g. `$\D`, `$(\D)` outside a Choice); the AST
  fuzz test exercises the only generative shape that bypasses that gate, so
  the existing structural assertion + the fuzz test together pin the fix.

Closes NVIDIA#14735
Closes NVIDIA#14855

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings May 22, 2026 05:15
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two intertwined correctness bugs in the regex parser/transpiler that affect regexp_replace capture-group accounting and $ line-anchor rewrite backref indexing, ensuring GPU behavior matches Java/Spark semantics even for patterns with alternation and quantifiers.

Changes:

  • Fix countCaptureGroups to recurse into RegexChoice and RegexRepetition, preventing under-counted capture groups and misindexed $N replacements.
  • Replace the $-anchor synthesized backref index formula with a running emittedCaptureGroups counter that tracks cuDF capture-group emission order.
  • Add targeted Scala + Python integration tests covering the reported failing shapes and regressions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Correct capture-group counting and compute $-rewrite synthesized backref indices based on emitted cuDF group order.
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala Add regression tests for capture-group counting under choice/repetition and $-anchor backref indexing with groups after $.
integration_tests/src/main/python/regexp_test.py Extend integration coverage for backrefs with alternation and repetition patterns (issue #14735).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR fixes two interleaved bugs in CudfRegexTranspiler: countCaptureGroups was missing RegexChoice and RegexRepetition arms (causing $N backrefs to misindex under | or quantifiers), and the synthesized ( )? group inserted at $ used a static formula numCaptureGroups + 1 that was off-by-one when source capture groups appeared after $. The two bugs previously cancelled out for some patterns, so both fixes ship together.

  • countCaptureGroups gains RegexChoice and RegexRepetition cases so all capture groups nested under alternation or quantifiers are counted correctly.
  • A running emittedCaptureGroups counter (reset per getTranspiledAST call, incremented pre-order at every cuDF group-open site) replaces the static formula, correctly computing the synthesized backref index regardless of whether source groups appear before or after $.
  • New Scala unit tests verify direct transpiler output (backref index assertions) and CPU/GPU behavioral equality; existing Python integration tests are extended with RegexChoice/RegexRepetition patterns.

Confidence Score: 4/5

Safe to merge; the two-bug fix is logically consistent and covered by targeted unit and integration tests.

The emittedCaptureGroups running counter correctly tracks pre-order group openings across all emission sites. The reset at getTranspiledAST entry handles sequential reuse. The only non-blocking concern is the mutable instance variable making CudfRegexTranspiler non-thread-safe for concurrent callers; in current Spark usage this path is single-threaded, so it is not an active defect.

No files require special attention; the main Scala change in RegexParser.scala is worth a second read-through to verify the pre-order counter placement at every group-open site.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Core fix: adds RegexChoice/RegexRepetition arms to countCaptureGroups and introduces a mutable emittedCaptureGroups counter reset per getTranspiledAST call; incremented pre-order at all cuDF capture-group-open sites (regular groups, \R rewrite, (\A)+ collapse arms, and the synthesized (\r\n)? group at $). Logic is correct but the instance-level mutable var makes the transpiler non-thread-safe for concurrent callers.
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala Adds two targeted tests: issue-14735 checks CPU/GPU behavioral equality for RegexChoice and RegexRepetition patterns; issue-14855 asserts the synthesized backref index in the transpiled replacement string for six positional shapes of $ relative to source capture groups.
integration_tests/src/main/python/regexp_test.py Extends test_re_replace_backrefs with three new REGEXP_REPLACE expressions covering RegexChoice and RegexRepetition patterns; copyright year updated to 2026.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["transpile(pattern, extractIndex, repl)"] --> B["getTranspiledAST(pattern, ...)"]
    B --> C["emittedCaptureGroups = 0"]
    C --> D["countCaptureGroups(regex)\n[now recurses into RegexChoice + RegexRepetition]"]
    D --> E["parseReplacement(numCaptureGroups)"]
    E --> F["rewrite(regex, replacement, ...)"]
    F --> G{AST node type}
    G -->|"RegexGroup(capture=true, term)"| H["emittedCaptureGroups += 1\nthen recurse into term"]
    G -->|"RegexChar('$') in ReplaceMode"| I["emittedCaptureGroups += 1\nappendBackref(emittedCaptureGroups)\nemit (\r\n)? group"]
    G -->|"RegexEscaped('R') — \\R"| J["emittedCaptureGroups += 1\nemit capture group"]
    G -->|"(\\A)+ collapse arms"| K["if capture: emittedCaptureGroups += 1\nemit RegexGroup"]
    G -->|"RegexChoice(l, r)"| L["rewrite(l) then rewrite(r)\ncounter accumulates across both branches"]
    G -->|"RegexRepetition(base, q)"| M["rewrite(base) — RegexGroup arm\nwill increment if capture=true"]
    H --> F
    L --> F
    M --> F
Loading

Reviews (1): Last reviewed commit: "Fix regex transpiler off-by-one in line-..." | Re-trigger Greptile

// getTranspiledAST. Used to compute the cuDF index of the synthesized (\r\n)? group
// inserted at the position of `$` so the appended backref points at the right group even
// when source capture groups appear AFTER `$` (issues #14855 and #14735).
private var emittedCaptureGroups: Int = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mutable instance state is not thread-safe for concurrent callers

emittedCaptureGroups is a private var on the CudfRegexTranspiler instance, reset at the top of getTranspiledAST for sequential reuse. If two threads call transpile() (or getTranspiledAST) on the same instance concurrently — e.g., if a transpiler is accidentally cached at a scope shared across tasks — they will corrupt each other's counter, producing wrong backref indices silently. Spark planning is single-threaded so this is unlikely to bite in production today, but making emittedCaptureGroups a local variable threaded through rewrite calls (or scoping it inside getTranspiledAST) would eliminate the risk entirely and also remove the need for the reset guard.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the abstract risk. Looking at the actual call sites, CudfRegexTranspiler is instantiated per-call in every Spark GpuOverrides hot path (e.g. GpuRLike, GpuRegExpReplace, GpuStringSplit); we never cache a single instance across tasks/threads. The emittedCaptureGroups = 0 reset on line 744 exists specifically to keep that contract honest when the same test instance is reused inside RegularExpressionTranspilerSuite (which does transpiler.transpile(a); transpiler.transpile(b); in a single thread), not because real production code shares an instance across threads.

Lifting the counter to a local threaded through rewrite would touch all 8 mutation sites (1116, 1118, 1272, 1444, 1463, 1482, 1598) in recursive paths that already pass several other contextual parameters (previous, flags, RegexRewriteFlags, mode-specific state). The transpiler also has other instance-level mutable state (the parser allocator inside RegexParser), so even after this change concurrency wouldn't be safe — we'd be making one specific field safe without changing the broader contract.

I'd prefer to keep the per-call reset for now and file the broader thread-safety refactor as a separate cleanup if/when we ever cache transpiler instances. Open to revisiting if you're aware of a code path that does share instances across threads — I couldn't find one in git grep CudfRegexTranspiler, all uses are new CudfRegexTranspiler(mode).transpile(...) inline.

@sameerz sameerz added the bug Something isn't working label May 22, 2026
@wjxiz1992 wjxiz1992 requested review from NVnavkumar and revans2 May 25, 2026 03:34
@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented May 25, 2026

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

Copy link
Copy Markdown
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found another related issue. The code that rejects $ inside of a capture group can be confused by choices or other operations within a capture group. (a$|b)(c) is accepted, but will produce incorrect answers, both because if the $ in the capture group capturing too much and also from the renumbering of the capture groups not working properly.

# used to be invisible to countCaptureGroups, mis-indexing $N.
'REGEXP_REPLACE(a, "(TEST)|(\ud720)", "[$1][$2]")',
'REGEXP_REPLACE(a, "(TEST)+", "<$1>")',
'REGEXP_REPLACE(a, "(T)(E)|(S)(T)", "{$1}{$2}{$3}{$4}")'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we likely still have some bugs in here. T$|(E) or (?:T$|(E)) with a replacement of [$1], I think would produce the wrong answer.
Source Java numbering says $1 is (E). After rewrite, the cuDF pattern effectively has the synthetic $ group before (E): T(\r\n)?$|(E).

Now cuDF group 1 is synthetic, and (E) is group 2.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right — traced through and the bug reproduces. For T$|(E) with replacement [$1] on input "E":

  • Java: right branch (E) matches; group 1 = (E) = "E"; substitute $1 → output [E]
  • cuDF post-5af9ff869: right branch (E) matches; cuDF group 1 = synth (\r\n)? (empty in right branch), cuDF group 2 = "E"; user $1 emitted verbatim → looks up cuDF $1 → empty → output []

This is a pre-existing remap gap, not introduced by 5af9ff869. Pre-fix had two cancelling off-by-ones (synth backref written as $(n+1) where n was the total-source count, so for T$|(E) the appended synth backref was $2 which coincidentally pointed at (E) for THIS pattern but was wrong for (?:\x{f91b}\>|[\02:#-,]$(\D))* fuzz #3712). 5af9ff869 fixes the synth-side correctly but leaves user-side $N emitted with Java numbering.

The proper fix is to maintain a sourceGroupToCudfIndex map during transpile (source (...) emits bump both the cuDF counter and a parallel source counter, recording the map entry) and then walk replacement.parts rewriting RegexBackref(N, isNew=false)RegexBackref(sourceGroupToCudfIndex(N)) before returning. Roughly 30–50 lines + tests.

Two options — let me know which you prefer:

  1. Extend this PR to add the user-side remap + tests covering T$|(E), (a)$|(b), (?:T$|(E)).
  2. Land this PR as scoped (synth index + countCaptureGroups recursion), file a follow-up tracking issue for the user-side remap, and optionally add a defensive checkUnsupported arm here that rejects patterns where a source capture group is positionally AFTER a $ so the GPU falls back to CPU instead of silently producing wrong output until the remap lands.

Slight preference for option 2 from me (keeps this PR scope-tight and the remap fix gets its own focused review), but happy to do option 1 if you would rather see them together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

5 participants