Skip to content

[FEA] Pre-split regex in Project to keep complex patterns on GPU instead of CPU fallback#14888

Draft
wjxiz1992 wants to merge 5 commits into
NVIDIA:mainfrom
wjxiz1992:feat/regex-estimator-stats-hint
Draft

[FEA] Pre-split regex in Project to keep complex patterns on GPU instead of CPU fallback#14888
wjxiz1992 wants to merge 5 commits into
NVIDIA:mainfrom
wjxiz1992:feat/regex-estimator-stats-hint

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

@wjxiz1992 wjxiz1992 commented May 26, 2026

Closes #14887

Summary

This supersedes the original rowsHint approach on this PR. Local verification showed the catalyst-rowCount hint is inert for the common case: a path-based spark.read.orc/parquet(path) reports rowCount = None even with CBO, so the hint was absent and the regex still fell back to CPU (see the findings comment and the design discussion).

Instead of rejecting a regex whose worst-case NFA scratch (numStates * rowsPerBatch * 2) exceeds spark.rapids.sql.regexp.maxStateMemoryBytes, GpuProjectExec now pre-splits the input batches to a per-batch row budget so the scratch fits, keeping the pattern on the GPU. This is stats-independent (works for path reads) and implements @thirtiseven's PreProjectSplitIterator suggestion from the issue.

How it works

  • RegexComplexityEstimator exposes numStates and adds maxRowsForStateBudget (= maxStateMemoryBytes / (numStates * 2)), estimatedSplits, and isValidWithPreSplit, plus a meta-side GpuRegexNumStates trait.
  • The RLike / RegExpReplace / RegExpExtract / RegExpExtractAll metas carry the transpiled-pattern state count. GpuProjectExec walks its child-expression meta tree for the maximum state count and pre-splits input batches to the budget via PreProjectSplitIterator.
  • The complexity gate is relaxed only for those regex metas inside a ProjectExec, and only within a new cap spark.rapids.sql.regexp.maxPreSplits (default 32). Filter/Join/Aggregate and split-with-regex keep the existing reject-to-CPU behavior. Setting maxPreSplits=1 disables pre-split entirely (kill switch).

Performance

Single RTX 5880 Ada, local[8], Spark 3.3.0, genuine non-rewritable 35-state pattern in a selectExpr Project; sum(...) over the result; median ms; default 2 GiB budget (the common case — no actual splitting needed at these sizes). cpu = vanilla Spark; fallback = today's reject-to-CPU island in an otherwise-GPU plan; presplit = this PR.

expr rows cpu fallback (today) presplit (this PR) vs fallback
rlike 1M 200 260 131 2.0×
rlike 10M 1219 1556 214 7.3×
regexp_replace 1M 255 340 175 1.9×
regexp_replace 10M 1582 1874 367 5.1×
regexp_extract 1M 219 289 136 2.1×
regexp_extract 10M 1461 1898 304 6.2×

The CPU-fallback island is the slowest option at every size (GPU scan → GpuColumnarToRow → CPU regex → GpuRowToColumnar); staying on GPU is 2× faster at 1M and 5–7× at 10M, and GPU scaling is near-flat. Split-overhead curve (10M rows, shrinking the budget to force splits): 0/8/96/992 splits → 208/205/336/1520 ms, i.e. ~1.3 ms/split, reaching break-even with CPU around ~1000 splits — which the maxPreSplits cap keeps you out of.

Validation

  • RegexComplexityEstimatorSuite (buildver 330): unit coverage for numStates, maxRowsForStateBudget, estimatedSplits (ceiling rounding), and the maxPreSplits cap / kill switch.
  • integration_tests/.../regexp_test.py: in-Project pre-split produces GPU==CPU results; maxPreSplits=1 falls back to CPU; an oversized regex in a filter (outside a Project) falls back to CPU.

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

wjxiz1992 and others added 2 commits May 26, 2026 17:21
)

The estimator's numRows previously defaulted to gpuTargetBatchSizeBytes /
StringType.defaultSize (~53M with the 1 GiB-batch / 20-byte defaults),
which over-rejected regex on small inputs even when no real memory risk
existed. Read stats.rowCount / sizeInBytes from the enclosing
SparkPlanMeta's logicalLink and pass it as an optional rowsHint. The
estimator always takes min(perBatchRows, rowsHint), so the hint can
only tighten — never inflate — the memory estimate.

For the 70k-row reproduction in issue NVIDIA#14887 with the
(quick brown|brown fox|...) pattern, catalyst reports the projected
Statistics(sizeInBytes=5.5 KiB) after column pruning, so bySize ≈ 281
and the per-batch ceiling drops from 6.4 GiB to ~35 KB. RLike now stays
on GPU (further rewritten to GpuContainsAny).

Local validation:
- mvn package -pl tests -am -Dbuildver=330 \
    -DwildcardSuites=com.nvidia.spark.rapids.RegexComplexityEstimatorSuite
  Tests: succeeded 5, failed 0
- spark-submit end-to-end with the user's regextestapp.py:
  Statistics(sizeInBytes=5.5 KiB) at the optimized Project; physical plan
  uses GpuRLike → GpuContainsAny; 70k-row query completes on GPU.

### Review notes
nt-code-review reported 0 must-fix, 6 should-fix, 8 suggestions. Applied:
NonFatal in the stats catch, @tailrec short-form, hoisted LongMaxBigInt,
added space after `if`, treat sizeInBytes==0 as a real empty input (0L)
rather than the unknown sentinel. Deferred for follow-up: caching the
hint per SparkPlanMeta (NVIDIA#14887 follow-up if profiling shows the stats
walk is hot), extracting a unit-testable helper, and an integration
test that exercises the meta-tree parent walk end-to-end. One reviewer
finding (drop asInstanceOf[SparkPlan]) was incorrect — the existential
wildcard SparkPlanMeta[_] erases the INPUT <: SparkPlan bound, so the
upcast is required for the return type; reverted with a comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Follow-up commit addressing the test-coverage and refactor findings from
the prior review that were deferred to keep the initial diff focused.

Changes:

- Extract `RegexComplexityEstimator.deriveRowsHint(rowCount, sizeInBytes)`
  as a pure helper, so the algorithm is unit-testable directly without
  standing up a real SparkPlan. `GpuRegExpUtils.estimateRowsHintFromStats`
  is now a thin wrapper that does the meta-tree walk and delegates the
  math. The previous "negative sizeInBytes treated as unknown" sentinel
  is preserved; an out-of-range sizeInBytes now returns None rather than
  Some(Long.MaxValue) so the estimator distinguishes "no hint" from
  "hint says unbounded" cleanly.

- Hoist `STRING_DEFAULT_ROW_BYTES` to the estimator object and reuse it
  from the helper. This removes the divergence between
  RegexComplexityEstimator.scala and stringFunctions.scala in how they
  read DataTypes.StringType.defaultSize.

- Add 9 unit tests for `deriveRowsHint` covering: no stats, both
  fields present (min), rowCount-only, sizeInBytes-only,
  sizeInBytes == 0 (empty), rowCount == 0 (empty after agg), negative
  sizeInBytes (unknown sentinel), rowCount overflow, sizeInBytes
  overflow.

Not changed:

- The reviewer's caching suggestion ("cache the hint per SparkPlanMeta
  to avoid O(N) full-subtree stats computations") is a misdiagnosis.
  `LogicalPlan.stats` is a `lazy val` per plan node, so repeated access
  across N regex expressions on the same Project is O(1) after the
  first call. The remaining per-expression cost is the parent-walk
  (parent-pointer chase, bounded by expression tree depth, typically
  1–3 levels) plus a small Option/BigInt allocation, neither of which
  is a meaningful planning-time hot spot. A cache would add a long-lived
  reference to SparkPlanMeta objects with no demonstrated benefit;
  documented inline.

- The proposed integration test (calibrated maxStateMemoryBytes that
  fails without stats and succeeds with) is hard to land reliably
  because the calibration depends on table size and Spark's stat
  estimation rules. The unit tests on `deriveRowsHint` cover the math;
  end-to-end coverage of the parent-walk is provided by the existing
  pyspark validation captured on PR #TBD.

Local validation:
- mvn package -pl tests -am -Dbuildver=330 \
    -DwildcardSuites=com.nvidia.spark.rapids.RegexComplexityEstimatorSuite
  Tests: succeeded 14, failed 0
- Rebuilt dist jar; re-ran the original 70k-row pyspark repro on GPU.
  Statistics(sizeInBytes=5.5 KiB), RLike runs on GPU (rewritten to
  GpuContainsAny) — same as before this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings May 26, 2026 09:49
@wjxiz1992 wjxiz1992 marked this pull request as draft May 26, 2026 09:53
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR fixes spurious CPU fallbacks for regex patterns on small inputs by feeding Catalyst plan statistics (rowCount/sizeInBytes) to RegexComplexityEstimator.isValid as a rowsHint, replacing the hard-coded worst-case ceiling of ~53 M rows per batch. The estimator always takes min(perBatchRows, rowsHint), so the hint can only tighten — never inflate — the memory estimate; plans with no usable stats are unaffected.

  • RegexComplexityEstimator: adds a rowsHint: Option[Long] parameter to estimateGpuMemory/isValid and a new public deriveRowsHint(rowCount, sizeInBytes) helper that converts BigInt Catalyst stats to a safe Option[Long], handling zero, overflow, and missing-rowCount edge cases.
  • GpuRegExpUtils: wires the hint by walking meta.parent up to the enclosing SparkPlanMeta, reading logicalLink.stats, and falling back gracefully on any NonFatal exception.
  • RegexComplexityEstimatorSuite: adds 9 unit tests covering the deriveRowsHint edge cases directly, without requiring a live SparkPlan.

Confidence Score: 4/5

Safe to merge; the min(perBatchRows, rowsHint) invariant ensures stats can only tighten — never widen — the regex memory estimate, so worst case is the old conservative behaviour.

The change is well-structured and safe: the parent-chain walk is guarded with NonFatal, the asInstanceOf[SparkPlan] is type-sound given SparkPlanMeta[INPUT <: SparkPlan], and the min guardrail prevents any regression for plans with no or inflated stats. The one gap is that the comment in deriveRowsHint claims Spark's Long.MaxValue no-stats sentinel returns None, but the sizeInBytes > LongMaxBigInt check is a strict greater-than so Long.MaxValue itself silently becomes Some(~4.6e17) — harmless in practice but the test named 'unknown sentinel' exercises BigInt(-1) rather than BigInt(Long.MaxValue), leaving the documented boundary untested.

RegexComplexityEstimator.scala lines 125–145 (the deriveRowsHint sentinel comment and boundary condition) and the corresponding test in RegexComplexityEstimatorSuite.scala.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala Adds rowsHint: Option[Long] to estimateGpuMemory/isValid and the new deriveRowsHint pure helper for catalyst stats; logic is sound but the comment misstates how the Long.MaxValue sentinel is handled.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Wires estimateRowsHintFromStats into validateRegExpComplexity by walking the RapidsMeta parent chain to the enclosing SparkPlanMeta and reading logicalLink.stats; well-guarded with NonFatal catch and safe min semantics.
tests/src/test/scala/com/nvidia/spark/rapids/RegexComplexityEstimatorSuite.scala Adds 9 new deriveRowsHint edge-case tests; missing coverage for sizeInBytes = BigInt(Long.MaxValue) (the actual Spark "no-stats" sentinel) which the comment claims is treated as None.

Sequence Diagram

sequenceDiagram
    participant EM as ExprMeta (e.g. RLike)
    participant GRU as GpuRegExpUtils
    participant RPM as SparkPlanMeta
    participant LP as LogicalPlan
    participant RCE as RegexComplexityEstimator

    EM->>GRU: validateRegExpComplexity(meta, regex)
    GRU->>GRU: findEnclosingPlan(meta) walk parent chain
    GRU->>RPM: SparkPlanMeta.wrapped
    RPM->>LP: logicalLink (Option[LogicalPlan])
    LP-->>GRU: stats.rowCount, stats.sizeInBytes
    GRU->>RCE: deriveRowsHint(rowCount, sizeInBytes)
    RCE-->>GRU: Option[Long] rowsHint
    GRU->>RCE: isValid(conf, regex, rowsHint)
    RCE->>RCE: min(perBatchRows, rowsHint)
    RCE->>RCE: saturatedMultiply(numStates, numRows, 2)
    RCE-->>GRU: Boolean
    GRU-->>EM: willNotWorkOnGpu (if false)
Loading

Reviews (1): Last reviewed commit: "Address nt-code-review deferred findings..." | Re-trigger Greptile

Comment on lines +125 to +145
// Pure helper that derives a per-batch row hint from catalyst Statistics. Exposed so the
// meta-tree plumbing in `org.apache.spark.sql.rapids.GpuRegExpUtils` can call it AND the
// unit-test suite can exercise the edge cases (zero rows, overflow, missing rowCount, etc.)
// directly without standing up a real SparkPlan / LogicalPlan.
//
// Returns None when sizeInBytes carries no useful information (Spark's "no stats" sentinel
// is `defaultSizeInBytes = Long.MaxValue`; any negative value is treated the same). Returns
// Some(0L) when the relation is genuinely empty (rowCount == 0 or sizeInBytes == 0), in
// which case the estimator short-circuits to 0 memory.
def deriveRowsHint(
rowCount: Option[BigInt],
sizeInBytes: BigInt): Option[Long] = {
val byCount: Option[Long] = rowCount.map { c =>
if (c <= 0) 0L
else if (c > LongMaxBigInt) Long.MaxValue
else c.toLong
}
val bySize: Option[Long] =
if (sizeInBytes < 0) None
else if (sizeInBytes > LongMaxBigInt) None
else Some((sizeInBytes / STRING_DEFAULT_ROW_BYTES).toLong)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Long.MaxValue sentinel is not filtered out, but the comment says it is

The header comment states "Returns None when sizeInBytes carries no useful information (Spark's 'no stats' sentinel is defaultSizeInBytes = Long.MaxValue; any negative value is treated the same)". However, the check sizeInBytes > LongMaxBigInt is false when sizeInBytes == BigInt(Long.MaxValue) because LongMaxBigInt == BigInt(Long.MaxValue). So the sentinel is silently converted to Some(Long.MaxValue / STRING_DEFAULT_ROW_BYTES) (~4.6e17) instead of None.

This is functionally safe — min(perBatchRows ≈ 53M, 4.6e17) still returns perBatchRows, so the old ceiling is preserved — but the comment is factually wrong, and the test suite does not include a case for sizeInBytes = BigInt(Long.MaxValue) (the test named "unknown sentinel" uses BigInt(-1) instead). Either the guard should be changed to sizeInBytes >= LongMaxBigInt to match the documented intent, or the comment should be corrected to reflect the actual > boundary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale — already addressed in 66d2734c1 (the previous follow-up commit). After the latest 511860139, the > Long.MaxValue sentinel logic and the negative-sentinel comment are gone entirely, since deriveRowsHint no longer takes sizeInBytes. Only rowCount is used now, with negative values defensively treated as 0L.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens the regex GPU complexity check by incorporating Catalyst plan statistics (rowCount/sizeInBytes) as an optional rowsHint, reducing unnecessary CPU fallback for complex-but-safe regex patterns on small inputs.

Changes:

  • Adds rowsHint: Option[Long] to RegexComplexityEstimator.isValid and introduces deriveRowsHint as a pure helper for turning Catalyst Statistics into a hint.
  • Extracts stats from the enclosing SparkPlan’s logicalLink during expression tagging and passes the derived hint into the estimator.
  • Expands RegexComplexityEstimatorSuite with new tests covering rowsHint behavior and deriveRowsHint edge cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
tests/src/test/scala/com/nvidia/spark/rapids/RegexComplexityEstimatorSuite.scala Adds unit tests for rowsHint and deriveRowsHint edge cases.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Computes a rowsHint from enclosing plan stats and feeds it into regex complexity validation.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexComplexityEstimator.scala Extends estimator API to accept rowsHint and adds deriveRowsHint helper + constant.

Comment on lines +1158 to +1173
// Walks up `meta.parent` to the enclosing SparkPlanMeta, reads catalyst stats from its
// linked LogicalPlan, and delegates to RegexComplexityEstimator.deriveRowsHint for the
// actual math. Returns None when no logical link is available (e.g. synthetic plans) or
// when stats access throws. The estimator always takes min(perBatchRows, rowsHint), so a
// loose hint is safe — it can only tighten, never inflate, the memory estimate.
//
// LogicalPlan.stats is a lazy val per plan node, so repeated access across multiple regex
// expressions on the same Project is O(1) after the first read.
private def estimateRowsHintFromStats(meta: ExprMeta[_]): Option[Long] = {
findEnclosingPlan(meta).flatMap(_.logicalLink).flatMap { logical =>
val stats = try {
Some(logical.stats)
} catch {
case NonFatal(_) => None
}
stats.flatMap(s => RegexComplexityEstimator.deriveRowsHint(s.rowCount, s.sizeInBytes))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 511860139. You are right — when stats come from a Project, sizeInBytes reflects the OUTPUT byte width (regex result + a small key for example) rather than the regex INPUT column. Deriving rows from that ratio could under-estimate and let the per-batch min clamp accept patterns that should be rejected.

deriveRowsHint now takes only rowCount: Option[BigInt] and returns None when unavailable. The sizeInBytes parameter, the sizeInBytes / STRING_DEFAULT_ROW_BYTES fallback, the STRING_DEFAULT_ROW_BYTES constant, and the four sizeInBytes-specific test cases are all gone.

Verified: Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0 (com.nvidia.spark.rapids.RegexComplexityEstimatorSuite).

Comment on lines +1161 to +1162
// when stats access throws. The estimator always takes min(perBatchRows, rowsHint), so a
// loose hint is safe — it can only tighten, never inflate, the memory estimate.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 511860139. After the Thread #2 fix (drop sizeInBytes fallback), the rowsHint is now an exact rowCount from catalyst — not a sizeInBytes heuristic — so it is guaranteed to be an upper bound on the rows the regex will see in any single batch. The safety claim in the doc comment is now factually correct as written (updated wording: "the hint, when present, is an exact rowCount from catalyst, so the estimators min(perBatchRows, rowsHint) clamp can only tighten the memory estimate, never inflate").

Comment on lines +1179 to +1181
// The existential wildcard erases the INPUT <: SparkPlan bound at the pattern site, so the
// upcast is needed to satisfy the return type. See nt-code-review finding rebuttal in the
// PR description.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 511860139. Replaced the external nt-code-review reference with a self-contained comment that explains the Scala compiler limitation directly:

// The existential wildcard `SparkPlanMeta[_]` erases the INPUT <: SparkPlan bound at the
// pattern site, so the upcast is needed to satisfy the return type.
case sp: SparkPlanMeta[_] => Some(sp.wrapped.asInstanceOf[SparkPlan])

The PR description is no longer load-bearing for understanding the upcast.

else c.toLong
}
val bySize: Option[Long] =
if (sizeInBytes < 0) None
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale — the sizeInBytes sentinel branch is gone in 511860139. deriveRowsHint now only takes rowCount; the > LongMaxBigInt and < 0 sentinel arms for sizeInBytes were removed along with the parameter itself. Negative rowCount is defensively treated as 0L; sentinel handling for rowCount (which catalyst returns as None when unavailable) is unchanged from the original.

Comment on lines +76 to +77
test("deriveRowsHint returns None when no stats are available") {
assert(RegexComplexityEstimator.deriveRowsHint(None, BigInt(-1)).isEmpty)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale — the sizeInBytes-sentinel test rewrite is moot in 511860139. deriveRowsHint no longer accepts sizeInBytes, so the four sizeInBytes-specific test cases (fallback, min(byCount, bySize), unknown-sentinel, overflow) were removed. The remaining test set covers rowCount-only edge cases: missing, present, zero, negative (defensive 0L), and overflow to Long.MaxValue.

Copy link
Copy Markdown
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite concerned about the accuracy of this. The metrics in spark that you are using are notoriously inaccurate, even when CBO is enabled, which no one does. For 70k rows is the GPU even the right choice generally? What is the performance difference?

@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

@revans2 Thanks for pointing this out! I did see the problem of performance and cpu vs. gpu choice during local tests. I put more info in the issue link to get more discussion.

The issue is raised by QA team, their tests failed due to the regex expression fallback to CPU so they were challenging like "why even a 70k rows table forces the regex operation fallback to CPU"? Then I started the whole story. For short workaround I've suggested they can modify the spark.rapids.sql.regexp.maxStateMemoryBytes to make it work on GPU.

For production case, I also think current configs/algo is fine, but not 100% sure it's ok to convince QA.

…VIDIA#14887)

Copilot review surfaced an under-estimation risk: when `rowCount` was
missing, the helper fell back to `sizeInBytes / STRING_DEFAULT_ROW_BYTES`
on the enclosing Project's stats. That sizeInBytes reflects the Project's
OUTPUT byte width — regex result + a small key, for example — not the
regex INPUT column's width, so the derived row count could be far below
actual rows and the per-batch `min` clamp would silently let oversized
patterns through to the GPU.

Changes:
- `deriveRowsHint` now takes only `rowCount: Option[BigInt]` and returns
  None when it is unavailable. The unused `sizeInBytes` parameter and the
  `STRING_DEFAULT_ROW_BYTES` constant are removed.
- `stringFunctions.estimateRowsHintFromStats` updates the caller and its
  doc comment to reflect that the hint is now an exact rowCount, so the
  estimator's `min(perBatchRows, rowsHint)` clamp can only tighten the
  memory estimate, never inflate (the safety claim Copilot flagged).
- `findEnclosingPlan` comment no longer references the external
  nt-code-review thread — explains the Scala existential-wildcard upcast
  inline.
- `RegexComplexityEstimatorSuite` drops the four sizeInBytes-specific
  test cases (fallback, min(byCount, bySize), unknown-sentinel, overflow)
  and keeps the rowCount-only edge cases.

Mvn summary: Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
  (com.nvidia.spark.rapids.RegexComplexityEstimatorSuite, buildver=330).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

wjxiz1992 commented May 28, 2026

@revans2 — on your stats-accuracy concern from the top-level review: agreed that Spark stats are unreliable in general, which is now reflected in the implementation. The rowsHint is intentionally conservative on three axes:

  1. It is optional — when catalyst returns rowCount = None (no CBO / no source stats), no hint is passed and the estimator falls back to the per-batch ceiling, exactly as before this PR.
  2. It is min-clamped with the per-batch ceiling, so a stale-large rowCount cannot inflate the memory estimate. A loose hint is always safe.
  3. It now derives only from exact rowCount (this commit) — not from the sizeInBytes / STRING_DEFAULT_ROW_BYTES heuristic that could under-estimate on a Project and falsely allow oversized patterns.

So the failure mode is: stats CLAIM 70k rows when actual is, say, 5M. In that case the estimator under-estimates memory; the per-batch ceiling still bounds at ≈53M rows / batch, which is the original pre-PR behavior; and if the regex actually overflows at runtime the cuDF kernel still throws — we have not removed any safety net. The hint can only move the rejection threshold tighter for cases where stats are accurate enough to make a small-input regex regress to CPU. Tightening is opt-in via existing stats infrastructure; if you want a stronger guard I can add a spark.rapids.sql.regexp.useCatalystRowsHint kill-switch defaulting to true so it can be disabled at a query level.

CI is green (49/49). Ready for re-review when you have a moment.

@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

@revans2 — ran the GPU-vs-CPU comparison you asked for, and while doing it I verified what the current code actually does on the repro. Both are below, and I think the second part changes the plan.

Performance

Single RTX 5880 Ada, local[8], Spark 3.3.0, the #14887 pattern over ORC. Timed action is sum(cast(rlike(...) as int)) (forces the regex on every row); 7 runs, first 2 discarded, median shown. GPU-vs-CPU routing is forced via spark.rapids.sql.regexp.maxStateMemoryBytes so these are pure execution times. Caveats up front: this is the literal-alternation pattern that the rlike-rewrite turns into GpuContainsAny (fast — a true NFA pattern would have higher GPU absolute times); and this is single-host / warm-cache / single-tenant GPU, i.e. not the high-tenancy regime where the OOM risk actually bites.

cpu = vanilla Spark · fallback = RAPIDS on but regex rejected → CPU island in an otherwise-GPU plan (today's behavior for this pattern) · gpu = regex on GPU (what the hint is meant to enable).

rows cpu fallback (today) gpu gpu vs cpu
70k 74 ms 151 ms 157 ms 0.5× (~83 ms slower)
1M 243 ms 315 ms 126 ms 1.9×
10M 1769 ms 2109 ms 198 ms 8.9×
  1. You're right at 70k in isolation — CPU wins by ~80 ms; the crossover is below 1M rows.
  2. But the gate's rejection never delivers that CPU path. When it rejects the regex the rest of the plan stays on GPU, so you get the fallback island (GpuScan → GpuColumnarToRow → CPU RLike → GpuRowToColumnar → GpuHashAgg) — slower than pure CPU and no faster than GPU at every size. It's strictly dominated.
  3. The gate is row-count-blind, so the same pattern is rejected identically at 70k and at 10M — but at 10M GPU is ~9× faster than CPU and ~10× faster than the fallback. GPU stays ~flat (126→198 ms) while CPU scales linearly.

So "is GPU the right choice at 70k?" — marginally no in isolation, but the rejection produces the worst of the three outcomes, not the CPU one, and the same blind rule blacklists the large-N cases where GPU wins big.

…but the current implementation is inert for the repro

You were more right about stats than the PR assumed. I rebuilt at HEAD (511860139, rowCount-only after the sizeInBytes fallback was dropped) and checked whether the plan actually flips, with default maxStateMemoryBytes:

source Statistics.rowCount regex flips to GPU?
spark.read.orc(path), CBO off None no
spark.read.orc(path), CBO on None no
CREATE TABLE … + ANALYZE … COMPUTE STATISTICS 70000 yes

A path-based file read reports no rowCount even with CBO on, because there is no catalog entry to hold it. So as written the hint only helps ANALYZE-d catalog tables and is a no-op for path reads — including this issue's own repro. (The plan-flip in the PR description was generated with the first commit, which still had the sizeInBytes path; that's stale now.) For file scans the useful stat isn't just inaccurate, it's absent — which is your point, sharpened.

Where I'd like your design input

Given that, I don't want to ship the hint as-is. The options I currently see:

  • (a) Re-introduce a sizeInBytes signal, but read it from the leaf scan (the regex input column) rather than a Project output, with a perBatchRows/10 floor so a compressed-size underestimate can't open the gate too far. Smaller change, still a heuristic.
  • (b) @thirtiseven's idea: feed the regex estimate into PreProjectSplitIterator so batches are pre-split to the regex budget. No stats dependency, works for path reads, and bounds per-batch memory directly — which addresses the OOM concern far better than any stats heuristic.

I'm leaning (b) as the actual fix, with the perf data above as the justification for bothering at all. But you have the most context on the regex memory model and why the per-batch ceiling is shaped the way it is — is there a more sensible design than either of these? Happy to rework toward whatever you think is right (including "the ceiling is fine, just close this").

wjxiz1992 and others added 2 commits May 29, 2026 15:46
Instead of rejecting a regex whose worst-case NFA scratch
(numStates * rowsPerBatch * 2) exceeds spark.rapids.sql.regexp.maxStateMemoryBytes,
GpuProjectExec now pre-splits input batches to a per-batch row budget so the
scratch fits, keeping the pattern on the GPU instead of falling back to CPU.

- RegexComplexityEstimator: expose numStates; add maxRowsForStateBudget,
  estimatedSplits, isValidWithPreSplit, and the GpuRegexNumStates meta trait.
- New conf spark.rapids.sql.regexp.maxPreSplits (default 32; 1 disables pre-split)
  and the regenerated advanced_configs.md entry.
- The RLike / RegExpReplace / RegExpExtract / RegExpExtractAll metas mix
  GpuRegexNumStates; GpuProjectExec walks the child-expr meta tree for the max
  state count and pre-splits to the budget via PreProjectSplitIterator.
- Gate relaxed only for these regex metas inside a Project and only within the
  maxPreSplits cap; Filter/Join/Aggregate and split-with-regex stay as before.
- GpuProjectExec gains a 4th (defaulted) field; update the 3-arg GpuProjectExec
  extractor patterns in the delta-lake providers to the new arity.
- Tests: RegexComplexityEstimatorSuite covers the estimator math and the cap;
  regexp_test.py covers in-Project pre-split GPU==CPU, the maxPreSplits=1 kill
  switch, and outside-Project (filter) CPU fallback.

### Review notes (nt-code-review; informational, deferred)
- estimatedSplits / isValid / numStates each call countStates on the same AST at
  planning time (pure; left for clarity).
- The per-meta numStates field + override is repeated in 4 metas; could be hoisted
  into GpuRegexNumStates as a concrete field.
- ceilDiv is duplicated in PreProjectSplitIterator and estimatedSplits.
- countStates' SimpleQuantifier inner match has no default case (unchanged code;
  the parser only ever produces * + ?, so it is unreachable today).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Supersedes the catalyst rowsHint approach (verified inert for path-based file
reads) with the GpuProjectExec pre-split implementation. The net diff versus
main is the pre-split change only; the earlier rowsHint commits remain in
history but their content is replaced by this merge's tree.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wjxiz1992 wjxiz1992 changed the title [FEA] Use catalyst stats as rowsHint to avoid regex CPU fallback on small inputs [FEA] Pre-split regex in Project to keep complex patterns on GPU instead of CPU fallback May 29, 2026
@revans2
Copy link
Copy Markdown
Collaborator

revans2 commented May 29, 2026

@wjxiz1992 I did a bit more digging and I think the right fix is to remove the complexity estimation entirely. It was added in as a part of #4061 to avoid the need for reserve stack space (Effectively it spilled from the kernel to GPU memory and that would cause problems with RMM using all of the memory). But rapidsai/cudf#10600 changed it so that the extra memory is now allocated through RMM. So the entire point of having the complexity stopped mattering 4 years ago and we have been wasting our time on it ever since. I think that the right fix now is to delete the complexity gate entirely and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Use catalyst stats as rowsHint in RegexComplexityEstimator to avoid CPU fallback on small inputs

4 participants