Skip to content

[AutoSparkUT] Propagate SQL query context for decimal-overflow exceptions (SPARK-39190)#14872

Open
wjxiz1992 wants to merge 1 commit into
NVIDIA:mainfrom
wjxiz1992:fix/14123-query-context-overflow
Open

[AutoSparkUT] Propagate SQL query context for decimal-overflow exceptions (SPARK-39190)#14872
wjxiz1992 wants to merge 1 commit into
NVIDIA:mainfrom
wjxiz1992:fix/14123-query-context-overflow

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

Closes #14123

Summary

SPARK-39190 / SPARK-39208 / SPARK-39210 (introduced in Spark 3.3) require that SparkArithmeticException for decimal overflow include the original SQL fragment so it survives serialization to executors when WSCG is off. On GPU the captured Origin was being lost across the executor-side plan reconstruction (GpuBindReferencesmapChildrenmakeCopy), so decimal-overflow errors for select d / 0.1, select sum(d), and select avg(d) lacked the SQL fragment that the test's msg.contains(query) assertion required.

What changed

  • RoundingErrorUtil.cannotChangeDecimalPrecisionError now threads the query context via OriginContextShim.queryContext(CurrentOrigin.get) instead of dropping it. The dead context: String = "" parameter is removed; the body previously discarded it.
  • GpuCheckOverflow and GpuCheckOverflowAfterSum snapshot a capturedOrigin field and override withNewChild(ren)Internal to re-enter it around the case-class copy(). Spark's mapChildren / withNewChildren already wrap with CurrentOrigin.withOrigin(this.origin); the override is defensive against reflection paths that bypass that wrap.
  • GpuDecimalSum and GpuDecimalAverageBase capture their own origin and wrap postUpdate / postMerge / evaluateExpression lazy vals with CurrentOrigin.withOrigin(capturedOrigin) so the GpuCheckOverflow* nodes they construct inherit the parser-set context. GpuDecimal128Average applies the same wrap to its own postUpdate / postMerge overrides.
  • columnarEval on GpuCheckOverflow* enters the captured origin so the static GpuCast.checkNFixDecimalBoundsRoundingErrorUtil chain reads it via CurrentOrigin.get at throw time.
  • RapidsTestSettings.scala exclusion for SPARK-39190,SPARK-39208,SPARK-39210: Query context of decimal overflow error should be serialized to executors when WSCG is off is removed.

Why the capture pattern

val capturedOrigin = origin snapshots the parser-set Origin at construction (driver-side, inside RapidsMeta.convertToGpu's withOrigin(wrapped.origin)). The follow-up withNewChild(ren)Internal overrides preserve that snapshot across the case-class copy() that Spark drives via mapChildren/makeCopy, so the SQL context survives all the way to the executor-thrown exception.

Local validation

Inside the worktree (/home/allxu/work/spark-set/spark-rapids-14123-query-ctx):

```
mvn package -pl tests -am -Dbuildver=330 -Dmaven.repo.local=./.mvn-repo \
-DwildcardSuites=org.apache.spark.sql.rapids.suites.RapidsSQLQuerySuite \
-Drapids.test.gpu.allocFraction=0.3 -Drapids.test.gpu.maxAllocFraction=0.3 \
-Drapids.test.gpu.minAllocFraction=0
```

`Tests: succeeded 225, failed 0, canceled 0, ignored 9, pending 0`

Per-test trace for the recovered case (SPARK-39190,SPARK-39208,SPARK-39210: Query context of decimal overflow error should be serialized to executors when WSCG is off, Spark source sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:4496): all three subqueries (select d / 0.1, select sum(d), select avg(d)) now throw a SparkArithmeticException whose getMessage contains the originating SQL fragment, matching CPU behavior.

Shim compile coverage:

  • mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330: SUCCESS
  • mvn package -DskipTests -pl sql-plugin -am -Dbuildver=340: SUCCESS
  • mvn package -DskipTests -pl sql-plugin -am -Dbuildver=400 -f scala2.13/pom.xml: SUCCESS

Follow-ups (out of scope here)

  • GpuDecimalSum.windowOutput and scanCombine still call GpuCast.checkNFixDecimalBounds without a captured-origin wrap, so window/scan decimal-overflow messages will not carry the SQL fragment. SPARK-39190 does not exercise those paths; can be addressed separately.
  • The capturedOrigin + withNewChild(ren)Internal boilerplate is repeated across four classes. Consolidating into a small GpuOriginCapture trait, or migrating to the GpuCast-style otherCopyArgs + second-parameter-list origin pattern, is a worthwhile refactor follow-up but kept out of this fix to minimize scope.

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

…ions

Closes NVIDIA#14123

SPARK-39190/39208/39210: Spark 3.3 introduced query-context serialization
on executors so SparkArithmeticException messages include the original SQL
fragment.  On GPU, the captured Origin was lost across the executor-side
plan reconstruction (GpuBindReferences -> mapChildren -> makeCopy), so
decimal-overflow errors for `select d / 0.1`, `select sum(d)`, and
`select avg(d)` lacked the SQL fragment that the test's
`msg.contains(query)` assertion required.

Fix:
- `RoundingErrorUtil.cannotChangeDecimalPrecisionError` now threads the
  query context via `OriginContextShim.queryContext(CurrentOrigin.get)`
  instead of dropping it.
- `GpuCheckOverflow` and `GpuCheckOverflowAfterSum` snapshot a
  `capturedOrigin` and override `withNewChild(ren)Internal` to re-enter
  it around the case-class copy, defensively keeping the parser-set
  context attached to instances rebuilt by reflection-based paths.
- `GpuDecimalSum` and `GpuDecimalAverageBase` capture their own origin
  and wrap `postUpdate` / `postMerge` / `evaluateExpression` lazy vals
  with `CurrentOrigin.withOrigin(capturedOrigin)` so the
  `GpuCheckOverflow*` nodes they construct inherit the right context.
  `GpuDecimal128Average` overrides both lazy vals with the same wrap.
- `columnarEval` on `GpuCheckOverflow*` wraps the body in the same
  withOrigin to keep CurrentOrigin set when the static
  `GpuCast.checkNFixDecimalBounds` -> `RoundingErrorUtil` chain throws.
- Remove the SPARK-39190 entry from `RapidsTestSettings.scala` exclusion
  list.

Local validation (in worktree):
- `mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=...RapidsSQLQuerySuite`:
  `Tests: succeeded 225, failed 0, canceled 0, ignored 9, pending 0`.
- Shim compile coverage: `mvn package -DskipTests -pl sql-plugin -am
  -Dbuildver={330,340,400(-f scala2.13/pom.xml)}` all SUCCESS.

### Review notes (informational, not addressed in this commit)
- Architecture: could adopt the `GpuCast`-style `otherCopyArgs` /
  second-parameter-list `origin` pattern instead of the explicit
  `capturedOrigin` + `withNewChild(ren)Internal` overrides.  Followup.
- Refactor: extract a `GpuOriginCapture` trait to deduplicate the
  capturedOrigin pattern across the four classes.  Followup.
- Test coverage: a unit test directly exercising
  `withNewChild(ren)Internal` to assert capturedOrigin survives rebind
  would catch regressions earlier than the end-to-end SPARK-39190 test.
- Architecture: removing `context: String = ""` from
  `RoundingErrorUtil.cannotChangeDecimalPrecisionError` is a source-
  level signature change.  The parameter was dead (callers passed 3
  args and the body discarded it), so there is no behavior change.
- Architecture: `GpuDecimalSum.windowOutput` and `scanCombine` also call
  `GpuCast.checkNFixDecimalBounds` without a captured-origin wrap and
  will emit empty-context messages on window/scan paths.  Out of scope
  for SPARK-39190 (the test does not exercise window/scan).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings May 25, 2026 06:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes GPU parity with Spark 3.3+ for ANSI decimal-overflow exceptions by ensuring the SQL query fragment (“query context”) is preserved through plan serialization/reconstruction and is available when the exception is thrown on executors (SPARK-39190 / SPARK-39208 / SPARK-39210).

Changes:

  • Thread SQL query context into decimal precision/overflow exceptions via OriginContextShim.queryContext(CurrentOrigin.get).
  • Capture and re-enter Origin across expression copy()/child-rewrite paths (withNewChildInternal / withNewChildrenInternal) and during evaluation so executor-thrown errors retain the original SQL fragment.
  • Re-enable the previously excluded Spark SQLQuerySuite test in Spark 3.3.0 test settings.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Removes the exclusion for SPARK-39190/39208/39210 now that GPU preserves query context in overflow errors.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/mathExpressions.scala Updates RoundingErrorUtil to pass query context derived from CurrentOrigin into Rapids/Spark error construction.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/aggregateFunctions.scala Captures Origin in decimal sum/avg paths and ensures constructed overflow-check expressions inherit and preserve SQL context across rewrites/evaluation.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/decimalExpressions.scala Captures/preserves Origin for GpuCheckOverflow across child rewrites and during doColumnar evaluation to keep exception query context intact.

@wjxiz1992 wjxiz1992 self-assigned this May 25, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 25, 2026

Greptile Summary

This PR propagates SQL query context through decimal-overflow exceptions on GPU to achieve parity with SPARK-39190/39208/39210. The root cause was that capturedOrigin was being dropped across GpuBindReferences → mapChildren → makeCopy plan reconstruction on the executor side, so SparkArithmeticException messages for select d / 0.1, select sum(d), and select avg(d) lacked the SQL fragment that CPU exceptions include.

  • GpuCheckOverflow and GpuCheckOverflowAfterSum now snapshot capturedOrigin = origin at construction, override withNewChild(ren)Internal to re-enter it around copy(), and wrap doColumnar/columnarEval with CurrentOrigin.withOrigin(capturedOrigin) so the active origin is correct when RoundingErrorUtil calls OriginContextShim.queryContext(CurrentOrigin.get).
  • GpuDecimalSum, GpuDecimalAverageBase, and GpuDecimal128Average capture their own origin and wrap postUpdate, postMerge, and evaluateExpression lazy vals with withOrigin(capturedOrigin) so the GpuCheckOverflow* nodes they construct inherit the parser-set context.
  • RoundingErrorUtil.cannotChangeDecimalPrecisionError removes its dead context: String = \"\" parameter and now reads OriginContextShim.queryContext(CurrentOrigin.get) directly.
  • The spark330 test exclusion for SPARK-39190/39208/39210 is removed.

Confidence Score: 4/5

The changes are narrowly scoped to origin capture and context threading, touch no GPU memory allocation or resource management, and are validated by the restored SPARK-39190/39208/39210 tests across three shim versions.

One code path in GpuCheckOverflowAfterSum.columnarEval — the ANSI-mode isEmpty overflow branch — still throws a bare ArithmeticException("Overflow in sum of decimals.") that does not carry the SQL query context, while the CPU counterpart would throw via QueryExecutionErrors. The SPARK-39190 test exercises the bounds-check path rather than the isEmpty path, so the gap is not caught by the restored tests. The rest of the fix is mechanically sound and correctly handles the three cases the tests verify.

aggregateFunctions.scala — the ANSI isEmpty-path overflow in GpuCheckOverflowAfterSum.columnarEval (line 874) is worth a follow-up to replace the bare ArithmeticException with a proper SparkArithmeticException carrying the captured SQL context.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/decimalExpressions.scala Adds capturedOrigin snapshot and withNewChildInternal override to GpuCheckOverflow, and wraps doColumnar in CurrentOrigin.withOrigin(capturedOrigin) so RoundingErrorUtil.cannotChangeDecimalPrecisionError reads the parser-set context at throw time.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/aggregateFunctions.scala Adds capturedOrigin snapshot to GpuCheckOverflowAfterSum, GpuDecimalSum, GpuDecimalAverageBase, and GpuDecimal128Average; wraps lazy vals constructing overflow-check nodes with CurrentOrigin.withOrigin(capturedOrigin); overrides withNewChildrenInternal defensively. One gap: the ANSI isEmpty-path overflow still throws a plain ArithmeticException without SQL context.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/mathExpressions.scala Removes the dead context: String = "" parameter from RoundingErrorUtil.cannotChangeDecimalPrecisionError and replaces it with OriginContextShim.queryContext(CurrentOrigin.get), correctly threading the active origin through the shim-versioned error path.
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Removes the SPARK-39190/39208/39210 exclusion from the spark330 test settings, enabling the query-context serialization test that the fix restores.

Sequence Diagram

sequenceDiagram
    participant Parser as Spark SQL Parser
    participant Origin as CurrentOrigin (thread-local)
    participant Meta as RapidsMeta.convertToGpu
    participant GPU as GpuCheckOverflow / GpuCheckOverflowAfterSum
    participant Sum as GpuDecimalSum / GpuDecimalAverageBase
    participant RU as RoundingErrorUtil
    participant Shim as OriginContextShim

    Parser->>Origin: set origin (SQL fragment)
    Parser->>Meta: convertToGpu(wrapped)
    Meta->>Origin: withOrigin(wrapped.origin)
    Origin-->>GPU: "origin = parser-set Origin"
    Note over GPU: capturedOrigin = origin (snapshotted)
    Meta->>Sum: "capturedOrigin = origin (snapshotted)"

    Note over Sum: Plan serialized to executor

    Sum->>Origin: withOrigin(capturedOrigin)
    Sum->>GPU: "construct GpuCheckOverflow* inside withOrigin"
    Note over GPU: capturedOrigin = capturedOrigin from Sum

    Note over GPU: columnarEval / doColumnar called on executor
    GPU->>Origin: withOrigin(capturedOrigin)
    GPU->>RU: checkNFixDecimalBounds
    RU->>Origin: CurrentOrigin.get()
    Origin-->>RU: capturedOrigin (parser-set)
    RU->>Shim: OriginContextShim.queryContext(capturedOrigin)
    Shim-->>RU: SQL query context string
    RU-->>GPU: SparkArithmeticException with SQL fragment
Loading

Reviews (1): Last reviewed commit: "[AutoSparkUT] Propagate SQL query contex..." | Re-trigger Greptile

withResource(problem) { problem =>
withResource(problem.any()) { anyProblem =>
if (anyProblem.isValid && anyProblem.getBoolean) {
throw new ArithmeticException("Overflow in sum of decimals.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 isEmpty-path overflow still lacks SQL query context

The throw new ArithmeticException("Overflow in sum of decimals.") on this line is a plain ArithmeticException — it does not include the SQL query context introduced by SPARK-39190. The CPU equivalent in Spark 3.3 calls QueryExecutionErrors.overflowInSumOfDecimalsError() (a SparkArithmeticException) which carries CurrentOrigin.get().context in the exception message. The capturedOrigin is correctly held in CurrentOrigin at this point (we're inside withOrigin(capturedOrigin)), but the exception construction doesn't read it. The SPARK-39190 test passes because the test's select sum(d) overflow goes through the checkNFixDecimalBounds path (out-of-range result), not the isEmpty path (null from earlier overflow). Any sum overflow that produces null and is detected here would still emit a message without the SQL fragment.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — you're right that this throw is missing the query context (CPU CheckOverflowInSum.eval at Spark 3.3 calls QueryExecutionErrors.overflowInSumOfDecimalError(queryContext), which carries the SQL fragment). The isEmpty path also isn't exercised by SPARK-39190 (the test's select sum(d) triggers the checkNFixDecimalBounds path, not this one), so I'm leaving it out of scope for this PR — same reason I kept GpuDecimalSum.windowOutput / scanCombine parity in the followup column (see the Review notes block in the PR body). I'll file a separate followup that bundles isEmpty-path + windowOutput + scanCombine context propagation together; thread the fix through RapidsErrorUtils + OriginContextShim.queryContext(CurrentOrigin.get) like the other call sites in this PR.

@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

Follow-ups deferred from this PR are tracked at #14873.

@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(from claude)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AutoSparkUT]"SPARK-39166: Query context of binary arithmetic should be serialized to executors when WSCG is off" in SQLQuerySuite failed

4 participants