[AutoSparkUT] Propagate SQL query context for decimal-overflow exceptions (SPARK-39190)#14872
[AutoSparkUT] Propagate SQL query context for decimal-overflow exceptions (SPARK-39190)#14872wjxiz1992 wants to merge 1 commit into
Conversation
…ions Closes NVIDIA#14123 SPARK-39190/39208/39210: Spark 3.3 introduced query-context serialization on executors so SparkArithmeticException messages include the original SQL fragment. On GPU, the captured Origin was lost across the executor-side plan reconstruction (GpuBindReferences -> mapChildren -> makeCopy), so decimal-overflow errors for `select d / 0.1`, `select sum(d)`, and `select avg(d)` lacked the SQL fragment that the test's `msg.contains(query)` assertion required. Fix: - `RoundingErrorUtil.cannotChangeDecimalPrecisionError` now threads the query context via `OriginContextShim.queryContext(CurrentOrigin.get)` instead of dropping it. - `GpuCheckOverflow` and `GpuCheckOverflowAfterSum` snapshot a `capturedOrigin` and override `withNewChild(ren)Internal` to re-enter it around the case-class copy, defensively keeping the parser-set context attached to instances rebuilt by reflection-based paths. - `GpuDecimalSum` and `GpuDecimalAverageBase` capture their own origin and wrap `postUpdate` / `postMerge` / `evaluateExpression` lazy vals with `CurrentOrigin.withOrigin(capturedOrigin)` so the `GpuCheckOverflow*` nodes they construct inherit the right context. `GpuDecimal128Average` overrides both lazy vals with the same wrap. - `columnarEval` on `GpuCheckOverflow*` wraps the body in the same withOrigin to keep CurrentOrigin set when the static `GpuCast.checkNFixDecimalBounds` -> `RoundingErrorUtil` chain throws. - Remove the SPARK-39190 entry from `RapidsTestSettings.scala` exclusion list. Local validation (in worktree): - `mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=...RapidsSQLQuerySuite`: `Tests: succeeded 225, failed 0, canceled 0, ignored 9, pending 0`. - Shim compile coverage: `mvn package -DskipTests -pl sql-plugin -am -Dbuildver={330,340,400(-f scala2.13/pom.xml)}` all SUCCESS. ### Review notes (informational, not addressed in this commit) - Architecture: could adopt the `GpuCast`-style `otherCopyArgs` / second-parameter-list `origin` pattern instead of the explicit `capturedOrigin` + `withNewChild(ren)Internal` overrides. Followup. - Refactor: extract a `GpuOriginCapture` trait to deduplicate the capturedOrigin pattern across the four classes. Followup. - Test coverage: a unit test directly exercising `withNewChild(ren)Internal` to assert capturedOrigin survives rebind would catch regressions earlier than the end-to-end SPARK-39190 test. - Architecture: removing `context: String = ""` from `RoundingErrorUtil.cannotChangeDecimalPrecisionError` is a source- level signature change. The parameter was dead (callers passed 3 args and the body discarded it), so there is no behavior change. - Architecture: `GpuDecimalSum.windowOutput` and `scanCombine` also call `GpuCast.checkNFixDecimalBounds` without a captured-origin wrap and will emit empty-context messages on window/scan paths. Out of scope for SPARK-39190 (the test does not exercise window/scan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes GPU parity with Spark 3.3+ for ANSI decimal-overflow exceptions by ensuring the SQL query fragment (“query context”) is preserved through plan serialization/reconstruction and is available when the exception is thrown on executors (SPARK-39190 / SPARK-39208 / SPARK-39210).
Changes:
- Thread SQL query context into decimal precision/overflow exceptions via
OriginContextShim.queryContext(CurrentOrigin.get). - Capture and re-enter
Originacross expressioncopy()/child-rewrite paths (withNewChildInternal/withNewChildrenInternal) and during evaluation so executor-thrown errors retain the original SQL fragment. - Re-enable the previously excluded Spark SQLQuerySuite test in Spark 3.3.0 test settings.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala | Removes the exclusion for SPARK-39190/39208/39210 now that GPU preserves query context in overflow errors. |
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/mathExpressions.scala | Updates RoundingErrorUtil to pass query context derived from CurrentOrigin into Rapids/Spark error construction. |
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/aggregateFunctions.scala | Captures Origin in decimal sum/avg paths and ensures constructed overflow-check expressions inherit and preserve SQL context across rewrites/evaluation. |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/decimalExpressions.scala | Captures/preserves Origin for GpuCheckOverflow across child rewrites and during doColumnar evaluation to keep exception query context intact. |
Greptile SummaryThis PR propagates SQL query context through decimal-overflow exceptions on GPU to achieve parity with SPARK-39190/39208/39210. The root cause was that
Confidence Score: 4/5The changes are narrowly scoped to origin capture and context threading, touch no GPU memory allocation or resource management, and are validated by the restored SPARK-39190/39208/39210 tests across three shim versions. One code path in
Important Files Changed
Sequence DiagramsequenceDiagram
participant Parser as Spark SQL Parser
participant Origin as CurrentOrigin (thread-local)
participant Meta as RapidsMeta.convertToGpu
participant GPU as GpuCheckOverflow / GpuCheckOverflowAfterSum
participant Sum as GpuDecimalSum / GpuDecimalAverageBase
participant RU as RoundingErrorUtil
participant Shim as OriginContextShim
Parser->>Origin: set origin (SQL fragment)
Parser->>Meta: convertToGpu(wrapped)
Meta->>Origin: withOrigin(wrapped.origin)
Origin-->>GPU: "origin = parser-set Origin"
Note over GPU: capturedOrigin = origin (snapshotted)
Meta->>Sum: "capturedOrigin = origin (snapshotted)"
Note over Sum: Plan serialized to executor
Sum->>Origin: withOrigin(capturedOrigin)
Sum->>GPU: "construct GpuCheckOverflow* inside withOrigin"
Note over GPU: capturedOrigin = capturedOrigin from Sum
Note over GPU: columnarEval / doColumnar called on executor
GPU->>Origin: withOrigin(capturedOrigin)
GPU->>RU: checkNFixDecimalBounds
RU->>Origin: CurrentOrigin.get()
Origin-->>RU: capturedOrigin (parser-set)
RU->>Shim: OriginContextShim.queryContext(capturedOrigin)
Shim-->>RU: SQL query context string
RU-->>GPU: SparkArithmeticException with SQL fragment
Reviews (1): Last reviewed commit: "[AutoSparkUT] Propagate SQL query contex..." | Re-trigger Greptile |
| withResource(problem) { problem => | ||
| withResource(problem.any()) { anyProblem => | ||
| if (anyProblem.isValid && anyProblem.getBoolean) { | ||
| throw new ArithmeticException("Overflow in sum of decimals.") |
There was a problem hiding this comment.
isEmpty-path overflow still lacks SQL query context
The throw new ArithmeticException("Overflow in sum of decimals.") on this line is a plain ArithmeticException — it does not include the SQL query context introduced by SPARK-39190. The CPU equivalent in Spark 3.3 calls QueryExecutionErrors.overflowInSumOfDecimalsError() (a SparkArithmeticException) which carries CurrentOrigin.get().context in the exception message. The capturedOrigin is correctly held in CurrentOrigin at this point (we're inside withOrigin(capturedOrigin)), but the exception construction doesn't read it. The SPARK-39190 test passes because the test's select sum(d) overflow goes through the checkNFixDecimalBounds path (out-of-range result), not the isEmpty path (null from earlier overflow). Any sum overflow that produces null and is detected here would still emit a message without the SQL fragment.
There was a problem hiding this comment.
Thanks — you're right that this throw is missing the query context (CPU CheckOverflowInSum.eval at Spark 3.3 calls QueryExecutionErrors.overflowInSumOfDecimalError(queryContext), which carries the SQL fragment). The isEmpty path also isn't exercised by SPARK-39190 (the test's select sum(d) triggers the checkNFixDecimalBounds path, not this one), so I'm leaving it out of scope for this PR — same reason I kept GpuDecimalSum.windowOutput / scanCombine parity in the followup column (see the Review notes block in the PR body). I'll file a separate followup that bundles isEmpty-path + windowOutput + scanCombine context propagation together; thread the fix through RapidsErrorUtils + OriginContextShim.queryContext(CurrentOrigin.get) like the other call sites in this PR.
|
Follow-ups deferred from this PR are tracked at #14873. |
|
build |
Closes #14123
Summary
SPARK-39190 / SPARK-39208 / SPARK-39210 (introduced in Spark 3.3) require that
SparkArithmeticExceptionfor decimal overflow include the original SQL fragment so it survives serialization to executors when WSCG is off. On GPU the capturedOriginwas being lost across the executor-side plan reconstruction (GpuBindReferences→mapChildren→makeCopy), so decimal-overflow errors forselect d / 0.1,select sum(d), andselect avg(d)lacked the SQL fragment that the test'smsg.contains(query)assertion required.What changed
RoundingErrorUtil.cannotChangeDecimalPrecisionErrornow threads the query context viaOriginContextShim.queryContext(CurrentOrigin.get)instead of dropping it. The deadcontext: String = ""parameter is removed; the body previously discarded it.GpuCheckOverflowandGpuCheckOverflowAfterSumsnapshot acapturedOriginfield and overridewithNewChild(ren)Internalto re-enter it around the case-classcopy(). Spark'smapChildren/withNewChildrenalready wrap withCurrentOrigin.withOrigin(this.origin); the override is defensive against reflection paths that bypass that wrap.GpuDecimalSumandGpuDecimalAverageBasecapture their own origin and wrappostUpdate/postMerge/evaluateExpressionlazy vals withCurrentOrigin.withOrigin(capturedOrigin)so theGpuCheckOverflow*nodes they construct inherit the parser-set context.GpuDecimal128Averageapplies the same wrap to its ownpostUpdate/postMergeoverrides.columnarEvalonGpuCheckOverflow*enters the captured origin so the staticGpuCast.checkNFixDecimalBounds→RoundingErrorUtilchain reads it viaCurrentOrigin.getat throw time.RapidsTestSettings.scalaexclusion forSPARK-39190,SPARK-39208,SPARK-39210: Query context of decimal overflow error should be serialized to executors when WSCG is offis removed.Why the capture pattern
val capturedOrigin = originsnapshots the parser-set Origin at construction (driver-side, insideRapidsMeta.convertToGpu'swithOrigin(wrapped.origin)). The follow-upwithNewChild(ren)Internaloverrides preserve that snapshot across the case-classcopy()that Spark drives viamapChildren/makeCopy, so the SQL context survives all the way to the executor-thrown exception.Local validation
Inside the worktree (
/home/allxu/work/spark-set/spark-rapids-14123-query-ctx):```
mvn package -pl tests -am -Dbuildver=330 -Dmaven.repo.local=./.mvn-repo \
-DwildcardSuites=org.apache.spark.sql.rapids.suites.RapidsSQLQuerySuite \
-Drapids.test.gpu.allocFraction=0.3 -Drapids.test.gpu.maxAllocFraction=0.3 \
-Drapids.test.gpu.minAllocFraction=0
```
`Tests: succeeded 225, failed 0, canceled 0, ignored 9, pending 0`
Per-test trace for the recovered case (
SPARK-39190,SPARK-39208,SPARK-39210: Query context of decimal overflow error should be serialized to executors when WSCG is off, Spark sourcesql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:4496): all three subqueries (select d / 0.1,select sum(d),select avg(d)) now throw aSparkArithmeticExceptionwhosegetMessagecontains the originating SQL fragment, matching CPU behavior.Shim compile coverage:
mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330: SUCCESSmvn package -DskipTests -pl sql-plugin -am -Dbuildver=340: SUCCESSmvn package -DskipTests -pl sql-plugin -am -Dbuildver=400 -f scala2.13/pom.xml: SUCCESSFollow-ups (out of scope here)
GpuDecimalSum.windowOutputandscanCombinestill callGpuCast.checkNFixDecimalBoundswithout a captured-origin wrap, so window/scan decimal-overflow messages will not carry the SQL fragment. SPARK-39190 does not exercise those paths; can be addressed separately.capturedOrigin+withNewChild(ren)Internalboilerplate is repeated across four classes. Consolidating into a smallGpuOriginCapturetrait, or migrating to theGpuCast-styleotherCopyArgs+ second-parameter-listoriginpattern, is a worthwhile refactor follow-up but kept out of this fix to minimize scope.Documentation
Testing
Performance