Emit multiple batches from GpuProjectExec split-retry instead of concatenating#14877
Emit multiple batches from GpuProjectExec split-retry instead of concatenating#14877thirtiseven wants to merge 6 commits into
Conversation
Adds a streaming variant of the split-retry path so that on GPU OOM the projection's sub-batches flow downstream as separate batches instead of being concatenated. The old runWithSplitRetry is preserved as a thin single-batch wrapper around the new streaming entry for callers that still need the single-batch contract (joins, aggregates, expand, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR removes the concat step in
Confidence Score: 5/5The streaming iterator change is well-scoped: it only affects GpuProjectExec's own execution path, backward-compat callers are explicitly left unchanged, and the allowMultipleOutputBatches guard preserves the one-output-per-input invariant for Window operations. The core logic of routing split-retry pieces directly downstream instead of concatenating them is sound. Resource management for the normal execution path is correct. The two observations raised are narrow edge cases that do not affect correctness for any currently exercised code path. No files require special attention beyond the two minor points noted. Important Files Changed
Sequence DiagramsequenceDiagram
participant RDD as RDD partition
participant Exec as GpuProjectExec
participant Tier as GpuTieredProject
participant Stream as runStreamingWithSplitRetry
participant Retry as withRetry framework
RDD->>Exec: flatMap(split)
Exec->>Exec: SpillableColumnarBatch(split)
Exec->>Tier: projectAndCloseStreamingWithSplitRetry(sb, allowMultiple)
alt "areAllRetryable && PROJECT_SPLIT_RETRY_ENABLED && allowMultiple"
Tier->>Stream: runStreamingWithSplitRetry(sb, retryables, project)
Stream->>Retry: "withRetry(sb, splitSpillableInHalfByRows) { project }"
Stream-->>Tier: Iterator[ColumnarBatch] (lazy, task-completion guarded)
else fallback single-batch
Tier-->>Tier: wrap projectAndCloseWithRetrySingleBatch in lazy one-shot Iterator
end
Tier-->>Exec: pieces: Iterator[ColumnarBatch]
Exec->>Exec: wrap in metric-tracking Iterator
loop for each piece
Exec->>Exec: "NvtxIdWithMetrics { pieces.next() }"
Exec->>Retry: retryIter.next()
alt OOM on first attempt
Retry->>Retry: splitSpillableInHalfByRows(sb)
Retry->>Retry: project(half1), project(half2)
Retry-->>Exec: half1 batch
Note over Exec,Retry: half2 yielded on next iteration
else no OOM
Retry-->>Exec: full projected batch
end
Exec->>Exec: "numOutputBatches += 1"
Exec-->>RDD: ColumnarBatch
end
Reviews (5): Last reviewed commit: "simplify" | Re-trigger Greptile |
The non-retryable fallback returned Iterator.single(projectAndClose(...)) which evaluated the projection eagerly. That combined with the outer closeOnExcept(sb) to double-close sb when the projection threw (sb was already closed by withResource inside projectAndCloseWithRetrySingleBatch), relying on SpillableColumnarBatch.close idempotency (see issue NVIDIA#10161). The eager result also had no cleanup hook if the task was cancelled between hasNext returning true and the consumer's next() call. The fallback now returns a lazy one-shot iterator that defers the projection until next(), and installs an onTaskCompletion guard that closes sb if the iterator is abandoned before being iterated. Both branches of projectAndCloseStreamingWithSplitRetry are now lazy and self-own sb, so the outer closeOnExcept in internalDoExecuteColumnar is no longer needed. The inner NvtxIdWithMetrics now correctly covers the projection work on both branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
| new Iterator[ColumnarBatch] { | ||
| @volatile private var started = false | ||
| private val onClose = Option(TaskContext.get()).map { tc => | ||
| onTaskCompletion(tc) { |
There was a problem hiding this comment.
nit: This looks like it is working around a design gap in RmmRapidsRetryIterator.withRetry(input: T, ...).
For the single-input overload, the input is wrapped by SingleItemAutoCloseableIteratorInternal, which knows how to close the input if it was never pulled. However, the task-completion callback in AutoCloseableAttemptSpliterator only closes attemptStack; before the first next(), the single input has not yet been pushed onto that stack (push happens here).
A centralized fix there would also protect other lazy single-input retry users, e.g. GpuFilter, GpuColumnarToRowExec, GpuSortExec, and GpuRunningWindowExec.
Could you either file a follow-up issue, or fix it here if you want to make this common? I think the Project-specific guard is reasonable for this PR, but the streaming withRetry(sb, ...) case should ideally not need every caller to remember this ownership edge.
Fixes #14868.
Description
Follow-up to #14724.
GpuProjectExec.runWithSplitRetrypreviously concatenated the sub-batches produced by row-split retry back into a single output batch (viaConcatAndConsumeAll.buildNonEmptyBatchFromTypes). That concat exists only to preserve the single-batch contract ofprojectAndCloseWithRetrySingleBatch. It has two costs:sum(pieces) + concatenated output, so a workload that only fits when split can still OOM at concat time. ThewithRetryNoSplitwrap can spill but cannot split the concat.GpuProjectExecalready returnsIterator[ColumnarBatch], so for the operator itself there is no reason to recombine.This PR adds a streaming variant that returns the per-piece iterator from
withRetrydirectly and switchesGpuProjectExec.internalDoExecuteColumnartoflatMapover it. Other callers ofprojectAndCloseWithRetrySingleBatch(GpuExpandExec,GpuGenerateExec,GpuAggregateExecpre/post-step,GpuTakeOrderedAndProjectExec,GpuArrowEvalPythonExec,GpuBroadcastHashJoinExecBase) are left unchanged — those operators embed a project inside a larger flow that aligns one output batch to one input batch (projection-index alignment inGpuExpandIterator, theSpillableColumnarBatch -> withRetry-splitshape insideGpuAggregateIterator, etc.), and migrating them is case-by-case work tracked under the umbrella issue #7866.Implementation
GpuProjectExec.runStreamingWithSplitRetryreturnsIterator[ColumnarBatch](no concat).GpuProjectExec.runWithSplitRetryis reduced to a thin wrapper: drain the streaming iterator and concat — behavior is unchanged for all current callers other thanGpuProjectExecitself.GpuTieredProject.projectAndCloseStreamingWithSplitRetrydispatches to the streaming variant whenareAllRetryable && PROJECT_SPLIT_RETRY_ENABLED; otherwise wraps the single-batch fallback inIterator.singleso the caller can uniformlyflatMap.GpuProjectExec.internalDoExecuteColumnarnowflatMaps over the streaming iterator. The NVTX/opTimerange is split into two parts so coverage matches the prior code: an outer range around spillable construction + retry-framework setup (which also captures the eager fallback projection), and an inner range around each lazypieces.next()on the streaming path.closeOnExcept(sb)was added around the streaming-entry call to defend against a synchronous failure inaddTaskCompletionListeneron a cancelled task.Behavior change to flag
numOutputBatchesis now incremented once per emitted piece instead of once per input batch — matches the long-standing per-batch counting inGpuFilterExec.filterAndCloseWithRetry. Same input under split-retry now reports N output batches; any tooling that compared input vs output batch counts on Project should be updated.Happy path perf tests:
spark.rapids.sql.projectExec.splitRetry.enabled=falsespark.rapids.sql.projectExec.splitRetry.enabled=trueChecklists
Documentation
Testing
Performance