Use merge and set validity in cast struct to struct#14846
Conversation
Signed-off-by: Rishi Chandra <rishic@nvidia.com>
Greptile SummaryThis PR replaces the null-restoration logic in
Confidence Score: 3/5The optimization is straightforward but changes null-propagation semantics for nested struct columns in a way that may silently break Parquet/ORC serialization or any operation that accesses child columns without checking the parent validity mask. The replacement of isNull+ifElse with mergeAndSetValidity on a struct column leaves children with non-null values at rows where the parent struct is null. Spark-rapids issue #7485 was filed specifically about this pattern. All pre-existing mergeAndSetValidity calls in the same file target flat STRING results; this is the first use on a nested type. Existing tests exercise the cast path but do not directly assert child-column null masks, so a subtle regression in serialization or downstream nested-access could go undetected. sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala — specifically the castStructToStruct method and how its output is consumed by Parquet/ORC writers and any nested struct projection paths. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant castStructToStruct
participant doCast
participant cuDF
Caller->>castStructToStruct: input StructType with nulls
loop for each child field
castStructToStruct->>doCast: getChildColumnView(i)
doCast-->>castStructToStruct: castChild[i]
end
castStructToStruct->>cuDF: makeStructView(castChildren)
cuDF-->>castStructToStruct: castedView all rows valid
alt "input.getNullCount == 0"
castStructToStruct->>cuDF: castedView.copyToColumnVector()
else has nulls NEW path
castStructToStruct->>cuDF: castedView.mergeAndSetValidity BITWISE_AND input
Note over cuDF: Top-level struct mask patched only
Note over cuDF: Children keep cast values at null rows
end
castStructToStruct-->>Caller: ColumnVector
Reviews (1): Last reviewed commit: "use merge and set validity in cast struc..." | Re-trigger Greptile |
| isNull.ifElse(nullVal, casted) | ||
| } | ||
| } | ||
| casted.mergeAndSetValidity(BinaryOp.BITWISE_AND, input) |
There was a problem hiding this comment.
mergeAndSetValidity on nested struct type only updates top-level validity
mergeAndSetValidity replaces the top-level null mask of the struct, but the child columns' validity bitmaps are not updated. For any row where input was null, the casted children will retain their values (the results of casting whatever was in the child slots under the null parent) rather than being null. This is the same class of issue tracked in #7485, which specifically calls out mergeAndSetValidity on nested types.
Every other existing call to mergeAndSetValidity in this file is applied to the result of ColumnVector.stringConcatenate — a flat STRING column — not to a struct. The original isNull + ifElse(nullStruct, casted) path was correct because it materialized a fully-null struct (children included) at each null parent row. With the new path, a null parent struct can carry non-null child values, which may produce incorrect results if any downstream operation (e.g. Parquet/ORC writing, getChildColumnView(...).getNullCount(), or deeply-nested struct projections) accesses children without first consulting the parent validity mask.
There was a problem hiding this comment.
I looked at #7485 and #7698 and i'm just going to dump my understanding here. rapidsai/cudf#13335 helped make this API safe only for offset-based nested types, i.e. STRING or LIST: it added an identity gather to get fresh offsets, dropping any non-zero offsets where a parent was null. But if the child itself has a null mask, i.e. STRUCT or anything containing a STRUCT, we do nothing about lining up child null masks. And generally, cuDF makes no guarantees on a child's null mask being consistent with the parent null mask (with the contract being the parent's mask is authoritative), however, in a bunch of places Spark RAPIDS relies on child null masks directly and presumably that isn't going to change anytime soon.
TLDR; I will close this change; @revans2 I know back in #7698 you preferred getting rid of this everywhere, but since i'm starting to introduce it in more places, should we now update this API to throw on anything with a STRUCT?
There was a problem hiding this comment.
Lets fix structs properly instead. The issue is that nulls do not properly get pushed down, but cudf has APIs to do this cudf::structs::detail::superimpose_and_sanitize_nulls(...) and cudf::structs::detail::push_down_nulls(...). We should be able to make it work with a small change, and very little overhead.
There was a problem hiding this comment.
Gotcha. Filed rapidsai/cudf#22631, this PR will depend on its closure. I can take a crack at it.
|
NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release. |
Contributes to #14588.
Description
The struct-to-struct path in
castStructToStructoriginally restored parent struct nulls withisNull/ifElseafter building the casted struct view. This change usesmergeAndSetValidityto set the casted struct validity from the input struct directly, preserving the same output values and parent null mask while avoiding the extra null-mask and conditional copy.This is covered by existing cast tests, e.g.
test_cast_day_time_interval_to_integral_no_overflowandtest_cast_integral_to_day_time_interval_no_overflow.Benchmark
Nanobenchmark
Isolating the rewritten block in a nanobenchmark shows 3.3x speedup.
struct-to-struct null restore nanobenchmark
Spark benchmark
An end-to-end Spark benchmark shows 1.14-1.15x speedup.
Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance