[AutoSparkUT] Recover SPARK-10136 nested-list parquet reads (#11589, #11592)#14838
[AutoSparkUT] Recover SPARK-10136 nested-list parquet reads (#11589, #11592)#14838wjxiz1992 wants to merge 2 commits into
Conversation
Refs NVIDIA#11589, NVIDIA#11592. ParquetSchemaUtils.clipSparkArrayType already follows the Parquet LIST backward-compatibility rules, but the guard for the unannotated 1-level legacy branch checked only isRepetition(REPEATED) and missed the additional getLogicalTypeAnnotation == null clause that Spark CPU applies (ParquetReadSupport.clipParquetListType, lines 268-269). For a Thrift- or parquet-avro 1.7-written 2-level nested LIST shape: required group f (LIST) { repeated group f_tuple (LIST) { repeated int32 f_tuple_tuple; } } the outer call descended into f_tuple (the inner LIST-annotated REPEATED group) and the recursive call short-circuited at the missing guard, passing f_tuple to clipSparkType as if it were a primitive element type. clipSparkType then called f_tuple.asPrimitiveType() and threw ClassCastException: repeated group f_tuple (LIST) { repeated int32 f_tuple_tuple; } is not primitive. The fix adds the getOriginalType == null guard so a LIST-annotated REPEATED group correctly routes to the LIST-wrapper branch. The plugin fix alone surfaces a downstream cuDF issue: cuDF's SchemaElement::is_stub() also treats a LIST-annotated REPEATED group as a stub and collapses one nesting level (list<list<int>> -> list<int>). This PR is paired with rapidsai/cudf#22597 (closes rapidsai/cudf#22596), which excludes LIST/MAP-annotated REPEATED groups from is_stub(). Local Maven validation with both fixes applied: mvn package -pl tests -am -Dbuildver=330 \ -Dmaven.repo.local=./.mvn-repo \ -DwildcardSuites=org.apache.spark.sql.rapids.suites.RapidsParquetThriftCompatibilitySuite,org.apache.spark.sql.rapids.suites.RapidsParquetAvroCompatibilitySuite \ -Drapids.test.gpu.allocFraction=0.3 \ -Drapids.test.gpu.maxAllocFraction=0.3 \ -Drapids.test.gpu.minAllocFraction=0 RapidsParquetThriftCompatibilitySuite: - SPARK-10136 list of primitive list RapidsParquetAvroCompatibilitySuite: - SPARK-10136 array of primitive array Tests: succeeded 9, failed 0, canceled 0, ignored 2, pending 0 All tests passed. Recovered tests: - RapidsParquetThriftCompatibilitySuite.SPARK-10136 list of primitive list (Spark original: ParquetThriftCompatibilitySuite.scala:74-147) - RapidsParquetAvroCompatibilitySuite.SPARK-10136 array of primitive array (Spark original: ParquetAvroCompatibilitySuite.scala:172-191) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
Greptile SummaryThis PR fixes a
Confidence Score: 5/5The change is a minimal, targeted guard addition to a schema-clipping utility with no GPU resource allocation, no data path side-effects, and two unit tests recovered to confirm the fix end-to-end. The two-line condition change exactly mirrors the Parquet spec backward-compatibility rules, is confined to schema interpretation logic with no GPU/JNI/OOM concerns, and the PR description provides thorough before/after test evidence. The only merge prerequisite is the paired cuDF fix, which is clearly documented. No files require special attention. Both changed files are straightforward: the schema-utils fix is self-contained and the test-settings change simply un-excludes two now-passing tests. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[clipSparkArrayType] --> B{isRepetition REPEATED}
B -- No --> E[LIST-wrapper branch]
B -- Yes --> C{getOriginalType != LIST AND != MAP}
C -- Yes unannotated legacy --> D[clipSparkType with parquetList as element]
C -- No LIST or MAP annotated NEW guard --> E
E --> F{repeated isPrimitive}
F -- Yes --> G[clipSparkType on primitive]
F -- No --> H{multi-field or named array or tuple}
H -- Yes --> I[clipSparkType on repeatedGroup]
H -- No --> J[clipSparkType on single subfield]
Reviews (2): Last reviewed commit: "Broaden legacy-list guard to cover REPEA..." | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
This PR fixes a Parquet LIST backward-compatibility edge case in the RAPIDS Parquet schema clipping logic to prevent ClassCastException on nested-list schemas written by parquet-thrift / parquet-avro 1.7, and re-enables the corresponding recovered Spark compatibility tests for Spark 3.3.0.
Changes:
- Update
ParquetSchemaUtils.clipSparkArrayTypeto avoid treating LIST-annotated REPEATED groups as legacy 1-level lists. - Re-enable two previously excluded SPARK-10136 compatibility tests in Spark 3.3.0 test settings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala | Re-enables SPARK-10136 parquet-thrift/avro compatibility tests by removing exclusions. |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetSchemaUtils.scala | Adjusts array/list schema clipping to correctly route LIST-annotated REPEATED groups through the LIST-wrapper handling. |
| // A REPEATED group with no LIST/MAP annotation is the legacy 1-level list: the element type | ||
| // is the group/primitive itself. A REPEATED group that IS LIST-annotated (Thrift / Avro 1.7 | ||
| // nested-list style) must go through the LIST-wrapper branch below, otherwise the wrapper | ||
| // gets passed to clipSparkType as if it were the primitive element and asPrimitiveType() | ||
| // throws ClassCastException (issues #11589, #11592). | ||
| val newSparkType = if (parquetList.getOriginalType == null && | ||
| parquetList.isRepetition(Repetition.REPEATED)) { |
There was a problem hiding this comment.
Good catch — repeated binary x (UTF8) and similar primitive-with-annotation legacy 1-level shapes do exist and the strict getOriginalType == null guard would route them into the LIST-wrapper branch, where asGroupType() blows up. Loosened the predicate to "REPEATED unless explicitly LIST- or MAP-annotated", which matches the Parquet spec's backward-compatibility rules and covers both repeated binary x (UTF8) and repeated fixed_len_byte_array (DECIMAL).
The previous predicate `getOriginalType == null && isRepetition(REPEATED)` was too strict: legacy 1-level lists can be encoded as a REPEATED primitive with a non-null original type (e.g. `repeated binary x (UTF8)` for array<string>, `repeated fixed_len_byte_array (DECIMAL)` for array<decimal>). Those shapes would route into the LIST-wrapper branch and parquetList.asGroupType() would throw ClassCastException because the type is primitive. Per the Parquet spec backward-compatibility rules, any REPEATED field that isn't explicitly LIST- or MAP-annotated is the legacy 1-level encoding. Updated the predicate accordingly. Caught by @copilot-pull-request-reviewer. Signed-off-by: Allen Xu <allxu@nvidia.com>
|
Converting to draft — this PR depends on the cuDF schema fix in rapidsai/cudf#22597 (the 2-level legacy LIST |
|
NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release. |
Refs #11589, #11592.
Description
ParquetSchemaUtils.clipSparkArrayTypealready follows the Parquet LIST backward-compatibility rules, but the guard for the unannotated 1-level legacy branch checked onlyisRepetition(REPEATED)and missed the additionalgetLogicalTypeAnnotation == nullclause that Spark CPU applies (ParquetReadSupport.clipParquetListTypelines 268-269).For a Thrift- or parquet-avro 1.7-written 2-level nested LIST shape:
the outer call descended into
f_tuple(the inner LIST-annotated REPEATED group) and the recursive call short-circuited at the missing guard, passingf_tupletoclipSparkTypeas if it were a primitive element type.clipSparkTypethen calledf_tuple.asPrimitiveType()and threwClassCastException: repeated group f_tuple (LIST) { repeated int32 f_tuple_tuple; } is not primitive.The fix adds the
getOriginalType == nullguard so a LIST-annotated REPEATED group correctly routes to the LIST-wrapper branch. Implementation usesgetOriginalType(rather thangetLogicalTypeAnnotation) to match the existing pattern in this file (the rest ofclipSparkArrayTypealready checksgetOriginalType == OriginalType.LIST).Paired cuDF fix
The plugin fix alone surfaces a downstream cuDF issue: cuDF's
SchemaElement::is_stub()also treats a LIST-annotated REPEATED group as a stub and collapses one nesting level (list<list<int>>->list<int>). This PR is paired with rapidsai/cudf#22597 (closes rapidsai/cudf#22596), which excludes LIST/MAP-annotated REPEATED groups fromis_stub(). Wait to merge until rapidsai/cudf#22597 is in the cuDF version this branch depends on.Recovered tests
SPARK-10136 list of primitive listSPARK-10136 list of primitive listParquetThriftCompatibilitySuite.scala:74-147SPARK-10136 array of primitive arraySPARK-10136 array of primitive arrayParquetAvroCompatibilitySuite.scala:172-191Local Maven validation
End-to-end validated with a locally-built spark-rapids-jni containing both the matching cuDF fix above and the existing rapidsai/cudf#22567 (which is the prerequisite for the sibling
RapidsParquetProtobufCompatibilitySuiterecovery in #14821):Coverage delta in the touched suites: 7/9 → 9/9 passing (the two SPARK-10136 tests recovered, no other tests affected).
Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance