Skip to content

[AutoSparkUT] Recover SPARK-10136 nested-list parquet reads (#11589, #11592)#14838

Draft
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/11589-parquet-nested-list
Draft

[AutoSparkUT] Recover SPARK-10136 nested-list parquet reads (#11589, #11592)#14838
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/11589-parquet-nested-list

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

@wjxiz1992 wjxiz1992 commented May 20, 2026

Refs #11589, #11592.

Description

ParquetSchemaUtils.clipSparkArrayType already follows the Parquet LIST backward-compatibility rules, but the guard for the unannotated 1-level legacy branch checked only isRepetition(REPEATED) and missed the additional getLogicalTypeAnnotation == null clause that Spark CPU applies (ParquetReadSupport.clipParquetListType lines 268-269).

For a Thrift- or parquet-avro 1.7-written 2-level nested LIST shape:

required group f (LIST) {
  repeated group f_tuple (LIST) {
    repeated int32 f_tuple_tuple;
  }
}

the outer call descended into f_tuple (the inner LIST-annotated REPEATED group) and the recursive call short-circuited at the missing guard, passing f_tuple to clipSparkType as if it were a primitive element type. clipSparkType then called f_tuple.asPrimitiveType() and threw ClassCastException: repeated group f_tuple (LIST) { repeated int32 f_tuple_tuple; } is not primitive.

The fix adds the getOriginalType == null guard so a LIST-annotated REPEATED group correctly routes to the LIST-wrapper branch. Implementation uses getOriginalType (rather than getLogicalTypeAnnotation) to match the existing pattern in this file (the rest of clipSparkArrayType already checks getOriginalType == OriginalType.LIST).

Paired cuDF fix

The plugin fix alone surfaces a downstream cuDF issue: cuDF's SchemaElement::is_stub() also treats a LIST-annotated REPEATED group as a stub and collapses one nesting level (list<list<int>> -> list<int>). This PR is paired with rapidsai/cudf#22597 (closes rapidsai/cudf#22596), which excludes LIST/MAP-annotated REPEATED groups from is_stub(). Wait to merge until rapidsai/cudf#22597 is in the cuDF version this branch depends on.

Recovered tests

RAPIDS test Spark original Source Status
SPARK-10136 list of primitive list SPARK-10136 list of primitive list ParquetThriftCompatibilitySuite.scala:74-147 RECOVERED
SPARK-10136 array of primitive array SPARK-10136 array of primitive array ParquetAvroCompatibilitySuite.scala:172-191 RECOVERED

Local Maven validation

mvn package -pl tests -am -Dbuildver=330 \
  -Dmaven.repo.local=./.mvn-repo \
  -DwildcardSuites=org.apache.spark.sql.rapids.suites.RapidsParquetThriftCompatibilitySuite,org.apache.spark.sql.rapids.suites.RapidsParquetAvroCompatibilitySuite \
  -Drapids.test.gpu.allocFraction=0.3 \
  -Drapids.test.gpu.maxAllocFraction=0.3 \
  -Drapids.test.gpu.minAllocFraction=0

End-to-end validated with a locally-built spark-rapids-jni containing both the matching cuDF fix above and the existing rapidsai/cudf#22567 (which is the prerequisite for the sibling RapidsParquetProtobufCompatibilitySuite recovery in #14821):

RapidsParquetThriftCompatibilitySuite:
- SPARK-10136 list of primitive list
RapidsParquetAvroCompatibilitySuite:
- SPARK-10136 array of primitive array
Tests: succeeded 9, failed 0, canceled 0, ignored 2, pending 0
All tests passed.
BUILD SUCCESS

Coverage delta in the touched suites: 7/9 → 9/9 passing (the two SPARK-10136 tests recovered, no other tests affected).

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

Refs NVIDIA#11589, NVIDIA#11592.

ParquetSchemaUtils.clipSparkArrayType already follows the Parquet LIST
backward-compatibility rules, but the guard for the unannotated 1-level
legacy branch checked only isRepetition(REPEATED) and missed the
additional getLogicalTypeAnnotation == null clause that Spark CPU applies
(ParquetReadSupport.clipParquetListType, lines 268-269).

For a Thrift- or parquet-avro 1.7-written 2-level nested LIST shape:

    required group f (LIST) {
      repeated group f_tuple (LIST) {
        repeated int32 f_tuple_tuple;
      }
    }

the outer call descended into f_tuple (the inner LIST-annotated REPEATED
group) and the recursive call short-circuited at the missing guard,
passing f_tuple to clipSparkType as if it were a primitive element type.
clipSparkType then called f_tuple.asPrimitiveType() and threw
ClassCastException: repeated group f_tuple (LIST) { repeated int32
f_tuple_tuple; } is not primitive.

The fix adds the getOriginalType == null guard so a LIST-annotated
REPEATED group correctly routes to the LIST-wrapper branch.

The plugin fix alone surfaces a downstream cuDF issue: cuDF's
SchemaElement::is_stub() also treats a LIST-annotated REPEATED group as
a stub and collapses one nesting level (list<list<int>> -> list<int>).
This PR is paired with rapidsai/cudf#22597 (closes rapidsai/cudf#22596),
which excludes LIST/MAP-annotated REPEATED groups from is_stub().

Local Maven validation with both fixes applied:

  mvn package -pl tests -am -Dbuildver=330 \
    -Dmaven.repo.local=./.mvn-repo \
    -DwildcardSuites=org.apache.spark.sql.rapids.suites.RapidsParquetThriftCompatibilitySuite,org.apache.spark.sql.rapids.suites.RapidsParquetAvroCompatibilitySuite \
    -Drapids.test.gpu.allocFraction=0.3 \
    -Drapids.test.gpu.maxAllocFraction=0.3 \
    -Drapids.test.gpu.minAllocFraction=0

  RapidsParquetThriftCompatibilitySuite:
  - SPARK-10136 list of primitive list
  RapidsParquetAvroCompatibilitySuite:
  - SPARK-10136 array of primitive array
  Tests: succeeded 9, failed 0, canceled 0, ignored 2, pending 0
  All tests passed.

Recovered tests:
- RapidsParquetThriftCompatibilitySuite.SPARK-10136 list of primitive list
  (Spark original: ParquetThriftCompatibilitySuite.scala:74-147)
- RapidsParquetAvroCompatibilitySuite.SPARK-10136 array of primitive array
  (Spark original: ParquetAvroCompatibilitySuite.scala:172-191)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings May 20, 2026 11:06
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR fixes a ClassCastException in ParquetSchemaUtils.clipSparkArrayType caused by a missing annotation guard: a LIST-annotated REPEATED group (written by Thrift/Avro 1.7 for nested-list schemas) was incorrectly routed to the unannotated legacy 1-level branch, where clipSparkType called asPrimitiveType() on a group type and threw. The fix adds getOriginalType != OriginalType.LIST && getOriginalType != OriginalType.MAP to the REPEATED branch predicate, aligning with the Parquet spec backward-compatibility rules, and removes the corresponding KNOWN_ISSUE exclusions from the Spark 3.3.0 test settings.

  • Schema-utils fix (ParquetSchemaUtils.scala): guards the legacy-element path with != LIST && != MAP so LIST-annotated REPEATED wrappers fall through to the LIST-group branch; replaces the stale TODO comment with a concise explanation of the Parquet backward-compatibility rules.
  • Test recovery (RapidsTestSettings.scala): removes the .exclude entries for SPARK-10136 list of primitive list and SPARK-10136 array of primitive array, recovering 2 of the 9 tests in the affected suites.

Confidence Score: 5/5

The change is a minimal, targeted guard addition to a schema-clipping utility with no GPU resource allocation, no data path side-effects, and two unit tests recovered to confirm the fix end-to-end.

The two-line condition change exactly mirrors the Parquet spec backward-compatibility rules, is confined to schema interpretation logic with no GPU/JNI/OOM concerns, and the PR description provides thorough before/after test evidence. The only merge prerequisite is the paired cuDF fix, which is clearly documented.

No files require special attention. Both changed files are straightforward: the schema-utils fix is self-contained and the test-settings change simply un-excludes two now-passing tests.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetSchemaUtils.scala Adds getOriginalType != LIST && != MAP guards to clipSparkArrayType so LIST-annotated REPEATED groups route to the LIST-wrapper branch instead of the legacy element branch, fixing a ClassCastException on nested-list Parquet schemas.
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Removes the KNOWN_ISSUE exclusions for the two SPARK-10136 tests in RapidsParquetAvroCompatibilitySuite and RapidsParquetThriftCompatibilitySuite now that the underlying schema-utils bug is fixed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[clipSparkArrayType] --> B{isRepetition REPEATED}
    B -- No --> E[LIST-wrapper branch]
    B -- Yes --> C{getOriginalType != LIST AND != MAP}
    C -- Yes unannotated legacy --> D[clipSparkType with parquetList as element]
    C -- No LIST or MAP annotated NEW guard --> E
    E --> F{repeated isPrimitive}
    F -- Yes --> G[clipSparkType on primitive]
    F -- No --> H{multi-field or named array or tuple}
    H -- Yes --> I[clipSparkType on repeatedGroup]
    H -- No --> J[clipSparkType on single subfield]
Loading

Reviews (2): Last reviewed commit: "Broaden legacy-list guard to cover REPEA..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Parquet LIST backward-compatibility edge case in the RAPIDS Parquet schema clipping logic to prevent ClassCastException on nested-list schemas written by parquet-thrift / parquet-avro 1.7, and re-enables the corresponding recovered Spark compatibility tests for Spark 3.3.0.

Changes:

  • Update ParquetSchemaUtils.clipSparkArrayType to avoid treating LIST-annotated REPEATED groups as legacy 1-level lists.
  • Re-enable two previously excluded SPARK-10136 compatibility tests in Spark 3.3.0 test settings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Re-enables SPARK-10136 parquet-thrift/avro compatibility tests by removing exclusions.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetSchemaUtils.scala Adjusts array/list schema clipping to correctly route LIST-annotated REPEATED groups through the LIST-wrapper handling.

Comment on lines +369 to +375
// A REPEATED group with no LIST/MAP annotation is the legacy 1-level list: the element type
// is the group/primitive itself. A REPEATED group that IS LIST-annotated (Thrift / Avro 1.7
// nested-list style) must go through the LIST-wrapper branch below, otherwise the wrapper
// gets passed to clipSparkType as if it were the primitive element and asPrimitiveType()
// throws ClassCastException (issues #11589, #11592).
val newSparkType = if (parquetList.getOriginalType == null &&
parquetList.isRepetition(Repetition.REPEATED)) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — repeated binary x (UTF8) and similar primitive-with-annotation legacy 1-level shapes do exist and the strict getOriginalType == null guard would route them into the LIST-wrapper branch, where asGroupType() blows up. Loosened the predicate to "REPEATED unless explicitly LIST- or MAP-annotated", which matches the Parquet spec's backward-compatibility rules and covers both repeated binary x (UTF8) and repeated fixed_len_byte_array (DECIMAL).

The previous predicate `getOriginalType == null && isRepetition(REPEATED)`
was too strict: legacy 1-level lists can be encoded as a REPEATED primitive
with a non-null original type (e.g. `repeated binary x (UTF8)` for
array<string>, `repeated fixed_len_byte_array (DECIMAL)` for array<decimal>).
Those shapes would route into the LIST-wrapper branch and
parquetList.asGroupType() would throw ClassCastException because the type
is primitive.

Per the Parquet spec backward-compatibility rules, any REPEATED field that
isn't explicitly LIST- or MAP-annotated is the legacy 1-level encoding.
Updated the predicate accordingly. Caught by @copilot-pull-request-reviewer.

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 marked this pull request as draft May 21, 2026 08:48
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

Converting to draft — this PR depends on the cuDF schema fix in rapidsai/cudf#22597 (the 2-level legacy LIST is_stub() correction). Will mark ready for review once that lands and the new JNI/cuDF version is pulled in.

@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented May 25, 2026

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Parquet reader collapses 2-level legacy nested LIST: list<list<int>> read as list<int>

3 participants