fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415

parthchandra · 2025-02-18T03:29:22Z

major changes :

allow Uint64 to decimal and FixedWidthBinary to Binary conversions in complex readers
do not enable prefetch reads in tests if complex reader is enabled
fix more incompatible checks in uint_8/uint_16 tests
skip datetime rebase tests for complex readers (not supported)

There may be conflicts between this and #1413 (@mbutrovich) which removes the cast_supported method but can be reconciled afterwards

Without #1413 the failure counts are:

native_datafusion: Tests: succeeded 658, failed 19, canceled 4, ignored 54, pending 0
native_iceberg_compat: Tests: succeeded 662, failed 15, canceled 4, ignored 54, pending 0

codecov-commenter · 2025-02-18T08:57:19Z

Codecov Report

Attention: Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.

Project coverage is 57.84%. Comparing base (f09f8af) to head (4d2dd2d).
Report is 42 commits behind head on main.

Files with missing lines	Patch %	Lines
...t/parquet/CometParquetPartitionReaderFactory.scala	63.63%	0 Missing and 4 partials ⚠️
...ala/org/apache/spark/sql/comet/CometScanExec.scala	66.66%	0 Missing and 3 partials ⚠️
...va/org/apache/comet/parquet/NativeBatchReader.java	0.00%	1 Missing ⚠️
.../apache/comet/parquet/CometParquetFileFormat.scala	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1415      +/-   ##
============================================
+ Coverage     56.12%   57.84%   +1.72%     
- Complexity      976      985       +9     
============================================
  Files           119      122       +3     
  Lines         11743    12130     +387     
  Branches       2251     2285      +34     
============================================
+ Hits           6591     7017     +426     
+ Misses         4012     3938      -74     
- Partials       1140     1175      +35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mbutrovich · 2025-02-18T14:23:49Z

spark/src/test/scala/org/apache/spark/sql/comet/ParquetDatetimeRebaseSuite.scala

-    super.test(testName, testTags: _*)(withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "") {
-      testFun
-    })(pos)
+    // Datasource V2 is not supported by complex readers so force the scan impl back


Let's be consistent and call them native readers. We will likely use these for more than just complex types at some point.

changed the comment

mbutrovich · 2025-02-18T14:24:47Z

#1413 shouldn't really conflict with this (I mean it will from a merge standpoint, but not from a logic standpoint). This just adds more true cases to cast_supported and #1413 removes the check entirely.

parthchandra · 2025-02-18T15:58:13Z

#1413 shouldn't really conflict with this (I mean it will from a merge standpoint, but not from a logic standpoint). This just adds more true cases to cast_supported and #1413 removes the check entirely.

I've updated the pr and removed the schema adapter changes. The number of failures has gone up but they should be taken care of by #1413

parthchandra · 2025-02-19T01:42:54Z

@kazuyukitanimura @andygrove @huaxingao @comphead review requested.

andygrove

LGTM. Thanks @parthchandra

comphead · 2025-02-19T18:17:33Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

@@ -260,7 +260,8 @@ public void init() throws URISyntaxException, IOException {
    } ////// End get requested schema

    String timeZoneId = conf.get("spark.sql.session.timeZone");
-    Schema arrowSchema = Utils$.MODULE$.toArrowSchema(sparkSchema, timeZoneId);
+    // Native code uses "UTC" always as the timeZoneId when converting from spark to arrow schema.


comphead · 2025-02-19T18:18:50Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

@@ -260,7 +260,8 @@ public void init() throws URISyntaxException, IOException {
    } ////// End get requested schema

    String timeZoneId = conf.get("spark.sql.session.timeZone");


wondering should we always use UTC then and remove reading the tz from spark parameter?

We need be using the session timezone parameter for at lest some conversions (especially those involving timezone_ntz). The conversions of timestamp/timestamp_ntz between Arrow and Spark are somewhat convoluted and sometimes require the timezone offset to be applied to the values to make them consistent.
Safer to make sure the timezone parameter is passed to the native side so it can be applied when necessary.

comphead

lgtm thanks @parthchandra

spark/src/main/scala/org/apache/comet/parquet/CometParquetPartitionReaderFactory.scala

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

kazuyukitanimura · 2025-02-19T19:15:16Z

spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

@@ -1001,7 +1012,7 @@ abstract class ParquetReadSuite extends CometTestBase {
                Seq(StructField("_1", LongType, false), StructField("_2", DoubleType, false)))

            withParquetDataFrame(data, schema = Some(readSchema)) { df =>
-              if (enableSchemaEvolution) {
+              if (enableSchemaEvolution || usingDataFusionParquetExec(conf)) {


Hmm does this mean that usingDataFusionParquetExec==true we always use schema evolution regardless COMET_SCHEMA_EVOLUTION_ENABLED This means it will be incompatible with Spark 3.x?

I suspect it may be incompatible. We will address this as we get to the point where we are able to run Spark tests with different versions of Spark. At the moment we are still clearing up the Comet test failures.

I feel we need a TODO comment or an issue ticket to track this.

…ceberg_compat readers major changes : - allow Uint64 to decimal and FixedWidthBinary to Binary conversions in complex readers - do not enable prefetch reads in tests if complex reader is enabled - fix more incompatible checks in uint_8/uint_16 tests - skip datetime rebase tests for complex readers (not supported)

…dates

parthchandra · 2025-02-20T11:55:59Z

@kazuyukitanimura addressed your last comment and also rebased (there were merge conflicts).
Test failure count after rebase:

native_datafusion:     Tests: succeeded 661, failed 14, canceled 6, ignored 54, pending 0
native_iceberg_compat: Tests: succeeded 663, failed 12, canceled 6, ignored 54, pending 0

kazuyukitanimura · 2025-02-20T20:39:11Z

merged thank you @parthchandra @comphead @andygrove @mbutrovich

parthchandra mentioned this pull request Feb 18, 2025

fix: Remove more cast.rs logic from parquet_support.rs for experimental native scans #1413

Merged

mbutrovich reviewed Feb 18, 2025

View reviewed changes

andygrove approved these changes Feb 19, 2025

View reviewed changes

comphead reviewed Feb 19, 2025

View reviewed changes

comphead approved these changes Feb 19, 2025

View reviewed changes

kazuyukitanimura reviewed Feb 19, 2025

View reviewed changes

kazuyukitanimura approved these changes Feb 20, 2025

View reviewed changes

parthchandra added 8 commits February 20, 2025 15:11

fix regression

2c22434

style

0e5120c

Improve comment. Skip tests involving date time rebasing for ancient …

8170bc6

…dates

style

0f41813

Add TODO to follow up with arrow-rs changes

a86e6b6

add todo to validate schema evolution

b69e2fa

format (after rebase)

4d2dd2d

parthchandra force-pushed the complex_test_failures_pr branch from c6ccb59 to 4d2dd2d Compare February 20, 2025 11:52

kazuyukitanimura merged commit 55ae543 into apache:main Feb 20, 2025
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415

fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415

parthchandra commented Feb 18, 2025

codecov-commenter commented Feb 18, 2025 •

edited

Loading

mbutrovich Feb 18, 2025

parthchandra Feb 18, 2025

mbutrovich commented Feb 18, 2025

parthchandra commented Feb 18, 2025

parthchandra commented Feb 19, 2025

andygrove left a comment

comphead Feb 19, 2025

comphead Feb 19, 2025

parthchandra Feb 20, 2025

comphead left a comment

kazuyukitanimura Feb 19, 2025

parthchandra Feb 20, 2025

kazuyukitanimura Feb 20, 2025

parthchandra Feb 20, 2025

parthchandra commented Feb 20, 2025

kazuyukitanimura commented Feb 20, 2025

		@@ -260,7 +260,8 @@ public void init() throws URISyntaxException, IOException {
		} ////// End get requested schema

		String timeZoneId = conf.get("spark.sql.session.timeZone");

fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415

fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415

Conversation

parthchandra commented Feb 18, 2025

codecov-commenter commented Feb 18, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbutrovich commented Feb 18, 2025

parthchandra commented Feb 18, 2025

parthchandra commented Feb 19, 2025

andygrove left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra commented Feb 20, 2025

kazuyukitanimura commented Feb 20, 2025

codecov-commenter commented Feb 18, 2025 •

edited

Loading