Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415

Merged
merged 8 commits into from
Feb 20, 2025

Conversation

parthchandra
Copy link
Contributor

major changes :

  • allow Uint64 to decimal and FixedWidthBinary to Binary conversions in complex readers
  • do not enable prefetch reads in tests if complex reader is enabled
  • fix more incompatible checks in uint_8/uint_16 tests
  • skip datetime rebase tests for complex readers (not supported)

There may be conflicts between this and #1413 (@mbutrovich) which removes the cast_supported method but can be reconciled afterwards

Without #1413 the failure counts are:

native_datafusion: Tests: succeeded 658, failed 19, canceled 4, ignored 54, pending 0
native_iceberg_compat: Tests: succeeded 662, failed 15, canceled 4, ignored 54, pending 0

@codecov-commenter
Copy link

codecov-commenter commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.

Project coverage is 57.84%. Comparing base (f09f8af) to head (4d2dd2d).
Report is 42 commits behind head on main.

Files with missing lines Patch % Lines
...t/parquet/CometParquetPartitionReaderFactory.scala 63.63% 0 Missing and 4 partials ⚠️
...ala/org/apache/spark/sql/comet/CometScanExec.scala 66.66% 0 Missing and 3 partials ⚠️
...va/org/apache/comet/parquet/NativeBatchReader.java 0.00% 1 Missing ⚠️
.../apache/comet/parquet/CometParquetFileFormat.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1415      +/-   ##
============================================
+ Coverage     56.12%   57.84%   +1.72%     
- Complexity      976      985       +9     
============================================
  Files           119      122       +3     
  Lines         11743    12130     +387     
  Branches       2251     2285      +34     
============================================
+ Hits           6591     7017     +426     
+ Misses         4012     3938      -74     
- Partials       1140     1175      +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

super.test(testName, testTags: _*)(withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "") {
testFun
})(pos)
// Datasource V2 is not supported by complex readers so force the scan impl back
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent and call them native readers. We will likely use these for more than just complex types at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the comment

@mbutrovich
Copy link
Contributor

#1413 shouldn't really conflict with this (I mean it will from a merge standpoint, but not from a logic standpoint). This just adds more true cases to cast_supported and #1413 removes the check entirely.

@parthchandra
Copy link
Contributor Author

#1413 shouldn't really conflict with this (I mean it will from a merge standpoint, but not from a logic standpoint). This just adds more true cases to cast_supported and #1413 removes the check entirely.

I've updated the pr and removed the schema adapter changes. The number of failures has gone up but they should be taken care of by #1413

@parthchandra
Copy link
Contributor Author

@kazuyukitanimura @andygrove @huaxingao @comphead review requested.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @parthchandra

@@ -260,7 +260,8 @@ public void init() throws URISyntaxException, IOException {
} ////// End get requested schema

String timeZoneId = conf.get("spark.sql.session.timeZone");
Schema arrowSchema = Utils$.MODULE$.toArrowSchema(sparkSchema, timeZoneId);
// Native code uses "UTC" always as the timeZoneId when converting from spark to arrow schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -260,7 +260,8 @@ public void init() throws URISyntaxException, IOException {
} ////// End get requested schema

String timeZoneId = conf.get("spark.sql.session.timeZone");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering should we always use UTC then and remove reading the tz from spark parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need be using the session timezone parameter for at lest some conversions (especially those involving timezone_ntz). The conversions of timestamp/timestamp_ntz between Arrow and Spark are somewhat convoluted and sometimes require the timezone offset to be applied to the values to make them consistent.
Safer to make sure the timezone parameter is passed to the native side so it can be applied when necessary.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @parthchandra

@@ -1001,7 +1012,7 @@ abstract class ParquetReadSuite extends CometTestBase {
Seq(StructField("_1", LongType, false), StructField("_2", DoubleType, false)))

withParquetDataFrame(data, schema = Some(readSchema)) { df =>
if (enableSchemaEvolution) {
if (enableSchemaEvolution || usingDataFusionParquetExec(conf)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm does this mean that usingDataFusionParquetExec==true we always use schema evolution regardless COMET_SCHEMA_EVOLUTION_ENABLED This means it will be incompatible with Spark 3.x?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it may be incompatible. We will address this as we get to the point where we are able to run Spark tests with different versions of Spark. At the moment we are still clearing up the Comet test failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we need a TODO comment or an issue ticket to track this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

…ceberg_compat readers

major changes :
- allow Uint64 to decimal and  FixedWidthBinary to Binary conversions in complex readers
- do not enable prefetch reads in tests if complex reader is enabled
- fix more incompatible checks in uint_8/uint_16 tests
- skip datetime rebase tests for complex readers (not supported)
@parthchandra parthchandra force-pushed the complex_test_failures_pr branch from c6ccb59 to 4d2dd2d Compare February 20, 2025 11:52
@parthchandra
Copy link
Contributor Author

@kazuyukitanimura addressed your last comment and also rebased (there were merge conflicts).
Test failure count after rebase:

native_datafusion:     Tests: succeeded 661, failed 14, canceled 6, ignored 54, pending 0
native_iceberg_compat: Tests: succeeded 663, failed 12, canceled 6, ignored 54, pending 0

@kazuyukitanimura kazuyukitanimura merged commit 55ae543 into apache:main Feb 20, 2025
74 checks passed
@kazuyukitanimura
Copy link
Contributor

merged thank you @parthchandra @comphead @andygrove @mbutrovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants