fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

parthchandra · 2025-02-07T00:01:22Z

Which issue does this PR close?

Partly addresses test failures caused by #1348

Rationale for this change

As the issue points out, datafusion comet returns different values from Spark for uint_8 and uint_16 parquet types that may have the sign bit set.

What changes are included in this PR?

Rewrites the parquet test files to not use the uint_8 and uint16 types if the complex type readers are enabled.

How are these changes tested?

Locally using existing unit tests. Note that the unit tests still fail, but not because of unsigned ints

…are enabled.

codecov-commenter · 2025-02-07T02:29:18Z

Codecov Report

Attention: Patch coverage is 45.45455% with 6 lines in your changes missing coverage. Please review.

Project coverage is 39.07%. Comparing base (f09f8af) to head (ddfac54).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
...org/apache/comet/CometSparkSessionExtensions.scala	0.00%	0 Missing and 3 partials ⚠️
.../main/scala/org/apache/comet/DataTypeSupport.scala	25.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1376       +/-   ##
=============================================
- Coverage     56.12%   39.07%   -17.06%     
- Complexity      976     2077     +1101     
=============================================
  Files           119      263      +144     
  Lines         11743    60767    +49024     
  Branches       2251    12921    +10670     
=============================================
+ Hits           6591    23742    +17151     
- Misses         4012    32534    +28522     
- Partials       1140     4491     +3351

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2025-02-07T02:46:31Z

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

+            makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, valueRanges + 1)
+            withParquetTable(path.toString, "tbl") {
+              if (CometSparkSessionExtensions.isComplexTypeReaderEnabled(conf)) {
+                checkSparkAnswer("select _9, _10 FROM tbl order by _11")


Do we already have logic to fall back to Spark when the complex type reader is enabled and when the query references uint Parquet fields?

No we don't for two reasons. Firstly, in the plan we get the schema as understood by Spark so all the signed int_8 and int_16 values are indistinguishable from the unsigned ones. As a result we fall back to Spark for both signed and unsigned integers. Secondly, too many unit tests fail because we check that the plan contains a comet operator and would need to be modified.
I'm open to putting it back though.

As a result we fall back to Spark for both signed and unsigned integers.

Just 8 and 16 bit, or all integers? I'm fine with falling back for 8 and 16 bit for now, although it would be nice to have a config to override this (with the understanding that behavior is incorrect for unsigned integers).

Just 8 and 16 bit.
I started with the fallback to spark and a compat override. The reason I reverted it is that I couldn't see a way to get to compatibility with spark even after/if apache/arrow-rs#7040 is addressed.
Let me do as you suggest. Marking this as draft in the meantime.

…lex type readers are enabled.

parthchandra · 2025-02-10T17:08:45Z

@andygrove updates this to fallback, updated the unit tests and removed the draft tag

andygrove · 2025-02-11T16:26:39Z

spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala

-      pageSize: Int = 128,
-      randomSize: Int = 0): Unit = {
-    val schemaStr =
+  def getAllTypesParquetSchema: String = {


If we are renaming this method, I wonder if we should remove the AllTypes part since it does not generate all types. Perhaps getPrimitiveTypesParquetSchema?

andygrove · 2025-02-12T23:14:27Z

spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala

-          sql("SELECT array_remove(array(_2, _3,_4), _3) from t1 where _3 is not null"))
-        checkSparkAnswerAndOperator(sql(
-          "SELECT array_remove(case when _2 = _3 THEN array(_2, _3,_4) ELSE null END, _3) from t1"))
+        withSQLConf(CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.key -> "true") {


I wonder if we should default COMET_SCAN_ALLOW_INCOMPATIBLE=true in CometTestBase and then just disable it in specific tests?

I'd be okay with that. Most Spark users will not have unsigned ints, and having it false creates a penalty for users who do not have any unsigned ints unless they explicitly set the allow incompatible flag.
Changing this and reverting the unit tests which had to explicitly set the flag.

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

…IBLE and reduce code duplication

andygrove · 2025-02-13T16:11:07Z

common/src/main/scala/org/apache/comet/CometConf.scala

+    conf("spark.comet.scan.allowIncompatible")
+      .doc(
+        "Comet is not currently fully compatible with Spark for all datatypes. " +
+          s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")


We link to the Compatibility Guide here but there is no new information in that guide about handling for byte/short, so would be good to add that. This could be done in a follow on PR.

andygrove · 2025-02-13T16:13:48Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

@@ -1352,6 +1352,15 @@ object CometSparkSessionExtensions extends Logging {
    org.apache.spark.SPARK_VERSION >= "4.0"
  }

+  def isComplexTypeReaderEnabled(conf: SQLConf): Boolean = {


I find the naming confusing here. This method determines if we are using native_datafusion or native_iceberg_compat (which both use DataFusion's ParquetExec). This is no logic related to complex types.

Complex type support was a big motivation for adding these new scans, but it doesn't seem to make sense to refer to complex types in the changes in this PR.

This is just a nit, and we can rename the methods in a future PR.

andygrove

LGTM. Thanks @parthchandra. I left some comments but they are nits that we can address in follow on PRs.

andygrove · 2025-02-13T16:19:42Z

common/src/main/scala/org/apache/comet/CometConf.scala

+        "Comet is not currently fully compatible with Spark for all datatypes. " +
+          s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
+      .booleanConf
+      .createWithDefault(true)


I think that we should default this to false because it is a correctness issue, and explicitly set this to true in CometTestBase.

andygrove · 2025-02-13T18:00:46Z

I created a follow on PR #1398

parthchandra added 2 commits February 6, 2025 15:53

fix: disable checking for uint_8 and uint_16 if complex type readers …

940aed5

…are enabled.

style fix

db7b11e

andygrove reviewed Feb 7, 2025

View reviewed changes

parthchandra marked this pull request as draft February 7, 2025 18:39

parthchandra added 4 commits February 7, 2025 15:55

Fallback to Spark if uint_8, uint_16 types are in the schema and comp…

0aa27d6

…lex type readers are enabled.

fix tests

5dfdc19

fix more tests

9389be8

style fix

501c52a

parthchandra marked this pull request as ready for review February 10, 2025 17:07

andygrove reviewed Feb 11, 2025

View reviewed changes

better name for method to get test file schema

a6c33a0

andygrove reviewed Feb 12, 2025

View reviewed changes

spark/src/test/scala/org/apache/comet/CometCastSuite.scala Outdated Show resolved Hide resolved

Address review comments. Change default for COMET_SCAN_ALLOW_INCOMPAT…

2c2237d

…IBLE and reduce code duplication

parthchandra mentioned this pull request Feb 13, 2025

fix: Reduce cast.rs and utils.rs logic from parquet_support.rs for experimental native scans #1387

Merged

fix style

ddfac54

andygrove reviewed Feb 13, 2025

View reviewed changes

andygrove approved these changes Feb 13, 2025

View reviewed changes

andygrove reviewed Feb 13, 2025

View reviewed changes

andygrove mentioned this pull request Feb 13, 2025

fix: Change default value of COMET_SCAN_ALLOW_INCOMPATIBLE and add documentation #1398

Merged

andygrove merged commit 57a4dca into apache:main Feb 13, 2025
74 checks passed

fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

Uh oh!

Conversation

parthchandra commented Feb 7, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthchandra commented Feb 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Feb 7, 2025 •

edited

Loading

andygrove Feb 7, 2025 •

edited

Loading