-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1376 +/- ##
=============================================
- Coverage 56.12% 39.07% -17.06%
- Complexity 976 2077 +1101
=============================================
Files 119 263 +144
Lines 11743 60767 +49024
Branches 2251 12921 +10670
=============================================
+ Hits 6591 23742 +17151
- Misses 4012 32534 +28522
- Partials 1140 4491 +3351 ☔ View full report in Codecov by Sentry. |
makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, valueRanges + 1) | ||
withParquetTable(path.toString, "tbl") { | ||
if (CometSparkSessionExtensions.isComplexTypeReaderEnabled(conf)) { | ||
checkSparkAnswer("select _9, _10 FROM tbl order by _11") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already have logic to fall back to Spark when the complex type reader is enabled and when the query references uint Parquet fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we don't for two reasons. Firstly, in the plan we get the schema as understood by Spark so all the signed int_8 and int_16 values are indistinguishable from the unsigned ones. As a result we fall back to Spark for both signed and unsigned integers. Secondly, too many unit tests fail because we check that the plan contains a comet operator and would need to be modified.
I'm open to putting it back though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result we fall back to Spark for both signed and unsigned integers.
Just 8 and 16 bit, or all integers? I'm fine with falling back for 8 and 16 bit for now, although it would be nice to have a config to override this (with the understanding that behavior is incorrect for unsigned integers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 8 and 16 bit.
I started with the fallback to spark and a compat override. The reason I reverted it is that I couldn't see a way to get to compatibility with spark even after/if apache/arrow-rs#7040 is addressed.
Let me do as you suggest. Marking this as draft in the meantime.
…lex type readers are enabled.
@andygrove updates this to fallback, updated the unit tests and removed the draft tag |
pageSize: Int = 128, | ||
randomSize: Int = 0): Unit = { | ||
val schemaStr = | ||
def getAllTypesParquetSchema: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are renaming this method, I wonder if we should remove the AllTypes
part since it does not generate all types. Perhaps getPrimitiveTypesParquetSchema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sql("SELECT array_remove(array(_2, _3,_4), _3) from t1 where _3 is not null")) | ||
checkSparkAnswerAndOperator(sql( | ||
"SELECT array_remove(case when _2 = _3 THEN array(_2, _3,_4) ELSE null END, _3) from t1")) | ||
withSQLConf(CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should default COMET_SCAN_ALLOW_INCOMPATIBLE=true
in CometTestBase
and then just disable it in specific tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be okay with that. Most Spark users will not have unsigned ints, and having it false creates a penalty for users who do not have any unsigned ints unless they explicitly set the allow incompatible flag.
Changing this and reverting the unit tests which had to explicitly set the flag.
…IBLE and reduce code duplication
conf("spark.comet.scan.allowIncompatible") | ||
.doc( | ||
"Comet is not currently fully compatible with Spark for all datatypes. " + | ||
s"Set this config to true to allow them anyway. $COMPAT_GUIDE.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We link to the Compatibility Guide here but there is no new information in that guide about handling for byte/short, so would be good to add that. This could be done in a follow on PR.
@@ -1352,6 +1352,15 @@ object CometSparkSessionExtensions extends Logging { | |||
org.apache.spark.SPARK_VERSION >= "4.0" | |||
} | |||
|
|||
def isComplexTypeReaderEnabled(conf: SQLConf): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the naming confusing here. This method determines if we are using native_datafusion
or native_iceberg_compat
(which both use DataFusion's ParquetExec
). This is no logic related to complex types.
Complex type support was a big motivation for adding these new scans, but it doesn't seem to make sense to refer to complex types in the changes in this PR.
This is just a nit, and we can rename the methods in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @parthchandra. I left some comments but they are nits that we can address in follow on PRs.
"Comet is not currently fully compatible with Spark for all datatypes. " + | ||
s"Set this config to true to allow them anyway. $COMPAT_GUIDE.") | ||
.booleanConf | ||
.createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should default this to false
because it is a correctness issue, and explicitly set this to true
in CometTestBase
.
I created a follow on PR #1398 |
Which issue does this PR close?
Partly addresses test failures caused by #1348
Rationale for this change
As the issue points out, datafusion comet returns different values from Spark for uint_8 and uint_16 parquet types that may have the sign bit set.
What changes are included in this PR?
Rewrites the parquet test files to not use the uint_8 and uint16 types if the complex type readers are enabled.
How are these changes tested?
Locally using existing unit tests. Note that the unit tests still fail, but not because of unsigned ints