-
Notifications
You must be signed in to change notification settings - Fork 245
feat: Define function signatures in CometFuzz #2614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2614 +/- ##
============================================
+ Coverage 56.12% 59.15% +3.02%
- Complexity 976 1444 +468
============================================
Files 119 147 +28
Lines 11743 13735 +1992
Branches 2251 2356 +105
============================================
+ Hits 6591 8125 +1534
- Misses 4012 4386 +374
- Partials 1140 1224 +84 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
How does this change with different Spark versions, or does it? |
It really doesn't. As an example, if we add a new function that only exists in Spark 4.0 then run the fuzz test against an older version, the query will fail both with Spark and Comet, so that is a pass. |
What if the signature changes in a new Spark release? Then it would start failing for Spark and Comet (and thus pass)? I'm just trying to understand what the maintenance process is for future releases, and how to potentially document that. |
|
||
val dateScalarFunc: Seq[Function] = | ||
Seq(Function("year", 1), Function("hour", 1), Function("minute", 1), Function("second", 1)) | ||
private def createFunctionWithInputs(name: String, inputs: Seq[SparkType]): Function = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private def createFunctionWithInputs(name: String, inputs: Seq[SparkType]): Function = { | |
private def createFunctionWithInputParams(name: String, inputs: Seq[SparkType]): Function = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputs might be confused with input data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to createFunctionWithInputTypes
That is true. This is all very manual at the moment. I briefly looked into using Spark APIs to get the signature, but there are some challenges. We can look at classes to see if they extend |
} | ||
|
||
// Math expressions (corresponds to mathExpressions in QueryPlanSerde) | ||
val mathScalarFunc: Seq[Function] = Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can validate that expressions in QueryPlanSerde
not covered by fuzzer? it would help us having a consistent functions set we fuzz testing especially if a person added a new function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice. The main challenge is that in the expr map in QueryPlanSerde
we just have class names and no mapping to SQL function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current approach is to use AI to detect any expressions in QueryPlanSerde
that are not covered in the fuzz test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #2627 for finding a way to automate this
createUnaryStringFunction("ascii"), | ||
createUnaryStringFunction("bit_length"), | ||
createUnaryStringFunction("chr"), | ||
createFunctionWithInputs("concat_ws", Seq(SparkStringType, SparkStringType)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be supporting concat
with strings input in 50.3.0 #2604 so need to add it there.
Btw @andygrove concat
supports string or arrays as input, looks like this design supports it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the framework supports it. In this case we could add two signatures to the function. One that takes two strings and one that takes two arrays. I did not implement support for variadic functions yet. I will file an issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added support for concat with two arguments. According to the docs:
The function works with strings, numeric, binary and compatible array columns.
a.length == b.length && a.zip(b).forall(x => same(x._1, x._2)) | ||
case (a: Row, b: Row) => | ||
// struct support | ||
format(a) == format(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is compared here? is it text representation of structs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This could probably be made more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this
return l == null && r == null | ||
} | ||
(l, r) match { | ||
case (a: Float, b: Float) if a.isInfinity => b.isInfinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also check negInfinity
and posInfinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@mbutrovich One option I would like to explore is to produce a summary report after running the queries, which would show how many successful queries ran for each expression and show error messages for any expressions that always failed edit: I filed an isssue for this: #2618 |
For comparison, should we delegate this to Spark itself for simplest cases? 🤔
|
Will this involve re-executing the queries? |
Yes, both rows would be actions and trigger the dataframe evaluation. to avoid this we can cache/checkpoint dataframes, so it would be evaluated only once. Another option is to save both result df to disk and then read them for correctness checks, this option may also help in the future when Comet support the parquet writer. |
We may be able to obtain function signatures in these ways:
|
Thanks. I filed an issue for this #2627 |
I'm not sure how we can run |
@mbutrovich @comphead Thanks for the reviews so far. CometFuzz is still quite experimental/hacky, but this PR expands coverage of tested functions and reduces the number of invalid queries generated now that we have signatures, so it seems worth merging in my opinion. It would be better to automate the discovery of function signatures rather than hand-code them. I filed #2627 to explore this. Here are stats from a recent run of this version:
So far, this version of CometFuzz has found three bugs: |
Moved to draft until #2629 is merged |
Which issue does this PR close?
Part of #2611
Rationale for this change
The fuzz tester currently passes random inputs to functions without checking if they are the correct type. For example, it could try and pass a string to a numeric function. Although this can be a valid test (because Spark will add a cast to coerce the input type) it also means that many generated queries are not valid, so is not very efficient.
What changes are included in this PR?
How are these changes tested?
Manually