Skip to content

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Oct 20, 2025

Which issue does this PR close?

Part of #2611

Rationale for this change

The fuzz tester currently passes random inputs to functions without checking if they are the correct type. For example, it could try and pass a string to a numeric function. Although this can be a valid test (because Spark will add a cast to coerce the input type) it also means that many generated queries are not valid, so is not very efficient.

What changes are included in this PR?

  • Define signatures for functions
  • Add all functions that Comet currently supports
  • Update query generator to only generate queries using valid input columns for the functions
  • Improve error handling

How are these changes tested?

Manually

@codecov-commenter
Copy link

codecov-commenter commented Oct 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.15%. Comparing base (f09f8af) to head (6058427).
⚠️ Report is 632 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2614      +/-   ##
============================================
+ Coverage     56.12%   59.15%   +3.02%     
- Complexity      976     1444     +468     
============================================
  Files           119      147      +28     
  Lines         11743    13735    +1992     
  Branches       2251     2356     +105     
============================================
+ Hits           6591     8125    +1534     
- Misses         4012     4386     +374     
- Partials       1140     1224      +84     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as ready for review October 21, 2025 15:50
@andygrove andygrove changed the title feat: Define function signatures in CometFuzz [WIP] feat: Define function signatures in CometFuzz Oct 21, 2025
@mbutrovich
Copy link
Contributor

How does this change with different Spark versions, or does it?

@andygrove
Copy link
Member Author

How does this change with different Spark versions, or does it?

It really doesn't. As an example, if we add a new function that only exists in Spark 4.0 then run the fuzz test against an older version, the query will fail both with Spark and Comet, so that is a pass.

@mbutrovich
Copy link
Contributor

mbutrovich commented Oct 21, 2025

It really doesn't. As an example, if we add a new function that only exists in Spark 4.0 then run the fuzz test against an older version, the query will fail both with Spark and Comet, so that is a pass.

What if the signature changes in a new Spark release? Then it would start failing for Spark and Comet (and thus pass)?

I'm just trying to understand what the maintenance process is for future releases, and how to potentially document that.


val dateScalarFunc: Seq[Function] =
Seq(Function("year", 1), Function("hour", 1), Function("minute", 1), Function("second", 1))
private def createFunctionWithInputs(name: String, inputs: Seq[SparkType]): Function = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private def createFunctionWithInputs(name: String, inputs: Seq[SparkType]): Function = {
private def createFunctionWithInputParams(name: String, inputs: Seq[SparkType]): Function = {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs might be confused with input data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to createFunctionWithInputTypes

@andygrove
Copy link
Member Author

It really doesn't. As an example, if we add a new function that only exists in Spark 4.0 then run the fuzz test against an older version, the query will fail both with Spark and Comet, so that is a pass.

What if the signature changes in a new Spark release? Then it would start failing for Spark and Comet (and thus pass)?

I'm just trying to understand what the maintenance process is for future releases, and how to potentially document that.

That is true. This is all very manual at the moment.

I briefly looked into using Spark APIs to get the signature, but there are some challenges. We can look at classes to see if they extend UnaryExpression or BinaryExpression, but to determine the valid input data types we would need to create an instance of the class, which seems challenging.

}

// Math expressions (corresponds to mathExpressions in QueryPlanSerde)
val mathScalarFunc: Seq[Function] = Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can validate that expressions in QueryPlanSerde not covered by fuzzer? it would help us having a consistent functions set we fuzz testing especially if a person added a new function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice. The main challenge is that in the expr map in QueryPlanSerde we just have class names and no mapping to SQL function name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current approach is to use AI to detect any expressions in QueryPlanSerde that are not covered in the fuzz test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #2627 for finding a way to automate this

createUnaryStringFunction("ascii"),
createUnaryStringFunction("bit_length"),
createUnaryStringFunction("chr"),
createFunctionWithInputs("concat_ws", Seq(SparkStringType, SparkStringType)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be supporting concat with strings input in 50.3.0 #2604 so need to add it there.

Btw @andygrove concat supports string or arrays as input, looks like this design supports it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the framework supports it. In this case we could add two signatures to the function. One that takes two strings and one that takes two arrays. I did not implement support for variadic functions yet. I will file an issue for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added support for concat with two arguments. According to the docs:

The function works with strings, numeric, binary and compatible array columns.

a.length == b.length && a.zip(b).forall(x => same(x._1, x._2))
case (a: Row, b: Row) =>
// struct support
format(a) == format(b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is compared here? is it text representation of structs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This could probably be made more efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this

return l == null && r == null
}
(l, r) match {
case (a: Float, b: Float) if a.isInfinity => b.isInfinity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also check negInfinity and posInfinity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@andygrove
Copy link
Member Author

andygrove commented Oct 21, 2025

It really doesn't. As an example, if we add a new function that only exists in Spark 4.0 then run the fuzz test against an older version, the query will fail both with Spark and Comet, so that is a pass.

What if the signature changes in a new Spark release? Then it would start failing for Spark and Comet (and thus pass)?
I'm just trying to understand what the maintenance process is for future releases, and how to potentially document that.

That is true. This is all very manual at the moment.

I briefly looked into using Spark APIs to get the signature, but there are some challenges. We can look at classes to see if they extend UnaryExpression or BinaryExpression, but to determine the valid input data types we would need to create an instance of the class, which seems challenging.

@mbutrovich One option I would like to explore is to produce a summary report after running the queries, which would show how many successful queries ran for each expression and show error messages for any expressions that always failed

edit: I filed an isssue for this: #2618

@comphead
Copy link
Contributor

For comparison, should we delegate this to Spark itself for simplest cases? 🤔

   sparkDf.count = = cometDf.count // to see duplicates or missing rows
   sparkDf.except(comeDf).union(cometDF.except(sparkDf)) // column-level checks

@andygrove
Copy link
Member Author

For comparison, should we delegate this to Spark itself for simplest cases? 🤔

   sparkDf.count = = cometDf.count // to see duplicates or missing rows
   sparkDf.except(comeDf).union(cometDF.except(sparkDf)) // column-level checks

Will this involve re-executing the queries?

@comphead
Copy link
Contributor

For comparison, should we delegate this to Spark itself for simplest cases? 🤔

   sparkDf.count = = cometDf.count // to see duplicates or missing rows
   sparkDf.except(comeDf).union(cometDF.except(sparkDf)) // column-level checks

Will this involve re-executing the queries?

Yes, both rows would be actions and trigger the dataframe evaluation. to avoid this we can cache/checkpoint dataframes, so it would be evaluated only once. Another option is to save both result df to disk and then read them for correctness checks, this option may also help in the future when Comet support the parquet writer.

@wForget
Copy link
Member

wForget commented Oct 22, 2025

I briefly looked into using Spark APIs to get the signature, but there are some challenges.

We may be able to obtain function signatures in these ways:

@andygrove
Copy link
Member Author

Thanks. I filed an issue for this #2627

@andygrove
Copy link
Member Author

For comparison, should we delegate this to Spark itself for simplest cases? 🤔

   sparkDf.count = = cometDf.count // to see duplicates or missing rows
   sparkDf.except(comeDf).union(cometDF.except(sparkDf)) // column-level checks

I'm not sure how we can run cometDF.except(sparkDf) with cometDF running with Comet enabled and sparkDf running with Comet disabled

@andygrove
Copy link
Member Author

andygrove commented Oct 22, 2025

@mbutrovich @comphead Thanks for the reviews so far. CometFuzz is still quite experimental/hacky, but this PR expands coverage of tested functions and reduces the number of invalid queries generated now that we have signatures, so it seems worth merging in my opinion. It would be better to automate the discovery of function signatures rather than hand-code them. I filed #2627 to explore this.

Here are stats from a recent run of this version:

Total queries: 853; Invalid queries: 317; Comet failed: 3; Comet succeeded: 533

So far, this version of CometFuzz has found three bugs:

@andygrove andygrove marked this pull request as draft October 22, 2025 15:10
@andygrove
Copy link
Member Author

Moved to draft until #2629 is merged

@andygrove andygrove marked this pull request as ready for review October 22, 2025 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants