-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signature::Coercible with user defined implicit casting #14440
Conversation
|
||
#[derive(Debug, Clone, Eq, PartialOrd, Hash)] | ||
pub struct ParameterType { | ||
pub param_type: LogicalTypeRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
param_type
: target type of function signature
allowed_casts
: implicit coercion allowed to cast to target type.
For example,
param_type
: string
allowed_casts
: binary, int
Valid: All are casted to string
func(string)
func(binary)
func(int)
Invalid:
func(float or other types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that Vec<Vec<ParameterType>>
might become challenging to reason about. It would also be confusing about when one use this new signature rather than Signature::Coercable
🤔
Given this seems very similar to Signature::Coercable
, and Signature::Coercable
mirrors what we want pretty well, could add some new information there on the allowed coercions rather than an entire new type signature. Something like extending Coercable
with rules could be used to coerce to the target type:
pub enum TypeSignature {
...
Coercible(Vec<Coercion>),
...
}
Where Coercion
looks like
struct Coercion {
desired_type: TypeSignatureClass,
allowed_casts: ... // also includes an option for User Defined?
}
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have a breaking change to Signature::Coercible
too, the only concern is whether this cause regression or large impact to downstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin If we replace CoercibleV2
with Signature::Coercible
would it be a large change we should be concerned to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, I'll replace Signature::Coercible with Signature::CoercibleV2.
23953dd
to
806c6a6
Compare
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
806c6a6
to
c99e986
Compare
vec![ | ||
TypeSignatureClass::Native(logical_string()), | ||
TypeSignatureClass::Native(logical_int64()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v1 version, any integer is casted to the defined NativeType's default casted type which is i64 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like how much more explicit the new formualtion is
// Accept all integer types but cast them to i64 | ||
Coercion { | ||
desired_type: TypeSignatureClass::Native(logical_int64()), | ||
allowed_casts: vec![TypeSignatureClass::Integer], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, i32 is rejected.
signature: Signature::coercible_v2( | ||
vec![Coercion { | ||
desired_type: TypeSignatureClass::Native(logical_string()), | ||
allowed_casts: vec![TypeSignatureClass::Native(logical_binary())], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coercing binary to string is now easily customizable
I will try and review this carefully over the weekend Maybe @shehabgamin has some time to take a look too |
I will review over the weekend as well! |
I took the liberty of merging up from main to get the CI to pass again. I hope to spend a bit longer today messing around with potential changes to Coercion for your consideration |
I was hoping it might be possible to make a 'wrapper' function that only overrides the signature but passes everything else in Though now that I type this, perhaps what is really needed is user defined coercion for operators (like |
Maybe we could make this a builder style API too -- I hope to mess around with it later todya |
@jayzhan211 I will re-review by tomorrow EOD, exciting progress! |
I ran out of time yesterday and I am out next week so I am not likely to be able to make much progress here. Sorry about that -- maybe one of you will have a chance to do something |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @jayzhan211. The new behavior of coercion signature makes sense to me.
I found I can't access Coercion
and ImplicitCoercion
outside the crate.
I believe we need to publicly use them in datafusion/expr/src/lib.rs
.
Maybe we can just set it as below for all struct in signature
.
pub use datafusion_expr_common::signature::*;
If you can access Signature, no reason you can't access for |
I see. I tried to use datafusion/datafusion/core/src/lib.rs Line 751 in 587277f
pub mod logical_expr_common {
pub use datafusion_expr_common::*;
} Then, user can only uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211. It looks great to me.
I was hoping it might be possible to make a 'wrapper' function that only overrides the signature but passes everything else in
@alamb's idea looks great, but I think it may be another issue 🤔 Maybe we can focus on the behavior of coercible signature improvement in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 Well done!!! 🔥
My approval doesn't actually do anything, but I approved it anyway lol.
Side note: Is this logic from the old PR needed?
datafusion/datafusion/expr/src/type_coercion/functions.rs
Lines 610 to 626 in 1d94e5c
// Following the behavior of `TypeSignature::String`, we find the common string type. | |
let string_indices: Vec<_> = target_types.iter().enumerate() | |
.filter(|(_, t)| { | |
matches!(t, TypeSignatureClass::Native(n) if n.native() == &NativeType::String) | |
}) | |
.map(|(i, _)| i) | |
.collect(); | |
if !string_indices.is_empty() { | |
let mut coerced_string_type = new_types[string_indices[0]].to_owned(); | |
for &i in string_indices.iter().skip(1) { | |
coerced_string_type = | |
find_common_string_type(&coerced_string_type, &new_types[i])?; | |
} | |
for i in string_indices { | |
new_types[i] = coerced_string_type.clone(); | |
} | |
} |
Probably not |
Thanks @jayzhan211 @shehabgamin and @goldmedal 🚀 |
Thanks @alamb @shehabgamin @goldmedal |
/// Target coerce types in order | ||
pub fn coercible( | ||
target_types: Vec<TypeSignatureClass>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, label is added already
/// Coercion that only accepts arguments exactly matching the desired type. | ||
Exact { | ||
/// The required type for the argument | ||
desired_type: TypeSignatureClass, | ||
}, | ||
|
||
/// Coercion that accepts the desired type and can implicitly coerce from other types. | ||
Implicit { | ||
/// The primary desired type for the argument | ||
desired_type: TypeSignatureClass, | ||
/// Rules for implicit coercion from other types | ||
implicit_coercion: ImplicitCoercion, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An implementor of a function `foo(s) wants to say:
foo(s)
accepts s being "a string" (eg via logical typeNativeType::String
or old schoolDataType::Utf8
)
so the query SELECT foo(s)
should succeed when s is a string or s is coercible to string.
Neither Coercion option gives that.
In theory Coercion::Implicit gives that, but alas, it requires the function implementor to know the coercion rules of the engine, to enumerate the "allowed source types".
How is it supposed to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the question.
so the query SELECT foo(s) should succeed when s is a string or s is coercible to string.
This is true for this signature. For exact, string succeed, for implicit, string and other defined allow source types succeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, but the burden to find "allowed source types" is on the function implementor. This is backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to find "allowed source types" is on the function implementor.
This is what we need, we don't have single rule that fits all the use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this structure (Coercion as an enum) allows for introducing another variant if we need additional rules / ways to express coercion.
So in my mind this is a step forward, even if we still have more work to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider example of substr(s, i) function.
The call to substr should succeed for i being any integer type coercible to UInt64 or Int64.
You’ve defined that the second argument of substr can be any integer type coercible to Int64. Isn’t this part of the function definition? By doing so, the function knows what coercion is needed. However, it's not enough to just make this definition possible. If we want to allow coercion from a string integer type or if we only expect coercion to Int32, those options should be possible as well. Given this, we can't decide how coercion should happen without being informed by the user or the function definition
For example, integer types should be coercible to broader integer types the same way regardless whether it's in context of UNION ALL, EXCEPT, a function call, or an operator.
For internal DataFusion use cases, this rule works well. However, from an extensibility perspective, restricting coercion rules reduces flexibility. What if they don't want broader integer type for some specific workflow because they know the max value is in i32? We should prioritize maximum flexibility while also providing built-in options for general use cases to ensure ease of use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we begin allowing user-defined coercion rules for other operations, not just functions, users would need to define coercion rules for operations like binary comparisons as well. In other hand, what you are suggesting could limit the extensibility of coercion rules within DataFusion.
However, while we aim to provide extensibility, basic rules—such as converting an integer to a broader integer type—can be implemented as default rules in DataFusion, as long as they align well with the internal logic of the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me like we are perhaps not talking about the same things here.
Maybe someone can write up some example / test case that shows what feature is lacking. I suspect that will get us to a resolution the quickest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a step forward. We removed ability for a function f(s) to express the most basic need: "f takes s being a type X, so the call succeeds for any type X' coercible to X"
This is, i believe, what the
TypeSignature::Coercible(logical type)
was doing. (At least this is what it should be doing by the name of it.)
@findepi This was the behavior before DataFusion 43. Starting in DataFusion 43, this was no longer the case. My old PR #14268 attempted to fix this regression. I agree that the naming does not match the behavior, and there is an extensive discussion in the PR I linked regarding that.
@jayzhan211 @alamb @findepi The relevant comment from my old PR: #14268 (comment).
With that in mind, this PR is still a significant step forward from the new behavior introduced in DataFusion 43. However, more work is needed to define functions that are explicitly documented, have clear and consistent naming, and sufficiently meet internal expectations while respecting system contracts and remaining flexible for downstream use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a step forward. We removed ability for a function f(s) to express the most basic need: "f takes s being a type X, so the call succeeds for any type X' coercible to X"
This is, i believe, what theTypeSignature::Coercible(logical type)
was doing. (At least this is what it should be doing by the name of it.)@findepi This was the behavior before DataFusion 43. Starting in DataFusion 43, this was no longer the case. My old PR #14268 attempted to fix this regression. I agree that the naming does not match the behavior, and there is an extensive discussion in the PR I linked regarding that.
@jayzhan211 @alamb @findepi The relevant comment from my old PR: #14268 (comment).
With that in mind, this PR is still a significant step forward from the new behavior introduced in DataFusion 43. However, more work is needed to define functions that are explicitly documented, have clear and consistent naming, and sufficiently meet internal expectations while respecting system contracts and remaining flexible for downstream use cases.
THanks @shehabgamin -- the writeup makes sense to me at a high level but I can't fully understand the practical implications / usecase.
In the past when we have had such issues, one thing we have done is to write up an example / usecase in https://github.com/apache/datafusion/tree/main/datafusion-examples that tries to show what is needed.
Perhaps in this case, someone could add an example to https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udf.rs that shows how to create a function with the desired coercion behavior (or if that is not possible, shows why not)
I think that would make the principles of the #14268 (comment) clearer, at least to me, and we would have a specific example to evaluate potential additional improvements against
@@ -95,24 +96,29 @@ impl DatePartFunc { | |||
signature: Signature::one_of( | |||
vec![ | |||
TypeSignature::Coercible(vec![ | |||
TypeSignatureClass::Native(logical_string()), | |||
TypeSignatureClass::Timestamp, | |||
Coercion::new_exact(TypeSignatureClass::Native(logical_string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function used to accept anything coercible to "logical string"
now it accepts anything that is "logical string"
Is this a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function used to accept anything coercible to "logical string"
This is actually incorrect so we fix this
See #14268 for what is happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function used to accept anything coercible to "logical string" now it accepts anything that is "logical string"
Is this a breaking change?
Anything coercible to "logical string" was behavior before DataFusion 43. Starting at DataFusion 43 the behavior changed to anything that is "logical string". #14268 attempted to change this back to anything coercible to "logical string", although I'm not sure if anything in there is relevant anymore after this PR.
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?