-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: array_distinct
inner nullability causing type mismatch
#18104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
833b860
to
01d5e71
Compare
28de6b0
to
a9c77c3
Compare
a9c77c3
to
f406fc2
Compare
} | ||
|
||
#[tokio::test] | ||
async fn array_distinct_on_list_with_inner_nullability_causing_type_mismatch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this test, I wish I could write slt
.
But I don't know how to construct a nullable array from sql.
Maybe with this improvement from arrow-rs
(apache/arrow-rs#8351),
we can write something like this in the future:
select arrow_cast([1,2,3], 'List(Int64)')
(for nullable = false)
and
select arrow_cast([1,2,3], 'List(nullable Int64)')
(for nullable = true)
(then lots of tests with nested data type can be rewritten into slt
tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also place this test in set_ops.rs
itself to get closer to the function, for example array_has
:
datafusion/datafusion/functions-nested/src/array_has.rs
Lines 753 to 791 in b98cad6
#[test] | |
fn test_array_has_list_empty_child() -> Result<(), DataFusionError> { | |
let haystack_field = Arc::new(Field::new_list( | |
"haystack", | |
Field::new_list("", Field::new("", DataType::Int32, true), true), | |
true, | |
)); | |
let needle_field = Arc::new(Field::new("needle", DataType::Int32, true)); | |
let return_field = Arc::new(Field::new_list( | |
"return", | |
Field::new("", DataType::Boolean, true), | |
true, | |
)); | |
let haystack = ListArray::new( | |
Field::new_list_field(DataType::Int32, true).into(), | |
OffsetBuffer::new(vec![0, 0].into()), | |
Arc::new(Int32Array::from(Vec::<i32>::new())) as ArrayRef, | |
Some(vec![false].into()), | |
); | |
let haystack = ColumnarValue::Array(Arc::new(haystack)); | |
let needle = ColumnarValue::Scalar(ScalarValue::Int32(Some(1))); | |
let result = ArrayHas::new().invoke_with_args(ScalarFunctionArgs { | |
args: vec![haystack, needle], | |
arg_fields: vec![haystack_field, needle_field], | |
number_rows: 1, | |
return_field, | |
config_options: Arc::new(ConfigOptions::default()), | |
})?; | |
let output = result.into_array(1)?; | |
let output = output.as_boolean(); | |
assert_eq!(output.len(), 1); | |
assert!(output.is_null(0)); | |
Ok(()) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test in set_ops.rs
, but it seems quite different with the issue description.
At first glance, to the issue description, I thought the error was caused by the data type mismatch between
array_distinct
and make_array
.
But the actual error was due to the type mismatch in the array_distinct
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer deleting the test in core/tests/dataframe
(because it is not related to dataframe at all),
but I'm not sure if adding only one test in set_ops.rs
can capture the whole picture in the issue description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test as is is sufficient; this is a minor fix and having a whole DataFrame test case might be a little overkill 😅
I verified the added test fail without the changes:
|
8b660a4
to
5a61d90
Compare
assert_eq!( | ||
result.data_type(), | ||
udf.return_type(&[input_field.data_type().clone()])? | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScalarUdf::invoke_with_args
has a data type check in debug mode:
datafusion/datafusion/expr/src/udf.rs
Lines 241 to 249 in b98cad6
#[cfg(debug_assertions)] | |
{ | |
if &result.data_type() != return_field.data_type() { | |
return datafusion_common::internal_err!("Function '{}' returned value of type '{:?}' while the following type was promised at planning time and expected: '{:?}'", | |
self.name(), | |
result.data_type(), | |
return_field.data_type() | |
); | |
} |
But I think it is better to have an explicit assertion in the test code.
I verified the new test added fail without code changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though I wonder if it is necessary to pull in rstest just for this single case?
I was thinking to add more cases (e.g. But it might be unnecessary, those can be removed later in favor of equivalent |
Thanks @dqkqd |
Thanks @Jefffrey for reviewing. |
Which issue does this PR close?
array_distinct
andmake_array
#17416.Rationale for this change
array_distinct
's inner return type is alwaysnullable
,however
general_array_distinct
maintain input nullability,causing type mismatch error.
I believe the same error happens for
array_union
andarray_intersect
(inset_ops.rs
).I can include the fix for those in this PR or maybe another separated PR.
What changes are included in this PR?
array_distinct
.Are these changes tested?
Yes.
I tried to add unit tests checking return types (similar to #15901),
but it wasn't clear to me whether the added tests could verify
the issue #17416. So I switched to the integration test.
List
with innernullability = true / false
.LargeList
, I don't think it neededbecause the code path for the return type is identical to
List
.Are there any user-facing changes?
I don't think so.