Skip to content

Conversation

dqkqd
Copy link
Contributor

@dqkqd dqkqd commented Oct 16, 2025

Which issue does this PR close?

Rationale for this change

array_distinct's inner return type is always nullable,
however general_array_distinct maintain input nullability,
causing type mismatch error.

I believe the same error happens for array_union and array_intersect (in set_ops.rs).
I can include the fix for those in this PR or maybe another separated PR.

What changes are included in this PR?

  • Match return type nullability for array_distinct.

Are these changes tested?

Yes.
I tried to add unit tests checking return types (similar to #15901),
but it wasn't clear to me whether the added tests could verify
the issue #17416. So I switched to the integration test.

  • Added test for List with inner nullability = true / false.
  • I did not added tests for LargeList, I don't think it needed
    because the code path for the return type is identical to List.

Are there any user-facing changes?

I don't think so.

@dqkqd dqkqd force-pushed the array-distinct-type-mismatch branch 2 times, most recently from 833b860 to 01d5e71 Compare October 18, 2025 22:43
@github-actions github-actions bot added the core Core DataFusion crate label Oct 18, 2025
@dqkqd dqkqd force-pushed the array-distinct-type-mismatch branch 2 times, most recently from 28de6b0 to a9c77c3 Compare October 18, 2025 23:21
@dqkqd dqkqd force-pushed the array-distinct-type-mismatch branch from a9c77c3 to f406fc2 Compare October 18, 2025 23:22
}

#[tokio::test]
async fn array_distinct_on_list_with_inner_nullability_causing_type_mismatch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this test, I wish I could write slt.
But I don't know how to construct a nullable array from sql.

Maybe with this improvement from arrow-rs (apache/arrow-rs#8351),
we can write something like this in the future:
select arrow_cast([1,2,3], 'List(Int64)') (for nullable = false)
and
select arrow_cast([1,2,3], 'List(nullable Int64)') (for nullable = true)

(then lots of tests with nested data type can be rewritten into slt tests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also place this test in set_ops.rs itself to get closer to the function, for example array_has:

#[test]
fn test_array_has_list_empty_child() -> Result<(), DataFusionError> {
let haystack_field = Arc::new(Field::new_list(
"haystack",
Field::new_list("", Field::new("", DataType::Int32, true), true),
true,
));
let needle_field = Arc::new(Field::new("needle", DataType::Int32, true));
let return_field = Arc::new(Field::new_list(
"return",
Field::new("", DataType::Boolean, true),
true,
));
let haystack = ListArray::new(
Field::new_list_field(DataType::Int32, true).into(),
OffsetBuffer::new(vec![0, 0].into()),
Arc::new(Int32Array::from(Vec::<i32>::new())) as ArrayRef,
Some(vec![false].into()),
);
let haystack = ColumnarValue::Array(Arc::new(haystack));
let needle = ColumnarValue::Scalar(ScalarValue::Int32(Some(1)));
let result = ArrayHas::new().invoke_with_args(ScalarFunctionArgs {
args: vec![haystack, needle],
arg_fields: vec![haystack_field, needle_field],
number_rows: 1,
return_field,
config_options: Arc::new(ConfigOptions::default()),
})?;
let output = result.into_array(1)?;
let output = output.as_boolean();
assert_eq!(output.len(), 1);
assert!(output.is_null(0));
Ok(())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test in set_ops.rs, but it seems quite different with the issue description.
At first glance, to the issue description, I thought the error was caused by the data type mismatch between
array_distinct and make_array.
But the actual error was due to the type mismatch in the array_distinct itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer deleting the test in core/tests/dataframe (because it is not related to dataframe at all),
but I'm not sure if adding only one test in set_ops.rs can capture the whole picture in the issue description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test as is is sufficient; this is a minor fix and having a whole DataFrame test case might be a little overkill 😅

@dqkqd
Copy link
Contributor Author

dqkqd commented Oct 18, 2025

I verified the added test fail without the changes:

    running 1 test
    test dataframe::array_distinct_on_list_with_inner_nullability_causing_type_mismatch ... FAILED

    failures:

    failures:
        dataframe::array_distinct_on_list_with_inner_nullability_causing_type_mismatch

    test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 766 filtered out; finished in 0.29s

  stderr ───

    thread 'dataframe::array_distinct_on_list_with_inner_nullability_causing_type_mismatch' panicked at datafusion/core/tests/dataframe/mod.rs:6554:46:
    called `Result::unwrap()` on an `Err` value: Internal("Function 'array_distinct' returned value of type 'List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })' while the following type was promised at planning time and expected: 'List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })'")
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

  Cancelling due to test failure:
────────────
     Summary [   0.331s] 1 test run: 0 passed, 1 failed, 1690 skipped
        FAIL [   0.298s] datafusion::core_integration dataframe::array_distinct_on_list_with_inner_nullability_causing_type_mismatch
error: test run failed

@dqkqd dqkqd marked this pull request as ready for review October 19, 2025 00:01
@dqkqd dqkqd force-pushed the array-distinct-type-mismatch branch from 8b660a4 to 5a61d90 Compare October 19, 2025 02:50
Comment on lines 601 to 604
assert_eq!(
result.data_type(),
udf.return_type(&[input_field.data_type().clone()])?
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarUdf::invoke_with_args has a data type check in debug mode:

#[cfg(debug_assertions)]
{
if &result.data_type() != return_field.data_type() {
return datafusion_common::internal_err!("Function '{}' returned value of type '{:?}' while the following type was promised at planning time and expected: '{:?}'",
self.name(),
result.data_type(),
return_field.data_type()
);
}

But I think it is better to have an explicit assertion in the test code.

@dqkqd
Copy link
Contributor Author

dqkqd commented Oct 19, 2025

I verified the new test added fail without code changes:


    running 1 test
    test set_ops::tests::test_array_distinct_inner_nullability_result_type_match_return_type::case_2 ... FAILED

    failures:

    failures:
        set_ops::tests::test_array_distinct_inner_nullability_result_type_match_return_type::case_2

    test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 11 filtered out; finished in 0.00s

  stderr ───

    thread 'set_ops::tests::test_array_distinct_inner_nullability_result_type_match_return_type::case_2' panicked at datafusion/functions-nested/src/set_ops.rs:607:9:
    assertion `left == right` failed
      left: List(Field { name: "item", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })
     right: List(Field { name: "item", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@github-actions github-actions bot removed the core Core DataFusion crate label Oct 19, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though I wonder if it is necessary to pull in rstest just for this single case?

@dqkqd
Copy link
Contributor Author

dqkqd commented Oct 19, 2025

Looks good, though I wonder if it is necessary to pull in rstest just for this single case?

I was thinking to add more cases (e.g. DataType::LargeList), and more tests (for array_union, array_intersect).

But it might be unnecessary, those can be removed later in favor of equivalent slt tests, once we have a way to
construct nested data types with different inner nullability in sql.

@Jefffrey Jefffrey added this pull request to the merge queue Oct 20, 2025
Merged via the queue into apache:main with commit 35b2e35 Oct 20, 2025
28 checks passed
@Jefffrey
Copy link
Contributor

Thanks @dqkqd

@dqkqd
Copy link
Contributor Author

dqkqd commented Oct 20, 2025

Thanks @Jefffrey for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected "type mismatch" when filtering bool list column using array_distinct and make_array

2 participants