Substrait join consumer should not merge nullability of join keys#21121
Substrait join consumer should not merge nullability of join keys#21121hareshkh wants to merge 8 commits intoapache:mainfrom
Conversation
275a5ea to
6210816
Compare
9ac08b5 to
79130bd
Compare
|
Hey @gabotechs, @alamb, @LiaCastaneda! would you like to take a look at this one please? |
|
|
||
| let (join_keys, null_equality) = | ||
| match (eq_keys.is_empty(), indistinct_keys.is_empty()) { | ||
| // Mixed: use eq_keys as equijoin keys, demote indistinct keys to filter |
There was a problem hiding this comment.
We unconditionally favour Eq keys here, but if we have a case where there exists multiple (say 4) IS NOT DISTINCT FROM column pairs and 1 Eq column pair, this demotes all 4 to filter and keeps just the 1 eq key, right?
But, in this case, would the inverse (demote the single eq to filter) not allow more columns to participate in the hash partitioning/pruning and therefore be a bit more performant?
More selective hash key = frwer candidate pairs survive and need fewer row-by-row filter evaluation, if I understand correctly?
There was a problem hiding this comment.
The optimiser currently already has this behaviour of favouring Eq predicates over Indistinct predicates. Added a SLT to confirm that behaviour - https://github.com/apache/datafusion/pull/21121/changes#diff-63fc43cf735eb03abd4d114cfbbf24982939425938a74b354fb7db6da7d499d7R305, and replicating that behaviour in this change.
I also think that selectivity is a function of data i.e. having a hash join on 3 indistinct keys could produce more data than 1 eq key.
|
Thanks for the review @Satyr09 - I have addressed all your comments. |
|
Thanks for the review @gabotechs - I have addressed the comment as well. |
|
Nice, I'll leave this open until tomorrow in case someone wants to chime in, otherwise good work! |
Which issue does this PR close?
Rationale for this change
When a Substrait join expression contains both equal and is_not_distinct_from predicates (e.g. Spark pushes a null-safe filter into a join that already has a regular equality key), the
split_eq_and_noneq_join_predicate_with_nulls_equalityfunction uses a singlenulls_equal_nullsboolean that gets overwritten per-predicate. Whichever operator is processed last determines theNullEqualityfor all keys, silently dropping NULL-matching rows.Since NullEquality is a join-level setting (not per-key) across all physical join implementations (HashJoinExec, SortMergeJoinExec, SymmetricHashJoinExec), the correct fix is to match DataFusion's own SQL planner behavior: demote IS NOT DISTINCT FROM keys to the join filter when mixed with Eq keys. This is already correctly handled for SQL as shown in join_is_not_distinct_from.slt:L188
What changes are included in this PR?
datafusion/substrait/src/logical_plan/consumer/rel/join_rel.rs:Are these changes tested?
Yes, three levels of coverage:
Are there any user-facing changes?
No API changes. Substrait plans with mixed equal/is_not_distinct_from join predicates now correctly preserve null-safe semantics instead of silently dropping NULL-matching rows.