Skip to content

[SPARK-55848][SQL][4.0] Fix incorrect dedup results with SPJ partial clustering#54851

Open
naveenp2708 wants to merge 1 commit intoapache:branch-4.0from
naveenp2708:spark-55848-fix-branch-4.0
Open

[SPARK-55848][SQL][4.0] Fix incorrect dedup results with SPJ partial clustering#54851
naveenp2708 wants to merge 1 commit intoapache:branch-4.0from
naveenp2708:spark-55848-fix-branch-4.0

Conversation

@naveenp2708
Copy link

What changes were proposed in this pull request?

Backport fix for SPARK-55848 to branch-4.0. Same fix as merged in branch-4.1 via #54751.

The fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and restructures satisfies0() to check ClusteredDistribution first, returning false when partially clustered. EnsureRequirements then inserts the necessary Exchange.

Why are the changes needed?

SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number).

Does this PR introduce any user-facing change?

Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results.

How was this patch tested?

Three regression tests added. All tests pass locally.

Was this patch authored or co-authored using generative AI tooling?

No.

@naveenp2708 naveenp2708 changed the title [SPARK-55848][SQL][4.0] Fix incorrect dedup results with SPJ partial … [SPARK-55848][SQL][4.0] Fix incorrect dedup results with SPJ partial clustering Mar 17, 2026
@naveenp2708
Copy link
Author

@peter-toth Backport to branch-4.0 as requested. Same fix as #54751. Branch-3.5 PR coming next.

required match {
case c @ ClusteredDistribution(requiredClustering, requireAllClusterKeys, _) =>
if (requireAllClusterKeys) {
// Checks whether this partitioning is partitioned on exactly same clustering keys of
Copy link
Contributor

@peter-toth peter-toth Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that you deleted 3 comments, but this change should be just a reorder of conditions. Can you please put those comments back?
Also, it seems those comments were deleted in #54751 as well, can you please restore them on branch-4.1 in a follow-up PR?

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor request.

…clustering

When SPJ partial clustering splits a partition across multiple tasks,
post-join dedup operators (dropDuplicates, Window row_number) produce
incorrect results because KeyGroupedPartitioning.satisfies0() incorrectly
reports satisfaction of ClusteredDistribution via super.satisfies0()
short-circuiting the isPartiallyClustered guard.

This fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and
restructures satisfies0() to check ClusteredDistribution first, returning
false when partially clustered. EnsureRequirements then inserts the
necessary Exchange. Plain SPJ joins without dedup are unaffected.

Closes apache#54378
@naveenp2708 naveenp2708 force-pushed the spark-55848-fix-branch-4.0 branch from bbd585d to ef11958 Compare March 17, 2026 21:57
@naveenp2708
Copy link
Author

@peter-toth Thanks for catching that! Restored all 3 deleted comments — the change is now a reorder only. Will open a follow-up PR for branch-4.1 to restore the comments there as well.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants