Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RewriteJoin logic to calculate hash table size #1430

Open
andygrove opened this issue Feb 20, 2025 · 0 comments
Open

Improve RewriteJoin logic to calculate hash table size #1430

andygrove opened this issue Feb 20, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

This is a follow on issue based on discussions in #1424.

When choosing the smaller side of a join to use for the build-side, we just use the total table size based on the sizeInBytes that was computed in a completed query stage.

We can make some improvements to this approach:

  • Calculate the resulting hash table size based on the join keys and the columns from the table that will be used in the join. We can compute size based on rowCount * sum(estimated size of each column).
  • In cases where the input is now a completed query stage, we can look at the HadoopFsRelation contained by the LogicalRelation. From this, we can can sizeInBytes and infer a row count based on this and the estimated schema size

Describe the potential solution

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant