Improve RewriteJoin logic to calculate hash table size #1430

andygrove · 2025-02-20T18:54:00Z

What is the problem the feature request solves?

This is a follow on issue based on discussions in #1424.

When choosing the smaller side of a join to use for the build-side, we just use the total table size based on the sizeInBytes that was computed in a completed query stage.

We can make some improvements to this approach:

Calculate the resulting hash table size based on the join keys and the columns from the table that will be used in the join. We can compute size based on rowCount * sum(estimated size of each column).
In cases where the input is now a completed query stage, we can look at the HadoopFsRelation contained by the LogicalRelation. From this, we can can sizeInBytes and infer a row count based on this and the estimated schema size

Describe the potential solution

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove added the enhancement New feature or request label Feb 20, 2025

andygrove mentioned this issue Feb 20, 2025

perf: Update RewriteJoin logic to choose optimal build side #1424

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve RewriteJoin logic to calculate hash table size #1430

Improve RewriteJoin logic to calculate hash table size #1430

andygrove commented Feb 20, 2025

Improve RewriteJoin logic to calculate hash table size #1430

Improve RewriteJoin logic to calculate hash table size #1430

Comments

andygrove commented Feb 20, 2025

What is the problem the feature request solves?

Describe the potential solution

Additional context