Replies: 3 comments 2 replies
-
@yetingsky Ting, thank you for the question. FYI, https://facebookincubator.github.io/velox/develop/spilling.html describes the spilling mechanisms used by different operators. @xiaoxmeng Meng, can you help answer this question? |
Beta Was this translation helpful? Give feedback.
-
Aggregation can produce very large accumulator states, in extreme cases GB scale with things like array__agg. The occurrence of these states will lead to spilling, as will the occurrence of a large number of small states. Otherwise, sorted runs is a common, well understood and generally practiced way of spilling group by. |
Beta Was this translation helpful? Give feedback.
-
Spark makes a spilling hash join into a sort-merge. We do not. There are a few reasons, like:
If just doing a full switch from hash to sort-merge, there is a big cliff, whereas spilling a fraction of the key space is a rather small cliff.
We do not see advantage in sort-merge. We know of one case where sorted runs compress much better than unsorted ones, but this is only if there is just keys and no payload with significant information. Spark plans sort-merge joins. Presto does not. Spark with Photoon, far as I know, prefers hash joins. Planning a sort-merge below a group by that becomes streaming would make more sense that the sort-merge by itself. This question comes up from time to time. We should see the relative merits of the join types in real world workloads later this year. |
Beta Was this translation helpful? Give feedback.
-
Hash Aggregation will sort spilled files, but Hash Join uses a classic recursive hash partition method (which will not sort spilled files).
I wondered why to use different spill strategies for Hash Aggregation and Hash Join.
Beta Was this translation helpful? Give feedback.
All reactions