Why sort spill data in hash aggregation? #3735

yetingsky · 2023-01-18T07:28:16Z

yetingsky
Jan 18, 2023

Hash Aggregation will sort spilled files, but Hash Join uses a classic recursive hash partition method (which will not sort spilled files).
I wondered why to use different spill strategies for Hash Aggregation and Hash Join.

mbasmanova · 2023-02-17T02:54:58Z

mbasmanova
Feb 17, 2023
Collaborator

@yetingsky Ting, thank you for the question. FYI, https://facebookincubator.github.io/velox/develop/spilling.html describes the spilling mechanisms used by different operators.

@xiaoxmeng Meng, can you help answer this question?

1 reply

yetingsky Feb 17, 2023
Author

@mbasmanova Thank you for your reply. I have read the document, but the document does not say why to choose different spilling mechanisms for different operators.

oerling · 2023-02-21T16:28:21Z

oerling
Feb 21, 2023
Collaborator

Aggregation can produce very large accumulator states, in extreme cases GB scale with things like array__agg. The occurrence of these states will lead to spilling, as will the occurrence of a large number of small states.

In any case, unspilling will be memory efficient, i.e. will require the one time materialization of only one accumulator state. If the state of the array_agg or similar does not fit in memory we can only fail. But we do not need to fail if only a few of these states fit at a time.

Otherwise, sorted runs is a common, well understood and generally practiced way of spilling group by.

Is there a better way of spilling group by? One could spill hash ranges without sorting, like is done with hash join. This would make unspilling liable to re-spill. This would not hit the n*log(n) of sorting. But instead we would do a hash lookup and be liable to respill. On the weight of evidence , sorted runs has fewer special cases and is easier to implement and has no obvious downsides compared to alternatives. The only downside is the n log(n) from sorting but its absolute value will be much smaller than the time to write out the data, so this is not expected to be significant. Because spill files must be sorted when written, they can only contain as much data as fits in memoryy. This may create large numbers of files, more than it is convenient to open at one time. This may lead to multi-level merge, e.g. given 10K files, we merge groups of 100 files into one to end up with 100 files. This will need only 100 files worth of buffer space at a time.

1 reply

yetingsky Feb 22, 2023
Author

@oerling Thank you for your reply. I got your point, sorted runs have lower probabilities to re-spill and sort time will be much smaller than the time to write out the data (which means sorting data before spill will be free). But I wonder why you guys not using sorted runs for hash join (do merge join for sorted runs).

oerling · 2023-02-27T16:38:13Z

oerling
Feb 27, 2023
Collaborator

Spark makes a spilling hash join into a sort-merge. We do not. There are a few reasons, like:

Spilling hash is linear and makes big spill files. It does not have the problem of thousands of files that need a multilevel merge.
Hash spill that spills just a little is almost like hash join without spill. One could sort and spill a range of hash numbers, of course, but then one would still have the problem of lots of small files for larger spills.

If just doing a full switch from hash to sort-merge, there is a big cliff, whereas spilling a fraction of the key space is a rather small cliff.

Merge join is not very vectorizable. It is veryy local but hash join is a shorter code path and better instruction level parallelism.
Because hash join is multithreaded on both build and probe, no matter how one spilled, one would have the same control structure and synchronization, so no win there. Group by is different in that each thread does its own slice of the key space and there is no coordination of spill between the threads on a worker.
Basically hash with and without spilling is more or less linear. Sorting is n*log(n). Hash join spill can become log with multiple levels of recursive spilling, where number of levels is log of data size, but we do not see this happening inpractice.

We do not see advantage in sort-merge. We know of one case where sorted runs compress much better than unsorted ones, but this is only if there is just keys and no payload with significant information.

Spark plans sort-merge joins. Presto does not. Spark with Photoon, far as I know, prefers hash joins. Planning a sort-merge below a group by that becomes streaming would make more sense that the sort-merge by itself.

This question comes up from time to time. We should see the relative merits of the join types in real world workloads later this year.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why sort spill data in hash aggregation? #3735

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Why sort spill data in hash aggregation? #3735

yetingsky Jan 18, 2023

Replies: 3 comments · 2 replies

mbasmanova Feb 17, 2023 Collaborator

yetingsky Feb 17, 2023 Author

oerling Feb 21, 2023 Collaborator

yetingsky Feb 22, 2023 Author

oerling Feb 27, 2023 Collaborator

yetingsky
Jan 18, 2023

Replies: 3 comments 2 replies

mbasmanova
Feb 17, 2023
Collaborator

yetingsky Feb 17, 2023
Author

oerling
Feb 21, 2023
Collaborator

yetingsky Feb 22, 2023
Author

oerling
Feb 27, 2023
Collaborator