How to support rank and dense_rank functions in TopNRowNumber? #9404

JkSelf · 2024-04-07T07:56:04Z

JkSelf
Apr 7, 2024
Collaborator

After Gluten was upgraded to Spark version 3.5, Spark 3.5 introduced the RankLimit operator here, which optimizes the performance of the rank, dense_rank, and row_number functions. It extracts only the top N data within each WindowPartition, and then in the window operator, it is only necessary to compute the top N data for each Partition without needing to process all the data. This approach not only improves performance but also reduces the risk of out-of-memory (OOM) issues when memory is constrained. Therefore, we plan to also introduce support for the RankLimit operator in Gluten.

Currently, to implement the RankLimit operator in Gluten, we need to address the following two issues:

At present, Velox's TopNRowNumber has already implemented similar optimizations for the row_number function, but not yet for rank and dense_rank. What is the reason for this? We have reviewed the code and believe that TopNRowNumber is fully capable of supporting rank and dense_rank. At the time of TopNRowNumber#getOutput, we can create corresponding WindowFunctions based on different function names, and then have different WindowPartitions apply these WindowFunctions to derive the final computation results. Do you think this solution is feasible?
Similar to the Window operator, Spark adds a Sort operator before RankLimit to sort the data according to the partition key and order by key. Therefore, within TopNRowNumber, there is no need to sort the data again. We need to implement an operator similar to StreamingWindow to remove the sorting operation from TopNRowNumber.

@mbasmanova @aditi-pandit @zhouyuan @ayushi-agarwal @PHILO-HE @rui-mo

mbasmanova · 2024-04-08T13:03:04Z

mbasmanova
Apr 8, 2024
Collaborator

@JkSelf At a high level, it makes sense to optimize rank <= N and dense_rank <= N queries. However, there are quite a few details to sort out. Would you create a Google doc to describe the proposed design and implementation in detail?

Specifically, the number of top rows that must be kept is quite different for these 3 functions. row_number <= 3 requires keeping only 3 top rows. However, it is not enough to keep 3 top rows for rank <= 3 or dense_rank <= 3.

Spark adds a Sort operator before RankLimit to sort the data according to the partition key and order by key.

It seems wasteful to sort all the data in this case.

1 reply

JkSelf Apr 24, 2024
Collaborator Author

@ayushi-agarwal apache/incubator-gluten#5398 already merged. Can you help to follow this task? Thanks for your help.

liujiayi771 · 2024-07-04T09:55:03Z

liujiayi771
Jul 4, 2024

@mbasmanova For rank and dense_rank, we need to store the duplicate top values in the topRows. We need to check if the input row is the same as the topRow. If they are the same, we also need to insert them into topRows. If they are different, we need to pop out the elements in topRows that are the same as topRow. I understand it this way, right?

1 reply

mbasmanova Jul 15, 2024
Collaborator

@liujiayi771 This understanding is accurate.

aditi-pandit · 2024-08-12T05:56:17Z

aditi-pandit
Aug 12, 2024
Collaborator

@mbasmanova : Trino has also implemented a similar optimization trinodb/trino#6333 for TopNRank. We are very keen to implement this in Presto/Prestissimo.

@JkSelf @liujiayi771 : Has work started on this already ? If not, I can pick it up. Will follow up with a design doc/implementation.

@amitkdutta

3 replies

liujiayi771 Aug 12, 2024

@aditi-pandit I haven't started this part of the work yet. I hacked some code internally for temporary use, but there is no complete design to support multiple window funcs. You can pick it up, and I look forward to your design doc.

JkSelf Aug 12, 2024
Collaborator Author

@aditi-pandit No, I don't have time for this task. Please go ahead and proceed with the work. I look forward to seeing your implementation. Thank you.

aditi-pandit Aug 12, 2024
Collaborator

@liujiayi771 @JkSelf : Thanks for confirming. I will follow up here.

liujiayi771 · 2024-10-04T04:19:51Z

liujiayi771
Oct 4, 2024

Hi @aditi-pandit.
Are you currently developing support for rank and dense_rank? I have implemented a version that supports both row_number and rank. I introduced a boolean value keepDuplicateRows_ in TopNRowNumber to indicate whether to retain duplicate values in the priority queue. For rank, this boolean is set to true, and in the processInputRow method, I use different code logic based on the value of keepDuplicateRows_. When keepDuplicateRows_ is true, it allows storing duplicate rows in the priority queue.

However, this approach does not support dense_rank, asdense_rank requires additional information about unique rows. It may also need to introduce a HashTable to keep track of unique rows. If you have already started development, could you share your design?

0 replies

aditi-pandit · 2024-10-10T03:39:17Z

aditi-pandit
Oct 10, 2024
Collaborator

@liujiayi771 : Yes, I'm working on TopN for rank and dense_rank. I am far along on a prototype as well.

Yes, there are more steps beyond retaining the duplicate rows based on how rank values are assigned.

I will send out a design/prototype next week.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to support rank and dense_rank functions in TopNRowNumber? #9404

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to support rank and dense_rank functions in TopNRowNumber? #9404

JkSelf Apr 7, 2024 Collaborator

Replies: 5 comments · 5 replies

mbasmanova Apr 8, 2024 Collaborator

JkSelf Apr 24, 2024 Collaborator Author

liujiayi771 Jul 4, 2024

mbasmanova Jul 15, 2024 Collaborator

aditi-pandit Aug 12, 2024 Collaborator

liujiayi771 Aug 12, 2024

JkSelf Aug 12, 2024 Collaborator Author

aditi-pandit Aug 12, 2024 Collaborator

liujiayi771 Oct 4, 2024

aditi-pandit Oct 10, 2024 Collaborator

JkSelf
Apr 7, 2024
Collaborator

Replies: 5 comments 5 replies

mbasmanova
Apr 8, 2024
Collaborator

JkSelf Apr 24, 2024
Collaborator Author

liujiayi771
Jul 4, 2024

mbasmanova Jul 15, 2024
Collaborator

aditi-pandit
Aug 12, 2024
Collaborator

JkSelf Aug 12, 2024
Collaborator Author

aditi-pandit Aug 12, 2024
Collaborator

liujiayi771
Oct 4, 2024

aditi-pandit
Oct 10, 2024
Collaborator