Replies: 5 comments 5 replies
-
@JkSelf At a high level, it makes sense to optimize rank <= N and dense_rank <= N queries. However, there are quite a few details to sort out. Would you create a Google doc to describe the proposed design and implementation in detail? Specifically, the number of top rows that must be kept is quite different for these 3 functions. row_number <= 3 requires keeping only 3 top rows. However, it is not enough to keep 3 top rows for rank <= 3 or dense_rank <= 3.
It seems wasteful to sort all the data in this case. |
Beta Was this translation helpful? Give feedback.
-
@mbasmanova For rank and dense_rank, we need to store the duplicate top values in the topRows. We need to check if the input row is the same as the topRow. If they are the same, we also need to insert them into topRows. If they are different, we need to pop out the elements in topRows that are the same as topRow. I understand it this way, right? |
Beta Was this translation helpful? Give feedback.
-
@mbasmanova : Trino has also implemented a similar optimization trinodb/trino#6333 for TopNRank. We are very keen to implement this in Presto/Prestissimo. @JkSelf @liujiayi771 : Has work started on this already ? If not, I can pick it up. Will follow up with a design doc/implementation. |
Beta Was this translation helpful? Give feedback.
-
Hi @aditi-pandit. However, this approach does not support |
Beta Was this translation helpful? Give feedback.
-
@liujiayi771 : Yes, I'm working on TopN for rank and dense_rank. I am far along on a prototype as well. Yes, there are more steps beyond retaining the duplicate rows based on how rank values are assigned. I will send out a design/prototype next week. |
Beta Was this translation helpful? Give feedback.
-
After Gluten was upgraded to Spark version 3.5, Spark 3.5 introduced the RankLimit operator here, which optimizes the performance of the rank, dense_rank, and row_number functions. It extracts only the top N data within each WindowPartition, and then in the window operator, it is only necessary to compute the top N data for each Partition without needing to process all the data. This approach not only improves performance but also reduces the risk of out-of-memory (OOM) issues when memory is constrained. Therefore, we plan to also introduce support for the RankLimit operator in Gluten.
Currently, to implement the RankLimit operator in Gluten, we need to address the following two issues:
@mbasmanova @aditi-pandit @zhouyuan @ayushi-agarwal @PHILO-HE @rui-mo
Beta Was this translation helpful? Give feedback.
All reactions