Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable nsmallest/nlargest on object dtype #61166

Open
Gri72 opened this issue Mar 22, 2025 · 20 comments
Open

ENH: Enable nsmallest/nlargest on object dtype #61166

Gri72 opened this issue Mar 22, 2025 · 20 comments
Assignees
Labels
Enhancement Filters e.g. head, tail, nth Sorting e.g. sort_index, sort_values

Comments

@Gri72
Copy link

Gri72 commented Mar 22, 2025

Is it possible to implement method DataFrame.nsorted that equivalent to df.sort_values(by=["col1", "col2"], ascending=[True, False]).head(k), but with time complexity O(n log k)? I know about nlargest and nsmallest methods but unfortunately they work only with numeric columns.

Example:

df.nsorted(k=6, by=["col1", "col2"], ascending=[True, False])
@rhshadrach
Copy link
Member

Thanks for the request. I see no reason nlargest and nsmallest cannot be made to work on object dtype. We'd need to add in a Cython implementation of nlargest as currently only nsmallest is implemented and we apply this to -values which will not work with object dtype. Certainly this would be preferred over adding a new method that does the same thing.

Further investigations and PRs to implement are welcome!

@rhshadrach rhshadrach added Enhancement Filters e.g. head, tail, nth Sorting e.g. sort_index, sort_values labels Mar 22, 2025
@rhshadrach rhshadrach changed the title ENH: nsorted method ENH: Enable nsmallest/nlargest on object dtype Mar 22, 2025
@Gri72
Copy link
Author

Gri72 commented Mar 22, 2025

Thank you for the reply. As far as I understood correctly, in the end it might look something like this?

df.nlargest(n=6, columns=['col1', 'col2'], ascending=[True, False])

@rhshadrach
Copy link
Member

I do not think we should add an ascending argument; this can be accomplished by sorting the result.

@RutujaGhodake
Copy link

take

@MartinBraquet
Copy link
Contributor

MartinBraquet commented Mar 25, 2025

@rhshadrach I'd like to push back a little bit to make sure the issue is clearly understood, as it appears to me that the request extends beyond simply enabling dtype objects in nsmallest/nlargest.

Indeed, I believe there is a second ENH request which solely concerns sorting, irrespective of the type (that is, it also concerns numeric values), which I'll try to explain below.

What is currently missing in those nlargest and nsmallest methods is a way to sort values where the order is ascending for one column and descending for another column. One can see it as a combination of nlargest and smallest, depending on the column, which is what is being proposed by OP by a new nsorted method.

Here is an example that, I believe, can only be achieved via sort_values, which OP would like to get more efficiently, and which none of nlargest or smallest can perform, irrespective of a post-hoc sorting.

import pandas as pd

df = pd.DataFrame({
    'c1': [1, 2, 2],
    'c2': [1, 2, 3],
})
   c1  c2
0   1   1
1   2   2
2   2   3
df.sort_values(['c1', 'c2'], ascending=[True, False]).head(2)
   c1  c2
0   1   1
2   2   3

For instance, nmallest returns a different result, which irreversibly (regardless of subsequent operations) prevents the user from obtaining the desired result above:

df.nsmallest(2, ['c1', 'c2'])
   c1  c2
0   1   1
1   2   2

Naturally, most users have managed their way through this issue by negating the columns for which they want nsmallest and then using nlargest on that modified dataframe; but this hack only works for numeric types, which, I assume, is the reason for the OP posting this ENH request.

Please let me know if I missed something.

If not, perhaps we should indeed have that nsorted method, with its ascending parameter, which would not only allow for dtypes but also have a cleaner code on the user end (without ad hoc negation of some columns).

@rhshadrach
Copy link
Member

Agreed that this is not doable in the API today. However given the input

   c1  c2
0   1   1
1   2   2
2   2   3

I do not think it makes sense to call the result:

   c1  c2
0   1   1
2   2   3

neither "smallest" nor "largest". In other words, I think it'd be trying to force the ability to do an operation into a method where it does not belong.

@MartinBraquet
Copy link
Contributor

MartinBraquet commented Mar 26, 2025

Right. Hence our suggestion from OP and I to make a new method called nsorted.

@rhshadrach
Copy link
Member

Ah, indeed, thanks! It does seem rather straightforward to parameterize SelectN to support a by-column ascending flag. Note this doesn't quite satisfy the OP, we'd also need to add Cython for object dtype. I'm +1 on adding nsorted, and having nsmallest and nlargest become convenience methods (e.g. nlargest is just ascending=False).

cc @pandas-dev/pandas-core

@MartinBraquet
Copy link
Contributor

Alright. So, to clarify, there are two independent upgrades needed, which can thus be addressed in two different PRs:

  1. Add nsorted method to Dataframe and Series, where the bulk of the change should be done in pandas.core.methods.selectn.SelectNFrame
  2. Enable dtypes for nsorted (and, as a consequence, nlargest and nsmallest), where the bulk of the change should be done in pandas.core.methods.selectn.SelectNSeries, and more specifically extending libalgos.kth_smallest, a Cython function, to dtypes

How should we handle two PRs required for the same issue? Shouldn't we create a new issue to close after upgrade 1 is merged, and then close the issue here when upgrade 2 is merged?

@rhshadrach
Copy link
Member

That proposal has my support.

How should we handle two PRs required for the same issue? Shouldn't we create a new issue to close after upgrade 1 is merged, and then close the issue here when upgrade 2 is merged?

No - a PR need not close any issue. One can add the line Part of #XYZ in the OP of a PR and merging the PR will not close the issue.

@MartinBraquet
Copy link
Contributor

Are you familiar with the sub-issue feature? Just asking to see if it's part of pandas' development workflow. I agree that it might not be very useful for two subtasks, but I see the value if we have an issue with 5+ subtasks, each requiring a different PR, in order to keep track of when the issue can be closed.

@MartinBraquet
Copy link
Contributor

@RutujaGhodake Do you have any plan on working on any of those subtasks?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 27, 2025

I agree with this proposal. However, I'm not sure that the current implementation of libalgos.kth_smallest is $O(n log k)$ . The OP was also asking for that as well....

@snitish
Copy link
Member

snitish commented Mar 28, 2025

@Dr-Irv I believe libalgos.kth_smallest uses the QuickSelect algorithm which has an average case time complexity of O(n) (worst case is O(n^2)). That, along with the subsequent merge sort on the remaining k elements means an effective average case time complexity of O(n logk) for nlargest/nsmallest.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 28, 2025

@Dr-Irv I believe libalgos.kth_smallest uses the QuickSelect algorithm which has an average case time complexity of O(n) (worst case is O(n^2)). That, along with the subsequent merge sort on the remaining k elements means an effective average case time complexity of O(n logk) for nlargest/nsmallest.

OK, but the OP was asking for worse case complexity (at least, based on the formal definition of time complexity), so that was my concern - the worst case.

@snitish
Copy link
Member

snitish commented Mar 31, 2025

Got it. I'd be happy to work on a new O(n log k) (worst case) cython implementation of nlargest/nsmallest if there is interest @Dr-Irv @rhshadrach

@MartinBraquet
Copy link
Contributor

@snitish Are you aware of any algorithm doing so?

@RutujaGhodake
Copy link

@MartinBraquet After reviewing the request, I believe it requires a deeper understanding of the codebase. I will be happy to work on more beginner friendly issue. Anyone who is interested can work on this.

@simonjayhawkins
Copy link
Member

  1. Add nsorted method to Dataframe and Series, where the bulk of the change should be done in pandas.core.methods.selectn.SelectNFrame

Will the proposed pandas.DataFrame.nsorted also have all the capabilities, i.e. accept all the parameters, of the current pandas.DataFrame.sort_values? It will have an extra parameter for k which represents n? the k parameter will be the first parameter and the others will be keyword only unlike sort_values which has by as the first parameter and the others as keyword only?

Could the proposal lead to potential duplication of the API and potential confusion to users?

could the 'k' or 'n' argument be added to the sort_values method to achieve the same outcome as the OP request?

DataFrame.nsmallest (and DataFrame.nlargest) is equivalent to df.sort_values(columns, ascending=True).head(n), but more performant. These methods also put the n parameter first which perhaps feels more natural and aligns with the proposed pandas.DataFrame.nsorted method.

+1 on an enhancement to achieve the outcome the OP is requesting but not entirely convinced that the addition of another DataFrame method is the best way to achieved this.

@snitish
Copy link
Member

snitish commented Apr 3, 2025

@MartinBraquet I believe we can use a min/max heap of size k to achieve O(n logk) worst case complexity for nlargest/nsmallest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Filters e.g. head, tail, nth Sorting e.g. sort_index, sort_values
Projects
None yet
Development

No branches or pull requests

7 participants