-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Enable nsmallest/nlargest on object dtype #61166
Comments
Thanks for the request. I see no reason Further investigations and PRs to implement are welcome! |
Thank you for the reply. As far as I understood correctly, in the end it might look something like this? df.nlargest(n=6, columns=['col1', 'col2'], ascending=[True, False]) |
I do not think we should add an |
take |
@rhshadrach I'd like to push back a little bit to make sure the issue is clearly understood, as it appears to me that the request extends beyond simply enabling dtype objects in nsmallest/nlargest. Indeed, I believe there is a second ENH request which solely concerns sorting, irrespective of the type (that is, it also concerns numeric values), which I'll try to explain below. What is currently missing in those nlargest and nsmallest methods is a way to sort values where the order is ascending for one column and descending for another column. One can see it as a combination of nlargest and smallest, depending on the column, which is what is being proposed by OP by a new nsorted method. Here is an example that, I believe, can only be achieved via sort_values, which OP would like to get more efficiently, and which none of nlargest or smallest can perform, irrespective of a post-hoc sorting. import pandas as pd
df = pd.DataFrame({
'c1': [1, 2, 2],
'c2': [1, 2, 3],
})
df.sort_values(['c1', 'c2'], ascending=[True, False]).head(2)
For instance, nmallest returns a different result, which irreversibly (regardless of subsequent operations) prevents the user from obtaining the desired result above: df.nsmallest(2, ['c1', 'c2'])
Naturally, most users have managed their way through this issue by negating the columns for which they want nsmallest and then using nlargest on that modified dataframe; but this hack only works for numeric types, which, I assume, is the reason for the OP posting this ENH request. Please let me know if I missed something. If not, perhaps we should indeed have that nsorted method, with its |
Agreed that this is not doable in the API today. However given the input
I do not think it makes sense to call the result:
neither "smallest" nor "largest". In other words, I think it'd be trying to force the ability to do an operation into a method where it does not belong. |
Right. Hence our suggestion from OP and I to make a new method called nsorted. |
Ah, indeed, thanks! It does seem rather straightforward to parameterize cc @pandas-dev/pandas-core |
Alright. So, to clarify, there are two independent upgrades needed, which can thus be addressed in two different PRs:
How should we handle two PRs required for the same issue? Shouldn't we create a new issue to close after upgrade 1 is merged, and then close the issue here when upgrade 2 is merged? |
That proposal has my support.
No - a PR need not close any issue. One can add the line |
Are you familiar with the sub-issue feature? Just asking to see if it's part of pandas' development workflow. I agree that it might not be very useful for two subtasks, but I see the value if we have an issue with 5+ subtasks, each requiring a different PR, in order to keep track of when the issue can be closed. |
@RutujaGhodake Do you have any plan on working on any of those subtasks? |
I agree with this proposal. However, I'm not sure that the current implementation of |
@Dr-Irv I believe |
OK, but the OP was asking for worse case complexity (at least, based on the formal definition of time complexity), so that was my concern - the worst case. |
Got it. I'd be happy to work on a new |
@snitish Are you aware of any algorithm doing so? |
@MartinBraquet After reviewing the request, I believe it requires a deeper understanding of the codebase. I will be happy to work on more beginner friendly issue. Anyone who is interested can work on this. |
Will the proposed Could the proposal lead to potential duplication of the API and potential confusion to users? could the 'k' or 'n' argument be added to the
+1 on an enhancement to achieve the outcome the OP is requesting but not entirely convinced that the addition of another DataFrame method is the best way to achieved this. |
@MartinBraquet I believe we can use a min/max heap of size |
Is it possible to implement method DataFrame.nsorted that equivalent to
df.sort_values(by=["col1", "col2"], ascending=[True, False]).head(k)
, but with time complexity O(n log k)? I know aboutnlargest
andnsmallest
methods but unfortunately they work only with numeric columns.Example:
The text was updated successfully, but these errors were encountered: