-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add nunique #9548 #10939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add nunique #9548 #10939
Conversation
|
|
||
| # If axis empty, return unchanged data. | ||
| if not axis: | ||
| return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change this, but note that raising an error when axis is empty will mean aggregator calls like
array_1 = np.array([[1, 2, 2], [3, 4, 4]])
array_2 = np.array([4, 5, 6])
da1 = xr.DataArray(array_1, dims=("x", "y"))
da2 = xr.DataArray(array_2, dims=("y",))
ds = xr.Dataset({"a": da1, "b": da2})
ds.nunique(dim="x")<xarray.Dataset> Size: 48B
Dimensions: (y: 3)
Dimensions without coordinates: y
Data variables:
a (y) int64 24B 2 2 2
b (y) int64 24B 4 5 6
will instead raise errors. Note the behaviour of aggregators like ds.mean(dim="x") is to leave the variables missing the required coordinates unchanged, analogously to the above; this is why I just return the unchanged data when axis is empty.
xarray/core/duck_array_ops.py
Outdated
| new_shape = [s for i, s in enumerate(shape) if i not in axis] + [-1] | ||
| stacked = xp.reshape(xp.transpose(data, new_order), new_shape) | ||
|
|
||
| # Check if data has type object; if so use pd.factorize for unique integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start but we'd prefer a vectorized approach.
I'd start with the first answer here: https://stackoverflow.com/questions/46893369/count-unique-elements-along-an-axis-of-a-numpy-array
but replace the np.diff with not_equal(a[..., :-1], a[..., 1:]) and sum that along the axis. You'll have to handle the case of NaNs not comparing equal, presumably by also summing duck_array_ops.isnull(a) along the same axis, and subtracting it away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! I've implemented this vectorized approach in commit 456f953. I've also tried to extend the approach to dask arrays; please let me know if there's a better or more standard way to do this.
| **kwargs, | ||
| ) | ||
|
|
||
| def nunique( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the arrayapi version seems to be unique_counts though we add a dim parameter. So let's go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are four uniqueness related functions in the array api,
I don't think any of these functions can be extended to xarray methods in a consistent way. These functions all return variable length arrays, or tuples of variable length arrays, so I don't think it makes sense to try to apply these along Dataset or Dataframe dimensions. However, I do think it's worth including the "nans are distinct" feature of these array api functions, so in commit 456f953 I've added an equalna option. If skipna=False,then equalna=True and equalna=False will give the pandas/numpy default nan counting, and the array api counting, respectively.
|
Thank you so much @dcherian for your review! I've responded to your code comments above. Some additional thoughts and questions below.
Thank you again for your time, and to the nunique_profile.ipynb |
Very glad to have you contribute!
Yes usually we are agnostic by using the array api functions. However the standardized array api version of
We try too but on this one the implementation is a bit involved as you see :) . Re:memory the "standard" for such estimations are approximate algorithms ( Our other option here is to use
The arrays are probably best, a lot of work goes in to optimizing
Yes, exactly.
Thanks for taking the time to contribute! |
|
Thanks again @dcherian for the fast response!
My thinking was that if the user tries to reduce the whole array, enforcing a single chunk along the reduction axis will try to load the whole array into memory.
OK! Is there any more work you'd like me to do on the present |
nuniquereduction for number of unique values #9548whats-new.rstapi.rstNote
I've tried to replicate the functionality of
pandas.DataFrame.nuniqueas closely as possible. Note however the Python array API standard suggests eachnanshould be treated as a unique value, which would contradict the behaviour ofpandas.DataFrame.nunique. One option would be to add an optionunique_nato thexarrayversion ofnunique, which would count eachnanas a distinct value.