Skip to content

Conversation

@eshort0401
Copy link

Note

I've tried to replicate the functionality of pandas.DataFrame.nunique as closely as possible. Note however the Python array API standard suggests each nan should be treated as a unique value, which would contradict the behaviour of pandas.DataFrame.nunique. One option would be to add an option unique_na to the xarray version of nunique, which would count each nan as a distinct value.

@github-actions github-actions bot added topic-arrays related to flexible array support topic-NamedArray Lightweight version of Variable labels Nov 21, 2025

# If axis empty, return unchanged data.
if not axis:
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an error

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change this, but note that raising an error when axis is empty will mean aggregator calls like

array_1 = np.array([[1, 2, 2], [3, 4, 4]])
array_2 = np.array([4, 5, 6])
da1 = xr.DataArray(array_1, dims=("x", "y"))
da2 = xr.DataArray(array_2, dims=("y",))
ds = xr.Dataset({"a": da1, "b": da2})
ds.nunique(dim="x")
<xarray.Dataset> Size: 48B
Dimensions:  (y: 3)
Dimensions without coordinates: y
Data variables:
    a        (y) int64 24B 2 2 2
    b        (y) int64 24B 4 5 6

will instead raise errors. Note the behaviour of aggregators like ds.mean(dim="x") is to leave the variables missing the required coordinates unchanged, analogously to the above; this is why I just return the unchanged data when axis is empty.

new_shape = [s for i, s in enumerate(shape) if i not in axis] + [-1]
stacked = xp.reshape(xp.transpose(data, new_order), new_shape)

# Check if data has type object; if so use pd.factorize for unique integers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start but we'd prefer a vectorized approach.

I'd start with the first answer here: https://stackoverflow.com/questions/46893369/count-unique-elements-along-an-axis-of-a-numpy-array
but replace the np.diff with not_equal(a[..., :-1], a[..., 1:]) and sum that along the axis. You'll have to handle the case of NaNs not comparing equal, presumably by also summing duck_array_ops.isnull(a) along the same axis, and subtracting it away.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! I've implemented this vectorized approach in commit 456f953. I've also tried to extend the approach to dask arrays; please let me know if there's a better or more standard way to do this.

**kwargs,
)

def nunique(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the arrayapi version seems to be unique_counts though we add a dim parameter. So let's go with that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are four uniqueness related functions in the array api,

I don't think any of these functions can be extended to xarray methods in a consistent way. These functions all return variable length arrays, or tuples of variable length arrays, so I don't think it makes sense to try to apply these along Dataset or Dataframe dimensions. However, I do think it's worth including the "nans are distinct" feature of these array api functions, so in commit 456f953 I've added an equalna option. If skipna=False,then equalna=True and equalna=False will give the pandas/numpy default nan counting, and the array api counting, respectively.

@eshort0401
Copy link
Author

Thank you so much @dcherian for your review! I've responded to your code comments above. Some additional thoughts and questions below.

  • I used xarray extensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper!
  • I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping np.unique or some-such?
  • I'm also assuming you want a method that works on dask arrays? I've had a go at a dask compatible version in commit 456f953. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;
    1. First use the vectorized stack-exchange method on each chunk of our starting array data, but instead of counting the uniques immediately, store the the unique values for each relevant cell of data as an array.
    2. For relevant chunks of data, combine the arrays of uniques across the relevant cells of by doing xp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimized np.sort.
      I used dask.array.reduction to combine the uniques across chunks, with the aggregate method doing the final counting. It may be possible to speed up the dask implementation with numba, but when I profiled the code the bottlenecks were just the repeated applications of xp.sort and xp.concatenate. From my testing the best way to optimize was just to choose the chunks intelligently. You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.
  • I'm not sure if the nunique function should live where it currently does in duck_array_ops.py. I'm also unsure if there is a more intelligent way to build the dask version from the non-dask version.
  • I haven't yet tested the other array backends mentioned in the xarray code as I'm unfamiliar with these, but will have a go shortly! From duck_array_ops.py I got the sense that the priority is numpy and dask compatibility.

Thank you again for your time, and to the xarray team for an amazing package!

nunique_profile.ipynb
nunique_good_chunks_profile.html
nunique_bad_chunks_profile.html

@dcherian
Copy link
Contributor

dcherian commented Nov 24, 2025

used xarray extensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper!

Very glad to have you contribute!

I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping np.unique or some-such?

Yes usually we are agnostic by using the array api functions. However the standardized array api version of unique_counts does not support an axis parameter, that's why I suggested the stack overflow version.

I'm also assuming you want a method that works on dask arrays? I've had a go at a dask compatible version in commit 456f953. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;

We try too but on this one the implementation is a bit involved as you see :) . Re:memory the "standard" for such estimations are approximate algorithms (approx_count_distinct using things like a HyperLogLog).

Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer. Given the unknown memory usage perhaps this is best. Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.

For relevant chunks of data, combine the arrays of uniques across the relevant cells of by doing xp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimized np.sort.

The arrays are probably best, a lot of work goes in to optimizing np.sort.

You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.

Yes, exactly.

Thank you again for your time, and to the xarray team for an amazing package!

Thanks for taking the time to contribute!

@eshort0401
Copy link
Author

Thanks again @dcherian for the fast response!

Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer.

My thinking was that if the user tries to reduce the whole array, enforcing a single chunk along the reduction axis will try to load the whole array into memory.

Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.

OK! Is there any more work you'd like me to do on the present xarray PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic-arrays related to flexible array support topic-NamedArray Lightweight version of Variable

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add nunique reduction for number of unique values

2 participants