Add nunique #9548 #10939

eshort0401 · 2025-11-21T02:50:04Z

Closes Add nunique reduction for number of unique values #9548
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

Note

I've tried to replicate the functionality of pandas.DataFrame.nunique as closely as possible. Note however the Python array API standard suggests each nan should be treated as a unique value, which would contradict the behaviour of pandas.DataFrame.nunique. One option would be to add an option unique_na to the xarray version of nunique, which would count each nan as a distinct value.

dcherian · 2025-11-21T15:10:26Z

xarray/core/duck_array_ops.py

+
+    # If axis empty, return unchanged data.
+    if not axis:
+        return data


this should be an error

Happy to change this, but note that raising an error when axis is empty will mean aggregator calls like

array_1 = np.array([[1, 2, 2], [3, 4, 4]]) array_2 = np.array([4, 5, 6]) da1 = xr.DataArray(array_1, dims=("x", "y")) da2 = xr.DataArray(array_2, dims=("y",)) ds = xr.Dataset({"a": da1, "b": da2}) ds.nunique(dim="x")

<xarray.Dataset> Size: 48B Dimensions: (y: 3) Dimensions without coordinates: y Data variables: a (y) int64 24B 2 2 2 b (y) int64 24B 4 5 6

will instead raise errors. Note the behaviour of aggregators like ds.mean(dim="x") is to leave the variables missing the required coordinates unchanged, analogously to the above; this is why I just return the unchanged data when axis is empty.

dcherian · 2025-11-21T15:14:41Z

xarray/core/duck_array_ops.py

+    new_shape = [s for i, s in enumerate(shape) if i not in axis] + [-1]
+    stacked = xp.reshape(xp.transpose(data, new_order), new_shape)
+
+    # Check if data has type object; if so use pd.factorize for unique integers


This is a good start but we'd prefer a vectorized approach.

I'd start with the first answer here: https://stackoverflow.com/questions/46893369/count-unique-elements-along-an-axis-of-a-numpy-array
but replace the np.diff with not_equal(a[..., :-1], a[..., 1:]) and sum that along the axis. You'll have to handle the case of NaNs not comparing equal, presumably by also summing duck_array_ops.isnull(a) along the same axis, and subtracting it away.

No worries! I've implemented this vectorized approach in commit 456f953. I've also tried to extend the approach to dask arrays; please let me know if there's a better or more standard way to do this.

dcherian · 2025-11-21T15:15:46Z

xarray/namedarray/_aggregations.py

            **kwargs,
        )

+    def nunique(


the arrayapi version seems to be unique_counts though we add a dim parameter. So let's go with that.

It looks like there are four uniqueness related functions in the array api,

unique_all

unique_counts

unique_inverse

unique_values

I don't think any of these functions can be extended to xarray methods in a consistent way. These functions all return variable length arrays, or tuples of variable length arrays, so I don't think it makes sense to try to apply these along Dataset or Dataframe dimensions. However, I do think it's worth including the "nans are distinct" feature of these array api functions, so in commit 456f953 I've added an equalna option. If skipna=False,then equalna=True and equalna=False will give the pandas/numpy default nan counting, and the array api counting, respectively.

eshort0401 · 2025-11-24T07:39:40Z

Thank you so much @dcherian for your review! I've responded to your code comments above. Some additional thoughts and questions below.

I used xarray extensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper!
I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping np.unique or some-such?
I'm also assuming you want a method that works on dask arrays? I've had a go at a dask compatible version in commit 456f953. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;
1. First use the vectorized stack-exchange method on each chunk of our starting array data, but instead of counting the uniques immediately, store the the unique values for each relevant cell of data as an array.
2. For relevant chunks of data, combine the arrays of uniques across the relevant cells of by doing xp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimized np.sort.
  I used dask.array.reduction to combine the uniques across chunks, with the aggregate method doing the final counting. It may be possible to speed up the dask implementation with numba, but when I profiled the code the bottlenecks were just the repeated applications of xp.sort and xp.concatenate. From my testing the best way to optimize was just to choose the chunks intelligently. You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.
I'm not sure if the nunique function should live where it currently does in duck_array_ops.py. I'm also unsure if there is a more intelligent way to build the dask version from the non-dask version.
I haven't yet tested the other array backends mentioned in the xarray code as I'm unfamiliar with these, but will have a go shortly! From duck_array_ops.py I got the sense that the priority is numpy and dask compatibility.

Thank you again for your time, and to the xarray team for an amazing package!

nunique_profile.ipynb
nunique_good_chunks_profile.html
nunique_bad_chunks_profile.html

dcherian · 2025-11-24T14:56:29Z

used xarray extensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper!

Very glad to have you contribute!

I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping np.unique or some-such?

Yes usually we are agnostic by using the array api functions. However the standardized array api version of unique_counts does not support an axis parameter, that's why I suggested the stack overflow version.

I'm also assuming you want a method that works on dask arrays? I've had a go at a dask compatible version in commit 456f953. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;

We try too but on this one the implementation is a bit involved as you see :) . Re:memory the "standard" for such estimations are approximate algorithms (approx_count_distinct using things like a HyperLogLog).

Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer. Given the unknown memory usage perhaps this is best. Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.

For relevant chunks of data, combine the arrays of uniques across the relevant cells of by doing xp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimized np.sort.

The arrays are probably best, a lot of work goes in to optimizing np.sort.

You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.

Yes, exactly.

Thank you again for your time, and to the xarray team for an amazing package!

Thanks for taking the time to contribute!

eshort0401 · 2025-11-24T22:51:20Z

Thanks again @dcherian for the fast response!

Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer.

My thinking was that if the user tries to reduce the whole array, enforcing a single chunk along the reduction axis will try to load the whole array into memory.

Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.

OK! Is there any more work you'd like me to do on the present xarray PR?

eshort0401 added 3 commits November 21, 2025 13:17

add nunique GH9548

3392194

fix accidental noqa removal

250f6a5

fix accidental reformat

2c318b3

github-actions bot added topic-arrays related to flexible array support topic-NamedArray Lightweight version of Variable labels Nov 21, 2025

eshort0401 mentioned this pull request Nov 21, 2025

Add nunique reduction for number of unique values #9548

Open

eshort0401 added 3 commits November 21, 2025 15:50

fix doctests

2ebb2ab

fix doctests

82e4816

remove bad blank lines

cf9f6a3

dcherian reviewed Nov 21, 2025

View reviewed changes

eshort0401 added 3 commits November 22, 2025 11:01

Merge remote-tracking branch 'upstream/main' into add-nunique

3a9273e

implement vectorized method and dask extension

456f953

fix missing permute_dims in old numpy versions

3ace1b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add nunique #9548 #10939

Add nunique #9548 #10939

eshort0401 commented Nov 21, 2025

Uh oh!

dcherian Nov 21, 2025

Uh oh!

eshort0401 Nov 24, 2025

Uh oh!

dcherian Nov 21, 2025

Uh oh!

eshort0401 Nov 24, 2025

Uh oh!

dcherian Nov 21, 2025

Uh oh!

eshort0401 Nov 24, 2025

Uh oh!

eshort0401 commented Nov 24, 2025

Uh oh!

dcherian commented Nov 24, 2025 •

edited

Loading

Uh oh!

eshort0401 commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add nunique #9548 #10939

Are you sure you want to change the base?

Add nunique #9548 #10939

Conversation

eshort0401 commented Nov 21, 2025

Note

Uh oh!

dcherian Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

eshort0401 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

eshort0401 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

eshort0401 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

eshort0401 commented Nov 24, 2025

Uh oh!

dcherian commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eshort0401 commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dcherian commented Nov 24, 2025 •

edited

Loading