Regression in DataArrays created from Pandas #10301

richard-berg · 2025-05-09T06:21:20Z

What happened?

Given:

index1 = np.array([1, 2, 3])
index2 = np.array([1, 2, 4])
srs = pd.Series(index=index1, data=1).convert_dtypes()
arr = srs.to_xarray()

Now consider:

>>> arr.reindex(index=index2)

In xarray 2023.1.0 this gave a reasonable (if weakly-typed) result.

<xarray.DataArray (index: 3)>
array([1, 1, nan], dtype=object)
Coordinates:
  * index    (index) int64 1 2 4

While upgrading to xarray 2025.3.x + pandas 2.x, my colleagues found it now raises:

TypeError: Cannot interpret 'Int64Dtype()' as a data type

What did you expect to happen?

Ideally, the result would be:

<xarray.DataArray (index: 3)> Size: 27B
PandasExtensionArray(array=<IntegerArray>
[1, 1, <NA>]
Length: 3, dtype: Int64)
Coordinates:
  * index    (index) <U1 12B '1' '2' '4'

Minimal Complete Verifiable Example

import numpy as np
import pandas as pd
import xarray as xr
index1 = np.array([1, 2, 3])
index2 = np.array([1, 2, 4])
srs = pd.Series(index=index1, data=1).convert_dtypes()
arr = srs.to_xarray()
arr.reindex(index=index2)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Anything else we need to know?

The difference is that arr.dtype is now pd.Int64Dtype() rather than np.dtype("object"), thanks to #8723. While arguably an improvement in typing, the xarray core doesn't seem ready to handle the former. In this case, core.dtypes.maybe_promote() is blindly passing a Pandas dtype to np.issubdtype, oops.

Patching this immediate issue is more revealing: reindex then fails when duck_array_ops.where(condition, x, y) tries to coerce x & y to a common dtype. The new extension-array code in as_shared_dtype is not at all general: when y is a scalar (the fill_value from the reindex operation), it simply gives up.

Once I understood the cause of the reindex issue above, producing more -- and much more worrisome -- failures was trivial:

>>> arr + 5

TypeError: unsupported operand type(s) for +: 'PandasExtensionArray' and 'int'

>>> np.add(arr, 5)

TypeError: 'PandasExtensionArray' object is not callable

>>> arr.fillna(0)

AttributeError: 'int' object has no attribute 'dtype'

I'd venture to say that the pandas df.to_xarray() / srs.to_xarray() methods have become foot-guns, bordering on unusable, now that pandas 2.x has reimplemented all of its native datatypes on top of ExtensionArray / ExtensionDtype.

The good news is I have a fix. The bad news is it's pretty invasive, needing careful oversight from someone who actually knows what they're doing. (Before this week I'd never used xarray, nor looked at the numpy / pandas source code.)

For now I might recommend excluding ALL numeric dtypes from being promoted to duck arrays, similar to what #9042 did for datetimes. (Basically everything except Categoricals, which seem to be the one extension type with good coverage in the xarray test suite, and which don't support the vast majority of ufuncs regardless.) That would at least allow people to safely continue using to_xarray() on modern versions of pandas, though you'd lose all the speed & type safety that @ilan-gold worked to achieve in 2024.5 & onward.

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-372.32.1.el8_6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: None

xarray: 2025.3.1
pandas: 2.2.3
numpy: 1.26.4
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: 1.6.1
h5py: 3.9.0
zarr: 3.0.6
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.3.0
distributed: 2025.3.0
matplotlib: 3.10.1
cartopy: None
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2024.9.0
cupy: 13.4.0
pint: None
sparse: 0.16.0
flox: None
numpy_groupies: None
setuptools: 78.1.0
pip: 25.0.1
conda: None
pytest: 8.3.5
mypy: 1.15.0
IPython: 8.35.0
sphinx: None

The text was updated successfully, but these errors were encountered:

ilan-gold · 2025-05-09T11:25:11Z

@richard-berg thanks for the issue.

import numpy as np
import pandas as pd
import xarray as xr
index1 = np.array([1, 2, 3])
index2 = np.array([1, 2, 4])
srs = pd.Series(index=index1, data=1).convert_dtypes()
arr = srs.to_xarray()
arr + 5

works for me on main maybe because of #10278. But I'm not sure it's the same as your example since you have x there in the addition example.

As for the text coverage, I agree we should increase it. I will add the cases you raise here but since you appear to be aware of more, I would love more guidance. If you look through my PRs here, it's a bit of whack-a-mole because while I think I am using relatively sound practices as I go, xarray has a lot of edge cases in its API that I am not familiar with. If you're aware of some, it would be great to handle them.

I would be opposed to somehow going around special casing because now we actually do let through datetimes as well as interval arrays (which are both tested even if it is not immediately obvious). The reason I had to do #9042 was exactly because all of the special casing that existed before was so complex that unraveling it required a massive PR.

So with special casing, we would be looking at categoricals, interval types, and datetimes passed through and only numerics excluded, until another type comes along. So I'd somewhat rather be very clear here that everything passes through.

ilan-gold · 2025-05-09T11:48:24Z

P.S I see:

TypeError: 'PandasExtensionArray' object is not callable

And we just made it so that PandasExtensionArray is no longer part of the public API. So hopefully this condition is not something that can be run into

dcherian · 2025-05-09T13:22:34Z

Whoops, int float and string should be converted to the numpy types. Looks like we lost this somehow and aren't testing it :(

ilan-gold · 2025-05-09T13:26:14Z

While I could see floats being a good idea, nullable integers do not exist in numpy:

import numpy as np


In [3]: np.array([np.nan, 1, 2], dtype="int32")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 np.array([np.nan, 1, 2], dtype="int32")

ValueError: cannot convert float NaN to integer

richard-berg · 2025-05-09T13:29:26Z

Thanks for checking, and for the quick fillna patch! FYI I do have a robust fix that handles all of the above in a general way, while preserving strong types (including nullable ints). Just may take a few days to get legal approval for (and work thru the mechanical details of) exfiltrating code off my network.

dcherian · 2025-05-09T13:30:33Z

nullable integers do not exist in numpy:

I know but this is what we do with numpy masked arrays today.

Ideally we'd convert to https://github.com/mdhaber/marray in the near future.

Regardless, for now we'd like to enable any "extra" dtypes (categorical, intervals) while using the numpy dtypes as much as we can.

ilan-gold · 2025-05-09T13:33:44Z

Ideally we'd convert to https://github.com/mdhaber/marray in the near future.

Interesting!

@richard-berg I would be interested why not just limit this behavior to from_dataframe i.e., users who wish to convert into an xarray object from pandas will lose the type but going the other way would be disabled. We rely on integer arrays and boolean arrays in https://github.com/scverse/anndata and projects like https://geopandas.org/en/stable/ rely on geometry extension types. So users constructing xarray objects should be allowed to continue to do that, no? And then they can exit xarray with their types in tact?

ilan-gold · 2025-05-09T13:34:25Z

Regardless, for now we'd like to enable any "extra" dtypes (categorical, intervals) while using the numpy dtypes as much as we can.

+1 here given the above example of geometry which is not in pandas core but is widely used int he geospatial community.

ilan-gold · 2025-05-10T17:48:27Z

I should say @richard-berg been thinking about this since you posted and I really appreciate your contribution and effort here btw! Looking forward to seeing you PR :)))) I love to "users" get involved with open source, wish it was easier given corporate situations sometimes. Thanks a million again!

richard-berg · 2025-05-30T12:42:28Z

@ilan-gold finally got past the firewall!

I see there's been lots of related activity in the last few weeks -- while I catch up on all the recent commits & PR discussions, mind glancing at #10380 to assess which changes are (a) still relevant (b) a reasonable direction to evolve the codebase? No sense slogging thru merge conflicts that'll just need to be backed out after feedback...

ilan-gold · 2025-05-30T12:43:13Z

Looking now!

ilan-gold · 2025-05-30T13:05:48Z

On first glance, I think this is a great PR. I think it's more opinionated and/or thorough than mine #10304, but seems to do roughly the same thing (I think the co. The find_result_type from pandas is awesome, didn't know about it.

It also has some other nice-ities I really like as well, so in general in favor of this PR (even though I'm not an official maintainer here, although I do appear to be pinged every time something happens with this feature haha). I've been splitting my PRs up to make it clear what changes are exactly needed to fix which issue and why, and that helps with reviewing.

For example, for this original issue around re-indexing, I think only a subset of the PR you opened is needed, but maybe I'm wrong. For example is https://github.com/pydata/xarray/pull/10380/files#diff-ca5c2de2fe6e9e25fbf22bd53e4976c15da74900dfb14deb7e6e87f5377230e3R7292-R7296 relevant to this issue specifically?

But the rest of it is great, at first glance the categorical stuff resolves #10247 which was opened by a contributor to our library: https://github.com/scverse/anndata and is of course super applicable across domains (categoricals are so much faster than strings for operations where they make sense). There should be a link in that PR to the code we have in our codebase for handling the different categoricals right now, so would be great to have that done as well inside xarray.

Thanks so much!

UPDATE: I see now in your description (bad habit of going right to the code) you say clearly that the PR goes beyond this issue. But still would be good to understand what is needed for this bugfix vs. others

richard-berg added bug needs triage Issue that has not been reviewed by xarray team member labels May 9, 2025

ilan-gold linked a pull request May 9, 2025 that will close this issue

(fix): no fill_value on reindex #10304

Draft

4 tasks

mancellin mentioned this issue May 13, 2025

Cannot export dataset with categorical index in 2025.4.0 #10312

Open

5 tasks

ilan-gold linked a pull request May 19, 2025 that will close this issue

(fix): disallow NumpyExtensionArray #10334

Open

4 tasks

This was referenced May 19, 2025

[Bug] - HorizontalFlowBarrierBase.__repr__ fails with xarray=2025.4.0 Deltares/imod-python#1522

Open

nbytes property errors for variable created from geopandas GeoDataframe with version 2025.04.0 #10342

Closed

richard-berg added a commit to richard-berg/xarray that referenced this issue May 30, 2025

Improve support for pandas Extension Arrays (pydata#10301)

1db86e8

richard-berg mentioned this issue May 30, 2025

Improve support for pandas Extension Arrays (#10301) #10380

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Regression in DataArrays created from Pandas #10301

Regression in DataArrays created from Pandas #10301

richard-berg commented May 9, 2025 •

edited

Loading

INSTALLED VERSIONS

ilan-gold commented May 9, 2025 •

edited

Loading

Uh oh!

ilan-gold commented May 9, 2025

Uh oh!

dcherian commented May 9, 2025

Uh oh!

ilan-gold commented May 9, 2025

Uh oh!

richard-berg commented May 9, 2025 •

edited

Loading

Uh oh!

dcherian commented May 9, 2025 •

edited

Loading

Uh oh!

ilan-gold commented May 9, 2025 •

edited

Loading

Uh oh!

ilan-gold commented May 9, 2025

Uh oh!

ilan-gold commented May 10, 2025

Uh oh!

richard-berg commented May 30, 2025

Uh oh!

ilan-gold commented May 30, 2025

Uh oh!

ilan-gold commented May 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Regression in DataArrays created from Pandas #10301

Regression in DataArrays created from Pandas #10301

Comments

richard-berg commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Anything else we need to know?

Environment

INSTALLED VERSIONS

ilan-gold commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented May 9, 2025

Uh oh!

dcherian commented May 9, 2025

Uh oh!

ilan-gold commented May 9, 2025

Uh oh!

richard-berg commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented May 9, 2025

Uh oh!

ilan-gold commented May 10, 2025

Uh oh!

richard-berg commented May 30, 2025

Uh oh!

ilan-gold commented May 30, 2025

Uh oh!

ilan-gold commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richard-berg commented May 9, 2025 •

edited

Loading

ilan-gold commented May 9, 2025 •

edited

Loading

richard-berg commented May 9, 2025 •

edited

Loading

dcherian commented May 9, 2025 •

edited

Loading

ilan-gold commented May 9, 2025 •

edited

Loading

ilan-gold commented May 30, 2025 •

edited

Loading