Skip to content

Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469

Open
rly wants to merge 8 commits into
devfrom
fix/pandas-3-compat
Open

Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469
rly wants to merge 8 commits into
devfrom
fix/pandas-3-compat

Conversation

@rly

@rly rly commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Accept pd.Series and pandas.api.extensions.ExtensionArray (incl. StringArray/ArrowStringArray) as data in Data and its subclasses, normalizing to numpy at the Data.__init__/Data.extend boundary so every subclass (VectorData, VectorIndex, ScratchData, ElementIdentifiers, …) picks up the fix without per-class changes.
  • Reject pd.NA/NaN-bearing pandas input with an informative TypeError (it would otherwise crash at HDF5 vlen-string write time). Inputs without missing values convert to their natural numpy dtype.
  • Lift the pandas<3 cap from pyproject.toml.

Fixes #1384.

Why

Pandas 3.0 makes PyArrow-backed strings the default for all DataFrame string columns. df['col'].values is now ArrowStringArray, so VectorData(name=..., data=df['col'].values) (and any other typical user pattern that hands HDMF a string column) now fails docval type validation. Centralizing the fix at the Data construction boundary means VectorData, add_unit, add_electrode, from_dataframe, etc. all keep working with no further changes.

Behavior

  • ArrowStringArray, StringArray, pd.Series (any backing dtype), pd.Categorical, and nullable numeric/boolean dtypes (Int64, boolean, Float64) without missing values → converted to np.ndarray at their natural numpy dtype.
  • pandas input containing pd.NA or NaNTypeError pointing at the missing-values cause and asking the user to fill with a sentinel. (This covers nullable dtypes with NAs, the only case where .to_numpy() would silently change the dtype.)
  • Non-pandas inputs are pass-through; no behavior change for existing callers.

Verification

Test plan

  • Unit tests for coerce_pandas_data covering StringArray, ArrowStringArray, plain numeric Series, Categorical, no-NA nullable int/bool, and NA-bearing inputs.
  • End-to-end test through VectorData for both Series and df.values paths.
  • Manual HDF5 roundtrip with pandas 3.0.
  • CI passes on Python 3.10–3.13 with pandas 1.4 (lower bound), pandas 2.x, and pandas 3.x.

🤖 Generated with Claude Code

…pat)

Pandas 3.0 makes PyArrow-backed strings the default for DataFrame string
columns, so df['col'].values is now ArrowStringArray and constructing
VectorData(data=...) fails type validation. Add pd.Series and
pandas.api.extensions.ExtensionArray to the array_data macro and coerce
to numpy at the Data construction boundary so every Data subclass picks
up the fix without per-class changes.

Reject pd.NA/NaN with an informative TypeError (HDF5 vlen-string writes
already crash on these) and reject IntegerArray/BooleanArray/FloatingArray
to avoid silent dtype widening on .to_numpy(). Lift the pandas<3 cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented May 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.23%. Comparing base (af7879a) to head (849c1d6).

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1469      +/-   ##
==========================================
+ Coverage   93.22%   93.23%   +0.01%     
==========================================
  Files          41       41              
  Lines       10224    10242      +18     
  Branches     2109     2114       +5     
==========================================
+ Hits         9531     9549      +18     
  Misses        417      417              
  Partials      276      276              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rly rly marked this pull request as draft May 4, 2026 19:29
rly and others added 3 commits June 23, 2026 09:19
# Conflicts:
#	src/hdmf/utils.py
Drop the IntegerArray/BooleanArray/FloatingArray rejection in
coerce_pandas_data. For inputs without missing values these convert
losslessly to their natural numpy dtype, and the missing-values guard
already rejects the only case where .to_numpy() would change the dtype.
This also makes masked-nullable inputs behave like arrow-backed nullable
inputs, which already passed through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The array_data docval macro now lists pandas ExtensionArray, so
autodoc-generated signatures reference pandas.ExtensionArray. pandas
documents this class as pandas.api.extensions.ExtensionArray, so the
short name has no intersphinx target and sphinx-build -W fails. Ignore
it the same way as the other unresolved external classes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rly rly marked this pull request as ready for review June 23, 2026 22:34
rly and others added 3 commits June 23, 2026 15:36
coerce_pandas_data converted nullable masked arrays (Int64, boolean,
Float64) via to_numpy()/np.asarray(), which returns an object array on
pandas < 2.2 and the native dtype only on pandas >= 2.2. Convert through
the dtype's backing numpy_dtype instead, so the result keeps its native
dtype (int64, bool, float64) on the full supported pandas range,
including the 1.4 lower bound exercised by the minimum CI env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rly rly requested a review from oruebel June 23, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas 3.0 String Type Compatibility Breaking HDMF Data Ingestion

1 participant