Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469
Open
rly wants to merge 8 commits into
Open
Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap#1469rly wants to merge 8 commits into
rly wants to merge 8 commits into
Conversation
…pat) Pandas 3.0 makes PyArrow-backed strings the default for DataFrame string columns, so df['col'].values is now ArrowStringArray and constructing VectorData(data=...) fails type validation. Add pd.Series and pandas.api.extensions.ExtensionArray to the array_data macro and coerce to numpy at the Data construction boundary so every Data subclass picks up the fix without per-class changes. Reject pd.NA/NaN with an informative TypeError (HDF5 vlen-string writes already crash on these) and reject IntegerArray/BooleanArray/FloatingArray to avoid silent dtype widening on .to_numpy(). Lift the pandas<3 cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #1469 +/- ##
==========================================
+ Coverage 93.22% 93.23% +0.01%
==========================================
Files 41 41
Lines 10224 10242 +18
Branches 2109 2114 +5
==========================================
+ Hits 9531 9549 +18
Misses 417 417
Partials 276 276 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
# Conflicts: # src/hdmf/utils.py
Drop the IntegerArray/BooleanArray/FloatingArray rejection in coerce_pandas_data. For inputs without missing values these convert losslessly to their natural numpy dtype, and the missing-values guard already rejects the only case where .to_numpy() would change the dtype. This also makes masked-nullable inputs behave like arrow-backed nullable inputs, which already passed through. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The array_data docval macro now lists pandas ExtensionArray, so autodoc-generated signatures reference pandas.ExtensionArray. pandas documents this class as pandas.api.extensions.ExtensionArray, so the short name has no intersphinx target and sphinx-build -W fails. Ignore it the same way as the other unresolved external classes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md
coerce_pandas_data converted nullable masked arrays (Int64, boolean, Float64) via to_numpy()/np.asarray(), which returns an object array on pandas < 2.2 and the native dtype only on pandas >= 2.2. Convert through the dtype's backing numpy_dtype instead, so the result keeps its native dtype (int64, bool, float64) on the full supported pandas range, including the 1.4 lower bound exercised by the minimum CI env. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pd.Seriesandpandas.api.extensions.ExtensionArray(incl.StringArray/ArrowStringArray) asdatainDataand its subclasses, normalizing to numpy at theData.__init__/Data.extendboundary so every subclass (VectorData,VectorIndex,ScratchData,ElementIdentifiers, …) picks up the fix without per-class changes.pd.NA/NaN-bearing pandas input with an informativeTypeError(it would otherwise crash at HDF5 vlen-string write time). Inputs without missing values convert to their natural numpy dtype.pandas<3cap frompyproject.toml.Fixes #1384.
Why
Pandas 3.0 makes PyArrow-backed strings the default for all DataFrame string columns.
df['col'].valuesis nowArrowStringArray, soVectorData(name=..., data=df['col'].values)(and any other typical user pattern that hands HDMF a string column) now fails docval type validation. Centralizing the fix at theDataconstruction boundary meansVectorData,add_unit,add_electrode,from_dataframe, etc. all keep working with no further changes.Behavior
ArrowStringArray,StringArray,pd.Series(any backing dtype),pd.Categorical, and nullable numeric/boolean dtypes (Int64,boolean,Float64) without missing values → converted tonp.ndarrayat their natural numpy dtype.pd.NAorNaN→TypeErrorpointing at the missing-values cause and asking the user to fill with a sentinel. (This covers nullable dtypes with NAs, the only case where.to_numpy()would silently change the dtype.)Verification
DynamicTable.from_dataframe(df=...)with pandas 3.0.2 default string columns works end-to-end.Test plan
coerce_pandas_datacoveringStringArray,ArrowStringArray, plain numericSeries,Categorical, no-NA nullable int/bool, and NA-bearing inputs.VectorDatafor bothSeriesanddf.valuespaths.🤖 Generated with Claude Code