GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195

jorisvandenbossche · 2024-09-23T15:14:45Z

Rationale for this change

With pandas' PDEP-14 proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype).

This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (pd.options.future.infer_string = True). To prepare for that, we should start using that string dtype in to_pandas() conversions when that option is enabled.

What changes are included in this PR?

If pandas >= 3.0 is used or the pandas option is enabled, ensure that to_pandas() calls use the default string dtype of pandas for string-like columns (string, large_string, string_view)

Are these changes tested?

It is tested in the pandas-nightly crossbow build.

There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879)

Are there any user-facing changes?

This PR includes breaking changes to public APIs. Depending on the version of pandas, to_pandas() will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side.

GitHub Issue: [Python] Support pandas future default string dtype #43683

…as-string-dtype

jorisvandenbossche · 2024-11-08T14:43:32Z

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

github-actions · 2024-11-08T14:45:48Z

Revision: 56b61f2

Submitted crossbow builds: ursacomputing/crossbow @ actions-c85b742ef7

Task	Status
test-conda-python-3.11-pandas-nightly-numpy-nightly

WillAyd · 2024-11-08T15:01:35Z

python/pyarrow/tests/test_pandas.py

+    e1 = pd.DataFrame(
+        {'a': a_values},
+        index=pd.RangeIndex(0, 8, step=2, name='qux'),
+        columns=pd.Index(['a'], dtype=object)


Does the column type created with the dict argument differ from this?

This test is specifically using old metadata that specifies the dtype of the columns is object dtype, and then pyarrow tries to restore it that way.

It's the question if we should do that though .. Because every file written from a pandas DataFrame before pandas 3.0 will have that, so maybe we should specifically ignore object dtype here if the inferred type is that it contains all strings, so users consistently get a columns Index object using str dtype

Hmm that's tricky but I think going with the str data type as you suggested is better; I would expect that is a better UX in over 99% of instances

OK, changed this to ensure we actually use str dtype columns Index object, even if the pandas metadata of the pyarrow table says that the original table was using object dtype.

This ensures that all existing files will use (with pandas>= 3) the default str dtype for the columns, but that also has the trade-off that if you explicitly want to use object dtype with strings, that this will no longer roundtrip in pandas->pyarrow/parquet->pandas)

jorisvandenbossche · 2024-11-13T09:05:07Z

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

github-actions · 2024-11-13T09:07:22Z

Revision: 84b8234

Submitted crossbow builds: ursacomputing/crossbow @ actions-ac3103d3ba

Task	Status
test-conda-python-3.11-pandas-nightly-numpy-nightly

jorisvandenbossche · 2024-11-13T09:43:53Z

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

github-actions · 2024-11-13T09:46:09Z

Revision: e5db09f

Submitted crossbow builds: ursacomputing/crossbow @ actions-3c389cd49e

Task	Status
test-conda-python-3.11-pandas-nightly-numpy-nightly

python/pyarrow/tests/test_compute.py

…as-string-dtype

jorisvandenbossche · 2025-01-03T11:10:35Z

@github-actions crossbow submit -g python

github-actions · 2025-01-03T11:13:20Z

Revision: 940b64d

Submitted crossbow builds: ursacomputing/crossbow @ actions-c96266afd8

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-1.26
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-pandas-nightly-numpy-nightly
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.13
test-conda-python-3.9
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5
test-conda-python-emscripten
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-python-3

WillAyd

Sorry @raulcd for missing your ping! I am also not very familiar with this part of the code so I have a few questions, although I generally trust Joris knows what he is doing here :-)

WillAyd · 2025-01-03T14:23:42Z

python/pyarrow/src/arrow/python/arrow_to_pandas.cc

@@ -2523,7 +2523,8 @@ Status ConvertCategoricals(const PandasOptions& options, ChunkedArrayVector* arr
  }
  if (options.strings_to_categorical) {
    for (int i = 0; i < static_cast<int>(arrays->size()); i++) {
-      if (is_base_binary_like((*arrays)[i]->type()->id())) {
+      if (is_base_binary_like((*arrays)[i]->type()->id()) ||


The binary_view changes are tangential to pandas 3.x right? I wonder if they shouldn't be done in their own PR

Somewhat tangential yes, as in that it is also a useful change to do regardless of the other changes here. I am also not entirely sure there is a very specific test for this.

Happy to move out to a separate PR, although I then would like both to get merged for 19.0

Yea I would prefer to separate out so we can analyze test coverage too. I suppose that would also be helpful if these changes end up going through to different releases (although that's not the aim)

OK. it turns out to not actually affect the tests here, so good to move that to a separate PR with actual test coverage: #45176

I've merged the related issue and marked as 19.0.0

WillAyd · 2025-01-03T14:27:22Z

python/pyarrow/pandas_compat.py

-    if name is not None and not isinstance(name, str):
+    if (
+        name is not None
+        and not (isinstance(name, float) and np.isnan(name))


Does this mean that np.nan is now a valid column name for the string data type?

Yes, since this was essentially to support some missing values in the column names (but restricted to None), also allowing np.nan keeps somewhat the same behaviour when switching from object dtype to string dtype

…as-string-dtype

…44195) ### Rationale for this change With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype). This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled. ### What changes are included in this PR? - If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view) ### Are these changes tested? It is tested in the pandas-nightly crossbow build. There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879) ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side. * GitHub Issue: #43683 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

conbench-apache-arrow · 2025-01-15T11:52:55Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5181c24.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

apacheGH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+)

ea76574

github-actions bot added Component: Python awaiting committer review Awaiting committer review labels Sep 23, 2024

jorisvandenbossche added 5 commits November 6, 2024 17:53

Merge remote-tracking branch 'upstream/main' into apachegh-43683-pand…

3e17983

…as-string-dtype

test on CI

e0b2958

honor strings_to_categorical

8a6d6c3

more test fixes

11d2691

honor categories keyword

56b61f2

WillAyd reviewed Nov 8, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 8, 2024

jorisvandenbossche added 2 commits November 9, 2024 17:52

propagate env variable in docker image

fdd6af3

ignore pandas_metadata for string dtype in case of dictionary column

84b8234

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 13, 2024

jorisvandenbossche added 2 commits November 13, 2024 10:27

keep columns Index as string dtype even if metadata says object

136b091

fix compute / feather tests

e5db09f

jorisvandenbossche marked this pull request as ready for review November 13, 2024 09:43

jorisvandenbossche requested review from assignUser, jonkeane, kou and raulcd as code owners November 13, 2024 09:43

reformat to avoid diff

ec750bd

github-actions bot removed the awaiting change review Awaiting change review label Nov 13, 2024

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Dec 16, 2024

amoeba reviewed Jan 3, 2025

View reviewed changes

python/pyarrow/tests/test_compute.py Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into apachegh-43683-pand…

940b64d

…as-string-dtype

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 3, 2025

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 3, 2025

WillAyd reviewed Jan 3, 2025

View reviewed changes

jorisvandenbossche added 2 commits January 5, 2025 12:05

remove strings_to_categorical for string_view changes

70a2c3c

Merge remote-tracking branch 'upstream/main' into apachegh-43683-pand…

ea4cbf4

…as-string-dtype

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jan 5, 2025

jorisvandenbossche mentioned this pull request Jan 5, 2025

API (string dtype): comparisons between different string classes pandas-dev/pandas#60639

Closed

Merge remote-tracking branch 'upstream/main' into apachegh-43683-pand…

a59a2a2

…as-string-dtype

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 9, 2025

amoeba mentioned this pull request Jan 9, 2025

WIP: Testing-only PR to check maint-19.0.0 status #45194

Closed

jorisvandenbossche merged commit 5181c24 into apache:main Jan 9, 2025
36 checks passed

jorisvandenbossche removed the awaiting change review Awaiting change review label Jan 9, 2025

jorisvandenbossche mentioned this pull request Jan 9, 2025

[Python] Support pandas future default string dtype #43683

Closed

jorisvandenbossche deleted the gh-43683-pandas-string-dtype branch January 9, 2025 19:22

GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195

GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195

Uh oh!

Conversation

jorisvandenbossche commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jorisvandenbossche commented Nov 8, 2024

Uh oh!

github-actions bot commented Nov 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Nov 13, 2024

Uh oh!

github-actions bot commented Nov 13, 2024

Uh oh!

jorisvandenbossche commented Nov 13, 2024

Uh oh!

github-actions bot commented Nov 13, 2024

Uh oh!

Uh oh!

jorisvandenbossche commented Jan 3, 2025

Uh oh!

github-actions bot commented Jan 3, 2025

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Jan 15, 2025

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 23, 2024 •

edited

Loading