GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping #44720

bretttully · 2024-11-14T00:01:24Z

Rationale for this change

This is a long standing pandas ticket with some fairly horrible workarounds, where complex arrow types do not serialise well to pandas as the pandas metadata string is not parseable. However, types_mapper always had highest priority as it overrode what was set before.

What changes are included in this PR?

By switching the logical ordering, it means that we don't need to call _pandas_api.pandas_dtype(dtype) when using the pyarrow backend, thus resolving the issue of complex dtype with list or struct. It will likely still fail if the numpy backend is used, but at least this gives a working solution rather than an inability to load files at all.

Are these changes tested?

Existing tests should stay unchanged and a new test for the complex type has been added

Are there any user-facing changes?

This PR contains a "Critical Fix".
This makes pd.read_parquet(..., dtype_backend="pyarrow") work with complex data types where the metadata added by pyarrow during pd.to_parquet is not serialisable and currently throwing an exception. This issue currently prevents the use of pyarrow as the default backend for pandas.

GitHub Issue: [Python][Parquet] Parquet files created from Pandas dataframes with Arrow-backed list columns cannot be read by pd.read_parquet #39914

Addresses pandas-dev/pandas#53011 `types_mapper` always had highest priority as it overrode what was set before. However, switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend. Resolving the issue of complex `dtype` with `list` or `struct`

github-actions · 2024-11-14T00:01:47Z

❌ GitHub issue #53011 could not be retrieved.

github-actions · 2024-11-14T00:05:38Z

⚠️ GitHub issue #39914 has been automatically assigned in GitHub to PR creator.

jorisvandenbossche · 2024-11-14T20:08:26Z

By switching the logical ordering, it means that we don't need to call _pandas_api.pandas_dtype(dtype) when using the pyarrow backend,

And because you added a name not in ext_columns to the subsequent methods to fill ext_columns, this should preserve the priority of the different methods to determine the pandas dtype? (metadata < pyarrow extension type < types_mapper)

python/pyarrow/tests/test_pandas.py

bretttully · 2024-11-14T20:27:03Z

By switching the logical ordering, it means that we don't need to call _pandas_api.pandas_dtype(dtype) when using the pyarrow backend,

And because you added a name not in ext_columns to the subsequent methods to fill ext_columns, this should preserve the priority of the different methods to determine the pandas dtype? (metadata < pyarrow extension type < types_mapper)

Yes, exactly. Priority remains the same, but functions are skipped if the field already has a type, meaning that the code causing the error is no longer called if types_mapper is provided.

jorisvandenbossche · 2024-11-14T20:37:36Z

The test_dlpack failure in the tests you can ignore (#44728)

jorisvandenbossche

Looks good!
I triggered CI again

python/pyarrow/tests/test_pandas.py

bretttully · 2024-11-20T02:32:32Z

Thanks @jorisvandenbossche -- is the process that I can merge this following approval, or is that done by a core maintainer?

raulcd · 2024-11-20T09:30:57Z

is the process that I can merge this following approval, or is that done by a core maintainer?

A committer will merge, probably @jorisvandenbossche in this specific case, once everything is running and addressed. I've triggered CI for the latest changes.

python/pyarrow/tests/test_pandas.py

jorisvandenbossche · 2024-11-20T23:34:53Z

@github-actions crossbow submit -g python

github-actions · 2024-11-20T23:37:43Z

Revision: e3b9892

Submitted crossbow builds: ursacomputing/crossbow @ actions-e01b93275b

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-1.26
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-pandas-nightly-numpy-nightly
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.13
test-conda-python-3.9
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5
test-conda-python-emscripten
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-python-3

jorisvandenbossche · 2024-11-21T07:59:55Z

@raulcd it seems something is going wrong with the minimal test builds (eg example-python-minimal-build-fedora-conda). The logs indicate "Successfully installed pyarrow-0.1.dev16896+ge3b9892", which then messes up pandas detection of the pyarrow version (for the pyarrow integration in pandas, pandas checks if pyarrow is recent enough and otherwise errors), giving some test failures.

(but also not entirely sure how this PR causes this issue, since I don't see the nightlies fail for the minimal builds at the moment)

jorisvandenbossche · 2024-11-21T08:06:00Z

(the other failures are the known nightly dlpack failures)

raulcd · 2024-11-21T08:24:53Z

The logs indicate "Successfully installed pyarrow-0.1.dev16896+ge3b9892"

From the git checkout I see is pulling from the remote on Syncing repository: bretttully/arrow. I recall an issue if dev tags are not present we are unable to detect the correct version. The remote doesn't seem to have other branches and/or tags.

raulcd · 2024-11-21T10:11:06Z

I've opened an issue because we should find a way to not fail if the dev tag is not present:

[Python] Jobs fail if Pyarrow version is not correctly generated due to missing remote dev tags #44803

jorisvandenbossche · 2024-11-21T11:18:53Z

Thanks for investigating that!

So then to resolve this here, @bretttully should fetch the upstream tags and push that to his fork? Something like

git fetch upstream
git push origin --tags

(assuming upstream is apache/arrow and origin is bretttully/arrow)

bretttully · 2024-11-21T11:31:56Z

I have merged upstream/main and pushed tags. Let's see if this works...

raulcd · 2024-11-21T12:14:18Z

@github-actions crossbow submit example-python-minimal-build-*

github-actions · 2024-11-21T12:16:39Z

Revision: 685167f

Submitted crossbow builds: ursacomputing/crossbow @ actions-524e782c26

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv

bretttully · 2024-11-24T02:31:37Z

Is there anything else for me to do here?

raulcd · 2024-11-25T09:09:09Z

Is there anything else for me to do here?

I don't think so. I am not comfortable with this area of our codebase so I'll let @jorisvandenbossche merge once he's happy about it, but as he already approved, he might do that soon.

jorisvandenbossche · 2024-11-27T16:12:50Z

Is there anything else for me to do here?

No, just me getting back to merge it!

conbench-apache-arrow · 2024-11-27T22:21:17Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 8548c22.

There were 132 benchmark results with an error:

Commit Run on arm64-t4g-2xlarge-linux at 2024-11-27 18:04:38Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1
and 130 more (see the report linked below)

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

bretttully · 2024-11-27T23:30:03Z

🚀 Thanks for all your help here @jorisvandenbossche!

bretttully added 2 commits November 14, 2024 09:25

Added test

cb87068

github-actions bot added the awaiting review Awaiting review label Nov 14, 2024

bretttully changed the title ~~GH-53011: [pyarrow] Reorder to_pandas extension dtype mapping~~ GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping Nov 14, 2024

jorisvandenbossche reviewed Nov 14, 2024

View reviewed changes

python/pyarrow/tests/test_pandas.py Outdated Show resolved Hide resolved

jorisvandenbossche mentioned this pull request Nov 14, 2024

[Python][CI] test_dlpack is failing on some nightly jobs allocating unexpected memory #44728

Closed

apacheGH-39914 better testing based on PR feedback

1c23076

github-actions bot added the Component: Python label Nov 15, 2024

bretttully requested a review from jorisvandenbossche November 18, 2024 22:37

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 19, 2024

jorisvandenbossche approved these changes Nov 19, 2024

View reviewed changes

python/pyarrow/tests/test_pandas.py Show resolved Hide resolved

bretttully added 2 commits November 20, 2024 13:29

apacheGH-39914 fix test for old pandas

436a9eb

apacheGH-39914 mistake in last

e3b9892

jorisvandenbossche reviewed Nov 20, 2024

View reviewed changes

python/pyarrow/tests/test_pandas.py Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 20, 2024

Update python/pyarrow/tests/test_pandas.py

662c744

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 21, 2024

raulcd mentioned this pull request Nov 21, 2024

[Python] Jobs fail if Pyarrow version is not correctly generated due to missing remote dev tags #44803

Open

Merge remote-tracking branch 'upstream/main' into issue-53011

685167f

jorisvandenbossche merged commit 8548c22 into apache:main Nov 27, 2024
13 checks passed

jorisvandenbossche removed the awaiting change review Awaiting change review label Nov 27, 2024

bretttully deleted the issue-53011 branch November 27, 2024 23:29

mroeschke mentioned this pull request Feb 13, 2025

BUG: Unable to round-trip nested arrow extension types with pa.Table.to_pandas pandas-dev/pandas#60927

Closed

3 tasks

hiroyuki-sato mentioned this pull request May 20, 2025

GH-46496: [CI][Dev] Fix shellcheck SC2086 errors in ci/scripts directory #46497

Merged

GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping #44720

GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping #44720

Uh oh!

Conversation

bretttully commented Nov 14, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 14, 2024

Uh oh!

github-actions bot commented Nov 14, 2024

Uh oh!

jorisvandenbossche commented Nov 14, 2024

Uh oh!

Uh oh!

bretttully commented Nov 14, 2024

Uh oh!

jorisvandenbossche commented Nov 14, 2024

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bretttully commented Nov 20, 2024

Uh oh!

raulcd commented Nov 20, 2024

Uh oh!

Uh oh!

jorisvandenbossche commented Nov 20, 2024

Uh oh!

github-actions bot commented Nov 20, 2024

Uh oh!

jorisvandenbossche commented Nov 21, 2024

Uh oh!

jorisvandenbossche commented Nov 21, 2024

Uh oh!

raulcd commented Nov 21, 2024

Uh oh!

raulcd commented Nov 21, 2024

Uh oh!

jorisvandenbossche commented Nov 21, 2024

Uh oh!

bretttully commented Nov 21, 2024

Uh oh!

raulcd commented Nov 21, 2024

Uh oh!

github-actions bot commented Nov 21, 2024

Uh oh!

bretttully commented Nov 24, 2024

Uh oh!

raulcd commented Nov 25, 2024

Uh oh!

jorisvandenbossche commented Nov 27, 2024

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Nov 27, 2024

Uh oh!

bretttully commented Nov 27, 2024

Uh oh!

Uh oh!

bretttully commented Nov 14, 2024 •

edited by github-actions bot

Loading