Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 23, 2024

Rationale for this change

With pandas' PDEP-14 proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype).

This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (pd.options.future.infer_string = True). To prepare for that, we should start using that string dtype in to_pandas() conversions when that option is enabled.

What changes are included in this PR?

  • If pandas >= 3.0 is used or the pandas option is enabled, ensure that to_pandas() calls use the default string dtype of pandas for string-like columns (string, large_string, string_view)

Are these changes tested?

It is tested in the pandas-nightly crossbow build.

There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879)

Are there any user-facing changes?

@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

Copy link

github-actions bot commented Nov 8, 2024

Revision: 56b61f2

Submitted crossbow builds: ursacomputing/crossbow @ actions-c85b742ef7

Task Status
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions

e1 = pd.DataFrame(
{'a': a_values},
index=pd.RangeIndex(0, 8, step=2, name='qux'),
columns=pd.Index(['a'], dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the column type created with the dict argument differ from this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is specifically using old metadata that specifies the dtype of the columns is object dtype, and then pyarrow tries to restore it that way.

It's the question if we should do that though .. Because every file written from a pandas DataFrame before pandas 3.0 will have that, so maybe we should specifically ignore object dtype here if the inferred type is that it contains all strings, so users consistently get a columns Index object using str dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's tricky but I think going with the str data type as you suggested is better; I would expect that is a better UX in over 99% of instances

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, changed this to ensure we actually use str dtype columns Index object, even if the pandas metadata of the pyarrow table says that the original table was using object dtype.

This ensures that all existing files will use (with pandas>= 3) the default str dtype for the columns, but that also has the trade-off that if you explicitly want to use object dtype with strings, that this will no longer roundtrip in pandas->pyarrow/parquet->pandas)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 8, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 13, 2024
Copy link

Revision: 84b8234

Submitted crossbow builds: ursacomputing/crossbow @ actions-ac3103d3ba

Task Status
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions

@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

Copy link

Revision: e5db09f

Submitted crossbow builds: ursacomputing/crossbow @ actions-3c389cd49e

Task Status
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions

@github-actions github-actions bot removed the awaiting change review Awaiting change review label Nov 13, 2024
@github-actions github-actions bot added the awaiting changes Awaiting changes label Nov 13, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 13, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit -g python

Copy link

Revision: f9f960f

Submitted crossbow builds: ursacomputing/crossbow @ actions-883577486f

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jorisvandenbossche . I'll try to further review (and try to understand better) once the upstream issue is fixed and CI is not failing :)

dev/tasks/tasks.yml Show resolved Hide resolved
pa.types.is_string(field.type)
or pa.types.is_large_string(field.type)
or pa.types.is_string_view(field.type)
) and field.name not in categories:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious on how were categories interpreted before inferring the new string type, was this just not taken into account on the arrow side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If field.name in categories is true, that means the user asked to convert this column to a categorical dtype on the pandas side. This is handled on the C++ side to dictionary encode the column, and so in this case we don't have to specify any custom pandas extension dtype here, because then our conversion layer will convert that dictionary encoded column to a pandas categorical.

@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 13, 2024
@jorisvandenbossche
Copy link
Member Author

once the upstream issue is fixed and CI is not failing :)

I added an xfail so the CI should not be failing anymore (but note that there is a failure on the nightly builds anyway, for a while, that is unrelated)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Nov 13, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 13, 2024
Copy link

Revision: a7e5e34

Submitted crossbow builds: ursacomputing/crossbow @ actions-dfa36b7aca

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@jorisvandenbossche
Copy link
Member Author

With the latest run, the failing tests is the dlpack one that is failing on main as well

@raulcd
Copy link
Member

raulcd commented Nov 14, 2024

With the latest run, the failing tests is the dlpack one that is failing on main as well

I don't think there was an issue opened for the dlpack error, so I've opened:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants