GH-48254: [Python][Parquet] Support extension types in read_schema by Kuinox · Pull Request #48255 · apache/arrow

Kuinox · 2025-11-25T17:05:26Z

Rationale for this change

pq.read_schema drops extension types (UUID comes back as fixed_size_binary[16]), while ParquetFile.schema_arrow and read_table preserve them. Schema inspection via metadata should match table/extension behavior.

What changes are included in this PR?

Plumb arrow_extensions_enabled into read_schema and return schema_arrow when enabled so extension types are preserved.
Add regression test ensuring UUID extension types are retained by read_schema and downgraded to binary(16) when extensions are disabled.

Are these changes tested?

Yes: added unit test test_read_schema_uuid_extension_type

Are there any user-facing changes?

Behavior improvement: read_schema now preserves extension types (e.g., UUID) when extensions are enabled; no API break

Notes:

I don't know if the fact the column types being returned are now extension<arrow.uuid> instead of fixed_size_binary[16], is considered a breaking change.
This PR patch was AI generated, but I personally reviewed it, the scope is small, and it looks fine to me.

GitHub Issue: [Python][Parquet] read_schema drops extension types (UUID returned as fixed_size_binary[16]) #48254

github-actions · 2025-11-25T17:05:54Z

⚠️ GitHub issue #48254 has been automatically assigned in GitHub to PR creator.

AlenkaF

Thanks for the contribution @Kuinox!
You can see my comments bellow.

Kuinox · 2026-03-03T14:57:05Z

I had issues running the tests on my machines (it was indicated green), I now have a non windows machine, so i'll try on it.

AlenkaF · 2026-05-06T12:42:42Z

@github-actions crossbow submit -g python

github-actions · 2026-05-06T12:48:23Z

Revision: 966df38

Submitted crossbow builds: ursacomputing/crossbow @ actions-e7fd264d23

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.12-pandas-latest-numpy-1.26
test-conda-python-3.12-pandas-latest-numpy-latest
test-conda-python-3.13
test-conda-python-3.13-pandas-nightly-numpy-nightly
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly
test-conda-python-3.14
test-conda-python-emscripten
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-42-python-3
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-python-3

Kuinox · 2026-05-06T13:03:07Z

Are the error expected? The build errors doesn't seems related to my change.

AlenkaF · 2026-05-06T13:14:08Z

Some of them are, but not that many. Could you first try to rebase again please?

AlenkaF · 2026-05-07T12:49:28Z

@github-actions crossbow submit -g python

github-actions · 2026-05-07T12:52:42Z

Revision: 808df3d

Submitted crossbow builds: ursacomputing/crossbow @ actions-8aeeb0e39d

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.12-pandas-latest-numpy-1.26
test-conda-python-3.12-pandas-latest-numpy-latest
test-conda-python-3.13
test-conda-python-3.13-pandas-nightly-numpy-nightly
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly
test-conda-python-3.14
test-conda-python-emscripten
test-debian-13-python-3-amd64
test-debian-13-python-3-i386
test-fedora-42-python-3
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-python-3

AlenkaF

LGTM, thanks!
The failures that are left are expected.

@raulcd mind giving one extra look before I merge?

raulcd

Just a minor nit and a question but approving as it LGTM,
Thanks @Kuinox for the PR

raulcd · 2026-05-08T09:14:03Z

+
+    file_path = tmp_path / "uuid.parquet"
+    file_path_str = str(file_path)
+    pq.write_table(table, file_path_str, store_schema=False)


just curious, is store_schema=False relevant?

it was 6 months ago so I'm only guessing now:
I remember that there was differents behavior depending if arrow loaded it's stored schema or not.
I don't remember if it was needed here, but store_schema=False would allow to be sure that an uuid logical type is detected as is without arrow getting the information from it's own schema.

I can confirm it if you want

I think this makes sense. @raulcd are you OK if we keep it as is?

raulcd · 2026-05-25T12:22:12Z

Thanks @Kuinox for the PR. Sorry it took me a while to come back to it. I plan to merge once CI finishes successfully.

conbench-apache-arrow · 2026-05-25T21:56:42Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit cd1811b.

There was 1 benchmark result with an error:

Commit Run on amd64-m5-4xlarge-linux at 2026-05-25 16:19:39Z
- dataset-serialize (Python) with dataset=nyctaxi_multi_parquet_s3, format=parquet, selectivity=100pc

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

…ema (apache#48255) ### Rationale for this change pq.read_schema drops extension types (UUID comes back as fixed_size_binary[16]), while ParquetFile.schema_arrow and read_table preserve them. Schema inspection via metadata should match table/extension behavior. ### What changes are included in this PR? - Plumb arrow_extensions_enabled into read_schema and return schema_arrow when enabled so extension types are preserved. - Add regression test ensuring UUID extension types are retained by read_schema and downgraded to binary(16) when extensions are disabled. ### Are these changes tested? - Yes: added unit test test_read_schema_uuid_extension_type ### Are there any user-facing changes? - Behavior improvement: read_schema now preserves extension types (e.g., UUID) when extensions are enabled; no API break Notes: - I don't know if the fact the column types being returned are now extension<arrow.uuid> instead of fixed_size_binary[16], is considered a breaking change. - This PR patch was AI generated, but I personally reviewed it, the scope is small, and it looks fine to me. * GitHub Issue: apache#48254 Authored-by: Nicolas Vandeginste <n.vandeginste@abc-arbitrage.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

Kuinox requested review from AlenkaF, raulcd and rok as code owners November 25, 2025 17:05

github-actions Bot added Component: Python awaiting review Awaiting review labels Nov 25, 2025

Kuinox force-pushed the schema_uuid_fix branch from ae7d919 to 47fe873 Compare November 25, 2025 17:09

Kuinox force-pushed the schema_uuid_fix branch 2 times, most recently from 2fcb4b7 to 820ae83 Compare December 17, 2025 17:34

AlenkaF reviewed Dec 23, 2025

View reviewed changes

Comment thread python/pyarrow/parquet/core.py

Comment thread python/pyarrow/parquet/core.py Outdated

Comment thread python/pyarrow/tests/parquet/test_data_types.py Outdated

Kuinox force-pushed the schema_uuid_fix branch 2 times, most recently from e16f96f to a144bc4 Compare February 4, 2026 12:03

Kuinox force-pushed the schema_uuid_fix branch from a144bc4 to 966df38 Compare March 3, 2026 22:13

Kuinox force-pushed the schema_uuid_fix branch from 966df38 to 808df3d Compare May 6, 2026 16:21

AlenkaF approved these changes May 7, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 7, 2026

raulcd approved these changes May 8, 2026

View reviewed changes

github-actions Bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 8, 2026

apacheGH-48254: [Python][Parquet] Support extension types in read_schema

d8ad044

Kuinox force-pushed the schema_uuid_fix branch from 808df3d to d8ad044 Compare May 13, 2026 15:40

raulcd approved these changes May 25, 2026

View reviewed changes

raulcd merged commit cd1811b into apache:main May 25, 2026
24 of 26 checks passed

raulcd removed the awaiting merge Awaiting merge label May 25, 2026

raulcd mentioned this pull request May 25, 2026

[Python][Parquet] read_schema drops extension types (UUID returned as fixed_size_binary[16]) #48254

Closed

Conversation

Kuinox commented Nov 25, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Nov 25, 2025

Uh oh!

AlenkaF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kuinox commented Mar 3, 2026

Uh oh!

AlenkaF commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Kuinox commented May 6, 2026

Uh oh!

AlenkaF commented May 6, 2026

Uh oh!

AlenkaF commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

AlenkaF left a comment

Choose a reason for hiding this comment

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raulcd May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Kuinox May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlenkaF May 19, 2026

Choose a reason for hiding this comment

Uh oh!

raulcd commented May 25, 2026

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kuinox commented Nov 25, 2025 •

edited by github-actions Bot

Loading

Kuinox May 8, 2026 •

edited

Loading