GH-48254: [Python][Parquet] Support extension types in read_schema#48255
Conversation
|
|
ae7d919 to
47fe873
Compare
2fcb4b7 to
820ae83
Compare
e16f96f to
a144bc4
Compare
|
I had issues running the tests on my machines (it was indicated green), I now have a non windows machine, so i'll try on it. |
|
@github-actions crossbow submit -g python |
|
Revision: 966df38 Submitted crossbow builds: ursacomputing/crossbow @ actions-e7fd264d23 |
|
Are the error expected? The build errors doesn't seems related to my change. |
|
Some of them are, but not that many. Could you first try to rebase again please? |
|
@github-actions crossbow submit -g python |
|
Revision: 808df3d Submitted crossbow builds: ursacomputing/crossbow @ actions-8aeeb0e39d |
|
|
||
| file_path = tmp_path / "uuid.parquet" | ||
| file_path_str = str(file_path) | ||
| pq.write_table(table, file_path_str, store_schema=False) |
There was a problem hiding this comment.
just curious, is store_schema=False relevant?
There was a problem hiding this comment.
it was 6 months ago so I'm only guessing now:
I remember that there was differents behavior depending if arrow loaded it's stored schema or not.
I don't remember if it was needed here, but store_schema=False would allow to be sure that an uuid logical type is detected as is without arrow getting the information from it's own schema.
I can confirm it if you want
There was a problem hiding this comment.
I think this makes sense. @raulcd are you OK if we keep it as is?
|
Thanks @Kuinox for the PR. Sorry it took me a while to come back to it. I plan to merge once CI finishes successfully. |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit cd1811b. There was 1 benchmark result with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ema (apache#48255) ### Rationale for this change pq.read_schema drops extension types (UUID comes back as fixed_size_binary[16]), while ParquetFile.schema_arrow and read_table preserve them. Schema inspection via metadata should match table/extension behavior. ### What changes are included in this PR? - Plumb arrow_extensions_enabled into read_schema and return schema_arrow when enabled so extension types are preserved. - Add regression test ensuring UUID extension types are retained by read_schema and downgraded to binary(16) when extensions are disabled. ### Are these changes tested? - Yes: added unit test test_read_schema_uuid_extension_type ### Are there any user-facing changes? - Behavior improvement: read_schema now preserves extension types (e.g., UUID) when extensions are enabled; no API break Notes: - I don't know if the fact the column types being returned are now extension<arrow.uuid> instead of fixed_size_binary[16], is considered a breaking change. - This PR patch was AI generated, but I personally reviewed it, the scope is small, and it looks fine to me. * GitHub Issue: apache#48254 Authored-by: Nicolas Vandeginste <n.vandeginste@abc-arbitrage.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
Rationale for this change
pq.read_schema drops extension types (UUID comes back as fixed_size_binary[16]), while ParquetFile.schema_arrow and read_table preserve them. Schema inspection via metadata should match table/extension behavior.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Notes: