Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Sep 22, 2025

Which issue does this PR close?

Rationale for this change

Parquet has logical types, which is how other writers signal what columns contain Variant values.

What changes are included in this PR?

  1. Add mapping from Parquet LogicalType to/from Arrow ExtensionType added in Add Arrow Variant Extension Type, remove Array impl for VariantArray and ShreddedVariantFieldArray #8392
  2. Documentation and tests showing reading/writing Parquet files with the Variant logical annotation

Are these changes tested?

Yes, new unit tests and doc examples

Are there any user-facing changes?

You can now read/write Variant columns to VariantArray

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 22, 2025
@alamb alamb force-pushed the alamb/variant_logical_type branch from 04b2e37 to 8f16381 Compare September 22, 2025 19:39
@alamb alamb marked this pull request as ready for review September 22, 2025 19:58
.with_nullable(false),
);
let value_field = Arc::new(convert_field(map_value, &value, arrow_value));
let value_field = Arc::new(convert_field(map_value, &mut value, arrow_value));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is somewhat confusing to me in that the arrow type is encoded twice -- once on a ParquetField (which has an arrow Field in it) and once in this ValueField.

Any help simplifing this would be most appreciated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't think I understand this part of the code well enough to suggest anything :(

// specific language governing permissions and limitations
// under the License.

//! Arrow Extension Type Support for Parquet
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to consolidate the rest of the extension type handling in this module, to try and improve the current situation where #cfg(..) is sprinkled over the type conversion logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[cfg(feature = "variant_experimental")]
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
use parquet_variant_compute::VariantType;
if field.extension_type_name()? == VariantType::NAME {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried here about testing for the extension type using try_extension_type and then discarding any error via ok() -- creating an ArrowError requires allocating a string, so that pattern can be expensive (allocate and format a string just to throw it away)

@scovich also noticed this in the pathfinding PR here:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do both? Check the name (= quick and cheap) and only try_extension_type if the name matches? (if they asked for Variant and provided an incorrect data type, they shouldn't be surprised at the cost of allocating an error message)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent call -- done in 68ffd32

//! ```
pub use parquet_variant::*;
pub use parquet_variant_compute as compute;
pub use parquet_variant_compute::*;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also started importing everything directly into parquet::variant rather than also having a parquet::variant::compute which I think makes it easier to work with this crate

use std::path::PathBuf;
use std::sync::Arc;

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are end to end cases showing how to read/write the files to disk

/// Ensure a file with Variant LogicalType, written by another writer in
/// parquet-testing, can be read as a VariantArray
#[test]
fn read_logical_type() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich noted in following comment of the pathfinding PR that this is very similar to the doc test above.

I re-reviewed it and while I agree this is redundant with the doc test above, I feel that putting the test in both places makes it easier to understand that there is test coverage for reading logical types, and thus I would like to keep it here.

@alamb alamb requested a review from scovich September 24, 2025 18:26
@alamb
Copy link
Contributor Author

alamb commented Sep 24, 2025

@scovich / @paleolimbot / @mbrobbel I wonder if you might have some time to review this PR?

It consists of the final connection code to read/write variant LogicalTypes.

I have a follow on in

@scovich
Copy link
Contributor

scovich commented Sep 24, 2025

@alamb -- PR description seems a bit... unpopulated?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a step forward

#[cfg(feature = "variant_experimental")]
pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> {
use parquet_variant_compute::VariantType;
if field.extension_type_name()? == VariantType::NAME {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do both? Check the name (= quick and cheap) and only try_extension_type if the name matches? (if they asked for Variant and provided an incorrect data type, they shouldn't be surprised at the cost of allocating an error message)

@alamb alamb changed the title Support reading/writing VariantArray to parquet with Variant LogicalType Support reading/writing VariantArray to parquet with Variant LogicalType Sep 24, 2025
@alamb
Copy link
Contributor Author

alamb commented Sep 24, 2025

@alamb -- PR description seems a bit... unpopulated?

Good call -- I updated it

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very new to this part of the code but I've added context from Arrow C++ where I thought it might fit!

Comment on lines +53 to +62
// Check the name (= quick and cheap) and only try_extension_type if the name matches
// to avoid unnecessary String allocations in ArrowError
if field.extension_type_name()? != VariantType::NAME {
return None;
}
match field.try_extension_type::<VariantType>() {
Ok(VariantType) => Some(LogicalType::Variant),
// Given check above, this should not error, but if it does ignore
Err(_e) => None,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're working to make this more generic, but as a pattern this is also how Arrow C++ does the conversion (except in Arrow C++ we can just look at the DataType and here you have to poke the field metadata and parse it. Arrow C++ just hard-codes a few names/storage types too (but no compile time flag on the Parquet library...just a static extension registry that would control whether an extension type shows up as a DataType or a field with metadata).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it is a fine line to decide how much of the canonical extension types end up in the core parquet/Arrow API

Including them all by default makes the initial developer experience nicer perhaps and keeps the code cleaner, but it makes it harder to customize the binary size / feature set

In my mind I am trying to follow the existing pattern / guidelines (set by @tustvold I think) and I do think it makes sense, but I do understand there are tradeoffs involved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I think I was reflecting in my comment that you're doing exactly what Arrow C++ is doing here (canonical extension type support there is also controlled by compile-time flags, they're just on by default and it just manifests as missing field metadata).

Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb, looking forward to #8409

@alamb
Copy link
Contributor Author

alamb commented Sep 25, 2025

I ran out of time today, but I plan to address comments on this PR tomorrow and merge it. I also want to close out the initial Variant epic and do some housecleaning ticket-wise

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

Ok, I think I have resolved all the comments on this PR and plan to merge it in when it passes. I'l then polish up #8409 and clean up the variant epic

@alamb alamb merged commit d6a29ec into apache:main Sep 26, 2025
20 checks passed
@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

woohoo! That feels pretty good -- we can read/write Parquet files with Variant logical types now!

alamb added a commit that referenced this pull request Sep 29, 2025
# Which issue does this PR close?

- Follow on to #8408
- Closes #7063


# Rationale for this change

I was trying to consolidate the parquet extension type code after
#8408, and in so doing I believe
I actually found (and fixed) the root cause of
#7063 (I will point it out
inline)

# What changes are included in this PR?

1. Consolidate parquet<-->arrow extension type metadata mapping in one
module
2. Enable tests

# Are these changes tested?

Yes

# Are there any user-facing changes?

When reading parquet that is annotated with Json or UUID logical types,
the resulting Arrow arrays will also have the canonical types attached.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support reading/writing Parquet Variant LogicalType

4 participants