-
Notifications
You must be signed in to change notification settings - Fork 1k
Support reading/writing VariantArray to parquet with Variant LogicalType
#8408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
04b2e37 to
8f16381
Compare
| .with_nullable(false), | ||
| ); | ||
| let value_field = Arc::new(convert_field(map_value, &value, arrow_value)); | ||
| let value_field = Arc::new(convert_field(map_value, &mut value, arrow_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is somewhat confusing to me in that the arrow type is encoded twice -- once on a ParquetField (which has an arrow Field in it) and once in this ValueField.
Any help simplifing this would be most appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't think I understand this part of the code well enough to suggest anything :(
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! Arrow Extension Type Support for Parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to consolidate the rest of the extension type handling in this module, to try and improve the current situation where #cfg(..) is sprinkled over the type conversion logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #[cfg(feature = "variant_experimental")] | ||
| pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> { | ||
| use parquet_variant_compute::VariantType; | ||
| if field.extension_type_name()? == VariantType::NAME { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried here about testing for the extension type using try_extension_type and then discarding any error via ok() -- creating an ArrowError requires allocating a string, so that pattern can be expensive (allocate and format a string just to throw it away)
@scovich also noticed this in the pathfinding PR here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do both? Check the name (= quick and cheap) and only try_extension_type if the name matches? (if they asked for Variant and provided an incorrect data type, they shouldn't be surprised at the cost of allocating an error message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent call -- done in 68ffd32
| //! ``` | ||
| pub use parquet_variant::*; | ||
| pub use parquet_variant_compute as compute; | ||
| pub use parquet_variant_compute::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also started importing everything directly into parquet::variant rather than also having a parquet::variant::compute which I think makes it easier to work with this crate
| use std::path::PathBuf; | ||
| use std::sync::Arc; | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are end to end cases showing how to read/write the files to disk
| /// Ensure a file with Variant LogicalType, written by another writer in | ||
| /// parquet-testing, can be read as a VariantArray | ||
| #[test] | ||
| fn read_logical_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich noted in following comment of the pathfinding PR that this is very similar to the doc test above.
I re-reviewed it and while I agree this is redundant with the doc test above, I feel that putting the test in both places makes it easier to understand that there is test coverage for reading logical types, and thus I would like to keep it here.
|
@scovich / @paleolimbot / @mbrobbel I wonder if you might have some time to review this PR? It consists of the final connection code to read/write variant LogicalTypes. I have a follow on in
|
|
@alamb -- PR description seems a bit... unpopulated? |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a step forward
| #[cfg(feature = "variant_experimental")] | ||
| pub(crate) fn logical_type_for_struct(field: &Field) -> Option<LogicalType> { | ||
| use parquet_variant_compute::VariantType; | ||
| if field.extension_type_name()? == VariantType::NAME { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do both? Check the name (= quick and cheap) and only try_extension_type if the name matches? (if they asked for Variant and provided an incorrect data type, they shouldn't be surprised at the cost of allocating an error message)
VariantArray to parquet with Variant LogicalType
Good call -- I updated it |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very new to this part of the code but I've added context from Arrow C++ where I thought it might fit!
| // Check the name (= quick and cheap) and only try_extension_type if the name matches | ||
| // to avoid unnecessary String allocations in ArrowError | ||
| if field.extension_type_name()? != VariantType::NAME { | ||
| return None; | ||
| } | ||
| match field.try_extension_type::<VariantType>() { | ||
| Ok(VariantType) => Some(LogicalType::Variant), | ||
| // Given check above, this should not error, but if it does ignore | ||
| Err(_e) => None, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're working to make this more generic, but as a pattern this is also how Arrow C++ does the conversion (except in Arrow C++ we can just look at the DataType and here you have to poke the field metadata and parse it. Arrow C++ just hard-codes a few names/storage types too (but no compile time flag on the Parquet library...just a static extension registry that would control whether an extension type shows up as a DataType or a field with metadata).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it is a fine line to decide how much of the canonical extension types end up in the core parquet/Arrow API
Including them all by default makes the initial developer experience nicer perhaps and keeps the code cleaner, but it makes it harder to customize the binary size / feature set
In my mind I am trying to follow the existing pattern / guidelines (set by @tustvold I think) and I do think it makes sense, but I do understand there are tradeoffs involved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I think I was reflecting in my comment that you're doing exactly what Arrow C++ is doing here (canonical extension type support there is also controlled by compile-time flags, they're just on by default and it just manifests as missing field metadata).
mbrobbel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I ran out of time today, but I plan to address comments on this PR tomorrow and merge it. I also want to close out the initial Variant epic and do some housecleaning ticket-wise |
|
Ok, I think I have resolved all the comments on this PR and plan to merge it in when it passes. I'l then polish up #8409 and clean up the variant epic |
|
woohoo! That feels pretty good -- we can read/write Parquet files with Variant logical types now! |
# Which issue does this PR close? - Follow on to #8408 - Closes #7063 # Rationale for this change I was trying to consolidate the parquet extension type code after #8408, and in so doing I believe I actually found (and fixed) the root cause of #7063 (I will point it out inline) # What changes are included in this PR? 1. Consolidate parquet<-->arrow extension type metadata mapping in one module 2. Enable tests # Are these changes tested? Yes # Are there any user-facing changes? When reading parquet that is annotated with Json or UUID logical types, the resulting Arrow arrays will also have the canonical types attached.
Which issue does this PR close?
VariantArrayto parquet withVariantLogicalType #8365Rationale for this change
Parquet has logical types, which is how other writers signal what columns contain
Variantvalues.What changes are included in this PR?
Arrayimpl forVariantArrayandShreddedVariantFieldArray#8392Are these changes tested?
Yes, new unit tests and doc examples
Are there any user-facing changes?
You can now read/write Variant columns to
VariantArray