-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Arrow Variant Extension Type, remove Array impl for VariantArray and ShreddedVariantFieldArray
#8392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
750ac8a to
bb11814
Compare
| use std::any::Any; | ||
| use std::sync::Arc; | ||
|
|
||
| /// Arrow Variant [`ExtensionType`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the new canonical variant type
| } | ||
|
|
||
| /// An array of Parquet [`Variant`] values | ||
| /// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to update these docstrings to explain what is going on here and how to use the extension types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc here is very helpful!
| pub fn try_new(inner: &dyn Array) -> Result<Self, ArrowError> { | ||
| // Workaround lack of support for Binary | ||
| // https://github.com/apache/arrow-rs/issues/8387 | ||
| let inner = cast_to_binary_view_arrays(inner)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast is unfortunate, but I think it is the best (simplest) solution while we work on the real solution
| } | ||
|
|
||
| /// Returns a new DataType representing this VariantArray's inner type | ||
| pub fn data_type(&self) -> &DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods are just moved from the Array impl
| }; | ||
|
|
||
| // Extract value and typed_value fields (metadata is not expected in ShreddedVariantFieldArray) | ||
| let value = inner_struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is moved to ShreddingState
| /// let struct_array: StructArray = get_struct_array(); | ||
| /// let shredding_state = ShreddingState::try_from(&struct_array).unwrap(); | ||
| /// ``` | ||
| pub fn new(value: Option<BinaryViewArray>, typed_value: Option<ArrayRef>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now infallable per @scovich 's suggestion #8365 (comment)
| } | ||
| } | ||
|
|
||
| impl Array for VariantArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was moved up in the file
| pub(crate) enum ShreddedPathStep { | ||
| /// Path step succeeded, return the new shredding state | ||
| Success(&'a ShreddingState), | ||
| Success(ShreddingState), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only part I am a little sad about -- that it now requires owing the ShreddingState which requires clone the state. While this is typically just a few Arcs it is still unfortunate
However, unless we change VariantArray to be reference based (e.g. VariantArray<'a>) I can't think of any way around it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if somebody is concerned about cloning ArrayRef like it's free... arrow-rs might be the wrong project for them to look very closely at?
Some potential ideas:
- Have the caller of
variant_getpass an ownedVariantArray, and then deconstruct it using some kind ofinto_partsmethod? Caller anyway had to clone the various innards of the&dyn Arrayit would have started out with? But that has the distinct disadvantage of producing a logically read-only operation that consumes its input. - If we're ok with VariantArray being an exclusively local thing (i.e. never returned from functions), we could make both it and
ShreddingStatetake references to their various arrays. That would produceVariantArray<'a>,ShreddingState<'a>, etc. - Make
ShreddingStatetrackOption<Cow<...>>internally, which might solve most of the ergonomics issues of 2/ without overly polluting the public API. The only downside is a generic lifetime parameter:VariantArray<'a>,ShreddingState<'a>, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think BinaryViewArrays also have a Vec that will get cloned. However, I think it is ok to clone for now, as the path / cloning happens once per array and so is likely amortized over a batch.
I tried option 1 briefly but I couldn't get it to work.
I'll let benchmarking be our guide and if we see problems with this clone, then perhaps we can pursue options 2 or 3.
| "expected a VariantArray as the input for variant_get".to_owned(), | ||
| ) | ||
| })?; | ||
| let variant_array = VariantArray::try_new(input)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will admit the new try_new is quite a bit more ergonomic
| .column_by_name("var") | ||
| .unwrap_or_else(|| panic!("No 'var' column found in parquet file {path:?}")); | ||
|
|
||
| // the values are read as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is moved to VariantArray::try_new
|
@scovich @klion26 and @codephage2020 this PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes the Array implementation from VariantArray and ShreddedVariantFieldArray and introduces a new VariantType extension type to properly represent Variant data in the Arrow ecosystem. This change enables proper integration with Parquet I/O operations which previously failed due to implementation issues.
- Removes
Arraytrait implementations that were causing Parquet integration problems - Introduces
VariantTypeextension type following Arrow conventions for canonical representation - Updates all code to use explicit conversions between
VariantArrayandArrayRef
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| parquet-variant-compute/src/variant_array.rs | Implements VariantType extension type, removes Array trait impls, adds conversion methods |
| parquet-variant-compute/src/variant_get.rs | Updates to use explicit VariantArray conversions instead of downcasting |
| parquet-variant-compute/src/variant_to_arrow.rs | Updates return type construction for VariantArray |
| parquet-variant-compute/src/variant_array_builder.rs | Updates VariantArray construction call |
| parquet-variant-compute/src/lib.rs | Exports new VariantType |
| parquet/tests/variant_integration.rs | Removes casting workaround code that's now handled internally |
| parquet-variant-compute/benches/variant_kernels.rs | Updates array conversion |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| inner: inner.clone(), | ||
| metadata: metadata.clone(), | ||
| shredding_state: ShreddingState::try_new(value, typed_value)?, | ||
| shredding_state: ShreddingState::new(value, typed_value), |
Copilot
AI
Sep 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ShreddingState::new method is called here but the old try_new method was removed. Consider adding error handling if ShreddingState::new can potentially fail, or document why this is safe by construction.
|
After merging up from #8366, a new shred_object test is now failing: I don't fully understand what this means and haven't had a chance to debug it. @scovich any ideas? |
The error means that some non-nullable column has null mask bits that are not covered by any ancestor column. The fields of a shredded variant object are not nullable -- NULL is encoded by the child It's also possible that some struct (or variant array) is being wrongly constructed with some non-nullable fields that should actually be nullable? |
|
The offending array is below. I notice a couple weird things:
|
|
@alamb -- Somehow you've managed to trigger a latent bug in my variant shredding code. I have no idea how the test ever passed before, nor why your PR should have anything to do with anything. But here's the fix that I confirmed unbreaks your PR when I stacked them both locally: |
# Which issue does this PR close? - Fast-follow for #8366 - Related to #8392 # Rationale for this change Somehow, #8392 exposes a latent bug in #8366, which has improper NULL handling for shredded object fields. The shredding PR originally attempted to handle this case, but somehow the test did not trigger the bug and so the (admittedly incomplete) code was removed. See #8366 (comment). To be honest, I have no idea how the original ever worked correctly, nor why the new PR is able to expose the problem. # What changes are included in this PR? When used as a top-level builder, `VariantToShreddedVariantRowBuilder::append_null` must append NULL to its own `NullBufferBuilder`; but when used as a shredded object field builder, it must append non-NULL. Plumb a new `top_level` parameter through the various functions and into the two sub-builders it relies on, so they can implement the correct semantics. # Are these changes tested? In theory, yes (I don't know how the object shredding test ever passed). And it fixes the breakage in #8392. # Are there any user-facing changes? No
Thanks! I merged that in and merged up from main |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The only open question was about shredding state cloning, but that doesn't affect correctness AFAIK
| // Note don't check for metadata/value fields here because they may be | ||
| // absent in shredded variants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound quite right -- AFAIK the extension type should only annotate the top-level variant. Any shredded object fields or array elements nested inside should be identified according to spec, because at that point we know we're traversing a shredded variant (by virtue of having stepped into the typed_value column of a VariantArray. If anything looks out of place when trying to step into a shredded struct field or array element, that's just plain an error. No annotations needed to figure that out.
Even if, for some reason, we decide it's important not to require metadata, we should still enforce that the only fields present are some subset of metadata, value and typed_value, where the first two are binary. TBD whether we recurse into typed_value to verify the rest of the structure -- we probably should, but that could become expensive if people are converting from ArrayRef to VariantArray to ArrayRef a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good -- removed the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the missing validation tho?
Co-authored-by: Ryan Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the improvement.
| } | ||
|
|
||
| /// An array of Parquet [`Variant`] values | ||
| /// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc here is very helpful!
codephage2020
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me !
| } | ||
|
|
||
| fn supports_data_type(&self, data_type: &DataType) -> Result<(), ArrowError> { | ||
| if matches!(data_type, DataType::Struct(_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we want to validate the rules from https://github.com/apache/arrow/blob/d803afcc43f5d132506318fd9e162d33b2c3d4cd/docs/source/format/CanonicalExtensions.rst?plain=1#L442-L465 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for sure, however the current implementation only supports BinaryView. Here is a ticket that for supporting other types:
Co-authored-by: Matthijs Brobbel <[email protected]>
…alamb/variant_array
Co-authored-by: Ryan Johnson <[email protected]>
|
Thanks everyone for your comments and reviews. I think I have addressed them all so I will merge this PR and open the next one to add support for Parquet logical type annotations |
…mp_nano (#8401) # Which issue does this PR close? - Closes #8331 . # Rationale for this change - Add typed_access for `Timestamp(Micro, _)` and `Timestamp(Nano, -)` # What changes are included in this PR? - Extract some data gen logic in tests to simplify the test logic (commit 93090d5), but it based on some old code(before #8392), rebase the master in the last commit - Add typed_access for `Timestamp(Micro, _)` and `Timestamp(Nano, _)` - Add test for typed_access for `Timestamp(Micro, _)` and `Timestamp(Nano, _)` # Are these changes tested? Covered by existing and added tests # Are there any user-facing changes? No
…lType (#8408) # Which issue does this PR close? - Closes #8370 - Closes #8365 # Rationale for this change Parquet has logical types, which is how other writers signal what columns contain `Variant` values. # What changes are included in this PR? 1. Add mapping from Parquet LogicalType to/from Arrow ExtensionType added in #8392 2. Documentation and tests showing reading/writing Parquet files with the Variant logical annotation # Are these changes tested? Yes, new unit tests and doc examples # Are there any user-facing changes? You can now read/write Variant columns to `VariantArray` --------- Co-authored-by: Matthijs Brobbel <[email protected]>
Which issue does this PR close?
VariantArray::data_typereturnsStructType, causingArray::as_structto panic #8319Rationale for this change
This is needed to read/write the Variant Parquet logical type and work with the rest of the Arrow Ecosystem
Note, this is broken out the larger PR here:
VariantArrayto parquet withVariantLogicalType #8365We need a way to write Variant encoded data to/from parquet, and the current way the VariantArray is implemented doesn't work (panics when writing to parquet). See tickets above
Instead of a
impl Arrayit seems the better way to do this is using an Arrow Extension Type. See #8319 (comment) for more detailsWhat changes are included in this PR?
Arrayimpl forVariantArray, which forces explict conversions back/forth when reading/writingArrayimpl forShreddedVariantFieldArray, which forces explicit conversions back/forth when reading/writingVariantTypeextension typeAre these changes tested?
Yes, with new code and tests
Are there any user-facing changes?
Yes, but this is not yet stable / released, so these changes have no impact on the releasability of this code