-
Notifications
You must be signed in to change notification settings - Fork 1k
[WIP] Support Shredded Lists/Array in variant_get
#8354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple comments that are hopefully helpful.
Also, we should (eventually) support nesting -- arrays and structs inside arrays.
Let's get simple lists of primitives working first, tho!
| let main_struct = crate::variant_array::StructArrayBuilder::new() | ||
| .with_field("metadata", Arc::new(metadata_array)) | ||
| .with_field("value", Arc::new(value_array)) | ||
| .with_field("typed_value", Arc::new(typed_value_array)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the variant shredding spec for arrays -- the typed_value for a shredded variant array is a non-nullable group called element, with child fields typed_value and value for shredded and unshredded list elements, respectively.
And then we'll need to build an appropriate GenericListArray out of this string array you built, which gives the offsets for each sub-list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this too, I was under the wrong impression that the metadata encoding stores the offsets for the actual values. Reading your #8359 and rereading the Variant Encoding spec I see that the values offsets are within the value encoding itself.
So the outermost typed_value should be an GenericListArray of element - VariantObjects with {value and typed_value fields}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly! And element is non-nullable (**), while the two children are nullable.
(**) As always, in arrow, it can still have null entries, but only if its parent is already NULL for the same row (so nobody can ever observe a non-null element)
…ed_list_support
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how these unit tests will translate to variant_get?
Could you elaborate please? I am currently trying to build just the Shredded |
No worries -- the current iteration does look it produces a correct shredded variant containing a list, so I should probably just be patient and let you finish! |
|
My question is: does The reason I am asking is that since we use the output of The only way to work with list arrays I came up with so far, is to build new arrays with
|
| // Build the list of indices to take | ||
| let mut take_indices = Vec::with_capacity(list_len); | ||
| for i in 0..list_len { | ||
| let start = offsets[i] as usize; | ||
| let end = offsets[i + 1] as usize; | ||
| let len = end - start; | ||
|
|
||
| if *index < len { | ||
| take_indices.push(Some((start + index) as u32)); | ||
| } else { | ||
| take_indices.push(None); | ||
| } | ||
| } | ||
|
|
||
| let index_array = UInt32Array::from(take_indices); | ||
|
|
||
| // Use Arrow compute kernel to gather elements | ||
| let taken = take(field_array, &index_array, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the basic idea here
…ed_list_support
…ed_list_support
|
Hey @scovich I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet. Do we have an issue tracking variant_to_arrow types support? If not, I can make one. |
I'm not sure we have a tracking issue for utf8 support in variant_to_arrow, but I've also noticed that it's an annoying gap for unit testing (we all seem to reach for string values...) |
Co-authored-by: Congxian Qiu <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
…row-rs into variant_to_arrow_utf8
sdf-jkl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building on top of the utf8 variant_to_arrow support PR.
Changes in generic_bytes_builder.rs, generic_bytes_view_builder.rs and variant_to_arrow.rs are irrelevant.
Some changes in variant_get.rs and variant_array.rs are also from the utf8 pr, so they can be safely skipped.
Main changes are:
- Adding
ShreddingStateCowenum - Adding
VariantPathElement::Indexsupport for unnestedListVariantArray
| let Some(list_array) = typed_value.as_any().downcast_ref::<GenericListArray<i32>>() | ||
| else { | ||
| // Downcast failure - if strict cast options are enabled, this should be an error | ||
| if !cast_options.safe { | ||
| return Err(ArrowError::CastError(format!( | ||
| "Cannot access index '{}' on non-list type: {}", | ||
| index, | ||
| typed_value.data_type() | ||
| ))); | ||
| } | ||
| // With safe cast options, return NULL (missing_path_step) | ||
| return Ok(missing_path_step()); | ||
| }; | ||
|
|
||
| let offsets = list_array.offsets(); | ||
| let values = list_array.values(); // This is a StructArray | ||
|
|
||
| let Some(struct_array) = values.as_any().downcast_ref::<StructArray>() else { | ||
| return Ok(missing_path_step()); | ||
| }; | ||
|
|
||
| let Some(typed_array) = struct_array.column_by_name("typed_value") else { | ||
| return Ok(missing_path_step()); | ||
| }; | ||
|
|
||
| // Build the list of indices to take | ||
| let mut take_indices = Vec::with_capacity(list_array.len()); | ||
| for i in 0..list_array.len() { | ||
| let start = offsets[i] as usize; | ||
| let end = offsets[i + 1] as usize; | ||
| let len = end - start; | ||
|
|
||
| if *index < len { | ||
| take_indices.push(Some((start + index) as u32)); | ||
| } else { | ||
| take_indices.push(None); | ||
| } | ||
| } | ||
|
|
||
| let index_array = UInt32Array::from(take_indices); | ||
|
|
||
| // Use Arrow compute kernel to gather elements | ||
| let taken = take(typed_array, &index_array, None)?; | ||
|
|
||
| let metadata_array = BinaryViewArray::from_iter_values(std::iter::repeat_n( | ||
| EMPTY_VARIANT_METADATA_BYTES, | ||
| taken.len(), | ||
| )); | ||
|
|
||
| let struct_array = &StructArrayBuilder::new() | ||
| .with_field("metadata", Arc::new(metadata_array), false) | ||
| .with_field("typed_value", taken, true) | ||
| .build(); | ||
|
|
||
| let state = ShreddingState::try_from(struct_array)?; | ||
| Ok(ShreddedPathStep::Success(state.into())) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use variant_get on Struct Variant Array's it's relatively easy to extract the typed_value. For example, if we extract a.b because on the inside it's just:
VariantArray{
StructArray{
"typed_value":
StructArray{
"typed_value": PrimiteArray, <- We can directly borrow the value into
ShreddingState::Success() because the needed values in the array are contiguous
"value": VariantArray
But if we try to extract "typed_value" from a List VariantArray it gets more complicated. For example, extracting 0.0:
VariantArray{
StructArray{
"typed_value":
ListArray{
Offsets
StructArray{
"typed_value": PrimiteArray, <- but the values are now not contiguous, and the
output array can only be extracted using offsets, no borrow available
"value": VariantArray
Because of this issue the output of follow_shredded_path_element -> ShreddedPathStep::Success can end up receiving BorrowedShreddingState or owned ShreddingState.
To make this work I added a ShreddingStateCow enum and made it the ShreddedPathStep::Success input.
| pub enum ShreddingStateCow<'a> { | ||
| Owned(ShreddingState), | ||
| Borrowed(BorrowedShreddingState<'a>), | ||
| } | ||
|
|
||
| impl<'a> From<ShreddingState> for ShreddingStateCow<'a> { | ||
| fn from(s: ShreddingState) -> Self { | ||
| Self::Owned(s) | ||
| } | ||
| } | ||
| impl<'a> From<BorrowedShreddingState<'a>> for ShreddingStateCow<'a> { | ||
| fn from(s: BorrowedShreddingState<'a>) -> Self { | ||
| Self::Borrowed(s) | ||
| } | ||
| } | ||
|
|
||
| impl<'a> ShreddingStateCow<'a> { | ||
| /// Always gives the caller a borrowed view, even if we own internally. | ||
| pub fn as_view(&self) -> BorrowedShreddingState<'_> { | ||
| match self { | ||
| ShreddingStateCow::Borrowed(b) => b.clone(), | ||
| ShreddingStateCow::Owned(o) => o.borrow(), | ||
| } | ||
| } | ||
|
|
||
| /// Materialize ownership when the caller needs to keep it. | ||
| pub fn into_owned(self) -> ShreddingState { | ||
| match self { | ||
| ShreddingStateCow::Borrowed(b) => b.into(), | ||
| ShreddingStateCow::Owned(o) => o, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the new ShreddingStateCow enum implementation
| ); | ||
| } | ||
| shredding_state = state; | ||
| shredding_state = ShreddingStateCow::Owned(state.into_owned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I could not come up with a way to make the shredding_state for the next path_element be ither borrowed or owned depending on the follow_shredded_path_element output.
Made it into_owned() just to pass the borrow checker.
|
Hey @scovich, I'm ready for another go when you are available, thanks. |
Which issue does this PR close?
variant_get#8082.Rationale for this change
We should be able to read lists using
variant_getWhat changes are included in this PR?
Are these changes tested?
I'm trying to start with some basic tests to do some TDD.
Are there any user-facing changes?