- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1k
[Draft] Implemented casting for RunEnd Encoding (pt2) #8384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes
| @Rich-T-kid I made some comments on the other PR #7713 | 
| // Both non-null - use slice comparison as a basic approach | ||
| // This is a simplified implementation | ||
| cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing @brancz's comments about not copying #7713 (comment)
| // Both non-null - use slice comparison as a basic approach | |
| // This is a simplified implementation | |
| cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() | |
| // Primitive types | |
| DataType::Boolean => { | |
| cast_array.as_boolean().value(i) == cast_array.as_boolean().value(i - 1) | |
| } | |
| DataType::Int8 => { | |
| cast_array.as_primitive::<Int8Type>().value(i) | |
| == cast_array.as_primitive::<Int8Type>().value(i - 1) | |
| } | |
| DataType::Int16 => { | |
| cast_array.as_primitive::<Int16Type>().value(i) | |
| == cast_array.as_primitive::<Int16Type>().value(i - 1) | |
| } | |
| DataType::Int32 => { | |
| cast_array.as_primitive::<Int32Type>().value(i) | |
| == cast_array.as_primitive::<Int32Type>().value(i - 1) | |
| } | |
| DataType::Int64 => { | |
| cast_array.as_primitive::<Int64Type>().value(i) | |
| == cast_array.as_primitive::<Int64Type>().value(i - 1) | |
| } | |
| DataType::UInt8 => { | |
| cast_array.as_primitive::<UInt8Type>().value(i) | |
| == cast_array.as_primitive::<UInt8Type>().value(i - 1) | |
| } | |
| DataType::UInt16 => { | |
| cast_array.as_primitive::<UInt16Type>().value(i) | |
| == cast_array.as_primitive::<UInt16Type>().value(i - 1) | |
| } | |
| DataType::UInt32 => { | |
| cast_array.as_primitive::<UInt32Type>().value(i) | |
| == cast_array.as_primitive::<UInt32Type>().value(i - 1) | |
| } | |
| DataType::UInt64 => { | |
| cast_array.as_primitive::<UInt64Type>().value(i) | |
| == cast_array.as_primitive::<UInt64Type>().value(i - 1) | |
| } | |
| DataType::Float16 => { | |
| cast_array.as_primitive::<Float16Type>().value(i) | |
| == cast_array.as_primitive::<Float16Type>().value(i - 1) | |
| } | |
| DataType::Float32 => { | |
| cast_array.as_primitive::<Float32Type>().value(i) | |
| == cast_array.as_primitive::<Float32Type>().value(i - 1) | |
| } | |
| DataType::Float64 => { | |
| cast_array.as_primitive::<Float64Type>().value(i) | |
| == cast_array.as_primitive::<Float64Type>().value(i - 1) | |
| } | |
| // String types | |
| DataType::Utf8 => { | |
| cast_array.as_string::<i32>().value(i) | |
| == cast_array.as_string::<i32>().value(i - 1) | |
| } | |
| DataType::LargeUtf8 => { | |
| cast_array.as_string::<i64>().value(i) | |
| == cast_array.as_string::<i64>().value(i - 1) | |
| } | |
| DataType::Utf8View => { | |
| cast_array.as_string_view().value(i) == cast_array.as_string_view().value(i - 1) | |
| } | |
| // Binary types | |
| DataType::Binary => { | |
| cast_array.as_binary::<i32>().value(i) | |
| == cast_array.as_binary::<i32>().value(i - 1) | |
| } | |
| DataType::LargeBinary => { | |
| cast_array.as_binary::<i64>().value(i) | |
| == cast_array.as_binary::<i64>().value(i - 1) | |
| } | |
| DataType::BinaryView => { | |
| cast_array.as_binary_view().value(i) == cast_array.as_binary_view().value(i - 1) | |
| } | |
| DataType::FixedSizeBinary(_) => { | |
| cast_array.as_fixed_size_binary().value(i) | |
| == cast_array.as_fixed_size_binary().value(i - 1) | |
| } | |
| // Temporal types | |
| DataType::Date32 => { | |
| cast_array.as_primitive::<Date32Type>().value(i) | |
| == cast_array.as_primitive::<Date32Type>().value(i - 1) | |
| } | |
| DataType::Date64 => { | |
| cast_array.as_primitive::<Date64Type>().value(i) | |
| == cast_array.as_primitive::<Date64Type>().value(i - 1) | |
| } | |
| DataType::Timestamp(time_unit, _) => match time_unit { | |
| TimeUnit::Second => { | |
| cast_array.as_primitive::<TimestampSecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<TimestampSecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Millisecond => { | |
| cast_array | |
| .as_primitive::<TimestampMillisecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<TimestampMillisecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Microsecond => { | |
| cast_array | |
| .as_primitive::<TimestampMicrosecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<TimestampMicrosecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Nanosecond => { | |
| cast_array | |
| .as_primitive::<TimestampNanosecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<TimestampNanosecondType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| DataType::Time32(time_unit) => match time_unit { | |
| TimeUnit::Second => { | |
| cast_array.as_primitive::<Time32SecondType>().value(i) | |
| == cast_array.as_primitive::<Time32SecondType>().value(i - 1) | |
| } | |
| TimeUnit::Millisecond => { | |
| cast_array.as_primitive::<Time32MillisecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<Time32MillisecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Microsecond | TimeUnit::Nanosecond => { | |
| panic!("Time32 must have a TimeUnit of either seconds or milliseconds") | |
| } | |
| }, | |
| DataType::Time64(time_unit) => match time_unit { | |
| TimeUnit::Second | TimeUnit::Millisecond => { | |
| panic!("Time64 must have a TimeUnit of either microseconds or nanoseconds") | |
| } | |
| TimeUnit::Microsecond => { | |
| cast_array.as_primitive::<Time64MicrosecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<Time64MicrosecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Nanosecond => { | |
| cast_array.as_primitive::<Time64NanosecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<Time64NanosecondType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| DataType::Duration(time_unit) => match time_unit { | |
| TimeUnit::Second => { | |
| cast_array.as_primitive::<DurationSecondType>().value(i) | |
| == cast_array.as_primitive::<DurationSecondType>().value(i - 1) | |
| } | |
| TimeUnit::Millisecond => { | |
| cast_array | |
| .as_primitive::<DurationMillisecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<DurationMillisecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Microsecond => { | |
| cast_array | |
| .as_primitive::<DurationMicrosecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<DurationMicrosecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Nanosecond => { | |
| cast_array.as_primitive::<DurationNanosecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<DurationNanosecondType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| DataType::Interval(interval_unit) => match interval_unit { | |
| IntervalUnit::YearMonth => { | |
| cast_array.as_primitive::<IntervalYearMonthType>().value(i) | |
| == cast_array | |
| .as_primitive::<IntervalYearMonthType>() | |
| .value(i - 1) | |
| } | |
| IntervalUnit::DayTime => { | |
| cast_array.as_primitive::<IntervalDayTimeType>().value(i) | |
| == cast_array | |
| .as_primitive::<IntervalDayTimeType>() | |
| .value(i - 1) | |
| } | |
| IntervalUnit::MonthDayNano => { | |
| cast_array | |
| .as_primitive::<IntervalMonthDayNanoType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<IntervalMonthDayNanoType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| // Decimal types | |
| DataType::Decimal32(_, _) => { | |
| cast_array.as_primitive::<Decimal32Type>().value(i) | |
| == cast_array.as_primitive::<Decimal32Type>().value(i - 1) | |
| } | |
| DataType::Decimal64(_, _) => { | |
| cast_array.as_primitive::<Decimal64Type>().value(i) | |
| == cast_array.as_primitive::<Decimal64Type>().value(i - 1) | |
| } | |
| DataType::Decimal128(_, _) => { | |
| cast_array.as_primitive::<Decimal128Type>().value(i) | |
| == cast_array.as_primitive::<Decimal128Type>().value(i - 1) | |
| } | |
| DataType::Decimal256(_, _) => { | |
| cast_array.as_primitive::<Decimal256Type>().value(i) | |
| == cast_array.as_primitive::<Decimal256Type>().value(i - 1) | |
| } | |
| // TODO: How to handle REE? | |
| DataType::RunEndEncoded(_, _) => todo!(), | |
| DataType::Null | |
| | DataType::List(_) | |
| | DataType::ListView(_) | |
| | DataType::FixedSizeList(_, _) | |
| | DataType::LargeList(_) | |
| | DataType::LargeListView(_) | |
| | DataType::Struct(_) | |
| | DataType::Union(_, _) | |
| | DataType::Dictionary(_, _) | |
| | DataType::Map(_, _) => false, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two changes together address @brancz's comments here #7713 (comment)
| let cast_array = if array.data_type() == value_type { | ||
| // No casting needed, use the array as-is | ||
| make_array(array.to_data()) | ||
| } else { | ||
| // Cast to the target value type | ||
| cast_with_options(array, value_type, cast_options)? | ||
| }; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let cast_array = if array.data_type() == value_type { | |
| // No casting needed, use the array as-is | |
| make_array(array.to_data()) | |
| } else { | |
| // Cast to the target value type | |
| cast_with_options(array, value_type, cast_options)? | |
| }; | |
| let cast_array = if array.data_type() == value_type { | |
| array | |
| } else { | |
| &cast_with_options(array, value_type, cast_options)? | |
| }; | 
| let indices = PrimitiveArray::<UInt32Type>::from_iter_values( | ||
| values_indices.iter().map(|&idx| idx as u32), | ||
| ); | ||
| let values_array = take(&cast_array, &indices, None)?; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let values_array = take(&cast_array, &indices, None)?; | |
| let values_array = take(cast_array, &indices, None)?; | 
| @vegarsti you want to open a new PR that combines the changes (of course keeping @Rich-T-kid’s commits in tact but at this point you’ve done with that should be recognized as well :) )? | 
| 
 Yeah, definitely! | 
| Opened #8589 cc @brancz @Rich-T-kid | 
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
context
For whatever reason, I’m running into an authentication issue when trying to push to the branch I created on my Datadog account. I wanted to push some of the refactoring changes that Brancz mentioned on my previous PRs.
I’ve been very busy with school and work, so I don’t think I’ll be able to continue contributing to the project at the same level as I did during the summer. Hopefully, I can still help out with smaller tasks around the codebase, but I don’t have large chunks of time to devote to REE right now.
Sorry about that—I hope I can still make smaller contributions.