Skip to content

Conversation

@klion26
Copy link
Member

@klion26 klion26 commented Sep 22, 2025

Which issue does this PR close?

Rationale for this change

  • Add typed_access for Timestamp(Micro, _) and Timestamp(Nano, -)

What changes are included in this PR?

Are these changes tested?

Covered by existing and added tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 22, 2025
@klion26 klion26 force-pushed the 8331_support_typed_access_for_timestamp branch 2 times, most recently from 984fbcb to 0748c0e Compare September 22, 2025 02:58
@klion26 klion26 marked this pull request as draft September 22, 2025 03:28
@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 23, 2025
@klion26 klion26 force-pushed the 8331_support_typed_access_for_timestamp branch from e003775 to aa9f8a9 Compare September 23, 2025 06:20
@klion26 klion26 marked this pull request as ready for review September 23, 2025 06:30
@klion26
Copy link
Member Author

klion26 commented Sep 23, 2025

@alamb @scovich Please help review this when you're free, thanks.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than a panic that needs to be removed.

Your call whether to deal with timestamp precision widening in this PR or as a follow-on effort (maybe it depends on whether there's any controversy about the approach)

)
}
// Variant timestamp only support time unit with microsecond or nanosecond precision
_ => panic!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return an error instead of panicking?

Also, AFAIK all the other timestamp precisions can be widened losslessly to microsecond; should we consider doing that instead of blowing up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's not panic here

ANother potential approach is to return a not yet implemented error -- maybe by matching the supported types explicitly, so like

        DataType::Timestamp(TimeUnit::Microsecond, Some(_)) => {
                    generic_conversion_single_value!(
                        TimestampMicrosecondType,
                        as_primitive,
                        |v| DateTime::from_timestamp_micros(v).unwrap(),
                        typed_value,
                        index
                    )
                }

instead of

        DataType::Timestamp(timeunit, tz) => {
            match (timeunit, tz) {
                (TimeUnit::Microsecond, Some(_)) => {
                    generic_conversion_single_value!(
                        TimestampMicrosecondType,
                        as_primitive,
                        |v| DateTime::from_timestamp_micros(v).unwrap(),
                        typed_value,
                        index
                    )
                }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a fan of less nesting, when possible...

Copy link
Member Author

@klion26 klion26 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Yes, TimeUnit::Second and TimeUnit::Millisecond can convert to TimeUnit::MicroSecond losslessly. Using panic here because this value is located in typed_value of the Variant, and from the spec of Variant, there are only Micro and Nano units for Timestamp(with/without timezone), I assumed that this is an error(this could not happen?), I can add the second and Millisecond units if this could happen.

Copy link
Contributor

@scovich scovich Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using panic here because this value is located in typed_value of the Variant, and from the spec of Variant, there are only Micro and Nano units for Timestamp(with/without timezone), I assumed that this is an error(this could not happen?)

Good point! There are definitely unit tests (parquet/tests/variant_integration.rs) that expect non-variant types in typed_value columns to produce an error (cases 127 and 137, specifically). So we should not widen on the read path, but rather reject.

Widening would seem to make sense on the shredding write path, tho?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #8435, which I noticed while trying to add struct support

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we need to widen to TimeUnit::MicroSecond when writing to a variant, as TimeUnit::MicroSecond is the minimum precision we support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We have flexibility on the write path to decide whether/how we widen or convert values that don't exactly fit the variant type system, but readers expect the resulting types to always be valid variant shredding types.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @klion26 and @scovich

)
}
// Variant timestamp only support time unit with microsecond or nanosecond precision
_ => panic!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's not panic here

ANother potential approach is to return a not yet implemented error -- maybe by matching the supported types explicitly, so like

        DataType::Timestamp(TimeUnit::Microsecond, Some(_)) => {
                    generic_conversion_single_value!(
                        TimestampMicrosecondType,
                        as_primitive,
                        |v| DateTime::from_timestamp_micros(v).unwrap(),
                        typed_value,
                        index
                    )
                }

instead of

        DataType::Timestamp(timeunit, tz) => {
            match (timeunit, tz) {
                (TimeUnit::Microsecond, Some(_)) => {
                    generic_conversion_single_value!(
                        TimestampMicrosecondType,
                        as_primitive,
                        |v| DateTime::from_timestamp_micros(v).unwrap(),
                        typed_value,
                        index
                    )
                }

variant_test_case!(31);
// https://github.com/apache/arrow-rs/issues/8334
variant_test_case!(32, "Unsupported typed_value type: Time64(Microsecond)");
// https://github.com/apache/arrow-rs/issues/8331
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@klion26
Copy link
Member Author

klion26 commented Sep 24, 2025

I'm rebasing the main to resolve the conflicts
rebased on main

@klion26 klion26 force-pushed the 8331_support_typed_access_for_timestamp branch from 958e8b0 to 3019a2e Compare September 24, 2025 14:40
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @klion26 and @scovich

])
});

partially_shredded_variant_array_gen!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for reducing the boilerplate

@alamb
Copy link
Contributor

alamb commented Sep 24, 2025

I merged up from main to resolve a conflict

@alamb
Copy link
Contributor

alamb commented Sep 25, 2025

@klion26 -- this PR seems to have accumulated conflicts -- I apologize for that. Can you please resolve the conflicts so I can merge it in?

@klion26 klion26 force-pushed the 8331_support_typed_access_for_timestamp branch from 69919eb to bb4c3eb Compare September 26, 2025 01:04
@klion26
Copy link
Member Author

klion26 commented Sep 26, 2025

@alamb I've rebased on main and resolved the conflicts. Please take another look, thanks.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @klion26 🙏 🙏

@alamb alamb merged commit 56cdfa7 into apache:main Sep 26, 2025
19 checks passed
@klion26 klion26 deleted the 8331_support_typed_access_for_timestamp branch September 29, 2025 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] [Shredding] Support typed_access for Timestamp(Microsecond, _) and Timestamp(Nanosecond, _)

3 participants