Conversation
|
I think I'm missing some context, why are we doing this? We should keep our code as simple as possible and having multiple features-gated options to export to parquet goes in the opposite direction. |
|
As it turned out, When migrating to using
With the changes in this PR, it offers both options for easier migration. |
Isn't the fork changing only 3 lines of code and only exists because we use
In general I'm against this kind of changes unless there is a very good reason to do them. |
Yes. Supporting other types (e.g.
I'm not talking about
|
|
So the blocker for using native Do we really need the struct on which we derive |
Not strictly speaking a blocker, since we have a fork with
Not unless we create another layer to borrow Anyhow, since there's no appetite for replacing |
Description
Refactors
parquet-based analytics exporters to add more configuration options, as well as add aserde-based exporter (usingserde_arrowcrate).There are some minor API changes in how these serializers are instantiated and configured, but the default configuration hasn't changed.
The crate now offers two features (both disabled by default):
parquet-native: Serializes the data using the nativeparquetRecordWriter<T>implementation (often derived using theParquetRecordWritermacro).parquet-serde: Generatesparquetschema usingserde_arrowand serializes the data usingserde. Note that this requires bothSerializeandDeserializeOwnedbeing implemented on the exported data, which breaks a common use case of using&'static strin the data exports. I suggest using something likeArcStrto cover all cases of exporting strings - owned, shared and static.For easier migration from native to serde serializer, and for usage in integration testing, both serializers now offer a
schema()function that returns the schema generated for the exported type. There's also theschema_from_str()function that parses a string schema for verification. See this crate'sparquet_schemaintegration test for an example.Note that this crate is no longer using the forked versions of
parquetandparquet_derive. In case of using the native serializer, the version ofparquetused in the consumer should match the one used in this crate, whileparquet_derivecan be of any version, e.g.:How Has This Been Tested?
Existing tests, with a few new ones to cover serialization and schema matching between multiple implementations of
parquetserializers.Due Diligence