Skip to content

Commit 911940f

Browse files
jecsand838scovichmbrobbel
authored
Added List and Struct Encoding to arrow-avro Writer (#8274)
# Which issue does this PR close? - Part of #4886 # Rationale for this change This refactor streamlines the `arrow-avro` writer by introducing a single, schema‑driven `RecordEncoder` that plans writes up front and encodes rows using consistent, explicit rules for nullability and type dispatch. It reduces duplication in nested/struct/list handling, makes the order of Avro union branches (null‑first vs null‑second) an explicit choice, and aligns header schema generation with value encoding. This should improve correctness (especially for nested optionals), make behavior easier to reason about, and pave the way for future optimizations. # What changes are included in this PR? **High‑level:** * Introduces a unified, schema‑driven `RecordEncoder` with a builder that walks the Avro record in Avro order and maps each field to its Arrow column, producing a reusable write plan. The encoder covers scalars and nested types (struct, (large) lists, maps, strings/binaries). * Applies a single model of **nullability** throughout encoding, including nested sites (list items, fixed‑size list items, map values), and uses explicit union‑branch indices according to the chosen order. **API and implementation details:** * **Writer / encoder refactor** * Replaces the previous per‑column/child encoding paths with a **`FieldPlan`** tree (variants for `Scalar`, `Struct { … }`, and `List { … }`) and per‑site `nullability` carried from the Avro schema. * Adds encoder variants for `LargeBinary`, `Utf8`, `Utf8Large`, `List`, `LargeList`, and `Struct`. * Encodes union branch indices with `write_optional_index` (writes `0x00/0x02` according to Null‑First/Null‑Second), replacing the old branch write. * **Schema generation & metadata** * Moves the **`Nullability`** enum to `schema.rs` and threads it through schema generation and writer logic. * Adds `AvroSchema::from_arrow_with_options(schema, Option<Nullability>)` to either reuse embedded Avro JSON or build new Avro JSON that **honors the requested null‑union order at all nullable sites**. * Adds `extend_with_passthrough_metadata` so Arrow schema metadata is copied into Avro JSON while skipping Avro‑reserved and internal Arrow keys. * Introduces helpers like `wrap_nullable` and `arrow_field_to_avro_with_order` to apply ordering consistently for arrays, fixed‑size lists, maps, structs, and unions. * **Format and glue** * Simplifies `writer/format.rs` by removing the `EncoderOptions` plumbing from the OCF format; `write_long` remains exported for header writing. # Are these changes tested? Yes. * Adds focused unit tests in `writer/encoder.rs` that verify scalar and string/binary encodings (e.g., Binary/LargeBinary, Utf8/LargeUtf8) and validate length/branch encoding primitives used by the writer. * Round trip integration tests that validate List and Struct decoding in `writer/mod.rs`. * Adjusts existing schema tests (e.g., decimal metadata expectations) to align with the new schema/metadata handling. # Are there any user-facing changes? N/A because arrow-avro is not public yet. --------- Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Matthijs Brobbel <[email protected]>
1 parent 0c7cb2a commit 911940f

File tree

6 files changed

+928
-311
lines changed

6 files changed

+928
-311
lines changed

arrow-avro/src/codec.rs

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616
// under the License.
1717

1818
use crate::schema::{
19-
Attributes, AvroSchema, ComplexType, Enum, PrimitiveType, Record, Schema, Type, TypeName,
20-
AVRO_ENUM_SYMBOLS_METADATA_KEY,
19+
Attributes, AvroSchema, ComplexType, Enum, Nullability, PrimitiveType, Record, Schema, Type,
20+
TypeName, AVRO_ENUM_SYMBOLS_METADATA_KEY,
2121
};
2222
use arrow_schema::{
2323
ArrowError, DataType, Field, Fields, IntervalUnit, TimeUnit, DECIMAL128_MAX_PRECISION,
@@ -28,19 +28,6 @@ use std::borrow::Cow;
2828
use std::collections::HashMap;
2929
use std::sync::Arc;
3030

31-
/// Avro types are not nullable, with nullability instead encoded as a union
32-
/// where one of the variants is the null type.
33-
///
34-
/// To accommodate this we special case two-variant unions where one of the
35-
/// variants is the null type, and use this to derive arrow's notion of nullability
36-
#[derive(Debug, Copy, Clone, PartialEq)]
37-
pub enum Nullability {
38-
/// The nulls are encoded as the first union variant
39-
NullFirst,
40-
/// The nulls are encoded as the second union variant
41-
NullSecond,
42-
}
43-
4431
/// Contains information about how to resolve differences between a writer's and a reader's schema.
4532
#[derive(Debug, Clone, PartialEq)]
4633
pub(crate) enum ResolutionInfo {

arrow-avro/src/reader/record.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18-
use crate::codec::{AvroDataType, Codec, Nullability, Promotion, ResolutionInfo};
18+
use crate::codec::{AvroDataType, Codec, Promotion, ResolutionInfo};
1919
use crate::reader::block::{Block, BlockDecoder};
2020
use crate::reader::cursor::AvroCursor;
2121
use crate::reader::header::Header;

0 commit comments

Comments
 (0)