-
Notifications
You must be signed in to change notification settings - Fork 1.7k
"Gentle Introduction to Arrow / Record Batches" #11336 #18051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
c316bfd
280a0af
7f83c2b
8ba8a39
8245f5c
b22f1c9
5f07d74
f44fa0e
4707c82
7f61048
8e5b732
4e99994
d80d48c
4f6446f
09d6ca6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,301 @@ | ||||||
| <!--- | ||||||
| Licensed to the Apache Software Foundation (ASF) under one | ||||||
| or more contributor license agreements. See the NOTICE file | ||||||
| distributed with this work for additional information | ||||||
| regarding copyright ownership. The ASF licenses this file | ||||||
| to you under the Apache License, Version 2.0 (the | ||||||
| "License"); you may not use this file except in compliance | ||||||
| with the License. You may obtain a copy of the License at | ||||||
|
|
||||||
| http://www.apache.org/licenses/LICENSE-2.0 | ||||||
|
|
||||||
| Unless required by applicable law or agreed to in writing, | ||||||
| software distributed under the License is distributed on an | ||||||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||
| KIND, either express or implied. See the License for the | ||||||
| specific language governing permissions and limitations | ||||||
| under the License. | ||||||
| --> | ||||||
|
|
||||||
| # A Gentle Introduction to Arrow & RecordBatches (for DataFusion users) | ||||||
|
|
||||||
| ```{contents} | ||||||
| :local: | ||||||
| :depth: 2 | ||||||
| ``` | ||||||
|
|
||||||
| This guide helps DataFusion users understand Arrow and its RecordBatch format. While you may never need to work with Arrow directly, this knowledge becomes valuable when using DataFusion's extension points or debugging performance issues. | ||||||
|
|
||||||
| **Why Arrow is central to DataFusion**: Arrow provides the unified type system that makes DataFusion possible. When you query a CSV file, join it with a Parquet file, and aggregate results from JSON—it all works seamlessly because every data source is converted to Arrow's common representation. This unified type system, combined with Arrow's columnar format, enables DataFusion to execute efficient vectorized operations across any combination of data sources while benefiting from zero-copy data sharing between query operators. | ||||||
|
||||||
|
|
||||||
| ## Why Columnar? The Arrow Advantage | ||||||
|
|
||||||
| Apache Arrow is an open **specification** that defines how analytical data should be organized in memory. Think of it as a blueprint that different systems agree to follow, not a database or programming language. | ||||||
|
||||||
| Apache Arrow is an open **specification** that defines how analytical data should be organized in memory. Think of it as a blueprint that different systems agree to follow, not a database or programming language. | |
| Apache Arrow is an open **specification** that defines a common way to organize analytical data in memory. Think of it as a set of best practices that different systems agree to follow, not a database or programming language. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be hesitant to mention compression here as being an in-memory format it isn't typically compressed (as compared to something like Parquet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right - I was thinking way more down the line, I was conflating storage format benefits with in-memory benefits. Arrow's columnar layout enables better compression when written to disk (like Parquet), but that's not relevant for the in-memory processing context. I'll remove this point or rephrase to focus on the actual in-memory benefits like cache efficiency and SIMD operations.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays sharing the same schema. | |
| A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays that form a common schema. |
I'm not sure about this wording either, but it feels slightly wrong to call the schema as being shared by arrays 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about the wording.
How about:
A RecordBatch represents a horizontal slice of a table—a collection of equal-length columnar arrays that conform to a defined schema.
This makes it clearer that the schema defines the structure, and the arrays conform to it, rather than "sharing" it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the wording of that 👍
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section feels a bit misplaced, as some of these downsides were mentioned right above under Why this matters so it feels a little inconsistent to have the points stated again right below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right -
I essentially repeated the same points. My intention was to show the progression from "too big" (entire table) → "too small" (single rows) → "just right" (batches), but I see it reads as repetitive.
I'll consolidate into a single "Why batches are the sweet spot" section that covers both extremes concisely without redundancy.
Do you have suggestions, I might not see ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could even simplify it to a single line like "it's more efficient to process data in batches etc." and have more focus on how/when you interact with the record batches directly, rather than having details on why DF uses recordbatches (if the point of the guide is to ease users into getting familiar with interacting with arrow api)
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the last two properties are a bit mismatched here; they are instead properties of arrays and not recordbatches, but more importantly in a guide that is meant to be a gentle introduction, they seem to be placed here randomly. If someone were to read Variable-length data (strings, lists) use offset arrays for efficient access there isn't much to gleam from that information (that is relevant to the overall theme of the guide) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I mixed RecordBatch properties with Array implementation details. These technical details don't help someone understand "why RecordBatches" at a conceptual level. I'll either:
- Remove these details entirely, OR
- Reframe as "What this means for users" (e.g., "Data is immutable, so operations create new batches rather than modifying existing ones")
The offset arrays detail especially doesn't belong in a gentle introduction.
Would you prefer I remove this section or refocus it on user-facing implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do feel it is worth mentioning the immutable aspect of RecordBatches/Arrays, as that is an important detail if you want to get hands on.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording implies Arc is the key to Arrow, though it can be misleading considering that's more of an implementation detail on the Rust side 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right - Arc is a Rust implementation detail, not core to understanding Arrow conceptually. I included it because users will see Arc/ArrayRef in code examples, but I'm giving it too much emphasis. I'll either:
- Move the Arc explanation to a small note: "Note: You'll see Arc in Rust code - it's how Rust safely shares data between threads"
- Remove it entirely and let users learn about Arc when they actually need to write code
Which approach would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean toward the latter but I don't know how user friendly that might turn out 🤔
Maybe just add a small footnote about DataFusion being built around async + having pointers to the arrays themselves = use of Arc frequently to wrap these data structures
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be worth linking / adding examples of two other APIs that are useful for writing tests:
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Once you have a [`RecordBatch`], you can query it with DataFusion using a [`MemTable`]. This is useful for testing, processing data from external systems, or combining in-memory data with other sources. The example below creates a batch, wraps it in a [`MemTable`], registers it as a named table, and queries it using SQL—demonstrating how Arrow serves as the bridge between your data and DataFusion's query engine. | |
| Once you have one or more [`RecordBatch`]es, you can query it with DataFusion using a [`MemTable`]. This is useful for testing, processing data from external systems, or combining in-memory data with other sources. The example below creates a batch, wraps it in a [`MemTable`], registers it as a named table, and queries it using SQL—demonstrating how Arrow serves as the bridge between your data and DataFusion's query engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically there are ways to mutate the data in place -- for example https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut
However I would say that is an advanced usecase and if it is mentioned at all could be just a reference
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this is more like "avoid row by row operations" rather than buffer management. Maybe something like
| - **Buffer management**: Variable-length types (UTF-8, binary, lists) use offsets + values arrays internally. Avoid manual buffer slicing unless you understand Arrow's internal invariants—use Arrow's built-in compute functions instead | |
| - **Row by Row Processing**: Avoid iterating over Arrays element by element when possible, and -- use Arrow's built-in compute functions instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also link to the kernels: https://docs.rs/arrow/latest/arrow/compute/index.html
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to the cast kernel: https://docs.rs/arrow/latest/arrow/compute/fn.cast.html
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think typically custom optimizer rules are more concerned about the Schemas than the arrays, but I think we can leave this as is too
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to have a Next steps about working with the DataFrame API, given this guide itself is meant to be an introduction to Arrow for DataFusion users who may not need to use Arrow directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point - if users don't need Arrow directly, why guide them to DataFrames?
My thinking was: users reading this guide are trying to understand the foundation before using DataFusion. But you're right that it creates a circular path. Would it be better to:
- Remove "Next Steps" entirely, OR
- Reframe as "When you'll encounter Arrow" focusing on the extension points where Arrow knowledge becomes necessary?
The second option would reinforce that most users can stay at the DataFrame level. (See first comment of the dataframe.md, where I first wanted to implement the introduction to Arrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If going with the second option, it might be a little odd to put it at the end of an article that is introducing users to recordbatch/arrow api as usually you'd provide a justification/reason upfront for why you might need this (to highlight why users would need to read the guide in the first place) 🤔
Maybe can just have the recommended reading links, but put some descriptions for each link so users would know why they might be interested in checking out the links (e.g. "understand arrow internals", "creating your own udf efficiently")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think different users will also need to use different APIs. There are plenty of people who will use the DataFrame API, but an important class of users will also want to go deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've hit on a key point, and I absolutely agree that different users need different APIs. My goal is to build a clear path for them to get there.
My intention with this document is to provide foundational "Arrow 101". The pedagogical progression I'm envisioning follows a dimensional model:
0-Dimensions → Data Types (the vocabulary - currently missing? Okay, there is the SQL centric docs/source/user-guide/sql/data_types.md )
1-Dimension → Arrays & RecordBatches (columnar data - the focus of this PR)
2-Dimensions → DataFrame API (tabular abstraction - my next contribution)
From there, users have the foundation to tackle extension APIs:
- ExecutionPlan API (custom operators)
- TableProvider API (custom data sources)
- UDF APIs (custom functions)
- More
This ensures users understand RecordBatches before trying to write a TableProvider that produces them. Building blocks in the right order.
Does this dimensional approach make sense? I believe it provides a gentle on-ramp while creating the necessary foundation for advanced users. Happy to adjust!
(And as a side benefit, I'm learning a ton while hopefully making DataFusion more accessible! 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we can trim some of these references; for example including IPC is probably unnecessary for the goal of this guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed Thank you for helping me tighter focus and leaving more verbose details to external links -
IPC is too deep for this guide's scope. I'll trim the references to focus on:
- Main Arrow documentation (for those wanting to go deeper)
- DataFusion-specific references (MemTable, TableProvider, DataFrame)
- The academic paper (for those interested in the theory)
I'll remove IPC, memory layout internals, and other implementation-focused references.
Uh oh!
There was an error while loading. Please reload this page.