-
Couldn't load subscription status.
- Fork 1.7k
"Gentle Introduction to Arrow / Record Batches" #11336 #18051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
c316bfd
280a0af
7f83c2b
8ba8a39
8245f5c
b22f1c9
5f07d74
f44fa0e
4707c82
7f61048
8e5b732
4e99994
d80d48c
4f6446f
09d6ca6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,284 @@ | ||||||
| <!--- | ||||||
| Licensed to the Apache Software Foundation (ASF) under one | ||||||
| or more contributor license agreements. See the NOTICE file | ||||||
| distributed with this work for additional information | ||||||
| regarding copyright ownership. The ASF licenses this file | ||||||
| to you under the Apache License, Version 2.0 (the | ||||||
| "License"); you may not use this file except in compliance | ||||||
| with the License. You may obtain a copy of the License at | ||||||
| http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| Unless required by applicable law or agreed to in writing, | ||||||
| software distributed under the License is distributed on an | ||||||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||
| KIND, either express or implied. See the License for the | ||||||
| specific language governing permissions and limitations | ||||||
| under the License. | ||||||
| --> | ||||||
|
|
||||||
| # A Gentle Introduction to Arrow & RecordBatches (for DataFusion users) | ||||||
|
|
||||||
| ```{contents} | ||||||
| :local: | ||||||
| :depth: 2 | ||||||
| ``` | ||||||
|
|
||||||
| This guide helps DataFusion users understand [Arrow] and its RecordBatch format. While you may never need to work with Arrow directly, this knowledge becomes valuable when using DataFusion's extension points or debugging performance issues. | ||||||
|
||||||
| This guide helps DataFusion users understand [Arrow] and its RecordBatch format. While you may never need to work with Arrow directly, this knowledge becomes valuable when using DataFusion's extension points or debugging performance issues. | |
| This guide helps DataFusion users understand [Apache Arrow] and its RecordBatch format. While you may never need to work with Arrow directly, this knowledge becomes valuable when using DataFusion's extension points or debugging performance issues. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of a follow on PR we can also copy some of the introductory material from https://jorgecarleitao.github.io/arrow2/main/guide/arrow.html#what-is-apache-arrow which I think is well written, though maybe it is too "database centric" 🤔
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be nice to mention here something like "DataFusion uses Arrow as its native internal format both for zero-copy interoperability with other libraries, as well as to leverage the highly optimized compute kernels available in arrow-rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Apache Arrow is an open **specification** that defines how analytical data should be organized in memory. Think of it as a blueprint that different systems agree to follow, not a database or programming language. | |
| Apache Arrow is an open **specification** that defines a common way to organize analytical data in memory. Think of it as a set of best practices that different systems agree to follow, not a database or programming language. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can throw in here that the default batch size is 8192 rows and can be controlled by the datafusion.execution.batch_size config setting https://datafusion.apache.org/user-guide/configs.html
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be worth linking / adding examples of two other APIs that are useful for writing tests:
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Once you have a [`RecordBatch`], you can query it with DataFusion using a [`MemTable`]. This is useful for testing, processing data from external systems, or combining in-memory data with other sources. The example below creates a batch, wraps it in a [`MemTable`], registers it as a named table, and queries it using SQL—demonstrating how Arrow serves as the bridge between your data and DataFusion's query engine. | |
| Once you have one or more [`RecordBatch`]es, you can query it with DataFusion using a [`MemTable`]. This is useful for testing, processing data from external systems, or combining in-memory data with other sources. The example below creates a batch, wraps it in a [`MemTable`], registers it as a named table, and queries it using SQL—demonstrating how Arrow serves as the bridge between your data and DataFusion's query engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically there are ways to mutate the data in place -- for example https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut
However I would say that is an advanced usecase and if it is mentioned at all could be just a reference
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this is more like "avoid row by row operations" rather than buffer management. Maybe something like
| - **Buffer management**: Variable-length types (UTF-8, binary, lists) use offsets + values arrays internally. Avoid manual buffer slicing unless you understand Arrow's internal invariants—use Arrow's built-in compute functions instead | |
| - **Row by Row Processing**: Avoid iterating over Arrays element by element when possible, and -- use Arrow's built-in compute functions instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also link to the kernels: https://docs.rs/arrow/latest/arrow/compute/index.html
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to the cast kernel: https://docs.rs/arrow/latest/arrow/compute/fn.cast.html
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think typically custom optimizer rules are more concerned about the Schemas than the arrays, but I think we can leave this as is too
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to have a Next steps about working with the DataFrame API, given this guide itself is meant to be an introduction to Arrow for DataFusion users who may not need to use Arrow directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point - if users don't need Arrow directly, why guide them to DataFrames?
My thinking was: users reading this guide are trying to understand the foundation before using DataFusion. But you're right that it creates a circular path. Would it be better to:
- Remove "Next Steps" entirely, OR
- Reframe as "When you'll encounter Arrow" focusing on the extension points where Arrow knowledge becomes necessary?
The second option would reinforce that most users can stay at the DataFrame level. (See first comment of the dataframe.md, where I first wanted to implement the introduction to Arrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If going with the second option, it might be a little odd to put it at the end of an article that is introducing users to recordbatch/arrow api as usually you'd provide a justification/reason upfront for why you might need this (to highlight why users would need to read the guide in the first place) 🤔
Maybe can just have the recommended reading links, but put some descriptions for each link so users would know why they might be interested in checking out the links (e.g. "understand arrow internals", "creating your own udf efficiently")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think different users will also need to use different APIs. There are plenty of people who will use the DataFrame API, but an important class of users will also want to go deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've hit on a key point, and I absolutely agree that different users need different APIs. My goal is to build a clear path for them to get there.
My intention with this document is to provide foundational "Arrow 101". The pedagogical progression I'm envisioning follows a dimensional model:
0-Dimensions → Data Types (the vocabulary - currently missing? Okay, there is the SQL centric docs/source/user-guide/sql/data_types.md )
1-Dimension → Arrays & RecordBatches (columnar data - the focus of this PR)
2-Dimensions → DataFrame API (tabular abstraction - my next contribution)
From there, users have the foundation to tackle extension APIs:
- ExecutionPlan API (custom operators)
- TableProvider API (custom data sources)
- UDF APIs (custom functions)
- More
This ensures users understand RecordBatches before trying to write a TableProvider that produces them. Building blocks in the right order.
Does this dimensional approach make sense? I believe it provides a gentle on-ramp while creating the necessary foundation for advanced users. Happy to adjust!
(And as a side benefit, I'm learning a ton while hopefully making DataFusion more accessible! 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we can trim some of these references; for example including IPC is probably unnecessary for the goal of this guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed Thank you for helping me tighter focus and leaving more verbose details to external links -
IPC is too deep for this guide's scope. I'll trim the references to focus on:
- Main Arrow documentation (for those wanting to go deeper)
- DataFusion-specific references (MemTable, TableProvider, DataFrame)
- The academic paper (for those interested in the theory)
I'll remove IPC, memory layout internals, and other implementation-focused references.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2444,7 +2444,6 @@ date_bin(interval, expression, origin-timestamp) | |
| - **interval**: Bin interval. | ||
| - **expression**: Time expression to operate on. Can be a constant, column, or function. | ||
| - **origin-timestamp**: Optional. Starting point used to determine bin boundaries. If not specified defaults 1970-01-01T00:00:00Z (the UNIX epoch in UTC). The following intervals are supported: | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this file is autogenerated, if you wanna change the doc please change the userdoc for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, In my first attemp, I used prettier on all files and it "fixed" this one... Thought of doing good by fixing this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reverted in 09d6ca6 |
||
| - nanoseconds | ||
| - microseconds | ||
| - milliseconds | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.