17
17
under the License.
18
18
-->
19
19
20
- # Ballista: Distributed Compute with Apache Arrow
20
+ # Ballista: Distributed Compute with Apache Arrow and DataFusion
21
21
22
- Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
23
- on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as
24
- first-class citizens without paying a penalty for serialization costs.
22
+ Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and
23
+ DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and
24
+ Java) to be supported as first-class citizens without paying a penalty for serialization costs.
25
25
26
26
The foundational technologies in Ballista are:
27
27
@@ -35,9 +35,30 @@ Ballista can be deployed as a standalone cluster and also supports [Kubernetes](
35
35
case, the scheduler can be configured to use [ etcd] ( https://etcd.io/ ) as a backing store to (eventually) provide
36
36
redundancy in the case of a scheduler failing.
37
37
38
+ # Getting Started
39
+
40
+ Fully working examples are available. Refer to the [ Ballista Examples README] ( ../ballista-examples/README.md ) for
41
+ more information.
42
+
43
+ ## Distributed Scheduler Overview
44
+
45
+ Ballista uses the DataFusion query execution framework to create a physical plan and then transforms it into a
46
+ distributed physical plan by breaking the query down into stages whenever the partitioning scheme changes.
47
+
48
+ Specifically, any ` RepartitionExec ` operator is replaced with an ` UnresolvedShuffleExec ` and the child operator
49
+ of the repartition operator is wrapped in a ` ShuffleWriterExec ` operator and scheduled for execution.
50
+
51
+ Each executor polls the scheduler for the next task to run. Tasks are currently always ` ShuffleWriterExec ` operators
52
+ and each task represents one * input* partition that will be executed. The resulting batches are repartitioned
53
+ according to the shuffle partitioning scheme and each * output* partition is streamed to disk in Arrow IPC format.
54
+
55
+ The scheduler will replace ` UnresolvedShuffleExec ` operators with ` ShuffleReaderExec ` operators once all shuffle
56
+ tasks have completed. The ` ShuffleReaderExec ` operator connects to other executors as required using the Flight
57
+ interface, and streams the shuffle IPC files.
58
+
38
59
# How does this compare to Apache Spark?
39
60
40
- Although Ballista is largely inspired by Apache Spark, there are some key differences.
61
+ Ballista implements a similar design to Apache Spark, but there are some key differences.
41
62
42
63
- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of
43
64
GC pauses.
@@ -49,14 +70,3 @@ Although Ballista is largely inspired by Apache Spark, there are some key differ
49
70
distributed compute.
50
71
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors
51
72
in any programming language with minimal serialization overhead.
52
-
53
- ## Status
54
-
55
- Ballista was [ donated] ( https://arrow.apache.org/blog/2021/04/12/ballista-donation/ ) to the Apache Arrow project in
56
- April 2021 and should be considered experimental.
57
-
58
- ## Getting Started
59
-
60
- The [ Ballista Developer Documentation] ( docs/README.md ) and the
61
- [ DataFusion User Guide] ( https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide ) are currently the
62
- best sources of information for getting started with Ballista.
0 commit comments