Skip to content

Commit 6e5cb82

Browse files
committed
first commit readme how to
1 parent fbf242f commit 6e5cb82

File tree

1 file changed

+63
-0
lines changed

1 file changed

+63
-0
lines changed

README.md

+63
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
### How to : deploy a local Spark cluster (standalone) w/Docker (Linux)
2+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
3+
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
4+
5+
> Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]
6+
7+
Deploying a local spark *cluster* (standalone) can be tricky
8+
Most of online ressources focus on single driver installation w/ Spark in a custom env or using [jupyter-docker-stacks](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html)
9+
Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose.
10+
All the PySpark dependencies already configured in a container, access to your local files (in an existing directory)
11+
You might also want to do it the easy way --not local though, using [Databricks](https://docs.databricks.com/getting-started/community-edition.html) community (free)
12+
13+
### 1. Prerequisites
14+
---
15+
- Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter.
16+
- Make sure Docker Compose is installed or install it.
17+
- Ressources :
18+
Medium [article](https://towardsdatascience.com/learning-docker-the-easy-way-52b7bdec5e86) install and basic use of Docker . Docker official ressources should be enough though.
19+
[Jupyter-Docker-Stacks](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html). "A set of ready-to-run Docker images containing Jupyter applications".
20+
The source [article](https://towardsdatascience.com/machine-learning-on-a-large-scale-2eef3bb749ee) I (very slightly) adapted the docker-compose file from.
21+
Install Docker engine (apt get), [official ressource](https://docs.docker.com/engine/install/ubuntu/).
22+
Install Docker compose (apt get), [official ressource](https://docs.docker.com/compose/install/linux/#install-using-the-repository).
23+
24+
### 2. How to
25+
---
26+
27+
After Docker Engine/compose installation, on linux, do not forget the post-installation [steps](https://docs.docker.com/engine/install/linux-postinstall/)
28+
1. Git clone this repository or create a new one (name of your choice)
29+
2. Open terminal, cd into custom directory, make sure `docker-compose.yml` file is present (copy it in if needed)
30+
https://github.com/matthieuvion/spark-cluster/blob/b2ac2c40562200deaaa0fda5a1c6a61a3b9d5102/docker-compose.yml?plain=1
31+
```
32+
cd my-directory
33+
docker compose up
34+
# or depending of your Docker Compose install:
35+
docker-compose up
36+
```
37+
Basically, the file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command
38+
3. Run docker compose, it will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing.
39+
40+
### 3. Profit : JupyterLab interface, Spark cluster (standalone) mode
41+
Jupyter lab interface : http://localhost:8888
42+
Spark Master : http://localhost:8080
43+
Spark Worker : http://localhost:8081
44+
You can use the demo file `spark-cluster.ipynb` for a ready to run PySpark notebook, or simply create a new one and run it using this code snippet to build the SparkSession :
45+
46+
```
47+
from pyspark.sql import SparkSession
48+
49+
# SparkSession
50+
URL_SPARK = "spark://spark:7077"
51+
52+
spark = (
53+
SparkSession.builder
54+
.appName("spark-ml")
55+
.config("executor.memory", "4g")
56+
.master(URL_SPARK)
57+
.getOrCreate()
58+
)
59+
```
60+
61+
### Bonus : Notebook, predict using spark.ml Pipeline()
62+
---
63+
If you use `spark-cluster.ipynb`, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset.

0 commit comments

Comments
 (0)