|
| 1 | +### How to : deploy a local Spark cluster (standalone) w/Docker (Linux) |
| 2 | +[](https://opensource.org/licenses/MIT) |
| 3 | +[](https://www.python.org/) |
| 4 | + |
| 5 | +> Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ] |
| 6 | +
|
| 7 | +Deploying a local spark *cluster* (standalone) can be tricky |
| 8 | +Most of online ressources focus on single driver installation w/ Spark in a custom env or using [jupyter-docker-stacks](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html) |
| 9 | +Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose. |
| 10 | +All the PySpark dependencies already configured in a container, access to your local files (in an existing directory) |
| 11 | +You might also want to do it the easy way --not local though, using [Databricks](https://docs.databricks.com/getting-started/community-edition.html) community (free) |
| 12 | + |
| 13 | +### 1. Prerequisites |
| 14 | +--- |
| 15 | +- Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter. |
| 16 | +- Make sure Docker Compose is installed or install it. |
| 17 | +- Ressources : |
| 18 | +Medium [article](https://towardsdatascience.com/learning-docker-the-easy-way-52b7bdec5e86) install and basic use of Docker . Docker official ressources should be enough though. |
| 19 | +[Jupyter-Docker-Stacks](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html). "A set of ready-to-run Docker images containing Jupyter applications". |
| 20 | +The source [article](https://towardsdatascience.com/machine-learning-on-a-large-scale-2eef3bb749ee) I (very slightly) adapted the docker-compose file from. |
| 21 | +Install Docker engine (apt get), [official ressource](https://docs.docker.com/engine/install/ubuntu/). |
| 22 | +Install Docker compose (apt get), [official ressource](https://docs.docker.com/compose/install/linux/#install-using-the-repository). |
| 23 | + |
| 24 | +### 2. How to |
| 25 | +--- |
| 26 | + |
| 27 | +After Docker Engine/compose installation, on linux, do not forget the post-installation [steps](https://docs.docker.com/engine/install/linux-postinstall/) |
| 28 | +1. Git clone this repository or create a new one (name of your choice) |
| 29 | +2. Open terminal, cd into custom directory, make sure `docker-compose.yml` file is present (copy it in if needed) |
| 30 | +https://github.com/matthieuvion/spark-cluster/blob/b2ac2c40562200deaaa0fda5a1c6a61a3b9d5102/docker-compose.yml?plain=1 |
| 31 | +``` |
| 32 | +cd my-directory |
| 33 | +docker compose up |
| 34 | +# or depending of your Docker Compose install: |
| 35 | +docker-compose up |
| 36 | +``` |
| 37 | +Basically, the file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command |
| 38 | +3. Run docker compose, it will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing. |
| 39 | + |
| 40 | +### 3. Profit : JupyterLab interface, Spark cluster (standalone) mode |
| 41 | +Jupyter lab interface : http://localhost:8888 |
| 42 | +Spark Master : http://localhost:8080 |
| 43 | +Spark Worker : http://localhost:8081 |
| 44 | +You can use the demo file `spark-cluster.ipynb` for a ready to run PySpark notebook, or simply create a new one and run it using this code snippet to build the SparkSession : |
| 45 | + |
| 46 | +``` |
| 47 | +from pyspark.sql import SparkSession |
| 48 | +
|
| 49 | +# SparkSession |
| 50 | +URL_SPARK = "spark://spark:7077" |
| 51 | +
|
| 52 | +spark = ( |
| 53 | + SparkSession.builder |
| 54 | + .appName("spark-ml") |
| 55 | + .config("executor.memory", "4g") |
| 56 | + .master(URL_SPARK) |
| 57 | + .getOrCreate() |
| 58 | +) |
| 59 | +``` |
| 60 | + |
| 61 | +### Bonus : Notebook, predict using spark.ml Pipeline() |
| 62 | +--- |
| 63 | +If you use `spark-cluster.ipynb`, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset. |
0 commit comments