The goal of this project is to:
- Create a Docker Container that runs Spark
- Use Prometheus to get metrics from Spark applications and Node-exporter
- Use Grafana to display the metrics collected
- Spark stream message with Kafka to BigQuery
- Spark version running is 3.0.2
- For all available metrics for Spark monitoring see here.
- The containerized environment consists of a Master, a Worker.
- To track metrics across Spark apps, appName needs to be set up or else the spark.metrics.namespace will be spark.app.id that changes after every invocation of the app.
- Main Scala Application running is Kafka Streaming Project-assembly-0.2.0.jar that is streaming job execution ingest to BigQuery.
- Dockerfile for Spark/Hadoop is also available here in order to add it in docker-compose.yaml file as seen here.
Assuming that Docker is installed, simply execute the following command to build and run the Docker Containers:
docker-compose -f docker-compose.spark.yaml -f docker-compose.kafka.yaml build && docker-compose -f docker-compose.spark.yaml -f docker-compose.kafka.yaml up
To shutdown Docker Containers, execute the following command:
docker-compose -f docker-compose.spark.yaml -f docker-compose.kafka.yaml down