README
In modern data-driven applications, real-time data streaming has become a critical requirement for processing, analyzing, and storing large volumes of continuously generated data.
This setup ensures scalable, low-latency, and fault-tolerant data streaming for real-time analytics and operational insights.
This streaming pipeline uses Kafka, Flink, Apache Pinot, PostgreSQL, and MySQL, leveraging both Python and Java, with everything provisioned using Docker Compose for seamless deployment and orchestration.
- Kafka serves as the central event broker, enabling efficient data ingestion and movement across systems.
- Apache Flink, a powerful stream processing framework, transforms and enriches incoming data in real-time before routing it to different storage systems.
- For fast analytical queries, Apache Pinot provides an OLAP engine optimized for high-speed aggregations
- PostgreSQL and MySQL act as traditional relational stores for structured and transactional data.
- Python and Java, we implement producers, consumers, Flink processing jobs, and seamless integration with storage layers.
Make sure you have the following installed:
- Docker
- Docker Compose
- Python 3.x
- Java 8/11 (for running Flink jobs)
- Git
git clone https://github.com/Nasruddin/realtime-streaming-kafka-flink-pinot-postgres.git
cd realtime-streaming-kafka-flink-pinot-postgres
Make sure to build and package Flink Java code
cd flink-processing-service
mvn clean package -DskipTests
docker compose \
-f docker-compose-base.yml \
-f docker-compose-pinot.yml \
-f docker-compose-flink.yml up --build
This will start Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL containers.
Run the following commands to check if the services are up:
docker ps
You should see containers for Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL running.
As soon as containers start running, data(1000 rides row) will be ingested to MySQL and evently will be publish to Kafka as an events. Once all the 1000 rides are pushed to Kafka, Python Kafka producer will start creating new events. You can verify in the ride service logs
docker logs rides-service
Confirm that your flink job is running and producing transformed events
docker exec -it kafka kafka-topics --bootstrap-server kafka:9092 --list
docker exec -it kafka kafka-console-consumer --topic riders_out --bootstrap-server kafka:9092
Pinot is used for OLAP (Online Analytical Processing) queries on real-time data.
-
Access Pinot UI:
http://localhost:9000
-
Verify schemas & tables are generated:
curl -X GET "http://localhost:9000/schemas"
curl -X GET "http://localhost:9000/tables"
NOTE - If you don't find schemas and table then re-run below image:
docker-compose restart pinot-add-table
- Once the data is processed and stored in Apache Pinot, query it using:
curl -X POST "http://localhost:9000/query/sql" -H "Content-Type: application/json" -d '{"sql":"SELECT * FROM rides LIMIT 10"}'
or
SELECT * FROM rides LIMIT 10;
Access the databases with:
docker exec -it postgres psql -U postgresuser -d rides_db
Superset is used for data visualization and dashboarding.
To stop the containers without deleting data:
docker compose \
-f docker-compose-base.yml \
-f docker-compose-pinot.yml \
-f docker-compose-flink.yml down
To stop and remove all containers, volumes, and networks:
docker compose \
-f docker-compose-base.yml \
-f docker-compose-pinot.yml \
-f docker-compose-flink.yml down -v
- Check logs of a specific service:
docker logs -f <container_id>
- Restart a specific service:
docker-compose restart <service-name>
- Ensure ports are not occupied:
Then kill the process:
sudo lsof -i :<port>
sudo kill -9 <PID> sudo kill -15 <PID>
- Kafka Specific
ZooKeeper stores broker metadata under /tmp/zookeeper, so you need to clear it:
docker stop kafka zookeeper docker rm kafka zookeeper
Kafka maintains log data, which might be causing conflicts. Delete the Kafka log directory:rm -rf /tmp/zookeeper
rm -rf /tmp/kafka-logs
Feel free to fork this repository and submit pull requests with improvements.