Skip to content

Setup for realtime data streaming using Kafka, Flink, Pinot, MySQL, Postgres and Superset

Notifications You must be signed in to change notification settings

Nasruddin/realtime-streaming-kafka-flink-pinot-postgres-superset

Repository files navigation

README

Realtime Data Streaming

Overview

In modern data-driven applications, real-time data streaming has become a critical requirement for processing, analyzing, and storing large volumes of continuously generated data.

This setup ensures scalable, low-latency, and fault-tolerant data streaming for real-time analytics and operational insights.

This streaming pipeline uses Kafka, Flink, Apache Pinot, PostgreSQL, and MySQL, leveraging both Python and Java, with everything provisioned using Docker Compose for seamless deployment and orchestration.

  1. Kafka serves as the central event broker, enabling efficient data ingestion and movement across systems.
  2. Apache Flink, a powerful stream processing framework, transforms and enriches incoming data in real-time before routing it to different storage systems.
  3. For fast analytical queries, Apache Pinot provides an OLAP engine optimized for high-speed aggregations
  4. PostgreSQL and MySQL act as traditional relational stores for structured and transactional data.
  5. Python and Java, we implement producers, consumers, Flink processing jobs, and seamless integration with storage layers.

Prerequisites

Make sure you have the following installed:

Installation & Setup

1. Clone the Repository

git clone https://github.com/Nasruddin/realtime-streaming-kafka-flink-pinot-postgres.git

cd realtime-streaming-kafka-flink-pinot-postgres

2. Start Services using Docker Compose

Make sure to build and package Flink Java code

cd flink-processing-service
mvn clean package -DskipTests 
docker compose \ 
    -f docker-compose-base.yml \
    -f docker-compose-pinot.yml \ 
    -f docker-compose-flink.yml  up --build

This will start Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL containers.

3. Verify Running Services

Run the following commands to check if the services are up:

docker ps

You should see containers for Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL running.

Verify running container

Running the Pipeline

1. Produce Sample Data to Kafka

As soon as containers start running, data(1000 rides row) will be ingested to MySQL and evently will be publish to Kafka as an events. Once all the 1000 rides are pushed to Kafka, Python Kafka producer will start creating new events. You can verify in the ride service logs

docker logs rides-service

Producer


Verify kafka events pusblished

Consumer


Verify kafka events consumed

2. Flink Job

Confirm that your flink job is running and producing transformed events

List the topics to confirm new realtime topic is created

docker exec -it kafka kafka-topics --bootstrap-server kafka:9092 --list

Verify flink topic

Verify the events are getting generated

docker exec -it kafka kafka-console-consumer --topic riders_out --bootstrap-server kafka:9092 

Verify flink topic events

3. Query Apache Pinot

Pinot is used for OLAP (Online Analytical Processing) queries on real-time data.

  • Access Pinot UI: http://localhost:9000

  • Verify schemas & tables are generated:

curl -X GET "http://localhost:9000/schemas"

Verify flink topic events

curl -X GET "http://localhost:9000/tables"

Verify flink topic events

NOTE - If you don't find schemas and table then re-run below image:

docker-compose restart pinot-add-table

Verify pinot table

  • Once the data is processed and stored in Apache Pinot, query it using:
curl -X POST "http://localhost:9000/query/sql" -H "Content-Type: application/json" -d '{"sql":"SELECT * FROM rides LIMIT 10"}'

or

Running Queries

Open Pinot UI (localhost:9000)

Query Pinot:

SELECT * FROM rides LIMIT 10;

Verify Pinot UI is working

4. Query PostgreSQL

Access the databases with:

docker exec -it postgres psql -U postgresuser -d rides_db

Verify Pinot UI is working

5. Dashboard on Superset

Superset is used for data visualization and dashboarding.

  • Access Superset: http://localhost:8088 Verify Superset is working

  • Superset dataset: Verify Superset is working

Stopping and Cleaning Up

To stop the containers without deleting data:

docker compose \
    -f docker-compose-base.yml \
    -f docker-compose-pinot.yml \
    -f docker-compose-flink.yml  down

To stop and remove all containers, volumes, and networks:

docker compose \
    -f docker-compose-base.yml \
    -f docker-compose-pinot.yml \
    -f docker-compose-flink.yml  down -v

Troubleshooting

  • Check logs of a specific service:
    docker logs -f <container_id>
  • Restart a specific service:
    docker-compose restart <service-name>
  • Ensure ports are not occupied:
    sudo lsof -i :<port>
    Then kill the process:
    sudo kill -9 <PID>
    sudo kill -15 <PID>
  • Kafka Specific
    docker stop kafka zookeeper 
    docker rm kafka zookeeper
    ZooKeeper stores broker metadata under /tmp/zookeeper, so you need to clear it:
    rm -rf /tmp/zookeeper
    Kafka maintains log data, which might be causing conflicts. Delete the Kafka log directory:
    rm -rf /tmp/kafka-logs  

Contributing

Feel free to fork this repository and submit pull requests with improvements.