Realtime Data Streaming

README

Realtime Data Streaming

Overview

In modern data-driven applications, real-time data streaming has become a critical requirement for processing, analyzing, and storing large volumes of continuously generated data.

This setup ensures scalable, low-latency, and fault-tolerant data streaming for real-time analytics and operational insights.

This streaming pipeline uses Kafka, Flink, Apache Pinot, PostgreSQL, and MySQL, leveraging both Python and Java, with everything provisioned using Docker Compose for seamless deployment and orchestration.

Kafka serves as the central event broker, enabling efficient data ingestion and movement across systems.
Apache Flink, a powerful stream processing framework, transforms and enriches incoming data in real-time before routing it to different storage systems.
For fast analytical queries, Apache Pinot provides an OLAP engine optimized for high-speed aggregations
PostgreSQL and MySQL act as traditional relational stores for structured and transactional data.
Python and Java, we implement producers, consumers, Flink processing jobs, and seamless integration with storage layers.

Prerequisites

Make sure you have the following installed:

Docker
Docker Compose
Python 3.x
Java 8/11 (for running Flink jobs)
Git

Installation & Setup

1. Clone the Repository

git clone https://github.com/Nasruddin/realtime-streaming-kafka-flink-pinot-postgres.git

cd realtime-streaming-kafka-flink-pinot-postgres

2. Start Services using Docker Compose

Make sure to build and package Flink Java code

cd flink-processing-service
mvn clean package -DskipTests

docker compose \ 
    -f docker-compose-base.yml \
    -f docker-compose-pinot.yml \ 
    -f docker-compose-flink.yml  up --build

This will start Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL containers.

3. Verify Running Services

Run the following commands to check if the services are up:

docker ps

You should see containers for Kafka, Zookeeper, Flink, Pinot, PostgreSQL, and MySQL running.

Running the Pipeline

1. Produce Sample Data to Kafka

As soon as containers start running, data(1000 rides row) will be ingested to MySQL and evently will be publish to Kafka as an events. Once all the 1000 rides are pushed to Kafka, Python Kafka producer will start creating new events. You can verify in the ride service logs

docker logs rides-service

Producer

Consumer

2. Flink Job

Confirm that your flink job is running and producing transformed events

List the topics to confirm new realtime topic is created

docker exec -it kafka kafka-topics --bootstrap-server kafka:9092 --list

Verify the events are getting generated

docker exec -it kafka kafka-console-consumer --topic riders_out --bootstrap-server kafka:9092

3. Query Apache Pinot

Pinot is used for OLAP (Online Analytical Processing) queries on real-time data.

Access Pinot UI: http://localhost:9000
Verify schemas & tables are generated:

curl -X GET "http://localhost:9000/schemas"

curl -X GET "http://localhost:9000/tables"

NOTE - If you don't find schemas and table then re-run below image:

docker-compose restart pinot-add-table

Once the data is processed and stored in Apache Pinot, query it using:

curl -X POST "http://localhost:9000/query/sql" -H "Content-Type: application/json" -d '{"sql":"SELECT * FROM rides LIMIT 10"}'

or

Running Queries

Open Pinot UI (localhost:9000)

Query Pinot:

SELECT * FROM rides LIMIT 10;

4. Query PostgreSQL

Access the databases with:

docker exec -it postgres psql -U postgresuser -d rides_db

5. Dashboard on Superset

Superset is used for data visualization and dashboarding.

Access Superset: http://localhost:8088
Superset dataset:

Stopping and Cleaning Up

To stop the containers without deleting data:

docker compose \
    -f docker-compose-base.yml \
    -f docker-compose-pinot.yml \
    -f docker-compose-flink.yml  down

To stop and remove all containers, volumes, and networks:

docker compose \
    -f docker-compose-base.yml \
    -f docker-compose-pinot.yml \
    -f docker-compose-flink.yml  down -v

Troubleshooting

Check logs of a specific service:
```
docker logs -f <container_id>
```
Restart a specific service:
```
docker-compose restart <service-name>
```

Ensure ports are not occupied:

sudo lsof -i :<port>

Then kill the process:

sudo kill -9 <PID>
sudo kill -15 <PID>

Kafka Specific
```
docker stop kafka zookeeper 
docker rm kafka zookeeper
```
ZooKeeper stores broker metadata under /tmp/zookeeper, so you need to clear it:
```
rm -rf /tmp/zookeeper
```
Kafka maintains log data, which might be causing conflicts. Delete the Kafka log directory:
```
rm -rf /tmp/kafka-logs  
```

Contributing

Feel free to fork this repository and submit pull requests with improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
flink-processing-service		flink-processing-service
images		images
mysql		mysql
pinot/config/rides		pinot/config/rides
postgres		postgres
rides-service		rides-service
.DS_Store		.DS_Store
README.md		README.md
docker-compose-base.yml		docker-compose-base.yml
docker-compose-flink.yml		docker-compose-flink.yml
docker-compose-pinot.yml		docker-compose-pinot.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Realtime Data Streaming

Overview

Prerequisites

Installation & Setup

1. Clone the Repository

2. Start Services using Docker Compose

3. Verify Running Services

Running the Pipeline

1. Produce Sample Data to Kafka

Producer

Consumer

2. Flink Job

List the topics to confirm new realtime topic is created

Verify the events are getting generated

3. Query Apache Pinot

Running Queries

Open Pinot UI (localhost:9000)

Query Pinot:

4. Query PostgreSQL

5. Dashboard on Superset

Stopping and Cleaning Up

Troubleshooting

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Nasruddin/realtime-streaming-kafka-flink-pinot-postgres-superset

Folders and files

Latest commit

History

Repository files navigation

Realtime Data Streaming

Overview

Prerequisites

Installation & Setup

1. Clone the Repository

2. Start Services using Docker Compose

3. Verify Running Services

Running the Pipeline

1. Produce Sample Data to Kafka

Producer

Consumer

2. Flink Job

List the topics to confirm new realtime topic is created

Verify the events are getting generated

3. Query Apache Pinot

Running Queries

Open Pinot UI (localhost:9000)

Query Pinot:

4. Query PostgreSQL

5. Dashboard on Superset

Stopping and Cleaning Up

Troubleshooting

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages