Skip to content

An AI-powered Lakehouse platform with Sales Copilot features, enabling smart retail analytics from ingestion to insight.

Notifications You must be signed in to change notification settings

DataCody/kineteco-evs-llm-pipeline

Repository files navigation

🛠️ Production-Ready Data Engineering Pipeline with Generative AI Integration

GCP GCS Built with Python Powered by Spark dbt Models Dockerized Dremio Nessie Dagster Jupyter Notebooks Superset Streamlit App RAG LLM

A production-grade Data Engineering pipeline that integrates Spark for data migration, dbt for transformation, and Dagster for orchestration. The processed data powers an analytics dashboard in Apache Superset and a Generative AI interface via RAG + LLM using Streamlit. Designed to simulate an EV product data workflow in a modern lakehouse architecture.

✅ Designed as a portfolio project for Data Engineer roles, integrating batch processing, modeling, BI reporting, and AI-based interaction.

🚀 Project Overview

🚗 Use case: Help sales & operations teams explore EV product configurations, color options, and sales trends via a conversational interface.

This project delivers a production-grade data engineering pipeline designed for real-time analytics and AI-driven exploration of electric vehicle (EV) sales data. It integrates modern data tooling and large language models (LLMs) to create a seamless, interactive experience for business users.

🔧 Key Highlights

  • 🚀 Data Ingestion using PySpark and Google Cloud Storage (GCS)
  • 🧱 Data Transformation & Modeling with dbt, orchestrated by Dagster
  • 🔍 Query Layer powered by Dremio (SQL over data lake, no warehouse needed)
  • 📊 BI Layer built using Apache Superset for sales dashboards
  • 🧠 Retrieval-Augmented Generation (RAG) using ChromaDB + Ollama (Mistral) for smart QA
  • 💬 Interactive UI via Streamlit chatbot
  • 🐳 Fully containerized with Docker and served securely via Ngrok

❓ Example Chat Queries

  • "What EV color sold best in Q1 2024?"
  • "Show me the sales breakdown for the AeroFlow model by region"
  • "List all available models with more than 500km range"

🗺️ Architecture Diagram

  • Spark ingestion from CSV/JSON into Google Cloud Storage(GCS)
  • dbt DAGs running on Dagster
  • Dremio as semantic + acceleration layer
  • Superset for analytics
  • ChromaDB for vector storage
  • LLM with Ollama (Mistral)
  • Chatbot interface via Streamlit

📷 Screenshot: Architecture Diagram

🧰 Tech Stack

Layer Tools & Frameworks
Data Ingestion Apache Spark (PySpark) + GCS (Google Cloud Storage)
Data Transformation dbt (data modeling), Dagster (orchestration)
Data Warehouse DuckDB (local OLAP engine) — pluggable with Redshift / Snowflake
Data Visualization Apache Superset (interactive dashboards & reporting)
Notebook Interface Jupyter (development & debugging support)
LLM Embedding Store Chroma (local vector database)
LLM Backend Ollama + Mistral + nomic-embed-text
Frontend / UI Streamlit chatbot interface (user Q&A with RAG pipeline)
Dev & Ops Tools Docker, Ngrok, Python, dotenv, logging

💡 All components are modular and containerized — easily portable for deployment or extension.

📊 Data Pipeline Walkthrough

1. 🔁 Data Migration with Spark

  • Used PySpark to extract, clean, and load raw sales data into a lakehouse.
  • Integrated with GCS as a Cloud DataLake.
  • Built modular dbt models following bronze → silver → gold architecture, enabling clean staging, enriched intermediate layers, and curated marts for analytics.

📷 Screenshot: dbt DAG and dbt run logs

2. 🧱 Data Modeling with dbt

  • Created models for products, sales_summary, and color_distribution.
  • Scheduled and tested transformations locally.

📷 Screenshot: dbt DAG and dbt run logs

3. 📈 Interactive Customer Insights Dashboard with Apache Superset

Developed an interactive dashboard to analyze key customer segmentation metrics, including:

  • VIP customer distribution by state to identify high-value regions
  • Customer model preferences by location across six product lines
  • Overall VIP ratio analysis, highlighting customer loyalty and engagement patterns

This dashboard provides actionable insights for regional marketing strategies and product optimization, with visualizations powered by Apache Superset.

📷 Screenshot: Superset Dashboard

4. 🧠 EV Product Chatbot Interface — Powered by Generative AI

Built an interactive chatbot application using a local LLM interface (deployed via localhost:8501) to simulate customer service for an electric vehicle (EV) company. The chatbot is capable of:

  • Answering context-aware product queries, such as best EV models for city driving or feature comparisons across models.
  • Providing dynamic responses about color availability, performance specs, and design features for multiple EV models (e.g., AeroFlow, UrbanGlide, EcoSprint).
  • Delivering natural language explanations drawn from embedded documentation (PDF/Markdown) using RAG (Retrieval-Augmented Generation).

📸 Screenshot shows real-time chat interactions where users inquire about model suitability, color availability, and detailed specifications.

📷 Screenshot: Generative AI Chat interface

🎯 Skills Demonstrated

  • ✅ GCS for large-scale datalake
  • ✅ PySpark for large-scale transformation
  • ✅ dbt for data modeling & CI-friendly pipelines
  • ✅ Superset for rapid BI development
  • ✅ LLM apps using LangChain + Ollama + Chroma
  • ✅ Streamlit for frontend chatbot UI
  • ✅ Environment control with .env, pyenv, ngrok

📁 Repository Structure

Sales-Copilot-Lakehouse/
├── rag/                    # LLM, RAG, Chatbot code
│   ├── manage_chroma_db.py
│   ├── streamlit_chat.py
├── dbt/                    # dbt models and config
│   ├── models/
│   ├── dbt_project.yml
├── spark_jobs/            # PySpark scripts for ingestion
├── superset/              # Superset dashboard exports
├── chroma/                # Persisted vector DB
├── images/                # 📷 Screenshot placeholders
├── .env                   # Environment secrets
└── README.md              # You're here

About

An AI-powered Lakehouse platform with Sales Copilot features, enabling smart retail analytics from ingestion to insight.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published