An advanced Text-to-SQL agent that uses a Knowledge Graph (FalkorDB) and a Heterogeneous Graph Attention Network (HAN) to accurately identify relevant tables before generating SQL with a Large Language Model (LLM).
This project implements a two-stage Retrieval-Augmented Generation (RAG) architecture for a more accurate and efficient Text-to-SQL conversion.
- YAML-based Schema: Easily define and manage your database schema, descriptions, and relationships using simple YAML files.
- Heterogeneous Knowledge Graph: Models your SQL schema as a rich graph in FalkorDB with different node (
Table,Column) and edge (HAS_COLUMN,LINKS_TO) types. - State-of-the-Art Embeddings: Uses a Heterogeneous Graph Attention Network (HAN) to create powerful, context-aware vector embeddings for each table. These embeddings understand both the table's content and its relationships within the schema.
- Two-Stage RAG Architecture:
- Retrieval: A fast vector search in FalkorDB finds the most relevant candidate tables for a user query.
- Generation: A precisely structured prompt containing only the candidate table schemas is sent to an LLM to generate the final SQL query.
The system is split into two main phases: an offline Indexing Pipeline and an online Query Pipeline.
This is a one-time process run by graph_builder.py to prepare the knowledge graph.
- Parse Schemas: Reads all
.yamlfiles from the/schemasdirectory. - Build Graph: Populates a FalkorDB database with
TableandColumnnodes and their relationships. - Generate Embeddings: Runs the HAN model on the graph to compute a
han_embeddingvector for eachTablenode and stores it back in FalkorDB.
This is the real-time process run by agent.py to answer a user's question.
- User Query: The user provides a question in natural language (e.g., "who were our top 5 customers last month?").
- Retrieve Candidates: The
retriever.pymodule embeds the user query and performs a cosine similarity search in FalkorDB to find the most relevantTablenodes based on theirhan_embedding. - Generate Prompt: The
prompt_engine.pymodule takes the list of candidate tables, loads their full schemas from the YAML files, and constructs a detailed, structured prompt for the LLM. - Generate SQL: The prompt is sent to an LLM (e.g., GPT-4, Claude 3), which returns the final, executable SQL query.
.
├── docs -- folder: contains documentation on some of the design desciions
├── schemas -- folder: contains SQL table definitions + meta data
├── src -- folder: the actual Python package code
│ ├── graphdb_agent
│ │ ├── agent.py
│ │ ├── config.py
│ │ ├── generate_starfish_questions.py
│ │ ├── graph_builder.py
│ │ ├── __init__.py
│ │ ├── lemmatization.py
│ │ ├── prompt_engine.py
│ │ ├── retriever.py
│ │ ├── semantic.py
│ │ └── z_snippets.py
├── docker-compose.yml -- stars GraphDB server, e.g. FalkorDB
├── LICENSE
├── pyproject.toml -- defines project dependencies
├── README.md
└── uv.lock -- for `uv sync`
Installation Steps:
-
Clone the repository:
git clone https://github.com/your-username/graphdb-agent.git cd graphdb-agent -
Create and activate a virtual environment via uv (assumes Linux or Mac):
uv sync # creates folder .venv with everything you need source .venv/bin/activate
-
Configure the application:
Opengraphdb-agent/config.pyand update your settings, in particular:- Choice of
SQL_DIALECT - LiteLLM
API_KEY - LiteLLM
LLM_MODEL
- Choice of
-
Start the FalkorDB server:
docker compose up -d
Its UI is accessible via http://localhost:3000/graph
Add or edit the .yaml files in the /schemas directory to match your database schema. Ensure you provide clear
descriptions and define _global_join_paths.yaml for foreign key relationships, too. Follow the layout of the
existing examples.
Execute the graph builder script to populate FalkorDB and generate the HAN embeddings. This only needs to be done once, or whenever your schema changes.
uv run src/graphdb_agent/graph_builder.py Execute the main agent script to process a natural language query.
# The agent.py script can be modified to take a command-line argument
uv run src/graphdb_agent/agent.py "Show me the names of customers who placed large orders"This will output the final, ready-to-use SQL query.