Graph-Powered Text-to-SQL Agent

An advanced Text-to-SQL agent that uses a Knowledge Graph (FalkorDB) and a Heterogeneous Graph Attention Network (HAN) to accurately identify relevant tables before generating SQL with a Large Language Model (LLM).

This project implements a two-stage Retrieval-Augmented Generation (RAG) architecture for a more accurate and efficient Text-to-SQL conversion.

Key Features

YAML-based Schema: Easily define and manage your database schema, descriptions, and relationships using simple YAML files.
Heterogeneous Knowledge Graph: Models your SQL schema as a rich graph in FalkorDB with different node (Table, Column) and edge (HAS_COLUMN, LINKS_TO) types.
State-of-the-Art Embeddings: Uses a Heterogeneous Graph Attention Network (HAN) to create powerful, context-aware vector embeddings for each table. These embeddings understand both the table's content and its relationships within the schema.
Two-Stage RAG Architecture:
1. Retrieval: A fast vector search in FalkorDB finds the most relevant candidate tables for a user query.
2. Generation: A precisely structured prompt containing only the candidate table schemas is sent to an LLM to generate the final SQL query.

Architecture Overview

The system is split into two main phases: an offline Indexing Pipeline and an online Query Pipeline.

1. Indexing Pipeline (Offline)

This is a one-time process run by graph_builder.py to prepare the knowledge graph.

Parse Schemas: Reads all .yaml files from the /schemas directory.
Build Graph: Populates a FalkorDB database with Table and Column nodes and their relationships.
Generate Embeddings: Runs the HAN model on the graph to compute a han_embedding vector for each Table node and stores it back in FalkorDB.

2. Query Pipeline (Online)

This is the real-time process run by agent.py to answer a user's question.

User Query: The user provides a question in natural language (e.g., "who were our top 5 customers last month?").
Retrieve Candidates: The retriever.py module embeds the user query and performs a cosine similarity search in FalkorDB to find the most relevant Table nodes based on their han_embedding.
Generate Prompt: The prompt_engine.py module takes the list of candidate tables, loads their full schemas from the YAML files, and constructs a detailed, structured prompt for the LLM.
Generate SQL: The prompt is sent to an LLM (e.g., GPT-4, Claude 3), which returns the final, executable SQL query.

Project Structure

.
├── docs                                                    -- folder: contains documentation on some of the design desciions
├── schemas                                                 -- folder: contains SQL table definitions + meta data
├── src                                                     -- folder: the actual Python package code
│   ├── graphdb_agent
│   │   ├── agent.py
│   │   ├── config.py
│   │   ├── generate_starfish_questions.py
│   │   ├── graph_builder.py
│   │   ├── __init__.py
│   │   ├── lemmatization.py
│   │   ├── prompt_engine.py
│   │   ├── retriever.py
│   │   ├── semantic.py
│   │   └── z_snippets.py
├── docker-compose.yml                                      -- stars GraphDB server, e.g. FalkorDB
├── LICENSE
├── pyproject.toml                                          -- defines project dependencies
├── README.md 
└── uv.lock                                                 -- for `uv sync`

Setup and Installation

Installation Steps:

Clone the repository:

git clone https://github.com/your-username/graphdb-agent.git
cd graphdb-agent

Create and activate a virtual environment via uv (assumes Linux or Mac):

uv sync                     # creates folder .venv with everything you need
source .venv/bin/activate

Configure the application:
Open graphdb-agent/config.py and update your settings, in particular:
- Choice of SQL_DIALECT
- LiteLLM API_KEY
- LiteLLM LLM_MODEL
Start the FalkorDB server:
```
docker compose up -d
```
Its UI is accessible via http://localhost:3000/graph

Usage

1. Define Your Schema

Add or edit the .yaml files in the /schemas directory to match your database schema. Ensure you provide clear descriptions and define _global_join_paths.yaml for foreign key relationships, too. Follow the layout of the existing examples.

2. Run the Indexing Pipeline

Execute the graph builder script to populate FalkorDB and generate the HAN embeddings. This only needs to be done once, or whenever your schema changes.

uv run src/graphdb_agent/graph_builder.py

3. Run a Query

Execute the main agent script to process a natural language query.

# The agent.py script can be modified to take a command-line argument
uv run src/graphdb_agent/agent.py "Show me the names of customers who placed large orders"

This will output the final, ready-to-use SQL query.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Graph-Powered Text-to-SQL Agent

Key Features

Architecture Overview

1. Indexing Pipeline (Offline)

2. Query Pipeline (Online)

Project Structure

Setup and Installation

Usage

1. Define Your Schema

2. Run the Indexing Pipeline

3. Run a Query

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
schemas		schemas
src/graphdb_agent		src/graphdb_agent
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

kzk2000/graphdb-agent

Folders and files

Latest commit

History

Repository files navigation

Graph-Powered Text-to-SQL Agent

Key Features

Architecture Overview

1. Indexing Pipeline (Offline)

2. Query Pipeline (Online)

Project Structure

Setup and Installation

Usage

1. Define Your Schema

2. Run the Indexing Pipeline

3. Run a Query

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages