TokenSmith

TokenSmith is a local-first database system for students to query textbooks, lecture slides, and notes and get fast, cited answers on their own machines using local LLMs. It is based on retrieval augmented generation (RAG) and applies database-inspired principles like indexing, latency-focused querying, caching, and incremental builds, to optimize the ingestion -> retrieval -> generation pipeline.

Capabilities

Parse and index PDF documents
Semantic retrieval with FAISS
Local inference via llama.cpp (GGUF models)
Acceleration: Metal (Apple Silicon), CUDA (NVIDIA), or CPU
Configurable chunking (tokens or characters)
Optional indexing progress visualization
Table preservation during indexing (flag-based)

Requirements

Python 3.9+
Conda/Miniconda
System prerequisites:
- macOS: Xcode Command Line Tools
- Linux: GCC, make, CMake
- Windows: Visual Studio Build Tools

Quick Start

1) Clone the repository and Download the models

git clone https://github.com/georgia-tech-db/TokenSmith.git
cd TokenSmith

Create the model directory and put in the appropriate models in it.

mkdir models
cd models

Now, let's say config.yaml has following configs:

embed_model: "models/Qwen3-Embedding-4B-Q5_K_M.gguf"
model_path: "models/qwen2.5-1.5b-instruct-q5_k_m.gguf"

For above config file, download appropriate files from the below link and put them in the models/ folder with the expected file name.

2) Build (creates env, builds llama.cpp, installs deps)

make build

Creates a Conda env tokensmith, installs Python deps, and builds/detects llama.cpp.

3) Activate the environment

conda activate tokensmith

4) Prepare documents

mkdir -p data/chapters
cp your-documents.pdf data/chapters/

5) Index documents

make run-index

With custom parameters:

make run-index ARGS="--pdf_range 1-10 --chunk_mode chars --visualize"

6) Chat

python -m src.main chat

If you see a missing-model error, download qwen2.5-0.5b-instruct-q5_k_m.gguf into llama.cpp/models.

7) Deactivate

conda deactivate

Configuration

Priority (highest → lowest):

--config CLI argument
~/.config/tokensmith/config.yaml
config/config.yaml

Example

embed_model: "sentence-transformers/all-MiniLM-L6-v2"
top_k: 5
max_gen_tokens: 400
halo_mode: "none"
seg_filter: null

# Model settings
model_path: "models/qwen2.5-0.5b-instruct-q5_k_m.gguf"

# Indexing settings
chunk_mode: "tokens" # or "chars"
chunk_tokens: 500
chunk_size_char: 20000

Usage

Basic indexing

make run-index

Index a specific PDF range

make run-index ARGS="--pdf_range <start>-<end> --chunk_mode <tokens|chars>"

Index with tables/visualization

make run-index ARGS="--keep_tables --visualize --chunk_tokens <num_tokens>"

Custom paths/settings

make run-index ARGS="--pdf_dir <path_to_pdf> --index_prefix book_index --config <path_to_yaml>"

Chat with custom settings

python -m src.main chat --config <path_to_yaml> --model_path <path_to_gguf>

Use an existing llama.cpp build

export LLAMA_CPP_BINARY=/usr/local/bin/llama-cli
make build

Environment maintenance

make update-env
make export-env
make show-deps

Command-Line Arguments

Core

mode: index or chat
--config: path to YAML config
--pdf_dir: directory with PDFs
--index_prefix: prefix for index files
--model_path: path to GGUF model

Indexing

--pdf_range: e.g., 1-10
--chunk_mode: tokens or chars
--chunk_tokens: default 500
--chunk_size_char: default 20000
--keep_tables
--visualize

Development

make help
make env
make build-llama
make install
make build
make test
make clean
make show-deps
make update-env
make export-env

Testing

pytest tests/
pytest tests/ -s
pytest tests/ --benchmark-ids="test" -s

Tests call the same get_answer() pipeline used by chat
Metrics: semantic similarity, BLEU, keyword matching, text similarity
Outputs: terminal logs and HTML report
System prompts: baseline, tutor, concise, detailed
Component isolation: run with/without chunks or with golden chunks

Artifacts:

tests/results/benchmark_results.json
tests/results/benchmark_summary.html
tests/results/failed_tests.log

Documentation: see tests/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
config		config
data		data
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokenSmith

Capabilities

Requirements

Quick Start

1) Clone the repository and Download the models

2) Build (creates env, builds llama.cpp, installs deps)

3) Activate the environment

4) Prepare documents

5) Index documents

6) Chat

7) Deactivate

Configuration

Example

Usage

Basic indexing

Index a specific PDF range

Index with tables/visualization

Custom paths/settings

Chat with custom settings

Use an existing llama.cpp build

Environment maintenance

Command-Line Arguments

Core

Indexing

Development

Testing

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

georgia-tech-db/TokenSmith

Folders and files

Latest commit

History

Repository files navigation

TokenSmith

Capabilities

Requirements

Quick Start

1) Clone the repository and Download the models

2) Build (creates env, builds llama.cpp, installs deps)

3) Activate the environment

4) Prepare documents

5) Index documents

6) Chat

7) Deactivate

Configuration

Example

Usage

Basic indexing

Index a specific PDF range

Index with tables/visualization

Custom paths/settings

Chat with custom settings

Use an existing llama.cpp build

Environment maintenance

Command-Line Arguments

Core

Indexing

Development

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages