TokenSmith is a local-first database system for students to query textbooks, lecture slides, and notes and get fast, cited answers on their own machines using local LLMs. It is based on retrieval augmented generation (RAG) and applies database-inspired principles like indexing, latency-focused querying, caching, and incremental builds, to optimize the ingestion -> retrieval -> generation pipeline.
- Parse and index PDF documents
- Semantic retrieval with FAISS
- Local inference via
llama.cpp(GGUF models) - Acceleration: Metal (Apple Silicon), CUDA (NVIDIA), or CPU
- Configurable chunking (tokens or characters)
- Optional indexing progress visualization
- Table preservation during indexing (flag-based)
-
Python 3.9+
-
Conda/Miniconda
-
System prerequisites:
- macOS: Xcode Command Line Tools
- Linux: GCC, make, CMake
- Windows: Visual Studio Build Tools
git clone https://github.com/georgia-tech-db/TokenSmith.git
cd TokenSmithCreate the model directory and put in the appropriate models in it.
mkdir models
cd modelsNow, let's say config.yaml has following configs:
embed_model: "models/Qwen3-Embedding-4B-Q5_K_M.gguf"
model_path: "models/qwen2.5-1.5b-instruct-q5_k_m.gguf"For above config file, download appropriate files from the below link
and put them in the models/ folder with the expected file name.
- https://huggingface.co/Qwen/Qwen3-Embedding-4B-GGUF/tree/main
- https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/tree/main
make buildCreates a Conda env tokensmith, installs Python deps, and builds/detects llama.cpp.
conda activate tokensmithmkdir -p data/chapters
cp your-documents.pdf data/chapters/make run-indexWith custom parameters:
make run-index ARGS="--pdf_range 1-10 --chunk_mode chars --visualize"python -m src.main chatIf you see a missing-model error, download
qwen2.5-0.5b-instruct-q5_k_m.ggufintollama.cpp/models.
conda deactivatePriority (highest → lowest):
--configCLI argument~/.config/tokensmith/config.yamlconfig/config.yaml
embed_model: "sentence-transformers/all-MiniLM-L6-v2"
top_k: 5
max_gen_tokens: 400
halo_mode: "none"
seg_filter: null
# Model settings
model_path: "models/qwen2.5-0.5b-instruct-q5_k_m.gguf"
# Indexing settings
chunk_mode: "tokens" # or "chars"
chunk_tokens: 500
chunk_size_char: 20000make run-indexmake run-index ARGS="--pdf_range <start>-<end> --chunk_mode <tokens|chars>"make run-index ARGS="--keep_tables --visualize --chunk_tokens <num_tokens>"make run-index ARGS="--pdf_dir <path_to_pdf> --index_prefix book_index --config <path_to_yaml>"python -m src.main chat --config <path_to_yaml> --model_path <path_to_gguf>export LLAMA_CPP_BINARY=/usr/local/bin/llama-cli
make buildmake update-env
make export-env
make show-depsmode:indexorchat--config: path to YAML config--pdf_dir: directory with PDFs--index_prefix: prefix for index files--model_path: path to GGUF model
--pdf_range: e.g.,1-10--chunk_mode:tokensorchars--chunk_tokens: default 500--chunk_size_char: default 20000--keep_tables--visualize
make help
make env
make build-llama
make install
make build
make test
make clean
make show-deps
make update-env
make export-envpytest tests/
pytest tests/ -s
pytest tests/ --benchmark-ids="test" -s- Tests call the same
get_answer()pipeline used by chat - Metrics: semantic similarity, BLEU, keyword matching, text similarity
- Outputs: terminal logs and HTML report
- System prompts: baseline, tutor, concise, detailed
- Component isolation: run with/without chunks or with golden chunks
Artifacts:
tests/results/benchmark_results.jsontests/results/benchmark_summary.htmltests/results/failed_tests.log
Documentation: see tests/README.md.