A local, CPU and GPU-optimized LLM code assistant built in Go with advanced RAG and plugin capabilities.
Vibrant is a command-line tool that brings AI-powered coding assistance directly to your terminal, running entirely on your local machine. Features both CPU and GPU (Metal on Apple Silicon) acceleration for fast inference. No internet connection or API keys required.
- π GPU Accelerated: Metal GPU on Apple Silicon (6.4x speedup), CUDA on Linux (10-15x speedup)
- π₯οΈ CPU-optimized: Runs efficiently on CPU using quantized models (GGUF format)
- π§ Context-aware: Understands your codebase structure with semantic search (RAG)
- π― Auto-tuned: Automatically selects the best model based on your system RAM
- π¬ Interactive: Rich terminal UI with syntax highlighting for 30+ languages
- π Private: All processing happens locally - your code never leaves your machine
- π Extensible: Plugin system for custom functionality
- π Git Integration: Smart commit message generation with conventional commits
- β¨οΈ Tab-Completion: Advanced shell completion for zsh, bash, and fish
- π Semantic Search: TF-IDF based vector store for code retrieval
- π€ Agentic: 15+ tools for multi-step workflows with self-correction
- π§ͺ Test & Build: Integrated testing, building, and linting support
- βοΈ Code Editing: Diff-based file editing with automatic backups
- π¬ AST Analysis: Deep code understanding with symbol extraction
β Feature Complete - Agentic code assistant with GPU acceleration!
Current Phase: Phase 10.11 - Chat Templates, Cache Warming & Fused Dequant β COMPLETE
GPU Backend:
- β Phase 11.1: Metal GPU support for Apple Silicon (complete, 6.4x speedup)
- β
Phase 11.3: NVIDIA CUDA support for Linux (Phase 1 & 2 COMPLETE!)
- β Phase 1: CUDA infrastructure and model loading (COMPLETE)
- β Phase 2: Device-aware tensor operations (COMPLETE)
- π§ Phase 3: Performance optimization and profiling (NEXT)
- β Metal GPU backend for Apple Silicon
- β CUDA GPU backend for NVIDIA GPUs on Linux (RTX 4090 validated)
- β Device abstraction layer (CPU/GPU/Metal/CUDA)
- β 12 GPU kernels implemented: MatMul, Softmax, RMSNorm, element-wise ops, RoPE
- β Tensor device migration (CPU β GPU)
- β Memory management with buffer pooling (19GB pool on RTX 4090)
- β
CLI integration with
--deviceflag (auto, cpu, gpu, metal, cuda) - β Quantized model support with automatic GPU dequantization
- β Device-aware tensor creation infrastructure
- β All core tensor operations GPU-accelerated (Add, Mul, SiLU, Softmax, RMSNorm, RoPE)
CUDA Performance Status:
- Model loading: β Working (13.3GB VRAM for 3B model)
- GPU operations: β All core ops implemented
- Infrastructure: β Complete and stable
- Next: Performance testing and optimization (Phase 3)
Metal Performance Results (Apple Silicon):
- Single-row (decode): CPU faster (low overhead)
- Medium ops (128Γ128): 1.37x GPU speedup
- Large ops (512Γ512): 6.4x GPU speedup
- Validated: Zero error on critical decode path
Recent Improvements (Phase 10.11):
- β Chat template support (ChatML, Llama 3, plain text with auto-detection)
- β
Special token encoding for
<|...|>tokens in chat prompts - β Weight cache warming (eliminates cold-start penalty on first inference)
- β Fused dequant-transpose (50% memory reduction per weight)
- β Debug output cleanup (no more spam on every forward pass)
- β All 230+ tests passing, no regressions
Note on LLM Inference: Currently uses a mock engine by default due to go-llama.cpp requiring manual setup (git submodules). This is perfect for development and testing. For real inference, we recommend using Ollama or manual vendor setup. See docs/llama-setup.md for details.
- β Phase 1: Project Setup & Foundation
- β Phase 2: System Detection & Model Management
- β Phase 3: LLM Integration (with mock engine)
- β Phase 4: Code Context System
- β Phase 5: Assistant Core Features
- β Phase 6: CLI User Experience
- β Phase 7: Advanced Features (RAG, Plugins, Git Integration)
- β Phase 8: Testing & Optimization
- β Phase 9: Agentic Behavior (Claude Code-inspired)
- β Phase 10: Pure Go Inference Engine (no dependencies!)
- β Phase 11.1: GPU Backend Foundation (Metal on Apple Silicon)
- β Phase 11.3: CUDA GPU Support (NVIDIA GPUs on Linux)
- β Phase 10.11: Chat Templates, Cache Warming & Fused Dequant-Transpose
Metal GPU Backend for Apple Silicon:
- Device abstraction layer supporting CPU and GPU
- GPU kernels: MatMul, Softmax, RMSNorm, element-wise ops
- Automatic CPU β GPU tensor migration
- Unified memory optimization (~40 GB/s transfer speeds)
- Production-ready with comprehensive validation
CUDA GPU Backend for NVIDIA GPUs (Linux):
- Full feature parity with Metal (11 kernels)
- Automatic quantized model support (Q4_K, Q5_K, Q6_K β Float32 dequantization)
- Direct CGO bindings to CUDA Runtime API
- Pre-compiled kernels with nvcc
- Buffer pooling for efficient memory reuse
- CLI device selection:
vibrant chat --device cuda - Expected 10-15x speedup on large operations (RTX 4090)
- Works with all existing quantized models (automatic dequantization)
- Setup guide: docs/setup/cuda-setup.md
Performance Benchmarks:
| Operation | CPU Time | GPU Time | Speedup |
|---|---|---|---|
| Decode (1Γ512) | 120 ΞΌs | 430 ΞΌs | CPU 3.6x faster* |
| Medium (128Γ128) | 435 ΞΌs | 318 ΞΌs | GPU 1.37x faster |
| Large (512Γ512) | 12.5 ms | 2.0 ms | GPU 6.4x faster |
*GPU overhead dominates for small operations; CPU used for decode step.
CLI Integration:
# Use Metal GPU acceleration (Apple Silicon)
vibrant ask --device metal "your question"
# Use CUDA GPU acceleration (Linux with NVIDIA)
# Works with quantized models (Q4_K, Q5_K, Q6_K) - automatically dequantizes to Float32
vibrant ask --device cuda "your question"
# Use CPU (default, compatible everywhere)
vibrant ask --device cpu "your question"
# Auto-detect and use best device
vibrant ask --device auto "your question"Note on Quantized Models: When using GPU acceleration with quantized models (Q4_K_M, Q5_K_M, Q6_K), the models are automatically dequantized to Float32 during GPU transfer. This provides full GPU acceleration while working with all existing quantized models, at the cost of ~4x more VRAM usage than the quantized size.
π€ Agentic Behavior (Phase 9.1)
- Tool calling system with 15+ built-in tools
- Multi-step planning with dependency tracking
- Self-correction with automatic retry strategies
- Task decomposition for complex workflows
- Progress tracking and result summarization
π§ Code Intelligence (Phase 9.2)
- AST parsing for Go with symbol extraction
- Symbol resolution (functions, methods, types, structs, interfaces)
- Dependency tracking and import analysis
- Cross-package references and code navigation
βοΈ Interactive Editing (Phase 9.3)
- Diff-based editing with unified diff format
- Find/replace operations across files
- Automatic backups before modifications
- Patch application with validation
π§ͺ Testing & Building (Phase 9.4-9.5)
- Test execution for Go, Python, Node.js
- Auto-detect test frameworks and build tools
- Build integration (go build, make, npm, pip)
- Error parsing and diagnostic reporting
π Quality Assurance (Phase 9.6)
- Linting integration (golangci-lint, pylint, eslint)
- Security scanning support
- Best practices suggestions
- Issue reporting with context
π RAG & Performance
- Semantic code search using TF-IDF vector embeddings
- Smart commit messages with conventional commits
- Plugin system for extensibility (93.2% test coverage)
- 164+ unit tests with comprehensive coverage
- Performance: <1Β΅s for most operations
tokenizer: 100%plugin: 93.2%agent: 89.1%gguf: 87.2%codeintel: 81.8%system: 82.1%gpu: 80.1% (new!)diff: 78.3%tensor: 78.2%transformer: 62.1%assistant: 59.6%context: 49.7%
Total: 220+ tests passing across 20 packages
# Clone and build
git clone https://github.com/xupit3r/vibrant.git
cd vibrant
make build
# Install to system path
make installAfter installing, enable tab-completion for your shell:
# Zsh
vibrant completion zsh > ~/.zsh/completion/_vibrant
echo 'fpath=(~/.zsh/completion $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc
# Bash
vibrant completion bash > ~/.local/share/bash-completion/completions/vibrant
source ~/.local/share/bash-completion/completions/vibrant
# Fish
vibrant completion fish > ~/.config/fish/completions/vibrant.fishRestart your shell or source your config to activate completions.
# List available models
vibrant model list
# Download a specific model (with tab-completion!)
vibrant model download <TAB>
vibrant model download qwen2.5-coder-7b-q5
# Show model information
vibrant model info qwen2.5-coder-3b-q4
# Ask a question (downloads model if needed)
vibrant ask "What is a goroutine?"
# Ask with GPU acceleration (Metal on macOS, CUDA on Linux)
vibrant ask --device gpu "Explain Go interfaces"
# With specific model (use tab to see available models)
vibrant ask --model <TAB>
vibrant ask --model qwen2.5-coder-7b-q5 "Explain Go interfaces"
# Interactive chat mode
vibrant chat
# Ask with context from specific files/directories
vibrant ask --context ./src "How does authentication work?"GPU acceleration is available on multiple platforms:
- Apple Silicon (M-series): Metal GPU backend (6.4x speedup)
- Linux with NVIDIA: CUDA GPU backend (validated on RTX 4090, 24 GB VRAM)
# Automatic device selection (tries GPU, falls back to CPU)
vibrant ask --device auto "your question"
# Force GPU/Metal (macOS with Apple Silicon)
vibrant ask --device gpu "your question"
vibrant ask --device metal "your question"
# Force CUDA (Linux with NVIDIA GPU)
vibrant ask --device cuda "your question"
# Force CPU (default, works everywhere)
vibrant ask --device cpu "your question"When to use GPU:
- β Large batch operations (prefill phase)
- β Matrix operations > 128Γ128
- β Single-token decode (CPU is faster due to overhead)
GPU provides 6.4x speedup for large matrix operations!
Requirements:
- Go 1.21 or later
- Make
- (Optional) C compiler for GPU support:
- macOS: Xcode command-line tools (for Metal)
- Linux: CUDA Toolkit 12.0+ (for NVIDIA GPU support)
# Clone repository
git clone https://github.com/xupit3r/vibrant.git
cd vibrant
# Check and install dependencies automatically
make install-deps
# Or just check what's missing
make check-deps
# Build (tries llama.cpp, falls back to mock if unavailable)
make build
# Run tests
make testBy default, make build uses the pure Go inference engine (no CGO required). This works on all platforms and includes CPU-only support.
# Default: Pure Go engine (no dependencies, CPU only)
make build
# With Metal GPU support (macOS with Apple Silicon)
make build-gpu
# With CUDA GPU support (Linux with NVIDIA GPU + CUDA Toolkit)
make build-cuda
# With llama.cpp (requires manual setup)
make build-llama
# Mock engine (for development/testing)
make build-mockGPU Support:
- Metal (macOS): Use
make build-gpuon Apple Silicon for 6.4x speedup on large ops - CUDA (Linux): Use
make build-cudawith NVIDIA GPU (RTX 30/40 series, validated on RTX 4090)- Requires CUDA Toolkit 12.0+ and NVIDIA Driver 525.60.13+
- See docs/setup/cuda-setup.md for setup
See docs/llama-setup.md for llama.cpp setup instructions.
vibrant/
βββ cmd/vibrant/ # CLI entry point
βββ internal/ # Private application code
β βββ agent/ # Agentic behavior: planning, self-correction
β βββ assistant/ # Conversation & prompt handling
β βββ codeintel/ # AST parsing, symbol extraction
β βββ context/ # Code indexing, RAG, vector store
β βββ diff/ # Diff generation & git integration
β βββ gpu/ # GPU device abstraction & Metal backend
β βββ inference/ # Inference engine with GPU support
β βββ model/ # Model management & caching
β βββ llm/ # LLM inference engine
β βββ plugin/ # Plugin system
β βββ tensor/ # Tensor operations (CPU & GPU)
β βββ tokenizer/ # Tokenization (BPE)
β βββ tools/ # Tool registry (15+ tools)
β βββ transformer/ # Transformer model architecture
β βββ config/ # Configuration management
β βββ system/ # System detection utilities
β βββ tui/ # Terminal UI components
βββ test/ # Integration and benchmark tests
βββ specs/ # Technical specifications
βββ docs/ # Additional documentation
Vibrant includes 15+ built-in tools for agentic workflows:
File Operations
read_file- Read file contentswrite_file- Write content to fileslist_directory- List directory contentsbackup_file- Create file backupsreplace_in_file- Find and replace in files
Code Analysis
analyze_code- AST-based code analysisfind_files- Search files by patterngrep- Search patterns in filesget_file_info- Get file metadata
Editing & Diffs
generate_diff- Create unified diffsapply_diff- Apply patches to files
Build & Test
run_tests- Execute tests (Go, Python, Node.js)build- Build projects (make, go, npm, pip)lint- Run linters (golangci-lint, pylint, eslint)
Shell
shell- Execute shell commands with timeout
Vibrant currently supports the following models:
| Model | Parameters | RAM Required | Recommended For |
|---|---|---|---|
| Qwen 2.5 Coder 3B (Q4_K_M) | 3B | 4 GB | Systems with 6-10 GB RAM |
| Qwen 2.5 Coder 7B (Q4_K_M) | 7B | 8 GB | Systems with 10-14 GB RAM |
| Qwen 2.5 Coder 7B (Q5_K_M) | 7B | 10 GB | Systems with 10-16 GB RAM |
| Qwen 2.5 Coder 14B (Q5_K_M) | 14B | 18 GB | Systems with 16+ GB RAM |
Models are automatically downloaded from HuggingFace on first use.
MatMul 1Γ512 (decode): CPU: 120 ΞΌs | GPU: 430 ΞΌs | CPU 3.6x faster
MatMul 128Γ128: CPU: 435 ΞΌs | GPU: 318 ΞΌs | GPU 1.37x faster
MatMul 512Γ512: CPU: 12.5 ms | GPU: 2.0 ms | GPU 6.4x faster
GPU shines for large operations, CPU is better for small decode steps.
BenchmarkConversationAdd 20365 64.3 Β΅s/op 201 KB/op
BenchmarkVectorStoreAdd 1254909 878 ns/op 1 KB/op
BenchmarkVectorStoreSearch 46449 25.4 Β΅s/op 10 KB/op
BenchmarkDiffGenerate 1640283 727 ns/op 2 KB/op
BenchmarkSmartCommitMsg 1959770 617 ns/op 344 B/op
See PLAN.md for the complete implementation plan and specs/ for detailed technical specifications.
# All tests
go test ./...
# With coverage
go test ./... -cover
# Integration tests
go test ./test/integration/...
# Benchmarks
go test ./test/bench/... -bench=. -benchmemTBD