Skip to content

xupit3r/vibrant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

135 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vibrant

A local, CPU and GPU-optimized LLM code assistant built in Go with advanced RAG and plugin capabilities.

Overview

Vibrant is a command-line tool that brings AI-powered coding assistance directly to your terminal, running entirely on your local machine. Features both CPU and GPU (Metal on Apple Silicon) acceleration for fast inference. No internet connection or API keys required.

Features

  • πŸš€ GPU Accelerated: Metal GPU on Apple Silicon (6.4x speedup), CUDA on Linux (10-15x speedup)
  • πŸ–₯️ CPU-optimized: Runs efficiently on CPU using quantized models (GGUF format)
  • 🧠 Context-aware: Understands your codebase structure with semantic search (RAG)
  • 🎯 Auto-tuned: Automatically selects the best model based on your system RAM
  • πŸ’¬ Interactive: Rich terminal UI with syntax highlighting for 30+ languages
  • πŸ”’ Private: All processing happens locally - your code never leaves your machine
  • πŸ”Œ Extensible: Plugin system for custom functionality
  • πŸ“ Git Integration: Smart commit message generation with conventional commits
  • ⌨️ Tab-Completion: Advanced shell completion for zsh, bash, and fish
  • πŸ” Semantic Search: TF-IDF based vector store for code retrieval
  • πŸ€– Agentic: 15+ tools for multi-step workflows with self-correction
  • πŸ§ͺ Test & Build: Integrated testing, building, and linting support
  • ✏️ Code Editing: Diff-based file editing with automatic backups
  • πŸ”¬ AST Analysis: Deep code understanding with symbol extraction

Status

βœ… Feature Complete - Agentic code assistant with GPU acceleration!

Current Phase: Phase 10.11 - Chat Templates, Cache Warming & Fused Dequant βœ… COMPLETE

GPU Backend:

  • βœ… Phase 11.1: Metal GPU support for Apple Silicon (complete, 6.4x speedup)
  • βœ… Phase 11.3: NVIDIA CUDA support for Linux (Phase 1 & 2 COMPLETE!)
    • βœ… Phase 1: CUDA infrastructure and model loading (COMPLETE)
    • βœ… Phase 2: Device-aware tensor operations (COMPLETE)
    • 🚧 Phase 3: Performance optimization and profiling (NEXT)
  • βœ… Metal GPU backend for Apple Silicon
  • βœ… CUDA GPU backend for NVIDIA GPUs on Linux (RTX 4090 validated)
  • βœ… Device abstraction layer (CPU/GPU/Metal/CUDA)
  • βœ… 12 GPU kernels implemented: MatMul, Softmax, RMSNorm, element-wise ops, RoPE
  • βœ… Tensor device migration (CPU ↔ GPU)
  • βœ… Memory management with buffer pooling (19GB pool on RTX 4090)
  • βœ… CLI integration with --device flag (auto, cpu, gpu, metal, cuda)
  • βœ… Quantized model support with automatic GPU dequantization
  • βœ… Device-aware tensor creation infrastructure
  • βœ… All core tensor operations GPU-accelerated (Add, Mul, SiLU, Softmax, RMSNorm, RoPE)

CUDA Performance Status:

  • Model loading: βœ… Working (13.3GB VRAM for 3B model)
  • GPU operations: βœ… All core ops implemented
  • Infrastructure: βœ… Complete and stable
  • Next: Performance testing and optimization (Phase 3)

Metal Performance Results (Apple Silicon):

  • Single-row (decode): CPU faster (low overhead)
  • Medium ops (128Γ—128): 1.37x GPU speedup
  • Large ops (512Γ—512): 6.4x GPU speedup
  • Validated: Zero error on critical decode path

Recent Improvements (Phase 10.11):

  • βœ… Chat template support (ChatML, Llama 3, plain text with auto-detection)
  • βœ… Special token encoding for <|...|> tokens in chat prompts
  • βœ… Weight cache warming (eliminates cold-start penalty on first inference)
  • βœ… Fused dequant-transpose (50% memory reduction per weight)
  • βœ… Debug output cleanup (no more spam on every forward pass)
  • βœ… All 230+ tests passing, no regressions

Note on LLM Inference: Currently uses a mock engine by default due to go-llama.cpp requiring manual setup (git submodules). This is perfect for development and testing. For real inference, we recommend using Ollama or manual vendor setup. See docs/llama-setup.md for details.

Completed Phases

  • βœ… Phase 1: Project Setup & Foundation
  • βœ… Phase 2: System Detection & Model Management
  • βœ… Phase 3: LLM Integration (with mock engine)
  • βœ… Phase 4: Code Context System
  • βœ… Phase 5: Assistant Core Features
  • βœ… Phase 6: CLI User Experience
  • βœ… Phase 7: Advanced Features (RAG, Plugins, Git Integration)
  • βœ… Phase 8: Testing & Optimization
  • βœ… Phase 9: Agentic Behavior (Claude Code-inspired)
  • βœ… Phase 10: Pure Go Inference Engine (no dependencies!)
  • βœ… Phase 11.1: GPU Backend Foundation (Metal on Apple Silicon)
  • βœ… Phase 11.3: CUDA GPU Support (NVIDIA GPUs on Linux)
  • βœ… Phase 10.11: Chat Templates, Cache Warming & Fused Dequant-Transpose

GPU Acceleration (Phase 11.1 & 11.3)

Metal GPU Backend for Apple Silicon:

  • Device abstraction layer supporting CPU and GPU
  • GPU kernels: MatMul, Softmax, RMSNorm, element-wise ops
  • Automatic CPU ↔ GPU tensor migration
  • Unified memory optimization (~40 GB/s transfer speeds)
  • Production-ready with comprehensive validation

CUDA GPU Backend for NVIDIA GPUs (Linux):

  • Full feature parity with Metal (11 kernels)
  • Automatic quantized model support (Q4_K, Q5_K, Q6_K β†’ Float32 dequantization)
  • Direct CGO bindings to CUDA Runtime API
  • Pre-compiled kernels with nvcc
  • Buffer pooling for efficient memory reuse
  • CLI device selection: vibrant chat --device cuda
  • Expected 10-15x speedup on large operations (RTX 4090)
  • Works with all existing quantized models (automatic dequantization)
  • Setup guide: docs/setup/cuda-setup.md

Performance Benchmarks:

Operation CPU Time GPU Time Speedup
Decode (1Γ—512) 120 ΞΌs 430 ΞΌs CPU 3.6x faster*
Medium (128Γ—128) 435 ΞΌs 318 ΞΌs GPU 1.37x faster
Large (512Γ—512) 12.5 ms 2.0 ms GPU 6.4x faster

*GPU overhead dominates for small operations; CPU used for decode step.

CLI Integration:

# Use Metal GPU acceleration (Apple Silicon)
vibrant ask --device metal "your question"

# Use CUDA GPU acceleration (Linux with NVIDIA)
# Works with quantized models (Q4_K, Q5_K, Q6_K) - automatically dequantizes to Float32
vibrant ask --device cuda "your question"

# Use CPU (default, compatible everywhere)
vibrant ask --device cpu "your question"

# Auto-detect and use best device
vibrant ask --device auto "your question"

Note on Quantized Models: When using GPU acceleration with quantized models (Q4_K_M, Q5_K_M, Q6_K), the models are automatically dequantized to Float32 during GPU transfer. This provides full GPU acceleration while working with all existing quantized models, at the cost of ~4x more VRAM usage than the quantized size.

Key Capabilities

πŸ€– Agentic Behavior (Phase 9.1)

  • Tool calling system with 15+ built-in tools
  • Multi-step planning with dependency tracking
  • Self-correction with automatic retry strategies
  • Task decomposition for complex workflows
  • Progress tracking and result summarization

🧠 Code Intelligence (Phase 9.2)

  • AST parsing for Go with symbol extraction
  • Symbol resolution (functions, methods, types, structs, interfaces)
  • Dependency tracking and import analysis
  • Cross-package references and code navigation

✏️ Interactive Editing (Phase 9.3)

  • Diff-based editing with unified diff format
  • Find/replace operations across files
  • Automatic backups before modifications
  • Patch application with validation

πŸ§ͺ Testing & Building (Phase 9.4-9.5)

  • Test execution for Go, Python, Node.js
  • Auto-detect test frameworks and build tools
  • Build integration (go build, make, npm, pip)
  • Error parsing and diagnostic reporting

πŸ” Quality Assurance (Phase 9.6)

  • Linting integration (golangci-lint, pylint, eslint)
  • Security scanning support
  • Best practices suggestions
  • Issue reporting with context

πŸ“Š RAG & Performance

  • Semantic code search using TF-IDF vector embeddings
  • Smart commit messages with conventional commits
  • Plugin system for extensibility (93.2% test coverage)
  • 164+ unit tests with comprehensive coverage
  • Performance: <1Β΅s for most operations

Test Coverage

  • tokenizer: 100%
  • plugin: 93.2%
  • agent: 89.1%
  • gguf: 87.2%
  • codeintel: 81.8%
  • system: 82.1%
  • gpu: 80.1% (new!)
  • diff: 78.3%
  • tensor: 78.2%
  • transformer: 62.1%
  • assistant: 59.6%
  • context: 49.7%

Total: 220+ tests passing across 20 packages

Installation

Install from Source

# Clone and build
git clone https://github.com/xupit3r/vibrant.git
cd vibrant
make build

# Install to system path
make install

Shell Completion (Recommended)

After installing, enable tab-completion for your shell:

# Zsh
vibrant completion zsh > ~/.zsh/completion/_vibrant
echo 'fpath=(~/.zsh/completion $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc

# Bash
vibrant completion bash > ~/.local/share/bash-completion/completions/vibrant
source ~/.local/share/bash-completion/completions/vibrant

# Fish
vibrant completion fish > ~/.config/fish/completions/vibrant.fish

Restart your shell or source your config to activate completions.

Usage

# List available models
vibrant model list

# Download a specific model (with tab-completion!)
vibrant model download <TAB>
vibrant model download qwen2.5-coder-7b-q5

# Show model information
vibrant model info qwen2.5-coder-3b-q4

# Ask a question (downloads model if needed)
vibrant ask "What is a goroutine?"

# Ask with GPU acceleration (Metal on macOS, CUDA on Linux)
vibrant ask --device gpu "Explain Go interfaces"

# With specific model (use tab to see available models)
vibrant ask --model <TAB>
vibrant ask --model qwen2.5-coder-7b-q5 "Explain Go interfaces"

# Interactive chat mode
vibrant chat

# Ask with context from specific files/directories
vibrant ask --context ./src "How does authentication work?"

GPU Support

GPU acceleration is available on multiple platforms:

  • Apple Silicon (M-series): Metal GPU backend (6.4x speedup)
  • Linux with NVIDIA: CUDA GPU backend (validated on RTX 4090, 24 GB VRAM)
# Automatic device selection (tries GPU, falls back to CPU)
vibrant ask --device auto "your question"

# Force GPU/Metal (macOS with Apple Silicon)
vibrant ask --device gpu "your question"
vibrant ask --device metal "your question"

# Force CUDA (Linux with NVIDIA GPU)
vibrant ask --device cuda "your question"

# Force CPU (default, works everywhere)
vibrant ask --device cpu "your question"

When to use GPU:

  • βœ… Large batch operations (prefill phase)
  • βœ… Matrix operations > 128Γ—128
  • ❌ Single-token decode (CPU is faster due to overhead)

GPU provides 6.4x speedup for large matrix operations!

Building from Source

Requirements:

  • Go 1.21 or later
  • Make
  • (Optional) C compiler for GPU support:
    • macOS: Xcode command-line tools (for Metal)
    • Linux: CUDA Toolkit 12.0+ (for NVIDIA GPU support)

Quick Setup

# Clone repository
git clone https://github.com/xupit3r/vibrant.git
cd vibrant

# Check and install dependencies automatically
make install-deps

# Or just check what's missing
make check-deps

# Build (tries llama.cpp, falls back to mock if unavailable)
make build

# Run tests
make test

Build Options

By default, make build uses the pure Go inference engine (no CGO required). This works on all platforms and includes CPU-only support.

# Default: Pure Go engine (no dependencies, CPU only)
make build

# With Metal GPU support (macOS with Apple Silicon)
make build-gpu

# With CUDA GPU support (Linux with NVIDIA GPU + CUDA Toolkit)
make build-cuda

# With llama.cpp (requires manual setup)
make build-llama

# Mock engine (for development/testing)
make build-mock

GPU Support:

  • Metal (macOS): Use make build-gpu on Apple Silicon for 6.4x speedup on large ops
  • CUDA (Linux): Use make build-cuda with NVIDIA GPU (RTX 30/40 series, validated on RTX 4090)

See docs/llama-setup.md for llama.cpp setup instructions.

Architecture

vibrant/
β”œβ”€β”€ cmd/vibrant/       # CLI entry point
β”œβ”€β”€ internal/          # Private application code
β”‚   β”œβ”€β”€ agent/        # Agentic behavior: planning, self-correction
β”‚   β”œβ”€β”€ assistant/    # Conversation & prompt handling
β”‚   β”œβ”€β”€ codeintel/    # AST parsing, symbol extraction
β”‚   β”œβ”€β”€ context/      # Code indexing, RAG, vector store
β”‚   β”œβ”€β”€ diff/         # Diff generation & git integration
β”‚   β”œβ”€β”€ gpu/          # GPU device abstraction & Metal backend
β”‚   β”œβ”€β”€ inference/    # Inference engine with GPU support
β”‚   β”œβ”€β”€ model/        # Model management & caching
β”‚   β”œβ”€β”€ llm/          # LLM inference engine
β”‚   β”œβ”€β”€ plugin/       # Plugin system
β”‚   β”œβ”€β”€ tensor/       # Tensor operations (CPU & GPU)
β”‚   β”œβ”€β”€ tokenizer/    # Tokenization (BPE)
β”‚   β”œβ”€β”€ tools/        # Tool registry (15+ tools)
β”‚   β”œβ”€β”€ transformer/  # Transformer model architecture
β”‚   β”œβ”€β”€ config/       # Configuration management
β”‚   β”œβ”€β”€ system/       # System detection utilities
β”‚   └── tui/          # Terminal UI components
β”œβ”€β”€ test/             # Integration and benchmark tests
β”œβ”€β”€ specs/            # Technical specifications
└── docs/             # Additional documentation

Available Tools

Vibrant includes 15+ built-in tools for agentic workflows:

File Operations

  • read_file - Read file contents
  • write_file - Write content to files
  • list_directory - List directory contents
  • backup_file - Create file backups
  • replace_in_file - Find and replace in files

Code Analysis

  • analyze_code - AST-based code analysis
  • find_files - Search files by pattern
  • grep - Search patterns in files
  • get_file_info - Get file metadata

Editing & Diffs

  • generate_diff - Create unified diffs
  • apply_diff - Apply patches to files

Build & Test

  • run_tests - Execute tests (Go, Python, Node.js)
  • build - Build projects (make, go, npm, pip)
  • lint - Run linters (golangci-lint, pylint, eslint)

Shell

  • shell - Execute shell commands with timeout

Model Support

Vibrant currently supports the following models:

Model Parameters RAM Required Recommended For
Qwen 2.5 Coder 3B (Q4_K_M) 3B 4 GB Systems with 6-10 GB RAM
Qwen 2.5 Coder 7B (Q4_K_M) 7B 8 GB Systems with 10-14 GB RAM
Qwen 2.5 Coder 7B (Q5_K_M) 7B 10 GB Systems with 10-16 GB RAM
Qwen 2.5 Coder 14B (Q5_K_M) 14B 18 GB Systems with 16+ GB RAM

Models are automatically downloaded from HuggingFace on first use.

Performance

GPU Benchmarks (Apple M1)

MatMul 1Γ—512 (decode):   CPU:  120 ΞΌs  |  GPU:  430 ΞΌs  |  CPU 3.6x faster
MatMul 128Γ—128:          CPU:  435 ΞΌs  |  GPU:  318 ΞΌs  |  GPU 1.37x faster
MatMul 512Γ—512:          CPU: 12.5 ms  |  GPU:  2.0 ms  |  GPU 6.4x faster

GPU shines for large operations, CPU is better for small decode steps.

CPU Benchmarks (Intel Core i5-1240P)

BenchmarkConversationAdd       20365    64.3 Β΅s/op    201 KB/op
BenchmarkVectorStoreAdd      1254909     878 ns/op      1 KB/op
BenchmarkVectorStoreSearch     46449    25.4 Β΅s/op     10 KB/op
BenchmarkDiffGenerate        1640283     727 ns/op      2 KB/op
BenchmarkSmartCommitMsg      1959770     617 ns/op    344 B/op

Development

See PLAN.md for the complete implementation plan and specs/ for detailed technical specifications.

Running Tests

# All tests
go test ./...

# With coverage
go test ./... -cover

# Integration tests
go test ./test/integration/...

# Benchmarks
go test ./test/bench/... -bench=. -benchmem

License

TBD

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors