Skip to content

Leeaandrob/neurogrid

Repository files navigation

NeuroGrid Engine

GPU-Accelerated Distributed LLM Inference Engine

Quick StartModelsAPIDistributedBenchmarks


NeuroGrid is a high-performance inference engine for Large Language Models (LLMs), built from scratch in Go + CUDA. Designed for both single-GPU and distributed inference across multiple machines.

Key Features

  • 279 tok/s on RTX 4090 (LFM2.5-1.2B, BF16) with CUDA Graph replay
  • OpenAI-compatible API with reasoning_content field for thinking models
  • BF16-native compute pipeline — zero FP16 conversions in decode path
  • Paged KV Cache (vLLM-style) with <4% memory waste
  • FlashAttention-2 for prefill via dlopen of vLLM's compiled kernel
  • Continuous batching with per-sequence isolation
  • Distributed inference via libp2p P2P with pipeline parallelism
  • Configurable quantization (--quantization bf16/int8)

Quick Start

# 1. Download a model
make download-tinyllama          # Small model for testing (~2.2GB)

# 2. Build and run
make run

# 3. Test the API
curl -X POST http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "tinyllama", "messages": [{"role": "user", "content": "Hello!"}]}'

That's it! The server auto-detects the model and starts on port 8090.

Requirements

Requirement Version
Go 1.21+
CUDA Toolkit 11.x or 12.x
GPU NVIDIA with Compute 7.0+ (RTX 20/30/40/50)
OS Linux (Ubuntu 22.04/24.04)

Supported Models

Model Size VRAM (BF16) GPU Status
TinyLlama 1.1B ~2.2GB ~3GB Any Tested
Mistral 7B Instruct ~15GB ~14GB RTX 4090 Tested
Llama 2 7B/13B 13-26GB 14-26GB RTX 4090 Tested
LFM2.5-1.2B-Thinking ~2.5GB ~4GB Any Validated (279 tok/s)
Qwen2.5-7B-Instruct ~15GB ~15GB RTX 4090 Validated
Qwen3-8B ~16GB ~79GB GH200 Validated (thinking)
Qwen3-32B ~62GB ~63GB GH200 Validated
Qwen2.5-72B-Instruct ~144GB ~144GB GH200 ¹ Runs via unified memory

¹ Requires --managed-memory flag on GH200 (uses 480GB unified pool: 96GB HBM3 + 384GB LPDDR5x via NVLink-C2C).

Download Any HuggingFace Model

# Generic download - works with any public model
make download REPO=mistralai/Mistral-Nemo-Instruct-2407
make download REPO=Qwen/Qwen2.5-7B-Instruct
make download REPO=google/gemma-2-9b-it

# For gated models (Llama, etc.)
export HF_TOKEN=your_token
make download REPO=meta-llama/Llama-3.3-70B-Instruct

Running the Server

Single Node (Recommended for most users)

# Auto-detect model and run
make run

# Or run with specific model
make run-mistral      # Mistral 7B Instruct
make run-tinyllama    # TinyLlama 1.1B
make run-llama7b      # Llama 2 7B

# Custom configuration
make run HTTP_PORT=8080 GPU_ID=1 LOG_LEVEL=debug

Configuration Options

Flag Default Description
--http-port 8090 API server port
--gpu 0 CUDA device ID
--log-level info Log verbosity (debug, info, warn, error)
--quantization bf16 Weight precision: bf16 (full) or int8 (W8A16)
--max-seq-len 4096 Maximum sequence length for KV cache
--min-peers 0 Minimum workers for distributed mode (0 = local only)

API

OpenAI-Compatible Endpoint

POST /v1/chat/completions

Request

curl -X POST http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "lfm2-1.2b-thinking",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The answer is 105.",
      "reasoning_content": "Okay, let me figure out 15 times 7..."
    },
    "finish_reason": "eos"
  }]
}

Note: For thinking models (LFM2), reasoning_content contains the chain-of-thought reasoning, and content contains the final answer. For non-thinking models, only content is populated.

Health Check

curl http://localhost:8090/health
# {"status":"healthy","model":"mistral-7b","timestamp":1769027418,"version":"1.0.0"}

Distributed Mode

For multi-GPU inference across machines using Pipeline Parallelism.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       Coordinator (GH200)                        │
│  - HTTP API endpoint                                             │
│  - Orchestrates inference                                        │
│  - Layers 0-13 (local)                                           │
└───────────────────────────┬─────────────────────────────────────┘
                            │ P2P (libp2p)
          ┌─────────────────┴─────────────────┐
          │                                   │
┌─────────┴───────────┐          ┌────────────┴──────────┐
│   Worker (RTX 4090) │          │   Worker (RTX 2080)   │
│   Layers 14-26      │          │   Layers 27-39        │
└─────────────────────┘          └───────────────────────┘

Mode 1: Workers with Local Models (Recommended)

When workers have the model downloaded locally, skip weight transfer for faster startup.

Important: Use -wait-for-assignment on workers so they only load layers assigned by the coordinator. This is critical for heterogeneous GPU clusters to prevent OOM errors.

# Coordinator (GH200 - start first)
LD_LIBRARY_PATH=./build ./build/neurogrid \
  -model models/mistral-nemo-instruct-2407 \
  -http-port 8090 \
  -p2p-port 9000 \
  -gpu 0 \
  -min-peers 2 \
  -max-seq-len 4096 \
  -skip-weight-transfer \
  -disable-mdns

# Worker 1 (RTX 4090 - connect via bootstrap)
LD_LIBRARY_PATH=./build ./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model models/mistral-nemo-instruct-2407 \
  -gpu 0 \
  -port 9001 \
  -wait-for-assignment

# Worker 2 (RTX 2080 - connect via bootstrap)
LD_LIBRARY_PATH=./build ./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model /path/to/models/mistral-nemo-instruct-2407 \
  -gpu 0 \
  -port 9002 \
  -wait-for-assignment

What happens:

  1. Workers connect and report their GPU info (VRAM, name) to coordinator
  2. Coordinator computes layer assignments based on actual VRAM of each GPU
  3. Coordinator sends layer requests to workers
  4. Workers load only their assigned layers from local storage

Mode 2: Auto-Discovery (LAN only)

For simple LAN setups, workers can be discovered automatically via mDNS:

# Machine 1: Worker with GPU 0
make run-worker GPU_ID=0 P2P_PORT=9001

# Machine 2: Worker with GPU 0
make run-worker GPU_ID=0 P2P_PORT=9002

# Machine 3: Coordinator (connects to workers automatically via mDNS)
make run-coordinator MIN_PEERS=2

Distributed Mode Flags

Coordinator Flags

Flag Default Description
-min-peers 0 Minimum workers to wait for (0 = local only)
-skip-weight-transfer false Skip P2P weight distribution (workers have local models)
-disable-mdns false Disable mDNS discovery (use explicit bootstrap)
-max-seq-len 4096 Max sequence length (caps KV cache size)
-peer-vram-gb 0 Override worker GPU VRAM in GB (workaround for GPU info timeout)

Worker Flags

Flag Default Description
-bootstrap "" Coordinator address for explicit connection
-model "" Path to local model weights
-wait-for-assignment false Critical for heterogeneous clusters: Wait for coordinator to assign layers before loading
-max-seq-len 4096 Max sequence length (caps KV cache size)
-port 9000 P2P listen port
-gpu 0 GPU device ID

Heterogeneous GPU Support

NeuroGrid supports clusters with different GPU types (e.g., GH200 + RTX 4090 + RTX 2080). The system:

  1. GPU Info Protocol: Workers report actual VRAM to coordinator on connect
  2. VRAM-Aware Scheduling: Scheduler assigns layers based on each GPU's available memory
  3. On-Demand Loading: With -wait-for-assignment, workers load only their assigned layers

This prevents OOM errors on smaller GPUs that would occur if all GPUs tried to load the same layers.

Troubleshooting Distributed Mode

Ghost Peer Issue: If you see connections to unknown peers, use -disable-mdns and connect workers via explicit -bootstrap addresses.

Out of Memory on Workers:

  • Use -wait-for-assignment on workers (most common fix)
  • Reduce -max-seq-len to use less KV cache memory
  • The scheduler automatically assigns fewer layers to GPUs with less VRAM

Workers not receiving layer assignments: Ensure coordinator has -min-peers set to the number of expected workers.

Building from Source

Prerequisites

# Verify CUDA installation
nvcc --version
nvidia-smi

# Verify Go installation
go version

Build Commands

# Build everything (CUDA library + binaries)
make build-all

# Build CUDA library only
make cuda

# Build specific binary
make build-coordinator
make build-worker

# Clean build artifacts
make clean

Run Tests

make test           # CUDA tests
make test-e2e       # End-to-end tests (no CUDA required)
make test-all       # All tests

Benchmarks

Single-Request Throughput (LFM2.5-1.2B-Thinking, BF16)

GPU Throughput CUDA Graph Nodes KV Cache
RTX 4090 279 tok/s 266 (persistent) Paged (27K blocks)
GH200 480GB 275 tok/s 302 Paged

Inference Quality (vs vLLM v0.17.1)

Metric Score
Semantic correctness 90% coherent thinking, 80% correct answers
Token match (first 3) 100% identical (same FA2 kernel for prefill)
Golden set (10 prompts) All factual answers correct

Run Benchmarks

# Single-request throughput
bash scripts/bench_single_rtx4090.sh

# Concurrent requests
python3 scripts/bench_concurrent.py

# Golden set validation (requires vLLM installed)
python3 scripts/golden_set_test.py

Project Structure

neurogrid/
├── cmd/
│   ├── neurogrid/       # Main server (coordinator)
│   ├── worker/          # Distributed worker node
│   └── download/        # Model download utility
├── gpu/
│   ├── cuda/            # CUDA kernels (attention, matmul, paged_attention, flash_attn_v2)
│   ├── engine/          # Layer forward passes (decode_all, layer, conv_layer)
│   └── bindings/        # Go ↔ CUDA bindings (CGO)
├── pkg/
│   ├── inference/       # Engine, batch scheduler, paged KV cache, sampler
│   ├── model/           # Weight loading, tokenizer, chat templates
│   ├── scheduler/       # Layer distribution & pipeline parallelism
│   └── huggingface/     # HuggingFace model downloader
├── api/                 # OpenAI-compatible HTTP API (streaming SSE, reasoning_content)
├── p2p/                 # libp2p peer-to-peer networking
├── scripts/             # Benchmarks, golden set tests
├── tests/               # Unit, integration, E2E, benchmark tests
├── third_party/cutlass/ # NVIDIA CUTLASS (header-only, for future INT8 fused GEMMs)
└── Makefile

Troubleshooting

CUDA Library Not Found

If you see libgpu_engine.so: cannot open shared object file:

# Option 1: Use make (handles LD_LIBRARY_PATH automatically)
make run

# Option 2: Set LD_LIBRARY_PATH manually
export LD_LIBRARY_PATH=$(pwd)/build:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
./build/neurogrid --model ./models/tinyllama --model-name tinyllama

No Model Found

# Check available models
ls ./models/

# Download a model
make download-tinyllama

GPU Out of Memory

Try a smaller model:

make download-tinyllama
make run-tinyllama

Help

make help   # Show all available commands

License

Source Available License with Academic & Educational Use Grant

  • ✅ Free for students, researchers, and academic use
  • ✅ Free for personal learning and non-commercial projects
  • ❌ Commercial use requires a license

Contact: leandrobar93@gmail.com

Acknowledgments


NeuroGrid Engine v0.19.0
Built with Go + CUDA

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors