NeuroGrid Engine

GPU-Accelerated Distributed LLM Inference Engine

Quick Start • Models • API • Distributed • Benchmarks

NeuroGrid is a high-performance inference engine for Large Language Models (LLMs), built from scratch in Go + CUDA. Designed for both single-GPU and distributed inference across multiple machines.

Key Features

279 tok/s on RTX 4090 (LFM2.5-1.2B, BF16) with CUDA Graph replay
OpenAI-compatible API with reasoning_content field for thinking models
BF16-native compute pipeline — zero FP16 conversions in decode path
Paged KV Cache (vLLM-style) with <4% memory waste
FlashAttention-2 for prefill via dlopen of vLLM's compiled kernel
Continuous batching with per-sequence isolation
Distributed inference via libp2p P2P with pipeline parallelism
Configurable quantization (--quantization bf16/int8)

Quick Start

# 1. Download a model
make download-tinyllama          # Small model for testing (~2.2GB)

# 2. Build and run
make run

# 3. Test the API
curl -X POST http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "tinyllama", "messages": [{"role": "user", "content": "Hello!"}]}'

That's it! The server auto-detects the model and starts on port 8090.

Requirements

Requirement	Version
Go	1.21+
CUDA Toolkit	11.x or 12.x
GPU	NVIDIA with Compute 7.0+ (RTX 20/30/40/50)
OS	Linux (Ubuntu 22.04/24.04)

Supported Models

Model	Size	VRAM (BF16)	GPU	Status
TinyLlama 1.1B	~2.2GB	~3GB	Any	Tested
Mistral 7B Instruct	~15GB	~14GB	RTX 4090	Tested
Llama 2 7B/13B	13-26GB	14-26GB	RTX 4090	Tested
LFM2.5-1.2B-Thinking	~2.5GB	~4GB	Any	Validated (279 tok/s)
Qwen2.5-7B-Instruct	~15GB	~15GB	RTX 4090	Validated
Qwen3-8B	~16GB	~79GB	GH200	Validated (thinking)
Qwen3-32B	~62GB	~63GB	GH200	Validated
Qwen2.5-72B-Instruct	~144GB	~144GB	GH200 ¹	Runs via unified memory

¹ Requires --managed-memory flag on GH200 (uses 480GB unified pool: 96GB HBM3 + 384GB LPDDR5x via NVLink-C2C).

Download Any HuggingFace Model

# Generic download - works with any public model
make download REPO=mistralai/Mistral-Nemo-Instruct-2407
make download REPO=Qwen/Qwen2.5-7B-Instruct
make download REPO=google/gemma-2-9b-it

# For gated models (Llama, etc.)
export HF_TOKEN=your_token
make download REPO=meta-llama/Llama-3.3-70B-Instruct

Running the Server

Single Node (Recommended for most users)

# Auto-detect model and run
make run

# Or run with specific model
make run-mistral      # Mistral 7B Instruct
make run-tinyllama    # TinyLlama 1.1B
make run-llama7b      # Llama 2 7B

# Custom configuration
make run HTTP_PORT=8080 GPU_ID=1 LOG_LEVEL=debug

Configuration Options

Flag	Default	Description
`--http-port`	8090	API server port
`--gpu`	0	CUDA device ID
`--log-level`	info	Log verbosity (debug, info, warn, error)
`--quantization`	bf16	Weight precision: `bf16` (full) or `int8` (W8A16)
`--max-seq-len`	4096	Maximum sequence length for KV cache
`--min-peers`	0	Minimum workers for distributed mode (0 = local only)

API

OpenAI-Compatible Endpoint

POST /v1/chat/completions

Request

curl -X POST http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "lfm2-1.2b-thinking",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The answer is 105.",
      "reasoning_content": "Okay, let me figure out 15 times 7..."
    },
    "finish_reason": "eos"
  }]
}

Note: For thinking models (LFM2), reasoning_content contains the chain-of-thought reasoning, and content contains the final answer. For non-thinking models, only content is populated.

Health Check

curl http://localhost:8090/health
# {"status":"healthy","model":"mistral-7b","timestamp":1769027418,"version":"1.0.0"}

Distributed Mode

For multi-GPU inference across machines using Pipeline Parallelism.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       Coordinator (GH200)                        │
│  - HTTP API endpoint                                             │
│  - Orchestrates inference                                        │
│  - Layers 0-13 (local)                                           │
└───────────────────────────┬─────────────────────────────────────┘
                            │ P2P (libp2p)
          ┌─────────────────┴─────────────────┐
          │                                   │
┌─────────┴───────────┐          ┌────────────┴──────────┐
│   Worker (RTX 4090) │          │   Worker (RTX 2080)   │
│   Layers 14-26      │          │   Layers 27-39        │
└─────────────────────┘          └───────────────────────┘

Mode 1: Workers with Local Models (Recommended)

When workers have the model downloaded locally, skip weight transfer for faster startup.

Important: Use -wait-for-assignment on workers so they only load layers assigned by the coordinator. This is critical for heterogeneous GPU clusters to prevent OOM errors.

# Coordinator (GH200 - start first)
LD_LIBRARY_PATH=./build ./build/neurogrid \
  -model models/mistral-nemo-instruct-2407 \
  -http-port 8090 \
  -p2p-port 9000 \
  -gpu 0 \
  -min-peers 2 \
  -max-seq-len 4096 \
  -skip-weight-transfer \
  -disable-mdns

# Worker 1 (RTX 4090 - connect via bootstrap)
LD_LIBRARY_PATH=./build ./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model models/mistral-nemo-instruct-2407 \
  -gpu 0 \
  -port 9001 \
  -wait-for-assignment

# Worker 2 (RTX 2080 - connect via bootstrap)
LD_LIBRARY_PATH=./build ./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model /path/to/models/mistral-nemo-instruct-2407 \
  -gpu 0 \
  -port 9002 \
  -wait-for-assignment

What happens:

Workers connect and report their GPU info (VRAM, name) to coordinator
Coordinator computes layer assignments based on actual VRAM of each GPU
Coordinator sends layer requests to workers
Workers load only their assigned layers from local storage

Mode 2: Auto-Discovery (LAN only)

For simple LAN setups, workers can be discovered automatically via mDNS:

# Machine 1: Worker with GPU 0
make run-worker GPU_ID=0 P2P_PORT=9001

# Machine 2: Worker with GPU 0
make run-worker GPU_ID=0 P2P_PORT=9002

# Machine 3: Coordinator (connects to workers automatically via mDNS)
make run-coordinator MIN_PEERS=2

Distributed Mode Flags

Coordinator Flags

Flag	Default	Description
`-min-peers`	0	Minimum workers to wait for (0 = local only)
`-skip-weight-transfer`	false	Skip P2P weight distribution (workers have local models)
`-disable-mdns`	false	Disable mDNS discovery (use explicit bootstrap)
`-max-seq-len`	4096	Max sequence length (caps KV cache size)
`-peer-vram-gb`	0	Override worker GPU VRAM in GB (workaround for GPU info timeout)

Worker Flags

Flag	Default	Description
`-bootstrap`	""	Coordinator address for explicit connection
`-model`	""	Path to local model weights
`-wait-for-assignment`	false	Critical for heterogeneous clusters: Wait for coordinator to assign layers before loading
`-max-seq-len`	4096	Max sequence length (caps KV cache size)
`-port`	9000	P2P listen port
`-gpu`	0	GPU device ID

Heterogeneous GPU Support

NeuroGrid supports clusters with different GPU types (e.g., GH200 + RTX 4090 + RTX 2080). The system:

GPU Info Protocol: Workers report actual VRAM to coordinator on connect
VRAM-Aware Scheduling: Scheduler assigns layers based on each GPU's available memory
On-Demand Loading: With -wait-for-assignment, workers load only their assigned layers

This prevents OOM errors on smaller GPUs that would occur if all GPUs tried to load the same layers.

Troubleshooting Distributed Mode

Ghost Peer Issue: If you see connections to unknown peers, use -disable-mdns and connect workers via explicit -bootstrap addresses.

Out of Memory on Workers:

Use -wait-for-assignment on workers (most common fix)
Reduce -max-seq-len to use less KV cache memory
The scheduler automatically assigns fewer layers to GPUs with less VRAM

Workers not receiving layer assignments: Ensure coordinator has -min-peers set to the number of expected workers.

Building from Source

Prerequisites

# Verify CUDA installation
nvcc --version
nvidia-smi

# Verify Go installation
go version

Build Commands

# Build everything (CUDA library + binaries)
make build-all

# Build CUDA library only
make cuda

# Build specific binary
make build-coordinator
make build-worker

# Clean build artifacts
make clean

Run Tests

make test           # CUDA tests
make test-e2e       # End-to-end tests (no CUDA required)
make test-all       # All tests

Benchmarks

Single-Request Throughput (LFM2.5-1.2B-Thinking, BF16)

GPU	Throughput	CUDA Graph Nodes	KV Cache
RTX 4090	279 tok/s	266 (persistent)	Paged (27K blocks)
GH200 480GB	275 tok/s	302	Paged

Inference Quality (vs vLLM v0.17.1)

Metric	Score
Semantic correctness	90% coherent thinking, 80% correct answers
Token match (first 3)	100% identical (same FA2 kernel for prefill)
Golden set (10 prompts)	All factual answers correct

Run Benchmarks

# Single-request throughput
bash scripts/bench_single_rtx4090.sh

# Concurrent requests
python3 scripts/bench_concurrent.py

# Golden set validation (requires vLLM installed)
python3 scripts/golden_set_test.py

Project Structure

neurogrid/
├── cmd/
│   ├── neurogrid/       # Main server (coordinator)
│   ├── worker/          # Distributed worker node
│   └── download/        # Model download utility
├── gpu/
│   ├── cuda/            # CUDA kernels (attention, matmul, paged_attention, flash_attn_v2)
│   ├── engine/          # Layer forward passes (decode_all, layer, conv_layer)
│   └── bindings/        # Go ↔ CUDA bindings (CGO)
├── pkg/
│   ├── inference/       # Engine, batch scheduler, paged KV cache, sampler
│   ├── model/           # Weight loading, tokenizer, chat templates
│   ├── scheduler/       # Layer distribution & pipeline parallelism
│   └── huggingface/     # HuggingFace model downloader
├── api/                 # OpenAI-compatible HTTP API (streaming SSE, reasoning_content)
├── p2p/                 # libp2p peer-to-peer networking
├── scripts/             # Benchmarks, golden set tests
├── tests/               # Unit, integration, E2E, benchmark tests
├── third_party/cutlass/ # NVIDIA CUTLASS (header-only, for future INT8 fused GEMMs)
└── Makefile

Troubleshooting

CUDA Library Not Found

If you see libgpu_engine.so: cannot open shared object file:

# Option 1: Use make (handles LD_LIBRARY_PATH automatically)
make run

# Option 2: Set LD_LIBRARY_PATH manually
export LD_LIBRARY_PATH=$(pwd)/build:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
./build/neurogrid --model ./models/tinyllama --model-name tinyllama

No Model Found

# Check available models
ls ./models/

# Download a model
make download-tinyllama

GPU Out of Memory

Try a smaller model:

make download-tinyllama
make run-tinyllama

Help

make help   # Show all available commands

License

Source Available License with Academic & Educational Use Grant

✅ Free for students, researchers, and academic use
✅ Free for personal learning and non-commercial projects
❌ Commercial use requires a license

Contact: leandrobar93@gmail.com

Acknowledgments

cuBLAS - NVIDIA's GPU-accelerated BLAS
libp2p - Peer-to-peer networking
HuggingFace - Model hub

NeuroGrid Engine v0.19.0
Built with Go + CUDA

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
api		api
cmd		cmd
configs		configs
docs		docs
gpu		gpu
observability		observability
p2p		p2p
pkg		pkg
schemas		schemas
scripts		scripts
tests		tests
third_party		third_party
tools/benchmark		tools/benchmark
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
autoresearch-results.tsv		autoresearch-results.tsv
docker-compose.observability.yml		docker-compose.observability.yml
go.mod		go.mod
go.sum		go.sum
worker		worker

Folders and files

Latest commit

History

Repository files navigation

NeuroGrid Engine

Key Features

Quick Start

Requirements

Supported Models

Download Any HuggingFace Model

Running the Server

Single Node (Recommended for most users)

Configuration Options

API

OpenAI-Compatible Endpoint

Request

Response

Health Check

Distributed Mode

Architecture

Mode 1: Workers with Local Models (Recommended)

Mode 2: Auto-Discovery (LAN only)

Distributed Mode Flags

Coordinator Flags

Worker Flags

Heterogeneous GPU Support

Troubleshooting Distributed Mode

Building from Source

Prerequisites

Build Commands

Run Tests

Benchmarks

Single-Request Throughput (LFM2.5-1.2B-Thinking, BF16)

Inference Quality (vs vLLM v0.17.1)

Run Benchmarks

Project Structure

Troubleshooting

CUDA Library Not Found

No Model Found

GPU Out of Memory

Help

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages