Releases: Leeaandrob/neurogrid
v0.8.0 — CUDA Graphs (298 nodes, 225 tok/s)
CUDA Graphs — Full Decode Replay
All 298 kernel launches captured as a single CUDA graph. After 2 warmup steps, every subsequent token replays the graph instead of launching kernels individually.
Key Changes
- Global stream routing: 71 kernels across 11 files use
ng_get_stream() - Zero D2H copies: position, seq_len, kv_len read from GPU buffers
- Conv workspace: pre-allocated BF16 buffers (graph-safe)
add_one_kernel: computes kv_len = position + 1 on GPU
Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)
| Engine | tok/s | Latency | vs vLLM |
|---|---|---|---|
| HuggingFace 5.3 | 184 | 705ms | 53% |
| NeuroGrid v0.7 | 216 | 592ms | 62% |
| NeuroGrid v0.8 | 225 | 568ms | 64% |
| vLLM 0.17.1 | 350 | 366ms | 100% |
- CUDA Graph: 298 nodes captured + replaying
- Golden Test: PASS —
<think>(token 64400) - 22% faster than HuggingFace transformers
Full Changelog: v0.7.0...v0.8.0
v0.7.0 — Paged KV Cache (PagedAttention)
Paged KV Cache
Block-based KV cache inspired by vLLM's PagedAttention. KV cache is allocated in fixed-size blocks (16 tokens) on demand, eliminating memory fragmentation.
What's New
- Block Allocator: O(1) alloc/free, per-sequence block tables
- Paged Attention Kernel: reads K/V via block_table indirection, GQA support
- Per-layer caches: one PagedKVCache per attention layer
- Auto-sizing: calculates blocks from available VRAM (25% budget)
Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)
| Config | tok/s | vs vLLM |
|---|---|---|
| v0.6 Contiguous KV | 204 | 58% |
| v0.7 Paged KV | 216 | 62% |
| vLLM 0.17.1 | 350 | 100% |
Correctness: <think> (token 64400) — 100% golden match.
Full Changelog: v0.6.0...v0.7.0
v0.6.0 — Throughput Optimizations & Flash Decode
Performance Optimizations
Flash Decode Attention Kernel
Custom CUDA kernel with online softmax — processes KV pairs in tiles without materializing the full attention score matrix.
Single CUDA Call Decode
All 16 layers execute in one C function call, eliminating ~90 Go↔CUDA round-trips per token. Best measured: 230 tok/s (+9%).
GPU-Resident Decode
Hidden state stays on GPU between tokens. Embedding lookup and LM head operate directly on GPU pointers. Automatic fallback for distributed mode.
Fused Kernels
cuda_silu_mul: SwiGLU in single passcuda_add_rmsnorm: Residual + RMSNorm in single pass
Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)
| Engine | tok/s | Latency | vs vLLM |
|---|---|---|---|
| HuggingFace 5.3 | 201 | 636ms | 57% |
| NeuroGrid v0.6 | 204 | 623ms | 58% |
| vLLM 0.17.1 | 350 | 365ms | 100% |
Correctness: first token = <think> (64400) — 100% golden match.
Full Changelog: v0.5.0...v0.6.0
v0.5.0 — LFM2 Hybrid Architecture Support
LFM2.5-1.2B-Thinking — First Non-Transformer Model
Full support for LiquidAI's hybrid conv+attention architecture, validated at 100% accuracy against HuggingFace reference.
Architecture
- 16 layers: 10 conv + 6 GQA attention (interleaved)
- BF16 CUDA kernels with FP16 boundary conversion
- Depthwise causal conv1d (prefill + decode)
- QK LayerNorm: per-head RMSNorm on Q/K before RoPE
- ChatML template with
<think>reasoning tokens
Performance
| Engine | Tokens/sec | Latency (128 tok) |
|---|---|---|
| NeuroGrid | ~210 tok/s | ~610ms |
| HuggingFace | ~207 tok/s | ~619ms |
Golden Test
First generated token matches HuggingFace reference exactly:
- Expected:
<think>(token 64400) - Got:
<think>(token 64400)
Quick Start
make download-lfm2-thinking
make run-lfm2-thinking
curl http://localhost:8090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"lfm2-1.2b-thinking","messages":[{"role":"user","content":"What is 2+2?"}]}'Full Changelog: v0.4.0...v0.5.0
v0.4.0 — Distributed Demo & LFM2 Support
Highlights
Distributed Inference Demo
- One-command setup:
./scripts/demo_distributed.sh - Validated on RTX 2080 Ti + RTX 4090 running Mistral 7B across both GPUs
- New
--peer-vram-gbflag for heterogeneous GPU VRAM override - Makefile targets:
make demo,make demo-stream,make demo-stop
LFM2 Architecture Support (branch feat/lfm2-support)
First non-Llama model: LiquidAI LFM2.5-1.2B-Thinking
- Hybrid conv+attention architecture (10 conv + 6 attention layers)
- BF16 CUDA kernels: RMSNorm, SiLU, GEMM via cublasGemmEx
- Depthwise causal conv1d with FP32 state
- FP16-pure attention layer (no INT8 quantization)
- ChatML template with thinking token support
Validated Configurations
| Model | GPUs | Distribution | Result |
|---|---|---|---|
| Mistral 7B Instruct | 2080 Ti + 4090 | 5 + 29 layers | Correct |
| TinyLlama 1.1B | 2080 Ti + 4090 | Distributed | Correct |
| LFM2.5-1.2B-Thinking | 4090 (BF16) | Single GPU | Coherent |
Bug Fixes
- Tokenizer: support array-of-arrays merges format
- Config: LFM2 SwiGLU intermediate_size adjustment
- Empty responses on some configurations
- Chat template detection for ChatML models
Full Changelog: v0.3.0...v0.4.0