Skip to content

Releases: Leeaandrob/neurogrid

v0.8.0 — CUDA Graphs (298 nodes, 225 tok/s)

19 Mar 15:00

Choose a tag to compare

CUDA Graphs — Full Decode Replay

All 298 kernel launches captured as a single CUDA graph. After 2 warmup steps, every subsequent token replays the graph instead of launching kernels individually.

Key Changes

  • Global stream routing: 71 kernels across 11 files use ng_get_stream()
  • Zero D2H copies: position, seq_len, kv_len read from GPU buffers
  • Conv workspace: pre-allocated BF16 buffers (graph-safe)
  • add_one_kernel: computes kv_len = position + 1 on GPU

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Engine tok/s Latency vs vLLM
HuggingFace 5.3 184 705ms 53%
NeuroGrid v0.7 216 592ms 62%
NeuroGrid v0.8 225 568ms 64%
vLLM 0.17.1 350 366ms 100%
  • CUDA Graph: 298 nodes captured + replaying
  • Golden Test: PASS — <think> (token 64400)
  • 22% faster than HuggingFace transformers

Full Changelog: v0.7.0...v0.8.0

v0.7.0 — Paged KV Cache (PagedAttention)

19 Mar 11:58

Choose a tag to compare

Paged KV Cache

Block-based KV cache inspired by vLLM's PagedAttention. KV cache is allocated in fixed-size blocks (16 tokens) on demand, eliminating memory fragmentation.

What's New

  • Block Allocator: O(1) alloc/free, per-sequence block tables
  • Paged Attention Kernel: reads K/V via block_table indirection, GQA support
  • Per-layer caches: one PagedKVCache per attention layer
  • Auto-sizing: calculates blocks from available VRAM (25% budget)

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Config tok/s vs vLLM
v0.6 Contiguous KV 204 58%
v0.7 Paged KV 216 62%
vLLM 0.17.1 350 100%

Correctness: <think> (token 64400) — 100% golden match.

Full Changelog: v0.6.0...v0.7.0

v0.6.0 — Throughput Optimizations & Flash Decode

19 Mar 11:11

Choose a tag to compare

Performance Optimizations

Flash Decode Attention Kernel

Custom CUDA kernel with online softmax — processes KV pairs in tiles without materializing the full attention score matrix.

Single CUDA Call Decode

All 16 layers execute in one C function call, eliminating ~90 Go↔CUDA round-trips per token. Best measured: 230 tok/s (+9%).

GPU-Resident Decode

Hidden state stays on GPU between tokens. Embedding lookup and LM head operate directly on GPU pointers. Automatic fallback for distributed mode.

Fused Kernels

  • cuda_silu_mul: SwiGLU in single pass
  • cuda_add_rmsnorm: Residual + RMSNorm in single pass

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Engine tok/s Latency vs vLLM
HuggingFace 5.3 201 636ms 57%
NeuroGrid v0.6 204 623ms 58%
vLLM 0.17.1 350 365ms 100%

Correctness: first token = <think> (64400) — 100% golden match.

Full Changelog: v0.5.0...v0.6.0

v0.5.0 — LFM2 Hybrid Architecture Support

18 Mar 17:40

Choose a tag to compare

LFM2.5-1.2B-Thinking — First Non-Transformer Model

Full support for LiquidAI's hybrid conv+attention architecture, validated at 100% accuracy against HuggingFace reference.

Architecture

  • 16 layers: 10 conv + 6 GQA attention (interleaved)
  • BF16 CUDA kernels with FP16 boundary conversion
  • Depthwise causal conv1d (prefill + decode)
  • QK LayerNorm: per-head RMSNorm on Q/K before RoPE
  • ChatML template with <think> reasoning tokens

Performance

Engine Tokens/sec Latency (128 tok)
NeuroGrid ~210 tok/s ~610ms
HuggingFace ~207 tok/s ~619ms

Golden Test

First generated token matches HuggingFace reference exactly:

  • Expected: <think> (token 64400)
  • Got: <think> (token 64400)

Quick Start

make download-lfm2-thinking
make run-lfm2-thinking
curl http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"lfm2-1.2b-thinking","messages":[{"role":"user","content":"What is 2+2?"}]}'

Full Changelog: v0.4.0...v0.5.0

v0.4.0 — Distributed Demo & LFM2 Support

18 Mar 15:04

Choose a tag to compare

Highlights

Distributed Inference Demo

  • One-command setup: ./scripts/demo_distributed.sh
  • Validated on RTX 2080 Ti + RTX 4090 running Mistral 7B across both GPUs
  • New --peer-vram-gb flag for heterogeneous GPU VRAM override
  • Makefile targets: make demo, make demo-stream, make demo-stop

LFM2 Architecture Support (branch feat/lfm2-support)

First non-Llama model: LiquidAI LFM2.5-1.2B-Thinking

  • Hybrid conv+attention architecture (10 conv + 6 attention layers)
  • BF16 CUDA kernels: RMSNorm, SiLU, GEMM via cublasGemmEx
  • Depthwise causal conv1d with FP32 state
  • FP16-pure attention layer (no INT8 quantization)
  • ChatML template with thinking token support

Validated Configurations

Model GPUs Distribution Result
Mistral 7B Instruct 2080 Ti + 4090 5 + 29 layers Correct
TinyLlama 1.1B 2080 Ti + 4090 Distributed Correct
LFM2.5-1.2B-Thinking 4090 (BF16) Single GPU Coherent

Bug Fixes

  • Tokenizer: support array-of-arrays merges format
  • Config: LFM2 SwiGLU intermediate_size adjustment
  • Empty responses on some configurations
  • Chat template detection for ChatML models

Full Changelog: v0.3.0...v0.4.0