Releases · Leeaandrob/neurogrid

19 Mar 15:00

v0.8.0

ff95d99

v0.8.0 — CUDA Graphs (298 nodes, 225 tok/s) Latest

Latest

CUDA Graphs — Full Decode Replay

All 298 kernel launches captured as a single CUDA graph. After 2 warmup steps, every subsequent token replays the graph instead of launching kernels individually.

Key Changes

Global stream routing: 71 kernels across 11 files use ng_get_stream()
Zero D2H copies: position, seq_len, kv_len read from GPU buffers
Conv workspace: pre-allocated BF16 buffers (graph-safe)
add_one_kernel: computes kv_len = position + 1 on GPU

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Engine	tok/s	Latency	vs vLLM
HuggingFace 5.3	184	705ms	53%
NeuroGrid v0.7	216	592ms	62%
NeuroGrid v0.8	225	568ms	64%
vLLM 0.17.1	350	366ms	100%

CUDA Graph: 298 nodes captured + replaying
Golden Test: PASS — <think> (token 64400)
22% faster than HuggingFace transformers

Full Changelog: v0.7.0...v0.8.0

Assets 2

19 Mar 11:58

Leeaandrob

v0.7.0

6bb502a

v0.7.0 — Paged KV Cache (PagedAttention)

Paged KV Cache

Block-based KV cache inspired by vLLM's PagedAttention. KV cache is allocated in fixed-size blocks (16 tokens) on demand, eliminating memory fragmentation.

What's New

Block Allocator: O(1) alloc/free, per-sequence block tables
Paged Attention Kernel: reads K/V via block_table indirection, GQA support
Per-layer caches: one PagedKVCache per attention layer
Auto-sizing: calculates blocks from available VRAM (25% budget)

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Config	tok/s	vs vLLM
v0.6 Contiguous KV	204	58%
v0.7 Paged KV	216	62%
vLLM 0.17.1	350	100%

Correctness: <think> (token 64400) — 100% golden match.

Full Changelog: v0.6.0...v0.7.0

Assets 2

19 Mar 11:11

Leeaandrob

v0.6.0

584543e

v0.6.0 — Throughput Optimizations & Flash Decode

Performance Optimizations

Flash Decode Attention Kernel

Custom CUDA kernel with online softmax — processes KV pairs in tiles without materializing the full attention score matrix.

Single CUDA Call Decode

All 16 layers execute in one C function call, eliminating ~90 Go↔CUDA round-trips per token. Best measured: 230 tok/s (+9%).

GPU-Resident Decode

Hidden state stays on GPU between tokens. Embedding lookup and LM head operate directly on GPU pointers. Automatic fallback for distributed mode.

Fused Kernels

cuda_silu_mul: SwiGLU in single pass
cuda_add_rmsnorm: Residual + RMSNorm in single pass

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Engine	tok/s	Latency	vs vLLM
HuggingFace 5.3	201	636ms	57%
NeuroGrid v0.6	204	623ms	58%
vLLM 0.17.1	350	365ms	100%

Correctness: first token = <think> (64400) — 100% golden match.

Full Changelog: v0.5.0...v0.6.0

Assets 2

18 Mar 17:40

Leeaandrob

v0.5.0

43a3398

v0.5.0 — LFM2 Hybrid Architecture Support

LFM2.5-1.2B-Thinking — First Non-Transformer Model

Full support for LiquidAI's hybrid conv+attention architecture, validated at 100% accuracy against HuggingFace reference.

Architecture

16 layers: 10 conv + 6 GQA attention (interleaved)
BF16 CUDA kernels with FP16 boundary conversion
Depthwise causal conv1d (prefill + decode)
QK LayerNorm: per-head RMSNorm on Q/K before RoPE
ChatML template with <think> reasoning tokens

Performance

Engine	Tokens/sec	Latency (128 tok)
NeuroGrid	~210 tok/s	~610ms
HuggingFace	~207 tok/s	~619ms

Golden Test

First generated token matches HuggingFace reference exactly:

Expected: <think> (token 64400)
Got: <think> (token 64400)

Quick Start

make download-lfm2-thinking
make run-lfm2-thinking
curl http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"lfm2-1.2b-thinking","messages":[{"role":"user","content":"What is 2+2?"}]}'

Full Changelog: v0.4.0...v0.5.0

Assets 2

18 Mar 15:04

Leeaandrob

v0.4.0

b18fd33

v0.4.0 — Distributed Demo & LFM2 Support

Highlights

Distributed Inference Demo

One-command setup: ./scripts/demo_distributed.sh
Validated on RTX 2080 Ti + RTX 4090 running Mistral 7B across both GPUs
New --peer-vram-gb flag for heterogeneous GPU VRAM override
Makefile targets: make demo, make demo-stream, make demo-stop

LFM2 Architecture Support (branch `feat/lfm2-support`)

First non-Llama model: LiquidAI LFM2.5-1.2B-Thinking

Hybrid conv+attention architecture (10 conv + 6 attention layers)
BF16 CUDA kernels: RMSNorm, SiLU, GEMM via cublasGemmEx
Depthwise causal conv1d with FP32 state
FP16-pure attention layer (no INT8 quantization)
ChatML template with thinking token support

Validated Configurations

Model	GPUs	Distribution	Result
Mistral 7B Instruct	2080 Ti + 4090	5 + 29 layers	Correct
TinyLlama 1.1B	2080 Ti + 4090	Distributed	Correct
LFM2.5-1.2B-Thinking	4090 (BF16)	Single GPU	Coherent

Bug Fixes

Tokenizer: support array-of-arrays merges format
Config: LFM2 SwiGLU intermediate_size adjustment
Empty responses on some configurations
Chat template detection for ChatML models

Full Changelog: v0.3.0...v0.4.0

Assets 2

Releases: Leeaandrob/neurogrid

v0.8.0 — CUDA Graphs (298 nodes, 225 tok/s)

CUDA Graphs — Full Decode Replay

Key Changes

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Uh oh!

v0.7.0 — Paged KV Cache (PagedAttention)

Paged KV Cache

What's New

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Uh oh!

v0.6.0 — Throughput Optimizations & Flash Decode

Performance Optimizations

Flash Decode Attention Kernel

Single CUDA Call Decode

GPU-Resident Decode

Fused Kernels

Benchmark (LFM2.5-1.2B, RTX 4090, 128 tokens)

Uh oh!

v0.5.0 — LFM2 Hybrid Architecture Support

LFM2.5-1.2B-Thinking — First Non-Transformer Model

Architecture

Performance

Golden Test

Quick Start

Uh oh!

v0.4.0 — Distributed Demo & LFM2 Support

Highlights

Distributed Inference Demo

LFM2 Architecture Support (branch feat/lfm2-support)

Validated Configurations

Bug Fixes

Uh oh!

LFM2 Architecture Support (branch `feat/lfm2-support`)