feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16
feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16christopherkarani wants to merge 163 commits into
Conversation
Feat/ane multitoken
Rewrite README to lead with numbers and architecture instead of hedged academic prose. Round benchmarks to 2 decimal places, merge repeated disclaimers into one caveats paragraph, cut 126→93 lines. All factual claims and links preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s-1773267371105 Add Claude Code GitHub Workflow
runCommittedSingleToken (used during prefill) was calling prepareActivationPair which computed output heads for both pair0 and pair1, but pair1's result was always discarded. For the identityZeroTrunk backend, fast-path directly to a single selectTokenFromActivation call. Also adds TTFT measurement (ttft_ms, ttft_cold_ms) to benchmark JSON output, benchmark script summary, and autoresearch log. Measured: two_step_ttft_ms 2.133 → ~1.0 ms (-53%), parity=match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…esent When hasFutureProposer is false, futureClassifier/futureRMS/futureNorm/ futureLogits are never read (guarded by hasFutureProposer check in selectProposedFutureToken). Allocating count: 0 instead of count: vocabSize*dim removes ~49MB of zeroed memory from init. Two-step cold TTFT improved from ~19.4ms to ~11.7ms (-40%). Two-step init is now faster than the control model init. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the scalar argmax loop in selectToken with Accelerate's vDSP_maxvi for SIMD-vectorized max-with-index over 32k logits. Also remove the redundant stepLogits.zero() before sgemm(beta=0) in the two-step CPU output head path. Warm TTFT improved from ~1.0ms to ~0.88ms (-12%) for both control and two-step models (shared selectToken function). Parity=match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… aggregate throughput Pack N concurrent streams into N spatial lanes of shared ANE kernels. ANE processes all 32 lanes per eval regardless, so N streams cost the same as 1 eval. This converts 3N dispatches per token step to just 3, yielding 2498 TPS@6streams vs 1206 TPS concurrent baseline (+107%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… 5381 TPS Scale the spatial-lane batched benchmark from 6 to 32 lanes. ANE eval cost is constant regardless of active lanes; only per-stream CPU work (embedding write + argmax read) grows linearly. At 32 streams: 5381 aggregate TPS (4.4x over concurrent baseline, 2.15x over 6-stream batched), per-stream 168 TPS, round latency 48ms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t 32 streams Add ane_interop_io_write_embedding_batch_fp16() and ane_interop_io_argmax_batch_fp16_spatial() — bulk I/O functions that process all N spatial lanes under a single IOSurface lock/unlock pair. The fused embedding write also eliminates the intermediate FP32 buffer by reading directly from the embedding table into the surface. Before: 5381 TPS@32 streams (512 lock/unlock ops per round) After: 9572 TPS@32 streams (4 lock/unlock ops per round, +78%) Round latency at 32 streams dropped from 47.6ms to 26.7ms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…old TTFT) Add TensorBuffer(nonOwningViewOf:) init that wraps existing memory without allocation. Model inits gain shareReadOnlyWeights parameter; probe CLI gains --share-weights flag. Cold TTFT drops from 11.1ms to 1.98ms (control) and 9.9ms to 1.85ms (two-step). Warm TTFT unaffected; parity match confirmed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No measurable TTFT impact at 3KB per copy, but cleaner and more idiomatic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t 32 streams Replace scalar 8-way unrolled per-stream argmax with ARM NEON branchless channel-major scan using float16x8_t + vcgtq_f16 + vbslq conditional select. Processes all 32 spatial lanes in 4 NEON vectors per channel row (1 cache line sequential read). Release-mode benchmark at 32 streams: Before: 11,537 agg TPS (argmax ~580 µs) After: 18,110 agg TPS (argmax ~46 µs) Includes per-component timing diagnostic test for future optimization instrumentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace scalar element-by-element scan in tensorBufferIsAllZero with vDSP_maxmgv which uses SIMD to find max absolute value in one pass. Scanning 9MB of layer weights drops from ~0.65ms to ~0.05ms. Cold TTFT 1.98ms → 1.16ms (control), 1.85ms → 1.06ms (two-step). Parity match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All attempted warm-path accelerations are slower than baseline: vDSP_mmul (1.05x), INT8 tiled (2.5x), streaming dot+argmax (1.7x), tiled FP16 conversion (8.6x). The 98MB FP32 classifier at ~100 GB/s yields a ~0.88ms hard minimum. Measured warm TTFT ~0.93ms is within 6% of theoretical floor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move recurrentWeightsUseIdentityZeroTrunk check to #if DEBUG — the caller already declares .identityZeroTrunk intent; the vDSP_maxmgv scan over 6×4×2.25MB was the last measurable init overhead. Init: 0.17ms → 0.01ms. Cold TTFT ≈ warm TTFT now. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Final 5-repeat median (warmup=3, iter=10): Control warm: 0.941ms (baseline 0.993ms, -5%) Two-step warm: 0.930ms (baseline 2.133ms, -56%) Control cold: 0.972ms (baseline 18.784ms, -95%) Two-step cold: 0.944ms (baseline 46.812ms, -98%) Warm TTFT within 6% of 0.88ms theoretical floor (98MB FP32 classifier at ~100 GB/s DRAM bandwidth). Cold ≈ warm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r 53K TPS Generalize NEON argmax from hardcoded spatial=32 to any multiple of 8 up to 128. Extend multistream benchmark to sweep [1..128] streams with power-of-2 laneSpatial (32, 64, 128). ANE amortizes weight loading across all spatial lanes — doubling lanes adds only 8-21% per-step overhead. Release-mode benchmark results: 32 streams (lane=32): 16,275 agg TPS 64 streams (lane=64): 31,337 agg TPS (+93%) 128 streams (lane=128): 53,496 agg TPS (+228%) Per-step latency: 1.97ms@32 → 2.04ms@64 → 2.39ms@128 Matched 6-stream benchmark: PASS (2769 TPS batched ANE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…untime-cache Improve GPT-2 speculative decode throughput and ANE cache reuse
- Add llama.cpp comparison row to benchmark table with Metal/CPU baselines - Add platform compatibility matrix covering M1–M4 SoCs - Add SPM integration section with 5-line first-inference example - Add release badge - Clarify iOS/tvOS entitlement situation in platform matrix - Tighten quick-start: git clone + 3-line TUI launch first Co-Authored-By: Paperclip <noreply@paperclip.ing>
…plates - Add .github/workflows/ci.yml: build + test matrix on Xcode 16.2 and 16.3 (macos-15, SPM cache, unit tests only — no ANE runner required) - Add CONTRIBUTING.md: dev setup, project structure, coding standards, TDD guide - Add .github/ISSUE_TEMPLATE/bug_report.md and feature_request.md - Add .github/PULL_REQUEST_TEMPLATE.md with benchmark impact section - Update README badges: CI badge alongside existing ANE matrix badge Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Target list with 10 Swift/ML community leaders (Panaro, Hollance, Maderix, HF team, MLX, Paul Hudson, Sean Allen) - 5 personalized ready-to-send outreach messages - Partnership research for Apple Silicon benchmark projects (ANEMLL, more-ane-transformers, neural-engine) - Conference talk proposals for Deep Dish Swift, try! Swift Tokyo, WWDC Labs, SwiftConf - Three talk formats: 40-min technical, 20-min intro, 10-min lightning demo Related: ESP-10, ESP-11, ESP-12 Co-Authored-By: Paperclip <noreply@paperclip.ing>
…rainingLoop - Examples/SimpleInference: ~20-line GPT-2 generation using RealModelInferenceEngine.build() - Examples/BenchmarkSuite: Espresso vs CoreML comparison via espresso bench - Examples/TrainingLoop: fine-tuning wrapper over espresso-train CLI - Examples/README.md: setup guide with env vars and local-path override Each example is a standalone Swift package (macOS 15+, Swift 6.2). Co-Authored-By: Paperclip <noreply@paperclip.ing>
- ModelRegistry: add llama3_2_1b (16L/32H/8KVH/2048d/8192h) and llama3_2_3b (28L/24H/8KVH/3072d/8192h). Both use .llama architecture (SwiGLU, RMSNorm, GQA). Offline converter handles GQA head expansion and RoPE rotation baking into Wq/Wk weights. - Tests: add llama3_2_1bConfigIsCorrect, llama3_2_3bConfigIsCorrect, update registryContainsAllSixModels. All 17 ModelSupportTests pass. - Benchmark dashboard: benchmarks/results/latest.json (3.41x over CoreML, 519 tok/s on M3 Max); scripts/generate-benchmark-dashboard.sh regenerates docs/benchmarks.md from JSON; .github/workflows/benchmark-dashboard.yml auto-triggers on JSON changes. - .gitignore: allow benchmarks/results/latest.json, docs/benchmarks.md, scripts/generate-benchmark-dashboard.sh. Co-Authored-By: Paperclip <noreply@paperclip.ing>
- coreml-vs-espresso-benchmarks.html: benchmark comparison with data tables, visual bar chart, architecture explanation, and M1/M2/M3/M4 projections - gpt2-926-tokens-per-second.html: step-by-step guide covering 4.76x win — direct ANE access, 3-layer fusion, recurrent arch, zero-copy argmax - reverse-engineering-apple-neural-engine.html: internals deep-dive covering dlopen bridge, MIL ops, IOSurface memory model, and confirmed dead ends - blog.html: blog index listing all posts with summaries and tags - docs/index.html: add Blog nav link + "From the Blog" section for SEO - .gitignore: allow docs HTML files for GitHub Pages Targets keywords: "CoreML alternative", "apple neural engine framework", "swift ml inference", "GPT-2 apple silicon", "ANE reverse engineering" Co-Authored-By: Paperclip <noreply@paperclip.ing>
…NE inference Adds EspressoGGUF target that bridges EdgeRunner's GGUF loader into Espresso's weight format. GGUFModelLoader.prepare() loads a GGUF file, dequantizes via Metal, transposes per architecture convention, wraps in BLOBFILE format, and writes to a temp directory compatible with RealModelInferenceEngine.build(). - Bumps platform to macOS 26 (required by EdgeRunner/Metal 4) - Bumps swift-tools-version to 6.2 - EdgeRunner added as local package dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces ClassifierStrategy with a 16M-element SRAM threshold (32MB fp16). Models with vocab*dModel <= 16M use the ANE lane-packed classifier head; larger models (Stories110M, TinyLlama, Qwen3 0.6B) fall back to FP16TiledClassifier on CPU. Adds 5 Swift Testing tests covering strategy selection for small/large/huge vocabs and CPU-tiled argmax correctness. Also adds Espresso dependency to RealModelInferenceTests for FP16TiledClassifier access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused `import Espresso` from ClassifierStrategy.swift - Add exactThresholdSelectsANE and oneOverThresholdSelectsCPU tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…PU classifier gating Integrate ClassifierStrategy into the Llama decode path to gate between ANE lane-packed classifier and CPU-tiled FP16 classifier based on vocab*dim SRAM fit. Changes: - Add lmHeadFP16 field to LlamaTopLevelAssets (pre-converted FP16 weights) - Add classifierStrategy stored property, initialized via ClassifierStrategy.select() - Gate ANE greedy norm+classifier compile behind classifierStrategy == .ane - Add CPU-tiled greedy head branch in generateIncrementalHybridLlama decode loop (surface read -> CPU RMSNorm -> FP16TiledClassifier.tiledMatvecArgmax) - Skip xCur readback when using either ANE or CPU-tiled greedy head Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate all token-semantic UInt16 to TokenID (UInt32) in CPURecurrentGenerationModel and FutureTokenProposingLanguageModel protocol. Includes OfflineExactAcceptanceEvaluator trace types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…guard Migrate all ~40 token-semantic UInt16 occurrences to TokenID (UInt32) in RealModelInferenceEngine: GenerationResult, GenerationStep, encodePrompt, sampleToken, selectGreedyToken, all hybrid/speculative generation methods, and testing helpers. Remove the UInt16 vocab capacity guard that was the original motivation for this migration. Preserved: lmHeadFP16: [UInt16] (fp16 weight data, not token IDs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nversion Change writeEmbeddingBatchFP16 tokenIDs parameter from UnsafePointer<UInt16> to UnsafePointer<TokenID>. At the C interop boundary, narrow each TokenID to UInt16 with an exact check, throwing argumentOutOfRange if the token exceeds the ANE embedding surface capacity. Preserved: fp16 channel capacity guards (UInt16.max) and all MemoryLayout<UInt16> references remain unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-allocate hidden buffer before decode loop (avoids per-token alloc) - Use &invRms instead of [invRms] for vDSP_vsmul (matches codebase pattern) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate EspressoTrain, EspressoMultiTokenProbe, and EspressoGenerate (CLI.swift + GPT2DemoSupport.swift) from UInt16 to TokenID for all token-semantic variables. Rename validateUInt16Token -> validateToken. Add ANETypes dependency to EspressoGenerate target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate LocalTextTokenDatasetBuilder, LocalBigramArtifactBuilder, LocalRealArtifactPipeline, and MultitokenProbeSupport from UInt16 to TokenID for all token-semantic variables. writeUInt16Dataset still narrows to UInt16 for on-disk format compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… + pre/postRoPE graphs TokenID migration (UInt16 → UInt32 / TokenID): - Tests/CPUOpsTests/CPUOpsTests.swift: fix token/target arrays and crossEntropyReference sig - Tests/EspressoTests/GenerationHarnessHardwareTests.swift: replace all [UInt16] token arrays and scalar UInt16(argmax...) with UInt32; preserve FP16 surface UInt16 uses - Tests/EspressoTests/GenerationStagedHeadHardwareTests.swift: [UInt16] → [UInt32] - Tests/RealModelInferenceTests/RealModelInferenceTests.swift: tokens: [UInt32] - Tests/EspressoGenerateTests/EspressoGenerateTests.swift: add ANETypes import for TokenID - Sources/Espresso/LocalBigramArtifactBuilder.swift: buildRecurrentWeights/buildFutureSidecar take [TokenID]; build() still takes [UInt16]; cast TokenID→UInt16 at bridge New features (pre-existing work, now compiling and tested): - Sources/CPUOps/RoPE.swift: add applyDecodeStep(position:theta:nKVHeads:) for single-token decode with GQA support and configurable theta - Sources/ModelSupport/MultiModelConfig.swift: add ropeTheta field (default 10000.0) - Sources/ModelSupport/ModelRegistry.swift: set ropeTheta=500000 on llama3_2_1b/3b - Sources/ModelSupport/TransformerLayerGraphBuilder.swift: add preRoPEForwardLayer and postRoPEForwardLayer graph builders for hybrid CPU-RoPE + ANE attention path - Tests/CPUOpsTests/RoPEDecodeStepTests.swift: 4 new tests (decode step parity, position offset, custom theta, GQA) - Tests/ModelSupportTests/TransformerLayerGraphBuilderLlamaTests.swift: 5 new tests for pre/postRoPE graph structure, output names, and MIL codegen - Tests/RealModelInferenceTests/HybridLlamaDecodeStepTests.swift: resolveLlamaTopLevelWeightPaths tests (struct roundtrip, real paths, missing-file error, ropeTheta values) - Sources/EspressoGGUF/GGUFBenchmark.swift, RunGGUF.swift, Sources/EspressoGGUFRunner/main.swift: GGUF benchmark runner target All 171 non-hardware tests pass. Build clean. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Finish the remaining TokenID migration across all affected files: - EspressoTrain/main.swift: add [TokenID] conversion buffers for UInt16 mmap tokens; use them in Embedding.lookup, CrossEntropy, and Embedding.backward - LocalRealArtifactPipeline: convert [UInt16] dataset to [TokenID] before LocalBigramArtifactBuilder; update promptToken field to TokenID - LocalBigramArtifactBuilder: migrate build/mostLikelyNextToken/ mostLikelyFutureToken and fill helpers to [TokenID: TokenID] - Tests: update all fake model stubs and test token arrays to TokenID in GenerationHarnessTests, RealArtifactPipelineTests, StreamingTwoTokenTests, EspressoTests, GenerationHarnessHardwareTests, GenerationStagedHeadHardwareTests, CPUOpsTests, EspressoGenerateTests, and RealModelInferenceTests swift build --build-tests: Build complete (0 errors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace remaining UInt32 with TokenID in RealModelInferenceTests and EspressoGenerateTests. Add ANETypes dependency to CPUOpsTests, RealModelInferenceTests, and EspressoGenerateTests targets. Sort imports alphabetically in RealModelInferenceTests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code reviewFound 4 issues:
The
The Llama path correctly gates ANE classifier compilation on
Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
… on classifierStrategy
Three code-review fixes:
1. applyRoPEHook: replace boolean-flag error swallowing with direct
error propagation — SurfaceIO failures now include the original
error description in the ANEError.invalidArguments message.
2. Llama generation EOS: add optional `eosToken: TokenID?` to
MultiModelConfig, wire it into the Llama decode loop so generation
stops on the model's EOS token. Llama 3.2 1B/3B registry entries
set eosToken=128001.
3. GPT-2 ensureHybridCompiled: wrap greedy norm+classifier compilation
in `if classifierStrategy == .ane { }` (matching the Llama path),
and add the same guard to `useANEGreedyHead` so the CPU-tiled
strategy is never bypassed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reads `{arch}.rope.freq_base` from GGUF metadata and passes it to
MultiModelConfig. Falls back to 10,000.0 if not present. Fixes silent
wrong positional encoding for Llama 3.2 models loaded via GGUF.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
ClassifierStrategyenum that selects between ANE and CPU-tiled classifier paths based on model vocab/dModel sizeensureHybridCompiledLlama— skips compile for large-vocab models (saves 1 ANE compile)FP16TiledClassifier.tiledMatvecArgmaxwith CPU RMSNormTest plan
ClassifierStrategyTests— 7 tests (selection logic + boundary + FP16TiledClassifier correctness)🤖 Generated with Claude Code