feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models by christopherkarani · Pull Request #16 · christopherkarani/Espresso

christopherkarani · 2026-03-17T09:24:13Z

Summary

Add ClassifierStrategy enum that selects between ANE and CPU-tiled classifier paths based on model vocab/dModel size
Gate ANE classifier compilation in ensureHybridCompiledLlama — skips compile for large-vocab models (saves 1 ANE compile)
Add CPU-tiled greedy decode path using FP16TiledClassifier.tiledMatvecArgmax with CPU RMSNorm
Pre-convert lmHead weights to FP16 at build time (no per-token conversion overhead)
Complete UInt16 → TokenID migration across codebase

Test plan

ClassifierStrategyTests — 7 tests (selection logic + boundary + FP16TiledClassifier correctness)
Full test suite — 208/210 passed (2 pre-existing MigrationParityTests failures)
Hardware verification: TinyLlama (32K vocab) no longer crashes with statusType=0x9

🤖 Generated with Claude Code

Feat/ane multitoken

Rewrite README to lead with numbers and architecture instead of hedged academic prose. Round benchmarks to 2 decimal places, merge repeated disclaimers into one caveats paragraph, cut 126→93 lines. All factual claims and links preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s-1773267371105 Add Claude Code GitHub Workflow

runCommittedSingleToken (used during prefill) was calling prepareActivationPair which computed output heads for both pair0 and pair1, but pair1's result was always discarded. For the identityZeroTrunk backend, fast-path directly to a single selectTokenFromActivation call. Also adds TTFT measurement (ttft_ms, ttft_cold_ms) to benchmark JSON output, benchmark script summary, and autoresearch log. Measured: two_step_ttft_ms 2.133 → ~1.0 ms (-53%), parity=match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…esent When hasFutureProposer is false, futureClassifier/futureRMS/futureNorm/ futureLogits are never read (guarded by hasFutureProposer check in selectProposedFutureToken). Allocating count: 0 instead of count: vocabSize*dim removes ~49MB of zeroed memory from init. Two-step cold TTFT improved from ~19.4ms to ~11.7ms (-40%). Two-step init is now faster than the control model init. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the scalar argmax loop in selectToken with Accelerate's vDSP_maxvi for SIMD-vectorized max-with-index over 32k logits. Also remove the redundant stepLogits.zero() before sgemm(beta=0) in the two-step CPU output head path. Warm TTFT improved from ~1.0ms to ~0.88ms (-12%) for both control and two-step models (shared selectToken function). Parity=match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… aggregate throughput Pack N concurrent streams into N spatial lanes of shared ANE kernels. ANE processes all 32 lanes per eval regardless, so N streams cost the same as 1 eval. This converts 3N dispatches per token step to just 3, yielding 2498 TPS@6streams vs 1206 TPS concurrent baseline (+107%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… 5381 TPS Scale the spatial-lane batched benchmark from 6 to 32 lanes. ANE eval cost is constant regardless of active lanes; only per-stream CPU work (embedding write + argmax read) grows linearly. At 32 streams: 5381 aggregate TPS (4.4x over concurrent baseline, 2.15x over 6-stream batched), per-stream 168 TPS, round latency 48ms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t 32 streams Add ane_interop_io_write_embedding_batch_fp16() and ane_interop_io_argmax_batch_fp16_spatial() — bulk I/O functions that process all N spatial lanes under a single IOSurface lock/unlock pair. The fused embedding write also eliminates the intermediate FP32 buffer by reading directly from the embedding table into the surface. Before: 5381 TPS@32 streams (512 lock/unlock ops per round) After: 9572 TPS@32 streams (4 lock/unlock ops per round, +78%) Round latency at 32 streams dropped from 47.6ms to 26.7ms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…old TTFT) Add TensorBuffer(nonOwningViewOf:) init that wraps existing memory without allocation. Model inits gain shareReadOnlyWeights parameter; probe CLI gains --share-weights flag. Cold TTFT drops from 11.1ms to 1.98ms (control) and 9.9ms to 1.85ms (two-step). Warm TTFT unaffected; parity match confirmed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

No measurable TTFT impact at 3KB per copy, but cleaner and more idiomatic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t 32 streams Replace scalar 8-way unrolled per-stream argmax with ARM NEON branchless channel-major scan using float16x8_t + vcgtq_f16 + vbslq conditional select. Processes all 32 spatial lanes in 4 NEON vectors per channel row (1 cache line sequential read). Release-mode benchmark at 32 streams: Before: 11,537 agg TPS (argmax ~580 µs) After: 18,110 agg TPS (argmax ~46 µs) Includes per-component timing diagnostic test for future optimization instrumentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace scalar element-by-element scan in tensorBufferIsAllZero with vDSP_maxmgv which uses SIMD to find max absolute value in one pass. Scanning 9MB of layer weights drops from ~0.65ms to ~0.05ms. Cold TTFT 1.98ms → 1.16ms (control), 1.85ms → 1.06ms (two-step). Parity match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All attempted warm-path accelerations are slower than baseline: vDSP_mmul (1.05x), INT8 tiled (2.5x), streaming dot+argmax (1.7x), tiled FP16 conversion (8.6x). The 98MB FP32 classifier at ~100 GB/s yields a ~0.88ms hard minimum. Measured warm TTFT ~0.93ms is within 6% of theoretical floor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move recurrentWeightsUseIdentityZeroTrunk check to #if DEBUG — the caller already declares .identityZeroTrunk intent; the vDSP_maxmgv scan over 6×4×2.25MB was the last measurable init overhead. Init: 0.17ms → 0.01ms. Cold TTFT ≈ warm TTFT now. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Final 5-repeat median (warmup=3, iter=10): Control warm: 0.941ms (baseline 0.993ms, -5%) Two-step warm: 0.930ms (baseline 2.133ms, -56%) Control cold: 0.972ms (baseline 18.784ms, -95%) Two-step cold: 0.944ms (baseline 46.812ms, -98%) Warm TTFT within 6% of 0.88ms theoretical floor (98MB FP32 classifier at ~100 GB/s DRAM bandwidth). Cold ≈ warm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…r 53K TPS Generalize NEON argmax from hardcoded spatial=32 to any multiple of 8 up to 128. Extend multistream benchmark to sweep [1..128] streams with power-of-2 laneSpatial (32, 64, 128). ANE amortizes weight loading across all spatial lanes — doubling lanes adds only 8-21% per-step overhead. Release-mode benchmark results: 32 streams (lane=32): 16,275 agg TPS 64 streams (lane=64): 31,337 agg TPS (+93%) 128 streams (lane=128): 53,496 agg TPS (+228%) Per-step latency: 1.97ms@32 → 2.04ms@64 → 2.39ms@128 Matched 6-stream benchmark: PASS (2769 TPS batched ANE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…untime-cache Improve GPT-2 speculative decode throughput and ANE cache reuse

- Add llama.cpp comparison row to benchmark table with Metal/CPU baselines - Add platform compatibility matrix covering M1–M4 SoCs - Add SPM integration section with 5-line first-inference example - Add release badge - Clarify iOS/tvOS entitlement situation in platform matrix - Tighten quick-start: git clone + 3-line TUI launch first Co-Authored-By: Paperclip <noreply@paperclip.ing>

…plates - Add .github/workflows/ci.yml: build + test matrix on Xcode 16.2 and 16.3 (macos-15, SPM cache, unit tests only — no ANE runner required) - Add CONTRIBUTING.md: dev setup, project structure, coding standards, TDD guide - Add .github/ISSUE_TEMPLATE/bug_report.md and feature_request.md - Add .github/PULL_REQUEST_TEMPLATE.md with benchmark impact section - Update README badges: CI badge alongside existing ANE matrix badge Co-Authored-By: Paperclip <noreply@paperclip.ing>

- Target list with 10 Swift/ML community leaders (Panaro, Hollance, Maderix, HF team, MLX, Paul Hudson, Sean Allen) - 5 personalized ready-to-send outreach messages - Partnership research for Apple Silicon benchmark projects (ANEMLL, more-ane-transformers, neural-engine) - Conference talk proposals for Deep Dish Swift, try! Swift Tokyo, WWDC Labs, SwiftConf - Three talk formats: 40-min technical, 20-min intro, 10-min lightning demo Related: ESP-10, ESP-11, ESP-12 Co-Authored-By: Paperclip <noreply@paperclip.ing>

…rainingLoop - Examples/SimpleInference: ~20-line GPT-2 generation using RealModelInferenceEngine.build() - Examples/BenchmarkSuite: Espresso vs CoreML comparison via espresso bench - Examples/TrainingLoop: fine-tuning wrapper over espresso-train CLI - Examples/README.md: setup guide with env vars and local-path override Each example is a standalone Swift package (macOS 15+, Swift 6.2). Co-Authored-By: Paperclip <noreply@paperclip.ing>

- ModelRegistry: add llama3_2_1b (16L/32H/8KVH/2048d/8192h) and llama3_2_3b (28L/24H/8KVH/3072d/8192h). Both use .llama architecture (SwiGLU, RMSNorm, GQA). Offline converter handles GQA head expansion and RoPE rotation baking into Wq/Wk weights. - Tests: add llama3_2_1bConfigIsCorrect, llama3_2_3bConfigIsCorrect, update registryContainsAllSixModels. All 17 ModelSupportTests pass. - Benchmark dashboard: benchmarks/results/latest.json (3.41x over CoreML, 519 tok/s on M3 Max); scripts/generate-benchmark-dashboard.sh regenerates docs/benchmarks.md from JSON; .github/workflows/benchmark-dashboard.yml auto-triggers on JSON changes. - .gitignore: allow benchmarks/results/latest.json, docs/benchmarks.md, scripts/generate-benchmark-dashboard.sh. Co-Authored-By: Paperclip <noreply@paperclip.ing>

- coreml-vs-espresso-benchmarks.html: benchmark comparison with data tables, visual bar chart, architecture explanation, and M1/M2/M3/M4 projections - gpt2-926-tokens-per-second.html: step-by-step guide covering 4.76x win — direct ANE access, 3-layer fusion, recurrent arch, zero-copy argmax - reverse-engineering-apple-neural-engine.html: internals deep-dive covering dlopen bridge, MIL ops, IOSurface memory model, and confirmed dead ends - blog.html: blog index listing all posts with summaries and tags - docs/index.html: add Blog nav link + "From the Blog" section for SEO - .gitignore: allow docs HTML files for GitHub Pages Targets keywords: "CoreML alternative", "apple neural engine framework", "swift ml inference", "GPT-2 apple silicon", "ANE reverse engineering" Co-Authored-By: Paperclip <noreply@paperclip.ing>

…NE inference Adds EspressoGGUF target that bridges EdgeRunner's GGUF loader into Espresso's weight format. GGUFModelLoader.prepare() loads a GGUF file, dequantizes via Metal, transposes per architecture convention, wraps in BLOBFILE format, and writes to a temp directory compatible with RealModelInferenceEngine.build(). - Bumps platform to macOS 26 (required by EdgeRunner/Metal 4) - Bumps swift-tools-version to 6.2 - EdgeRunner added as local package dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduces ClassifierStrategy with a 16M-element SRAM threshold (32MB fp16). Models with vocab*dModel <= 16M use the ANE lane-packed classifier head; larger models (Stories110M, TinyLlama, Qwen3 0.6B) fall back to FP16TiledClassifier on CPU. Adds 5 Swift Testing tests covering strategy selection for small/large/huge vocabs and CPU-tiled argmax correctness. Also adds Espresso dependency to RealModelInferenceTests for FP16TiledClassifier access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove unused `import Espresso` from ClassifierStrategy.swift - Add exactThresholdSelectsANE and oneOverThresholdSelectsCPU tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…PU classifier gating Integrate ClassifierStrategy into the Llama decode path to gate between ANE lane-packed classifier and CPU-tiled FP16 classifier based on vocab*dim SRAM fit. Changes: - Add lmHeadFP16 field to LlamaTopLevelAssets (pre-converted FP16 weights) - Add classifierStrategy stored property, initialized via ClassifierStrategy.select() - Gate ANE greedy norm+classifier compile behind classifierStrategy == .ane - Add CPU-tiled greedy head branch in generateIncrementalHybridLlama decode loop (surface read -> CPU RMSNorm -> FP16TiledClassifier.tiledMatvecArgmax) - Skip xCur readback when using either ANE or CPU-tiled greedy head Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Migrate all token-semantic UInt16 to TokenID (UInt32) in CPURecurrentGenerationModel and FutureTokenProposingLanguageModel protocol. Includes OfflineExactAcceptanceEvaluator trace types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…guard Migrate all ~40 token-semantic UInt16 occurrences to TokenID (UInt32) in RealModelInferenceEngine: GenerationResult, GenerationStep, encodePrompt, sampleToken, selectGreedyToken, all hybrid/speculative generation methods, and testing helpers. Remove the UInt16 vocab capacity guard that was the original motivation for this migration. Preserved: lmHeadFP16: [UInt16] (fp16 weight data, not token IDs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nversion Change writeEmbeddingBatchFP16 tokenIDs parameter from UnsafePointer<UInt16> to UnsafePointer<TokenID>. At the C interop boundary, narrow each TokenID to UInt16 with an exact check, throwing argumentOutOfRange if the token exceeds the ANE embedding surface capacity. Preserved: fp16 channel capacity guards (UInt16.max) and all MemoryLayout<UInt16> references remain unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Pre-allocate hidden buffer before decode loop (avoids per-token alloc) - Use &invRms instead of [invRms] for vDSP_vsmul (matches codebase pattern) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Migrate EspressoTrain, EspressoMultiTokenProbe, and EspressoGenerate (CLI.swift + GPT2DemoSupport.swift) from UInt16 to TokenID for all token-semantic variables. Rename validateUInt16Token -> validateToken. Add ANETypes dependency to EspressoGenerate target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Migrate LocalTextTokenDatasetBuilder, LocalBigramArtifactBuilder, LocalRealArtifactPipeline, and MultitokenProbeSupport from UInt16 to TokenID for all token-semantic variables. writeUInt16Dataset still narrows to UInt16 for on-disk format compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… + pre/postRoPE graphs TokenID migration (UInt16 → UInt32 / TokenID): - Tests/CPUOpsTests/CPUOpsTests.swift: fix token/target arrays and crossEntropyReference sig - Tests/EspressoTests/GenerationHarnessHardwareTests.swift: replace all [UInt16] token arrays and scalar UInt16(argmax...) with UInt32; preserve FP16 surface UInt16 uses - Tests/EspressoTests/GenerationStagedHeadHardwareTests.swift: [UInt16] → [UInt32] - Tests/RealModelInferenceTests/RealModelInferenceTests.swift: tokens: [UInt32] - Tests/EspressoGenerateTests/EspressoGenerateTests.swift: add ANETypes import for TokenID - Sources/Espresso/LocalBigramArtifactBuilder.swift: buildRecurrentWeights/buildFutureSidecar take [TokenID]; build() still takes [UInt16]; cast TokenID→UInt16 at bridge New features (pre-existing work, now compiling and tested): - Sources/CPUOps/RoPE.swift: add applyDecodeStep(position:theta:nKVHeads:) for single-token decode with GQA support and configurable theta - Sources/ModelSupport/MultiModelConfig.swift: add ropeTheta field (default 10000.0) - Sources/ModelSupport/ModelRegistry.swift: set ropeTheta=500000 on llama3_2_1b/3b - Sources/ModelSupport/TransformerLayerGraphBuilder.swift: add preRoPEForwardLayer and postRoPEForwardLayer graph builders for hybrid CPU-RoPE + ANE attention path - Tests/CPUOpsTests/RoPEDecodeStepTests.swift: 4 new tests (decode step parity, position offset, custom theta, GQA) - Tests/ModelSupportTests/TransformerLayerGraphBuilderLlamaTests.swift: 5 new tests for pre/postRoPE graph structure, output names, and MIL codegen - Tests/RealModelInferenceTests/HybridLlamaDecodeStepTests.swift: resolveLlamaTopLevelWeightPaths tests (struct roundtrip, real paths, missing-file error, ropeTheta values) - Sources/EspressoGGUF/GGUFBenchmark.swift, RunGGUF.swift, Sources/EspressoGGUFRunner/main.swift: GGUF benchmark runner target All 171 non-hardware tests pass. Build clean. Co-Authored-By: Paperclip <noreply@paperclip.ing>

Finish the remaining TokenID migration across all affected files: - EspressoTrain/main.swift: add [TokenID] conversion buffers for UInt16 mmap tokens; use them in Embedding.lookup, CrossEntropy, and Embedding.backward - LocalRealArtifactPipeline: convert [UInt16] dataset to [TokenID] before LocalBigramArtifactBuilder; update promptToken field to TokenID - LocalBigramArtifactBuilder: migrate build/mostLikelyNextToken/ mostLikelyFutureToken and fill helpers to [TokenID: TokenID] - Tests: update all fake model stubs and test token arrays to TokenID in GenerationHarnessTests, RealArtifactPipelineTests, StreamingTwoTokenTests, EspressoTests, GenerationHarnessHardwareTests, GenerationStagedHeadHardwareTests, CPUOpsTests, EspressoGenerateTests, and RealModelInferenceTests swift build --build-tests: Build complete (0 errors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace remaining UInt32 with TokenID in RealModelInferenceTests and EspressoGenerateTests. Add ANETypes dependency to CPUOpsTests, RealModelInferenceTests, and EspressoGenerateTests targets. Sort imports alphabetically in RealModelInferenceTests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christopherkarani · 2026-03-17T09:42:15Z

Code review

Found 4 issues:

RoPE hook silently drops error context (CLAUDE.md says "ALWAYS handle errors comprehensively... Never silently swallow errors")

The applyRoPEHook closure catches SurfaceIO errors, discards them into a readOK = false boolean, then throws a generic invalidArguments("RoPE hook surface read failed") with no underlying error details. Every other error handler in this file preserves context via "\(error)" interpolation. This is the only call site that swallows it, making hardware debugging significantly harder.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L2338-L2345

Llama generation never stops on EOS token (comment at line 2527 acknowledges this but does not implement it)

generateIncrementalHybridLlama only stops when effectiveMaxTokens or maxSeq is reached. The comment says "Llama EOS varies by model -- use vocab-1 as a safe sentinel or check config" but no check is implemented. The GPT-2 path has explicit if nextToken == Self.gpt2EOSToken { break } guards. Generation will produce garbage output past the model's natural stop point.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L2525-L2530

GPT-2 greedy path does not check classifierStrategy, will attempt ANE classifier for large-vocab GPT-2

The Llama path correctly gates ANE classifier compilation on classifierStrategy == .ane (line 1659) and adds it to the useANEGreedyHead check (line 2299). The GPT-2 path in ensureHybridCompiled (line 1544) unconditionally compiles the ANE classifier, and generateIncrementalHybrid (line 1739) does not check classifierStrategy. GPT-2-124M has vocab=50257, dModel=768 (38.6M elements > 16M limit), so classifierStrategy would be .cpuTiled, yet the GPT-2 path ignores this and attempts to use the ANE classifier anyway.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/RealModelInference/RealModelInferenceEngine.swift#L1737-L1743

GGUFModelLoader does not read ropeTheta from GGUF metadata, silently defaults to 10,000

GGUFModelLoader.prepare() constructs MultiModelConfig without setting ropeTheta, so it defaults to 10_000.0. The same PR registers Llama 3.2 models with ropeTheta: 500_000.0 in ModelRegistry. A GGUF-loaded Llama 3.2 model will silently use the wrong RoPE base frequency, producing incorrect positional encodings.

https://github.com/christopherkarani/Espresso/blob/28142453e779b7813d29eb2cfb25b395f2f7bdfe/Sources/EspressoGGUF/GGUFModelLoader.swift#L52-L66

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

… on classifierStrategy Three code-review fixes: 1. applyRoPEHook: replace boolean-flag error swallowing with direct error propagation — SurfaceIO failures now include the original error description in the ANEError.invalidArguments message. 2. Llama generation EOS: add optional `eosToken: TokenID?` to MultiModelConfig, wire it into the Llama decode loop so generation stops on the model's EOS token. Llama 3.2 1B/3B registry entries set eosToken=128001. 3. GPT-2 ensureHybridCompiled: wrap greedy norm+classifier compilation in `if classifierStrategy == .ane { }` (matching the Llama path), and add the same guard to `useANEGreedyHead` so the CPU-tiled strategy is never bypassed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reads `{arch}.rope.freq_base` from GGUF metadata and passes it to MultiModelConfig. Falls back to 10,000.0 if not present. Fixes silent wrong positional encoding for Llama 3.2 models loaded via GGUF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christopherkarani and others added 30 commits March 11, 2026 07:08

Merge pull request #1 from christopherkarani/feat/ane-multitoken

420c754

Feat/ane multitoken

Narrow launch copy and fix public license

db0953b

"Claude PR Assistant workflow"

91393d6

"Claude Code Review workflow"

0930b5e

Merge pull request #2 from christopherkarani/add-claude-github-action…

dc4bfef

…s-1773267371105 Add Claude Code GitHub Workflow

feat(bench): align espresso-bench target dependencies

cc98f10

feat(bench): add benchmark statistics and flops helpers

557829f

feat(bench): add locale-stable reporting and thermal sampling

8aee06b

feat(bench): wire direct ane benchmark runner

03cbdb4

feat(bench): refine coreml baseline loading

f845319

feat(bench): simplify espresso-bench cli flow

455da7e

feat(bench): add coreml generation and power scripts

ea96759

feat(bench): finalize benchmark task tracking

6f732d8

Review HEAD vs main changes

eb70e78

Replace scalar embedding copy loops with memcpy (code cleanup)

cdd224f

No measurable TTFT impact at 3KB per copy, but cleaner and more idiomatic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add autoresearch artifacts to .gitignore

8d6370c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christopherkarani and others added 26 commits March 16, 2026 20:13

Reuse speculative runtimes and seed ANE cache donors

56cf8b9

Fix speculative runtime review findings

891ae4d

Merge pull request #15 from christopherkarani/perf/gpt2-speculative-r…

7b124dd

…untime-cache Improve GPT-2 speculative decode throughput and ANE cache reuse

refactor: migrate GenerationHarness protocols from UInt16 to TokenID

ff06eb6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: Embedding and CrossEntropy use TokenID (UInt32)

ba1ce9d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused import, add boundary tests for ClassifierStrategy

d0c4fdf

- Remove unused `import Espresso` from ClassifierStrategy.swift - Add exactThresholdSelectsANE and oneOverThresholdSelectsCPU tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: Embedding lookup/backward use TokenID (UInt32)

14063c2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: argmax and output head use TokenID

dbe6fec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christopherkarani and others added 2 commits March 17, 2026 13:47

christopherkarani force-pushed the main branch from 7b124dd to e1f09ad Compare March 26, 2026 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16

feat: add ClassifierStrategy to gate ANE vs CPU-tiled classifier for large-vocab models#16
christopherkarani wants to merge 163 commits into
mainfrom
feat/classifier-strategy

christopherkarani commented Mar 17, 2026

Uh oh!

christopherkarani commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

christopherkarani commented Mar 17, 2026

Summary

Test plan

Uh oh!

christopherkarani commented Mar 17, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants