Skip to content

Commit 74ca1f1

Browse files
jgmelberclaude
andcommitted
Multi-column benchmark: 14.7s (no speedup — Python overhead dominates)
Multi-column correctly compiles and runs (5 detections) but doesn't reduce forward time because the bottleneck is NOT kernel compute (already ~1ms per layer with vectorization). The bottleneck is Python-side dequant/SiLU/requant between layers + buffer management. To reach 60 FPS, ALL inter-layer operations must move into NPU: 1. Integrate vectorized SiLU kernel (already written, not yet wired) 2. Use xrt::runlist for all-at-once execution (no Python between) 3. Pre-stage all buffers, let DMA sequence handle data flow 4. Dataflow chaining for on-chip activation passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d55c26b commit 74ca1f1

0 file changed

File tree

    0 commit comments

    Comments
     (0)