Commit 74ca1f1
Multi-column benchmark: 14.7s (no speedup — Python overhead dominates)
Multi-column correctly compiles and runs (5 detections) but doesn't
reduce forward time because the bottleneck is NOT kernel compute
(already ~1ms per layer with vectorization). The bottleneck is
Python-side dequant/SiLU/requant between layers + buffer management.
To reach 60 FPS, ALL inter-layer operations must move into NPU:
1. Integrate vectorized SiLU kernel (already written, not yet wired)
2. Use xrt::runlist for all-at-once execution (no Python between)
3. Pre-stage all buffers, let DMA sequence handle data flow
4. Dataflow chaining for on-chip activation passing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent d55c26b commit 74ca1f1
0 file changed
0 commit comments