Commit 74ca1f1

and

committed

Multi-column benchmark: 14.7s (no speedup — Python overhead dominates)

Multi-column correctly compiles and runs (5 detections) but doesn't reduce forward time because the bottleneck is NOT kernel compute (already ~1ms per layer with vectorization). The bottleneck is Python-side dequant/SiLU/requant between layers + buffer management. To reach 60 FPS, ALL inter-layer operations must move into NPU: 1. Integrate vectorized SiLU kernel (already written, not yet wired) 2. Use xrt::runlist for all-at-once execution (no Python between) 3. Pre-stage all buffers, let DMA sequence handle data flow 4. Dataflow chaining for on-chip activation passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1 parent d55c26b commit 74ca1f1Copy full SHA for 74ca1f1

0 file changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 74ca1f1

File tree

0 commit comments