Skip to content

iGPU Accelerate#8

Draft
bouk wants to merge 17 commits into
faster2from
nuc15-hardware
Draft

iGPU Accelerate#8
bouk wants to merge 17 commits into
faster2from
nuc15-hardware

Conversation

@bouk

@bouk bouk commented Jun 11, 2026

Copy link
Copy Markdown
Member

No description provided.

bouk and others added 17 commits June 11, 2026 12:17
ctasks was a fixed 16-slot stack array, but the chunk count scales with
4*td->nthreads; any thread count above 4 overflowed the array and
corrupted the stack. Size it from the actual group count instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rness

bench.sh runs hyperfine over the vide_images2 corpus and appends a
per-stage row to results.tsv per commit. Thread sweep on the 225H
(4P+8E+2LP-E, no SMT): 12 workers on cpus 0-11 is optimal; the LP-E
cores hurt. Detector drops 39.9 -> 22.6 ms/image with no code change.
check.sh defaults updated to the NUC config and vide_images2 corpus.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New apriltag_opencl module: OpenCL context on the Intel iGPU with
Intel USM buffers (host USM for CPU-consumed outputs, device USM for
GPU-internal data), kernels for tile min/max, 3x3 tile blur, fused
per-pixel threshold + window-parallel run-length encoding (movemask
bit tricks, mirroring the AVX2 CPU code), per-row and per-frame run
offset scans, and run emission. Output is bit-identical to the CPU
path (4583/4583 detections, max coord delta 0.0 px); any GPU failure
falls back to the CPU implementation transparently.

Threshold alone is slower than the CPU stage (3.6 vs 2.2 ms): the
6.4 MB frame handoff costs ~0.9 ms in coherency traffic and each
dispatch carries ~1 ms of fixed overhead. This commit is the
foundation: the next stages join the same GPU batch, amortizing both.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The union-by-size representative is an artifact of union order and
threading layout. GPU connected-components algorithms converge to
min-label representatives natively, so to make the CPU and GPU paths
bit-comparable, rewrite every component's label to its smallest member
id (and flatten all parent chains) after the union pass. Verified
output-identical on the corpus: 4583/4583 detections, 0.0 px delta.

Costs ~2 ms net on the CPU path (the flattened parents speed up the
cluster stage); this pass disappears once the GPU CCL lands, which
produces canonical labels for free.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run-based CCL as OpenCL kernels: edge extraction reproduces
connect_runs_to_prev exactly (vertical contact, white 8-connectivity,
the last-column virtual nodes), then a single-pass ECL-CC-style
lock-free union (CAS hooks only roots, so links are never lost) with
path-halving finds, one compression sweep, atomic size accumulation,
and a publish kernel into shared memory. Labels converge to each
component's minimum node id - bit-identical to the canonicalized CPU
union-find (verified node-for-node over ~1M nodes/frame via
APRILTAG_CCL_VERIFY, and 4583/4583 detections at 0.0 px delta).

The whole CCL runs as one GPU batch with a single sync; the iterative
atomic-min version needed 5-6 full sweeps and was 3x slower.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
One work-item per run replays the CPU run-driven boundary sweep
exactly: same gates (lazy component-size checks against the canonical
labels), same appends, same within-run emission order. The previous
run's carry-in (connected_last) is derived locally - it reduces to
'previous run usable and an opposite usable pixel at (end+1, y+1)'.
Records (cluster slot + point) land at per-run offsets from a count
pass + hierarchical scan, so the flat stream equals the CPU emission
order; cluster identity goes through a 64-bit-atomic open-addressing
hash table. The host then groups records with an order-stable
parallel two-pass partition and sorts clusters by (bucket hash, id) -
reproducing the CPU task/merge pipeline bit for bit (4583/4583
detections, 0.0 px).

The whole stage is one GPU batch (count, scan, emit with capacity
guard + retry, directory compaction). A run->row lookup table
replaces per-run binary searches here and in the CCL edge kernel.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fit_quads_k kernel runs the entire fit_quad pipeline per cluster:
angle keys (same exact-rounded float ops; built with
-cl-fp32-correctly-rounded-divide-sqrt and FP_CONTRACT OFF), the CPU
ping-pong merge sort as an explicit-stack iteration (same splits, leaf
networks, and tie rules), the double-precision line-fit prefix sums in
serial order, windowed fit errors, the Gaussian-filtered maxima search
(taps computed with host libm and float-rounded like the CPU), the
memoized candidate-quad search, and the final corner intersections.
Verified bit-identical end to end (4583/4583 detections, 0.0 px).

On this iGPU (Arc 130T, weak fp64, per-cluster serial structure) the
kernel is ~17x slower than the hand-vectorized CPU stage, so it stays
opt-in via APRILTAG_GPU_QUADS=1. The default path keeps the report's
hybrid split: GPU front-end, CPU quad fit + decode -- but the CPU
fitter now reads clusters directly from the GPU-grouped shared slab
(fit_quad takes raw point arrays), dropping the per-cluster pt_list
materialization. The union-find publish to shared memory is deferred
until something on the CPU actually reads it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New apriltag_detector_detect_prepare(td, im): enqueues the whole GPU
front end (threshold, CCL, cluster sweep) and returns immediately; the
next detect of the same image consumes the in-flight results, paying
only the residual wait. The demo keeps one image of lookahead so each
frame's GPU front-end overlaps the next image's JPEG decode.

Rationale (measured with perf): with the iGPU active the P-cores
average 2.68 GHz vs 3.65 GHz without (powersave governor,
balance_performance EPP, shared package budget), so running GPU and
CPU stages serially let the frequency penalty eat the GPU's savings.
Overlapped, the GPU work disappears from the critical path: at 4
threads the demo runs 1.15x faster than CPU-only end to end (6.94 s
vs 7.95 s corpus wall) while using 26% less CPU time (13.7 vs 18.4
user-seconds) - the front-end no longer costs CPU at all.

bench.sh/check.sh defaults return to 4 threads on the P-cores.
Output remains bit-identical (4583/4583 detections, 0.0 px).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…write

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@bouk bouk marked this pull request as draft June 11, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant