iGPU Accelerate#8

Draft

bouk wants to merge 17 commits into

bouk commented Jun 11, 2026

Member

No description provided.

bouk and others added 17 commits

June 11, 2026 12:17


          Fix stack overflow in gradient_clusters with more than 4 threads

92275dd

ctasks was a fixed 16-slot stack array, but the chunk count scales with
4*td->nthreads; any thread count above 4 overflowed the array and
corrupted the stack. Size it from the actual group count instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          Phase 0: re-baseline on NUC15 (Ultra 5 225H), add per-commit bench ha…

ccb12cb

…rness

bench.sh runs hyperfine over the vide_images2 corpus and appends a
per-stage row to results.tsv per commit. Thread sweep on the 225H
(4P+8E+2LP-E, no SMT): 12 workers on cpus 0-11 is optimal; the LP-E
cores hurt. Detector drops 39.9 -> 22.6 ms/image with no code change.
check.sh defaults updated to the NUC config and vide_images2 corpus.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          GPU stage 1: threshold + RLE on the Arc iGPU via OpenCL

318e482

New apriltag_opencl module: OpenCL context on the Intel iGPU with
Intel USM buffers (host USM for CPU-consumed outputs, device USM for
GPU-internal data), kernels for tile min/max, 3x3 tile blur, fused
per-pixel threshold + window-parallel run-length encoding (movemask
bit tricks, mirroring the AVX2 CPU code), per-row and per-frame run
offset scans, and run emission. Output is bit-identical to the CPU
path (4583/4583 detections, max coord delta 0.0 px); any GPU failure
falls back to the CPU implementation transparently.

Threshold alone is slower than the CPU stage (3.6 vs 2.2 ms): the
6.4 MB frame handoff costs ~0.9 ms in coherency traffic and each
dispatch carries ~1 ms of fixed overhead. This commit is the
foundation: the next stages join the same GPU batch, amortizing both.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          results.tsv: record GPU stage 1 benchmark

b5f1e95

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          Canonicalize union-find representatives to minimum member id

ad9ecd4

The union-by-size representative is an artifact of union order and
threading layout. GPU connected-components algorithms converge to
min-label representatives natively, so to make the CPU and GPU paths
bit-comparable, rewrite every component's label to its smallest member
id (and flatten all parent chains) after the union pass. Verified
output-identical on the corpus: 4583/4583 detections, 0.0 px delta.

Costs ~2 ms net on the CPU path (the flattened parents speed up the
cluster stage); this pass disappears once the GPU CCL lands, which
produces canonical labels for free.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          results.tsv: canonicalization benchmark

39e913d

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          GPU stage 2: connected components on the Arc iGPU

dd0cd60

Run-based CCL as OpenCL kernels: edge extraction reproduces
connect_runs_to_prev exactly (vertical contact, white 8-connectivity,
the last-column virtual nodes), then a single-pass ECL-CC-style
lock-free union (CAS hooks only roots, so links are never lost) with
path-halving finds, one compression sweep, atomic size accumulation,
and a publish kernel into shared memory. Labels converge to each
component's minimum node id - bit-identical to the canonicalized CPU
union-find (verified node-for-node over ~1M nodes/frame via
APRILTAG_CCL_VERIFY, and 4583/4583 detections at 0.0 px delta).

The whole CCL runs as one GPU batch with a single sync; the iterative
atomic-min version needed 5-6 full sweeps and was 3x slower.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          results.tsv: GPU CCL benchmark

7babbae

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          GPU stage 3: gradient-cluster sweep on the Arc iGPU

239cb99

One work-item per run replays the CPU run-driven boundary sweep
exactly: same gates (lazy component-size checks against the canonical
labels), same appends, same within-run emission order. The previous
run's carry-in (connected_last) is derived locally - it reduces to
'previous run usable and an opposite usable pixel at (end+1, y+1)'.
Records (cluster slot + point) land at per-run offsets from a count
pass + hierarchical scan, so the flat stream equals the CPU emission
order; cluster identity goes through a 64-bit-atomic open-addressing
hash table. The host then groups records with an order-stable
parallel two-pass partition and sorts clusters by (bucket hash, id) -
reproducing the CPU task/merge pipeline bit for bit (4583/4583
detections, 0.0 px).

The whole stage is one GPU batch (count, scan, emit with capacity
guard + retry, directory compaction). A run->row lookup table
replaces per-run binary searches here and in the CCL edge kernel.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          results.tsv: GPU cluster sweep benchmark

bdbe11c

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          GPU stage 4: bit-exact quad fitting on the iGPU + slab-based CPU fitter

596ca50

The fit_quads_k kernel runs the entire fit_quad pipeline per cluster:
angle keys (same exact-rounded float ops; built with
-cl-fp32-correctly-rounded-divide-sqrt and FP_CONTRACT OFF), the CPU
ping-pong merge sort as an explicit-stack iteration (same splits, leaf
networks, and tie rules), the double-precision line-fit prefix sums in
serial order, windowed fit errors, the Gaussian-filtered maxima search
(taps computed with host libm and float-rounded like the CPU), the
memoized candidate-quad search, and the final corner intersections.
Verified bit-identical end to end (4583/4583 detections, 0.0 px).

On this iGPU (Arc 130T, weak fp64, per-cluster serial structure) the
kernel is ~17x slower than the hand-vectorized CPU stage, so it stays
opt-in via APRILTAG_GPU_QUADS=1. The default path keeps the report's
hybrid split: GPU front-end, CPU quad fit + decode -- but the CPU
fitter now reads clusters directly from the GPU-grouped shared slab
(fit_quad takes raw point arrays), dropping the per-cluster pt_list
materialization. The union-find publish to shared memory is deferred
until something on the CPU actually reads it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          results.tsv: stage 4 benchmark

20038fd

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          Document the GPU acceleration campaign; record the all-GPU benchmark

c6304a4

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          Pipeline the GPU front-end ahead of the CPU back half

9b35e44

New apriltag_detector_detect_prepare(td, im): enqueues the whole GPU
front end (threshold, CCL, cluster sweep) and returns immediately; the
next detect of the same image consumes the in-flight results, paying
only the residual wait. The demo keeps one image of lookahead so each
frame's GPU front-end overlaps the next image's JPEG decode.

Rationale (measured with perf): with the iGPU active the P-cores
average 2.68 GHz vs 3.65 GHz without (powersave governor,
balance_performance EPP, shared package budget), so running GPU and
CPU stages serially let the frequency penalty eat the GPU's savings.
Overlapped, the GPU work disappears from the critical path: at 4
threads the demo runs 1.15x faster than CPU-only end to end (6.94 s
vs 7.95 s corpus wall) while using 26% less CPU time (13.7 vs 18.4
user-seconds) - the front-end no longer costs CPU at all.

bench.sh/check.sh defaults return to 4 threads on the P-cores.
Output remains bit-identical (4583/4583 detections, 0.0 px).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          results.tsv: pipelined front-end benchmarks at 4 threads

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          GPU_NOTES: document the power-coupling finding and pipelining results

a454f51

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>


          Ignore all build directories; refresh results.tsv commit ids after re…

089968d

…write

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

bouk marked this pull request as draft

June 11, 2026 12:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet