iGPU Accelerate#8
Draft
bouk wants to merge 17 commits into
Draft
Conversation
ctasks was a fixed 16-slot stack array, but the chunk count scales with 4*td->nthreads; any thread count above 4 overflowed the array and corrupted the stack. Size it from the actual group count instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rness bench.sh runs hyperfine over the vide_images2 corpus and appends a per-stage row to results.tsv per commit. Thread sweep on the 225H (4P+8E+2LP-E, no SMT): 12 workers on cpus 0-11 is optimal; the LP-E cores hurt. Detector drops 39.9 -> 22.6 ms/image with no code change. check.sh defaults updated to the NUC config and vide_images2 corpus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New apriltag_opencl module: OpenCL context on the Intel iGPU with Intel USM buffers (host USM for CPU-consumed outputs, device USM for GPU-internal data), kernels for tile min/max, 3x3 tile blur, fused per-pixel threshold + window-parallel run-length encoding (movemask bit tricks, mirroring the AVX2 CPU code), per-row and per-frame run offset scans, and run emission. Output is bit-identical to the CPU path (4583/4583 detections, max coord delta 0.0 px); any GPU failure falls back to the CPU implementation transparently. Threshold alone is slower than the CPU stage (3.6 vs 2.2 ms): the 6.4 MB frame handoff costs ~0.9 ms in coherency traffic and each dispatch carries ~1 ms of fixed overhead. This commit is the foundation: the next stages join the same GPU batch, amortizing both. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The union-by-size representative is an artifact of union order and threading layout. GPU connected-components algorithms converge to min-label representatives natively, so to make the CPU and GPU paths bit-comparable, rewrite every component's label to its smallest member id (and flatten all parent chains) after the union pass. Verified output-identical on the corpus: 4583/4583 detections, 0.0 px delta. Costs ~2 ms net on the CPU path (the flattened parents speed up the cluster stage); this pass disappears once the GPU CCL lands, which produces canonical labels for free. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run-based CCL as OpenCL kernels: edge extraction reproduces connect_runs_to_prev exactly (vertical contact, white 8-connectivity, the last-column virtual nodes), then a single-pass ECL-CC-style lock-free union (CAS hooks only roots, so links are never lost) with path-halving finds, one compression sweep, atomic size accumulation, and a publish kernel into shared memory. Labels converge to each component's minimum node id - bit-identical to the canonicalized CPU union-find (verified node-for-node over ~1M nodes/frame via APRILTAG_CCL_VERIFY, and 4583/4583 detections at 0.0 px delta). The whole CCL runs as one GPU batch with a single sync; the iterative atomic-min version needed 5-6 full sweeps and was 3x slower. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
One work-item per run replays the CPU run-driven boundary sweep exactly: same gates (lazy component-size checks against the canonical labels), same appends, same within-run emission order. The previous run's carry-in (connected_last) is derived locally - it reduces to 'previous run usable and an opposite usable pixel at (end+1, y+1)'. Records (cluster slot + point) land at per-run offsets from a count pass + hierarchical scan, so the flat stream equals the CPU emission order; cluster identity goes through a 64-bit-atomic open-addressing hash table. The host then groups records with an order-stable parallel two-pass partition and sorts clusters by (bucket hash, id) - reproducing the CPU task/merge pipeline bit for bit (4583/4583 detections, 0.0 px). The whole stage is one GPU batch (count, scan, emit with capacity guard + retry, directory compaction). A run->row lookup table replaces per-run binary searches here and in the CCL edge kernel. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fit_quads_k kernel runs the entire fit_quad pipeline per cluster: angle keys (same exact-rounded float ops; built with -cl-fp32-correctly-rounded-divide-sqrt and FP_CONTRACT OFF), the CPU ping-pong merge sort as an explicit-stack iteration (same splits, leaf networks, and tie rules), the double-precision line-fit prefix sums in serial order, windowed fit errors, the Gaussian-filtered maxima search (taps computed with host libm and float-rounded like the CPU), the memoized candidate-quad search, and the final corner intersections. Verified bit-identical end to end (4583/4583 detections, 0.0 px). On this iGPU (Arc 130T, weak fp64, per-cluster serial structure) the kernel is ~17x slower than the hand-vectorized CPU stage, so it stays opt-in via APRILTAG_GPU_QUADS=1. The default path keeps the report's hybrid split: GPU front-end, CPU quad fit + decode -- but the CPU fitter now reads clusters directly from the GPU-grouped shared slab (fit_quad takes raw point arrays), dropping the per-cluster pt_list materialization. The union-find publish to shared memory is deferred until something on the CPU actually reads it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New apriltag_detector_detect_prepare(td, im): enqueues the whole GPU front end (threshold, CCL, cluster sweep) and returns immediately; the next detect of the same image consumes the in-flight results, paying only the residual wait. The demo keeps one image of lookahead so each frame's GPU front-end overlaps the next image's JPEG decode. Rationale (measured with perf): with the iGPU active the P-cores average 2.68 GHz vs 3.65 GHz without (powersave governor, balance_performance EPP, shared package budget), so running GPU and CPU stages serially let the frequency penalty eat the GPU's savings. Overlapped, the GPU work disappears from the critical path: at 4 threads the demo runs 1.15x faster than CPU-only end to end (6.94 s vs 7.95 s corpus wall) while using 26% less CPU time (13.7 vs 18.4 user-seconds) - the front-end no longer costs CPU at all. bench.sh/check.sh defaults return to 4 threads on the P-cores. Output remains bit-identical (4583/4583 detections, 0.0 px). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…write Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.