Skip to content

GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration#11

Draft
jameshmcvay wants to merge 106 commits into
masterfrom
james-gpu-x-bouke-cpu
Draft

GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration#11
jameshmcvay wants to merge 106 commits into
masterfrom
james-gpu-x-bouke-cpu

Conversation

@jameshmcvay

@jameshmcvay jameshmcvay commented Jun 16, 2026

Copy link
Copy Markdown

The OpenCL GPU AprilTag detector (#9) combined with Bouke's faster3 CPU + arm64 campaign, so the fit stage uses the best implementation per platform.

Based on master, so the full stack is visible together: the GPU frontend/fit work (#9, authored on james/gpu) and the faster3 integration. For the GPU detector on its own (implementation, correctness contract, fleet/soak validation), see #9 — this PR adds the per-platform fit routing and the bridge between the two.

Per-platform fit

The GPU frontend (#9) is used whenever a GPU is available; the fit is chosen by capability:

frontend fit
GPU + fp64 (NUC14/15) GPU GPU fp64 fit (#9); leftovers → Bouke's CPU fit
GPU, no fp64 (NUC13) GPU Bouke's faster CPU fit (bridged)
no GPU Bouke's CPU frontend Bouke's CPU fit

oclClustersToPtList bridges the GPU frontend's clusters into Bouke's packed pt_list (materializing only the leftovers the GPU fit hands back); struct pt / pt_list moved to apriltag_pt.h so the bridge can build them.

ctasks[16] stack overflow — carried by faster3

cluster_concat_task ctasks[16] was a fixed 16-slot stack array, but the task count is ~4 × nthreads — so it overflowed and segfaulted at nthreads ≥ 5 (≤4 is safe, which is why production nthreads=4 never tripped it). faster3 sizes it dynamically. We originally carried an identical malloc fix on the faster2 base; on faster3 it's Bouke's commit, so this branch just inherits it.


Benchmarks

All numbers measured fresh on the f0ba6e4 builds, both robots: vide stopped, 2-min cooldown before each scenario. Two loads — a synthetic single 32-tag fixture (nthreads=8, isolates per-frame engine cost) and the vide corpus (84 real frames, nthreads=4, production config). cores = average busy cores (CPU-seconds / wall-seconds); lower is more headroom for the rest of vide. pkg °C = median CPU package temperature (x86_pkg_temp) sampled across the run; the mean tracks it within ~1 °C.

Synthetic — single 32-tag fixture (nthreads=8)

32/32 tags every run.

W2 / NUC13 (Raptor Lake i5-1340P, no fp64)

Config FPS core-ms/detect cores pkg °C
Master (stock CPU) 12.6 452 5.71 85
GPU branch (#9): GPU frontend + original fit 18.1 239 4.34 89
This branch: GPU frontend + Bouke's CPU fit 17.8 190 3.39 80
faster3 pure CPU 24.8 259 6.41 88

W3 / NUC14 (Core Ultra 5 125H, fp64 GPU fit engaged)

Config FPS core-ms/detect cores pkg °C
Master (stock CPU) 15.7 326 5.12 101
GPU branch (#9): GPU + fp64 fit 23.9 47 1.13 83
This branch: GPU + fp64 fit + Bouke's tail 23.6 41 0.96 79
faster3 pure CPU 30.1 200 6.02 100

Vide corpus — 84 real frames (nthreads=4)

The whole corpus looped for 30 s per scenario; core-ms/frame and cores are per real frame.

W3 / NUC14 (fp64 GPU fit engaged)

Config FPS core-ms/frame cores pkg °C
Master (stock CPU) 8.6 382 3.30 102
GPU branch (#9): GPU + fp64 fit 16.9 53 0.90 97
This branch: GPU + fp64 fit + Bouke's tail 17.5 42 0.73 86
faster3 pure CPU 15.7 218 3.43 101

W2 / NUC13 (no fp64 → CPU fit)

Config FPS core-ms/frame cores pkg °C
Master (stock CPU) 8.2 415 3.41 98
GPU branch (#9): GPU frontend + original fit 11.4 261 2.97 100
This branch: GPU frontend + Bouke's CPU fit 14.3 166 2.36 97
faster3 pure CPU 14.6 242 3.55 96

On the corpus, this branch is the best FPS-per-core on both machines — W3 17.5 FPS at 0.73 cores (−78% CPU vs stock), W2 14.3 FPS at 2.36 cores — and on W3 it beats #9 outright (17.5 vs 16.9 FPS, 0.73 vs 0.90 cores). The single-fixture synthetic is closer between #9 and this branch (within ~1%); the corpus separates them because real frames carry more leftover clusters, where Bouke's faster fit + AVX2 decode pay off. faster3 pure CPU posts the highest raw FPS but saturates the CPU (3.4–6.4 cores).

On W3 the fp64 GPU fit drops the CPU below 1 core, and that offload also runs the package ~15–20 °C cooler than the CPU-bound configs (this branch 79–86 °C vs master / faster3 ~100–102 °C). On W2 (no fp64, fit on the CPU) the temperatures are closer.

Exactness

Bit-exact vs stock master, GPU path vs CPU path, across the full matrix — every case max_coord_diff=0, max_margin_diff=0:

  • Both machines (W2 no-fp64; W3 with the fp64 GPU fit engaged) × both loads (vide corpus, 69 images / 1578 detections; the 32-tag fixture) × nthreads 1 and production (4 for the corpus, 8 for the fixture) × all three configs (GPU-accelerated AprilTag detection (OpenCL), bit-exact drop-in #9, this branch, faster3). 24 bit-exact checks per machine.

bouk and others added 30 commits June 10, 2026 18:56
--save-detections/--save-timing write per-detection and per-stage TSVs
so builds can be compared for identical output.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
benchmark.sh wraps hyperfine over vide_images with git-rev-labeled
artifacts; .gitignore covers the TSV and benchmark outputs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
quad_segment_maxima's 4-deep maxima loop re-fit identical segments for
every outer combination; memoize fit_line results per maxima pair.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fixed sigma=1/cutoff=0.05 kernel (size 7) was recomputed with exp()
for every cluster; compute it once per thread.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The error, low-pass, and maxima loops indexed errs[] with % sz; a single
branch adjustment wraps without the integer divide.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fit_quad malloc'd five buffers and compute_lfps calloc'd a sixth for
every cluster; thread one growable per-task scratch struct through
quad_segment_maxima/compute_lfps/do_quad_task instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The old ptsort copied the whole range into temp at every recursion
level. Alternate between two buffers so data is copied only at the
<=5-element leaves. Same splits, leaf networks, and merge tie behavior,
so the sorted order is bit-identical.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
check.sh diffs --save-detections TSVs against a baseline run with
per-field epsilon gates (id/hamming exact, coords and margin tolerant).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lets sqrt() and friends compile to bare (vectorizable) instructions
instead of errno-guarded libm calls.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Free each cluster in its quad task (parallel, cache-warm) instead of a
serial ~11k-free loop after fit_quads.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Each merge-tree level's pairwise merges fan out to the worker pool
instead of running serially on the calling thread.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Bucket selection becomes a mask instead of an integer divide, and the
table is sized to the slab (well under one entry per 64 slab pixels) so
it stays cache resident instead of spread across 0.2*w*h callocs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
With the hash table sized independently of the slab, the pool-chunk
array is seeded small and realloc-doubled as chunks fill.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The collection loop already iterates buckets in order, so the secondary
key is just the bucket index instead of a re-hash.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Consecutive boundary points almost always extend the same cluster, so
remember the last entry and skip the hash lookup when the id matches.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Most pixels are interior to a region and never reach a black/white
boundary neighbor; compute rep0 (and its size gate) lazily on the first
one instead of eagerly for every boundary-candidate pixel.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Read uf->size[rep] + 1 directly instead of calling
unionfind_get_set_size in the hot boundary loop.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
key64 = monotone(slope) << 32 | ~index. Full-u64 merge comparisons
reproduce the historical merge's take-right-on-tie rule (left-run
elements always carry smaller original indices -> larger complements);
leaf networks compare the slope word only, matching no-swap-on-tie.
Bit-identical ordering, but the sort moves 8-byte keys instead of
12-byte structs and compares without function calls; points are
gathered once at the end.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drive cluster construction from row runs: component representatives
resolve once per run (cached) instead of per boundary pixel, 127 spans
skip wholesale, and the same (y, x, neighbor) emission order is kept so
output is bit-identical.

Cluster points accumulate in fixed-size chunks bump-allocated from a
per-task pool instead of doubling zarrays (a frame can produce ~3M
points; growth reallocs were copying tens of MB per frame). Clusters
materialize into exact-size zarrays at collection.

Output verified byte-identical over the 133-image corpus.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
threshim and the four tile min/max arrays are kept on the detector and
reused instead of allocated and freed every frame; every entry is
rewritten before use.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Only run heads (plus the lazily-initialized last column) enter the
union-find: the run-driven cluster pass resolves representatives through
run heads, so the per-pixel parent fill (~25MB of writes/frame) was
waste. Unions operate on head ids -- same union graph, same nodes, so
roots, sizes, and output are identical. The serial stitch becomes a
run-pair pass and the per-pixel line code is gone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
While a run and the run below keep overlapping, each pixel emits
exactly its (0,1) and (1,1) points into the same cluster entry, so emit
them in a tight loop without per-pixel pointer or gate checks.

Output verified byte-identical over the 133-image corpus.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The per-point window error loop computes the small eigenvalue inline
(identical arithmetic) for windows that don't wrap the cluster ends;
only the wrapping windows still call the general fit_line.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Merge comparisons are data-dependent coin flips; select with a
conditional move instead of a ~50%-mispredicted branch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same output: clusters ascend by (hash, id) and split clusters
concatenate their points in task order, exactly as the pairwise tree
produced. Saves a workerpool barrier per level and all intermediate
lists. Detections verified within 1e-4 px of baseline (the build now
uses -march=native, whose FMA contraction shifts coordinates by ~1e-4).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The slope was only ever used to order points, and the ordering now runs
on packed u64 keys built directly in fit_quad's angle loop, so the
stored float is dead weight. Dropping it cuts a third off all point
traffic (chunk storage, merges, collection copies, sort gathers).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The sorted point array was only consumed by compute_lfps. Sort just the
key array and have compute_lfps gather points through the key indices,
skipping the per-cluster point permutation and its scratch buffer.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
__AVX2__-guarded vectorization (scalar fallback kept) for the threshold
stage's 4x4 tile min/max and the 3x3 tile blur. Integer min/max only, so
results are exact.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rows were being run-length encoded up to four times per frame in scalar
byte loops (union-find tasks, stitch, and gradient clustering each
re-scanned threshim). A single two-pass parallel build (SIMD count via
change masks, prefix-sum, SIMD fill) now produces packed row-run tables
cached on the detector, and all consumers read those instead of
re-scanning the 6.4MB image.

Output verified within 1e-4 px over the 133-image corpus.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
bouk and others added 25 commits June 11, 2026 21:04
Same projection expressions; the matd element loads happen once per
quad instead of per bit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Four steps per iteration: both bilinear samples, the bounds checks, and
the gradient test run as 4-wide doubles; rejected lanes contribute an
exact 0.0 to the weight accumulators, so only the final reduction order
differs (within 1e-4 px on the corpus). Indices are clamped so masked
lanes still load safely. Scalar tail handles max_steps % 4.

All 8 ctest pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Baseline on Apple M3 Pro, 133-image vide_images2 corpus, 4 threads,
tagStandard52h13: 29.0 ms/image detector total, 6.28 s wall.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4 tiles (16 columns) per iteration vs AVX2's 8; pairwise min/max
collapses the 4x4 tile reduction, vqtbl1q expands tile min/max to
columns, and the same saturating-subtract compare tricks carry over.
Output byte-identical to the scalar baseline (4583/4583 detections).

threshold 3.13 -> 2.07 ms/image on M3 Pro (vide_images2, 4 threads).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
16 bytes per iteration; vshrn narrows the byte-compare mask to a
nibble mask (4 bits per byte), giving ctz/popcount-driven iteration
over change positions like the AVX2 movemask version.
Output byte-identical (4583/4583).

threshold (which includes the fused RLE passes) 2.07 -> 1.68 ms/image.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two uint64x2 registers stride the packed (x,y,gx,gy) pair pattern
forward 4 points per iteration, same as the AVX2 version's single
256-bit register. Output byte-identical (4583/4583).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vld4 deinterleaves x/y/gx/gy directly, so the key loop is 4 points
per iteration with none of the AVX2 blend/permute shuffling. Keys are
bit-identical to scalar; the 4-lane dot accumulator only decides the
border-orientation sign. Output byte-identical (4583/4583).

Also gate bench_neon.sh on a quiet-load window; this machine is
shared and numbers taken under load are garbage.

fit quads 12.9 -> 11.6 ms/image.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Window-error eigenvalues 2-wide (with the exact double->float->
sqrtf->double sequence via vcvt_f32_f64/vsqrt_f32), the 7-tap
low-pass filter 4-wide with two accumulator chains, the maxima scan
4-wide with a lane-weighted-add movemask, and the lfps gradient
weights through vsqrtq_f64. Output byte-identical (4583/4583).

fit quads 11.6 -> 10.8 ms/image.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two steps per iteration in f64x2 with truncating vcvtzs casts,
s64-compare bounds masks, scalar-clamped gathers, and bit-masked
weights so rejected lanes contribute exactly 0.0, mirroring the
4-wide AVX2 version. Output byte-identical (4583/4583).

decode+refinement 1.70 -> 1.59 ms/image.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The serial image_u8_decimate had exactly one caller, the detector,
which always has a worker pool. Replace it with
image_u8_decimate_parallel in image_u8_parallel: output rows are
chunked across the pool, in whole 2-row blocks for the 1.5 path.
Output is identical for every factor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Decimating by 2 now averages each 2x2 input block instead of point
sampling, with AVX2 and ARM NEON paths (identical rounding tree:
vertical rounding average per column, then a rounding average across
the column pair). Point sampling aliases thin tag borders — whether a
1-2 px border survives decimation depends on its phase relative to the
sample grid — while the box average makes segmentation deterministic
and slightly improves corner accuracy. Sequential instead of strided
reads also make the stage ~7x faster (1.59 -> 0.21 ms/frame on the
benchmark corpus).

On the 133-frame corpus at quad_decimate=2 vs the point sampler:
detection retention statistically unchanged (901 vs 883 missing of
4583), spurious detections down 17 -> 5, mean corner error 0.056 ->
0.052 px. quad_decimate=1 output is unaffected. In the pixel-corner
convention a decimated coordinate c maps to 2*c on the input, so the
existing corner scale-up holds; the residual half-pixel sample-center
shift is absorbed by refine_edges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Python harnesses that ran on the robots via the vision closure's
dt_apriltags: tier_probe (stock/frontend/fit A/B on one frame),
tier_burn (extended reversed-order suite with thermal/clock
trajectory), power_probe (synchronized phase frequency/RAPL sampling
that settled the GPU power-coupling question), clock_spinner (pinned
work-rate clock proxy), galaxy_burst (raw camera capture while vide is
stopped), frame_density/frame_structure (threshold-level scene
analysis), and mock_vide (launcher attempting to shim the broken
upstream --mock camera path).
Re-base of the GPU integration onto faster3, replacing the faster2 base of
the prior james-gpu-x-bouke-cpu (kept as -old). Same per-platform fit:

- GPU frontend available: oclFrontend produces the clusters; where fp64
  exists (W3) the GPU fit decides most and the leftovers are bridged to
  his pt_list for the CPU fit; without fp64 (W2) all clusters go to his
  CPU fit.
- No GPU: his CPU frontend + fit, unchanged.

struct pt / pt_list moved to apriltag_pt.h so the GPU bridge
(oclClustersToPtList) can build his packed clusters.

faster3 already carries the cluster_concat_task ctasks[16] stack-overflow
fix (segfault at nthreads >= 5), so we no longer apply our own. It also
adds the arm64/NEON port and parallel decimation, both inert on the x86
NUCs: NEON is #ifdef(__ARM_NEON)-guarded, and the decimate path is unused
at quad_decimate=1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jameshmcvay jameshmcvay force-pushed the james-gpu-x-bouke-cpu branch from f03ace1 to f0ba6e4 Compare June 16, 2026 17:20
@jameshmcvay jameshmcvay changed the base branch from james/gpu to master June 16, 2026 17:20
@jameshmcvay jameshmcvay changed the title GPU frontend + faster2 CPU fit: per-platform AprilTag acceleration GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration Jun 16, 2026
find_package(OpenCL) succeeds on macOS via Apple's OpenCL.framework, so the
OpenCL front-end (ocl_threshold.c) compiled there and failed on its
`#include <CL/cl.h>` (the Khronos layout Apple doesn't ship). Gate both OpenCL
blocks on NOT APPLE (matching line 90); the GPU detector targets Linux x86 NUCs.
macOS then builds the CPU path cleanly; x86-Linux is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jameshmcvay jameshmcvay marked this pull request as draft June 18, 2026 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants