GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration#11
Draft
jameshmcvay wants to merge 106 commits into
Draft
GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration#11jameshmcvay wants to merge 106 commits into
jameshmcvay wants to merge 106 commits into
Conversation
--save-detections/--save-timing write per-detection and per-stage TSVs so builds can be compared for identical output. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
benchmark.sh wraps hyperfine over vide_images with git-rev-labeled artifacts; .gitignore covers the TSV and benchmark outputs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
quad_segment_maxima's 4-deep maxima loop re-fit identical segments for every outer combination; memoize fit_line results per maxima pair. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fixed sigma=1/cutoff=0.05 kernel (size 7) was recomputed with exp() for every cluster; compute it once per thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The error, low-pass, and maxima loops indexed errs[] with % sz; a single branch adjustment wraps without the integer divide. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fit_quad malloc'd five buffers and compute_lfps calloc'd a sixth for every cluster; thread one growable per-task scratch struct through quad_segment_maxima/compute_lfps/do_quad_task instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The old ptsort copied the whole range into temp at every recursion level. Alternate between two buffers so data is copied only at the <=5-element leaves. Same splits, leaf networks, and merge tie behavior, so the sorted order is bit-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
check.sh diffs --save-detections TSVs against a baseline run with per-field epsilon gates (id/hamming exact, coords and margin tolerant). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lets sqrt() and friends compile to bare (vectorizable) instructions instead of errno-guarded libm calls. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Free each cluster in its quad task (parallel, cache-warm) instead of a serial ~11k-free loop after fit_quads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Each merge-tree level's pairwise merges fan out to the worker pool instead of running serially on the calling thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Bucket selection becomes a mask instead of an integer divide, and the table is sized to the slab (well under one entry per 64 slab pixels) so it stays cache resident instead of spread across 0.2*w*h callocs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
With the hash table sized independently of the slab, the pool-chunk array is seeded small and realloc-doubled as chunks fill. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The collection loop already iterates buckets in order, so the secondary key is just the bucket index instead of a re-hash. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Consecutive boundary points almost always extend the same cluster, so remember the last entry and skip the hash lookup when the id matches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Most pixels are interior to a region and never reach a black/white boundary neighbor; compute rep0 (and its size gate) lazily on the first one instead of eagerly for every boundary-candidate pixel. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Read uf->size[rep] + 1 directly instead of calling unionfind_get_set_size in the hot boundary loop. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
key64 = monotone(slope) << 32 | ~index. Full-u64 merge comparisons reproduce the historical merge's take-right-on-tie rule (left-run elements always carry smaller original indices -> larger complements); leaf networks compare the slope word only, matching no-swap-on-tie. Bit-identical ordering, but the sort moves 8-byte keys instead of 12-byte structs and compares without function calls; points are gathered once at the end. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drive cluster construction from row runs: component representatives resolve once per run (cached) instead of per boundary pixel, 127 spans skip wholesale, and the same (y, x, neighbor) emission order is kept so output is bit-identical. Cluster points accumulate in fixed-size chunks bump-allocated from a per-task pool instead of doubling zarrays (a frame can produce ~3M points; growth reallocs were copying tens of MB per frame). Clusters materialize into exact-size zarrays at collection. Output verified byte-identical over the 133-image corpus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
threshim and the four tile min/max arrays are kept on the detector and reused instead of allocated and freed every frame; every entry is rewritten before use. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Only run heads (plus the lazily-initialized last column) enter the union-find: the run-driven cluster pass resolves representatives through run heads, so the per-pixel parent fill (~25MB of writes/frame) was waste. Unions operate on head ids -- same union graph, same nodes, so roots, sizes, and output are identical. The serial stitch becomes a run-pair pass and the per-pixel line code is gone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
While a run and the run below keep overlapping, each pixel emits exactly its (0,1) and (1,1) points into the same cluster entry, so emit them in a tight loop without per-pixel pointer or gate checks. Output verified byte-identical over the 133-image corpus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The per-point window error loop computes the small eigenvalue inline (identical arithmetic) for windows that don't wrap the cluster ends; only the wrapping windows still call the general fit_line. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Merge comparisons are data-dependent coin flips; select with a conditional move instead of a ~50%-mispredicted branch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same output: clusters ascend by (hash, id) and split clusters concatenate their points in task order, exactly as the pairwise tree produced. Saves a workerpool barrier per level and all intermediate lists. Detections verified within 1e-4 px of baseline (the build now uses -march=native, whose FMA contraction shifts coordinates by ~1e-4). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The slope was only ever used to order points, and the ordering now runs on packed u64 keys built directly in fit_quad's angle loop, so the stored float is dead weight. Dropping it cuts a third off all point traffic (chunk storage, merges, collection copies, sort gathers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The sorted point array was only consumed by compute_lfps. Sort just the key array and have compute_lfps gather points through the key indices, skipping the per-cluster point permutation and its scratch buffer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
__AVX2__-guarded vectorization (scalar fallback kept) for the threshold stage's 4x4 tile min/max and the 3x3 tile blur. Integer min/max only, so results are exact. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rows were being run-length encoded up to four times per frame in scalar byte loops (union-find tasks, stitch, and gradient clustering each re-scanned threshim). A single two-pass parallel build (SIMD count via change masks, prefix-sum, SIMD fill) now produces packed row-run tables cached on the detector, and all consumers read those instead of re-scanning the 6.4MB image. Output verified within 1e-4 px over the 133-image corpus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same projection expressions; the matd element loads happen once per quad instead of per bit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Four steps per iteration: both bilinear samples, the bounds checks, and the gradient test run as 4-wide doubles; rejected lanes contribute an exact 0.0 to the weight accumulators, so only the final reduction order differs (within 1e-4 px on the corpus). Indices are clamped so masked lanes still load safely. Scalar tail handles max_steps % 4. All 8 ctest pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Baseline on Apple M3 Pro, 133-image vide_images2 corpus, 4 threads, tagStandard52h13: 29.0 ms/image detector total, 6.28 s wall. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4 tiles (16 columns) per iteration vs AVX2's 8; pairwise min/max collapses the 4x4 tile reduction, vqtbl1q expands tile min/max to columns, and the same saturating-subtract compare tricks carry over. Output byte-identical to the scalar baseline (4583/4583 detections). threshold 3.13 -> 2.07 ms/image on M3 Pro (vide_images2, 4 threads). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
16 bytes per iteration; vshrn narrows the byte-compare mask to a nibble mask (4 bits per byte), giving ctz/popcount-driven iteration over change positions like the AVX2 movemask version. Output byte-identical (4583/4583). threshold (which includes the fused RLE passes) 2.07 -> 1.68 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two uint64x2 registers stride the packed (x,y,gx,gy) pair pattern forward 4 points per iteration, same as the AVX2 version's single 256-bit register. Output byte-identical (4583/4583). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vld4 deinterleaves x/y/gx/gy directly, so the key loop is 4 points per iteration with none of the AVX2 blend/permute shuffling. Keys are bit-identical to scalar; the 4-lane dot accumulator only decides the border-orientation sign. Output byte-identical (4583/4583). Also gate bench_neon.sh on a quiet-load window; this machine is shared and numbers taken under load are garbage. fit quads 12.9 -> 11.6 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Window-error eigenvalues 2-wide (with the exact double->float-> sqrtf->double sequence via vcvt_f32_f64/vsqrt_f32), the 7-tap low-pass filter 4-wide with two accumulator chains, the maxima scan 4-wide with a lane-weighted-add movemask, and the lfps gradient weights through vsqrtq_f64. Output byte-identical (4583/4583). fit quads 11.6 -> 10.8 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two steps per iteration in f64x2 with truncating vcvtzs casts, s64-compare bounds masks, scalar-clamped gathers, and bit-masked weights so rejected lanes contribute exactly 0.0, mirroring the 4-wide AVX2 version. Output byte-identical (4583/4583). decode+refinement 1.70 -> 1.59 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The serial image_u8_decimate had exactly one caller, the detector, which always has a worker pool. Replace it with image_u8_decimate_parallel in image_u8_parallel: output rows are chunked across the pool, in whole 2-row blocks for the 1.5 path. Output is identical for every factor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Decimating by 2 now averages each 2x2 input block instead of point sampling, with AVX2 and ARM NEON paths (identical rounding tree: vertical rounding average per column, then a rounding average across the column pair). Point sampling aliases thin tag borders — whether a 1-2 px border survives decimation depends on its phase relative to the sample grid — while the box average makes segmentation deterministic and slightly improves corner accuracy. Sequential instead of strided reads also make the stage ~7x faster (1.59 -> 0.21 ms/frame on the benchmark corpus). On the 133-frame corpus at quad_decimate=2 vs the point sampler: detection retention statistically unchanged (901 vs 883 missing of 4583), spurious detections down 17 -> 5, mean corner error 0.056 -> 0.052 px. quad_decimate=1 output is unaffected. In the pixel-corner convention a decimated coordinate c maps to 2*c on the input, so the existing corner scale-up holds; the residual half-pixel sample-center shift is absorbed by refine_edges. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Python harnesses that ran on the robots via the vision closure's dt_apriltags: tier_probe (stock/frontend/fit A/B on one frame), tier_burn (extended reversed-order suite with thermal/clock trajectory), power_probe (synchronized phase frequency/RAPL sampling that settled the GPU power-coupling question), clock_spinner (pinned work-rate clock proxy), galaxy_burst (raw camera capture while vide is stopped), frame_density/frame_structure (threshold-level scene analysis), and mock_vide (launcher attempting to shim the broken upstream --mock camera path).
Re-base of the GPU integration onto faster3, replacing the faster2 base of the prior james-gpu-x-bouke-cpu (kept as -old). Same per-platform fit: - GPU frontend available: oclFrontend produces the clusters; where fp64 exists (W3) the GPU fit decides most and the leftovers are bridged to his pt_list for the CPU fit; without fp64 (W2) all clusters go to his CPU fit. - No GPU: his CPU frontend + fit, unchanged. struct pt / pt_list moved to apriltag_pt.h so the GPU bridge (oclClustersToPtList) can build his packed clusters. faster3 already carries the cluster_concat_task ctasks[16] stack-overflow fix (segfault at nthreads >= 5), so we no longer apply our own. It also adds the arm64/NEON port and parallel decimation, both inert on the x86 NUCs: NEON is #ifdef(__ARM_NEON)-guarded, and the decimate path is unused at quad_decimate=1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f03ace1 to
f0ba6e4
Compare
find_package(OpenCL) succeeds on macOS via Apple's OpenCL.framework, so the OpenCL front-end (ocl_threshold.c) compiled there and failed on its `#include <CL/cl.h>` (the Khronos layout Apple doesn't ship). Gate both OpenCL blocks on NOT APPLE (matching line 90); the GPU detector targets Linux x86 NUCs. macOS then builds the CPU path cleanly; x86-Linux is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The OpenCL GPU AprilTag detector (#9) combined with Bouke's
faster3CPU + arm64 campaign, so the fit stage uses the best implementation per platform.Based on
master, so the full stack is visible together: the GPU frontend/fit work (#9, authored onjames/gpu) and thefaster3integration. For the GPU detector on its own (implementation, correctness contract, fleet/soak validation), see #9 — this PR adds the per-platform fit routing and the bridge between the two.Per-platform fit
The GPU frontend (#9) is used whenever a GPU is available; the fit is chosen by capability:
oclClustersToPtListbridges the GPU frontend's clusters into Bouke's packedpt_list(materializing only the leftovers the GPU fit hands back);struct pt/pt_listmoved toapriltag_pt.hso the bridge can build them.ctasks[16]stack overflow — carried by faster3cluster_concat_task ctasks[16]was a fixed 16-slot stack array, but the task count is ~4 × nthreads— so it overflowed and segfaulted at nthreads ≥ 5 (≤4 is safe, which is why productionnthreads=4never tripped it).faster3sizes it dynamically. We originally carried an identicalmallocfix on the faster2 base; on faster3 it's Bouke's commit, so this branch just inherits it.Benchmarks
All numbers measured fresh on the
f0ba6e4builds, both robots: vide stopped, 2-min cooldown before each scenario. Two loads — a synthetic single 32-tag fixture (nthreads=8, isolates per-frame engine cost) and the vide corpus (84 real frames,nthreads=4, production config).cores= average busy cores (CPU-seconds / wall-seconds); lower is more headroom for the rest of vide.pkg °C= median CPU package temperature (x86_pkg_temp) sampled across the run; the mean tracks it within ~1 °C.Synthetic — single 32-tag fixture (
nthreads=8)32/32 tags every run.
W2 / NUC13 (Raptor Lake i5-1340P, no fp64)
faster3pure CPUW3 / NUC14 (Core Ultra 5 125H, fp64 GPU fit engaged)
faster3pure CPUVide corpus — 84 real frames (
nthreads=4)The whole corpus looped for 30 s per scenario;
core-ms/frameandcoresare per real frame.W3 / NUC14 (fp64 GPU fit engaged)
faster3pure CPUW2 / NUC13 (no fp64 → CPU fit)
faster3pure CPUOn the corpus, this branch is the best FPS-per-core on both machines — W3 17.5 FPS at 0.73 cores (−78% CPU vs stock), W2 14.3 FPS at 2.36 cores — and on W3 it beats #9 outright (17.5 vs 16.9 FPS, 0.73 vs 0.90 cores). The single-fixture synthetic is closer between #9 and this branch (within ~1%); the corpus separates them because real frames carry more leftover clusters, where Bouke's faster fit + AVX2 decode pay off.
faster3pure CPU posts the highest raw FPS but saturates the CPU (3.4–6.4 cores).On W3 the fp64 GPU fit drops the CPU below 1 core, and that offload also runs the package ~15–20 °C cooler than the CPU-bound configs (this branch 79–86 °C vs master /
faster3~100–102 °C). On W2 (no fp64, fit on the CPU) the temperatures are closer.Exactness
Bit-exact vs stock master, GPU path vs CPU path, across the full matrix — every case
max_coord_diff=0,max_margin_diff=0:faster3). 24 bit-exact checks per machine.