GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration by jameshmcvay · Pull Request #11 · BuildMonumental/apriltag

jameshmcvay · 2026-06-16T14:38:38Z

The OpenCL GPU AprilTag detector (#9) combined with Bouke's faster3 CPU + arm64 campaign, so the fit stage uses the best implementation per platform.

Based on master, so the full stack is visible together: the GPU frontend/fit work (#9, authored on james/gpu) and the faster3 integration. For the GPU detector on its own (implementation, correctness contract, fleet/soak validation), see #9 — this PR adds the per-platform fit routing and the bridge between the two.

Per-platform fit

The GPU frontend (#9) is used whenever a GPU is available; the fit is chosen by capability:

	frontend	fit
GPU + fp64 (NUC14/15)	GPU	GPU fp64 fit (#9); leftovers → Bouke's CPU fit
GPU, no fp64 (NUC13)	GPU	Bouke's faster CPU fit (bridged)
no GPU	Bouke's CPU frontend	Bouke's CPU fit

oclClustersToPtList bridges the GPU frontend's clusters into Bouke's packed pt_list (materializing only the leftovers the GPU fit hands back); struct pt / pt_list moved to apriltag_pt.h so the bridge can build them.

`ctasks[16]` stack overflow — carried by faster3

cluster_concat_task ctasks[16] was a fixed 16-slot stack array, but the task count is ~4 × nthreads — so it overflowed and segfaulted at nthreads ≥ 5 (≤4 is safe, which is why production nthreads=4 never tripped it). faster3 sizes it dynamically. We originally carried an identical malloc fix on the faster2 base; on faster3 it's Bouke's commit, so this branch just inherits it.

Benchmarks

All numbers measured fresh on the f0ba6e4 builds, both robots: vide stopped, 2-min cooldown before each scenario. Two loads — a synthetic single 32-tag fixture (nthreads=8, isolates per-frame engine cost) and the vide corpus (84 real frames, nthreads=4, production config). cores = average busy cores (CPU-seconds / wall-seconds); lower is more headroom for the rest of vide. pkg °C = median CPU package temperature (x86_pkg_temp) sampled across the run; the mean tracks it within ~1 °C.

Synthetic — single 32-tag fixture (`nthreads=8`)

32/32 tags every run.

W2 / NUC13 (Raptor Lake i5-1340P, no fp64)

Config	FPS	core-ms/detect	cores	pkg °C
Master (stock CPU)	12.6	452	5.71	85
GPU branch (#9): GPU frontend + original fit	18.1	239	4.34	89
This branch: GPU frontend + Bouke's CPU fit	17.8	190	3.39	80
`faster3` pure CPU	24.8	259	6.41	88

W3 / NUC14 (Core Ultra 5 125H, fp64 GPU fit engaged)

Config	FPS	core-ms/detect	cores	pkg °C
Master (stock CPU)	15.7	326	5.12	101
GPU branch (#9): GPU + fp64 fit	23.9	47	1.13	83
This branch: GPU + fp64 fit + Bouke's tail	23.6	41	0.96	79
`faster3` pure CPU	30.1	200	6.02	100

Vide corpus — 84 real frames (`nthreads=4`)

The whole corpus looped for 30 s per scenario; core-ms/frame and cores are per real frame.

W3 / NUC14 (fp64 GPU fit engaged)

Config	FPS	core-ms/frame	cores	pkg °C
Master (stock CPU)	8.6	382	3.30	102
GPU branch (#9): GPU + fp64 fit	16.9	53	0.90	97
This branch: GPU + fp64 fit + Bouke's tail	17.5	42	0.73	86
`faster3` pure CPU	15.7	218	3.43	101

W2 / NUC13 (no fp64 → CPU fit)

Config	FPS	core-ms/frame	cores	pkg °C
Master (stock CPU)	8.2	415	3.41	98
GPU branch (#9): GPU frontend + original fit	11.4	261	2.97	100
This branch: GPU frontend + Bouke's CPU fit	14.3	166	2.36	97
`faster3` pure CPU	14.6	242	3.55	96

On the corpus, this branch is the best FPS-per-core on both machines — W3 17.5 FPS at 0.73 cores (−78% CPU vs stock), W2 14.3 FPS at 2.36 cores — and on W3 it beats #9 outright (17.5 vs 16.9 FPS, 0.73 vs 0.90 cores). The single-fixture synthetic is closer between #9 and this branch (within ~1%); the corpus separates them because real frames carry more leftover clusters, where Bouke's faster fit + AVX2 decode pay off. faster3 pure CPU posts the highest raw FPS but saturates the CPU (3.4–6.4 cores).

On W3 the fp64 GPU fit drops the CPU below 1 core, and that offload also runs the package ~15–20 °C cooler than the CPU-bound configs (this branch 79–86 °C vs master / faster3 ~100–102 °C). On W2 (no fp64, fit on the CPU) the temperatures are closer.

Exactness

Bit-exact vs stock master, GPU path vs CPU path, across the full matrix — every case max_coord_diff=0, max_margin_diff=0:

Both machines (W2 no-fp64; W3 with the fp64 GPU fit engaged) × both loads (vide corpus, 69 images / 1578 detections; the 32-tag fixture) × nthreads 1 and production (4 for the corpus, 8 for the fixture) × all three configs (GPU-accelerated AprilTag detection (OpenCL), bit-exact drop-in #9, this branch, faster3). 24 bit-exact checks per machine.

--save-detections/--save-timing write per-detection and per-stage TSVs so builds can be compared for identical output. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

benchmark.sh wraps hyperfine over vide_images with git-rev-labeled artifacts; .gitignore covers the TSV and benchmark outputs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

quad_segment_maxima's 4-deep maxima loop re-fit identical segments for every outer combination; memoize fit_line results per maxima pair. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The fixed sigma=1/cutoff=0.05 kernel (size 7) was recomputed with exp() for every cluster; compute it once per thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The error, low-pass, and maxima loops indexed errs[] with % sz; a single branch adjustment wraps without the integer divide. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fit_quad malloc'd five buffers and compute_lfps calloc'd a sixth for every cluster; thread one growable per-task scratch struct through quad_segment_maxima/compute_lfps/do_quad_task instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The old ptsort copied the whole range into temp at every recursion level. Alternate between two buffers so data is copied only at the <=5-element leaves. Same splits, leaf networks, and merge tie behavior, so the sorted order is bit-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

check.sh diffs --save-detections TSVs against a baseline run with per-field epsilon gates (id/hamming exact, coords and margin tolerant). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Lets sqrt() and friends compile to bare (vectorizable) instructions instead of errno-guarded libm calls. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Free each cluster in its quad task (parallel, cache-warm) instead of a serial ~11k-free loop after fit_quads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Each merge-tree level's pairwise merges fan out to the worker pool instead of running serially on the calling thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Bucket selection becomes a mask instead of an integer divide, and the table is sized to the slab (well under one entry per 64 slab pixels) so it stays cache resident instead of spread across 0.2*w*h callocs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

With the hash table sized independently of the slab, the pool-chunk array is seeded small and realloc-doubled as chunks fill. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The collection loop already iterates buckets in order, so the secondary key is just the bucket index instead of a re-hash. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Consecutive boundary points almost always extend the same cluster, so remember the last entry and skip the hash lookup when the id matches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Most pixels are interior to a region and never reach a black/white boundary neighbor; compute rep0 (and its size gate) lazily on the first one instead of eagerly for every boundary-candidate pixel. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Read uf->size[rep] + 1 directly instead of calling unionfind_get_set_size in the hot boundary loop. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

key64 = monotone(slope) << 32 | ~index. Full-u64 merge comparisons reproduce the historical merge's take-right-on-tie rule (left-run elements always carry smaller original indices -> larger complements); leaf networks compare the slope word only, matching no-swap-on-tie. Bit-identical ordering, but the sort moves 8-byte keys instead of 12-byte structs and compares without function calls; points are gathered once at the end. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Drive cluster construction from row runs: component representatives resolve once per run (cached) instead of per boundary pixel, 127 spans skip wholesale, and the same (y, x, neighbor) emission order is kept so output is bit-identical. Cluster points accumulate in fixed-size chunks bump-allocated from a per-task pool instead of doubling zarrays (a frame can produce ~3M points; growth reallocs were copying tens of MB per frame). Clusters materialize into exact-size zarrays at collection. Output verified byte-identical over the 133-image corpus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

threshim and the four tile min/max arrays are kept on the detector and reused instead of allocated and freed every frame; every entry is rewritten before use. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Only run heads (plus the lazily-initialized last column) enter the union-find: the run-driven cluster pass resolves representatives through run heads, so the per-pixel parent fill (~25MB of writes/frame) was waste. Unions operate on head ids -- same union graph, same nodes, so roots, sizes, and output are identical. The serial stitch becomes a run-pair pass and the per-pixel line code is gone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

While a run and the run below keep overlapping, each pixel emits exactly its (0,1) and (1,1) points into the same cluster entry, so emit them in a tight loop without per-pixel pointer or gate checks. Output verified byte-identical over the 133-image corpus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The per-point window error loop computes the small eigenvalue inline (identical arithmetic) for windows that don't wrap the cluster ends; only the wrapping windows still call the general fit_line. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge comparisons are data-dependent coin flips; select with a conditional move instead of a ~50%-mispredicted branch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Same output: clusters ascend by (hash, id) and split clusters concatenate their points in task order, exactly as the pairwise tree produced. Saves a workerpool barrier per level and all intermediate lists. Detections verified within 1e-4 px of baseline (the build now uses -march=native, whose FMA contraction shifts coordinates by ~1e-4). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The slope was only ever used to order points, and the ordering now runs on packed u64 keys built directly in fit_quad's angle loop, so the stored float is dead weight. Dropping it cuts a third off all point traffic (chunk storage, merges, collection copies, sort gathers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The sorted point array was only consumed by compute_lfps. Sort just the key array and have compute_lfps gather points through the key indices, skipping the per-cluster point permutation and its scratch buffer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

__AVX2__-guarded vectorization (scalar fallback kept) for the threshold stage's 4x4 tile min/max and the 3x3 tile blur. Integer min/max only, so results are exact. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Rows were being run-length encoded up to four times per frame in scalar byte loops (union-find tasks, stitch, and gradient clustering each re-scanned threshim). A single two-pass parallel build (SIMD count via change masks, prefix-sum, SIMD fill) now produces packed row-run tables cached on the detector, and all consumers read those instead of re-scanning the 6.4MB image. Output verified within 1e-4 px over the 133-image corpus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Same projection expressions; the matd element loads happen once per quad instead of per bit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Four steps per iteration: both bilinear samples, the bounds checks, and the gradient test run as 4-wide doubles; rejected lanes contribute an exact 0.0 to the weight accumulators, so only the final reduction order differs (within 1e-4 px on the corpus). Indices are clamped so masked lanes still load safely. Scalar tail handles max_steps % 4. All 8 ctest pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Baseline on Apple M3 Pro, 133-image vide_images2 corpus, 4 threads, tagStandard52h13: 29.0 ms/image detector total, 6.28 s wall. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

4 tiles (16 columns) per iteration vs AVX2's 8; pairwise min/max collapses the 4x4 tile reduction, vqtbl1q expands tile min/max to columns, and the same saturating-subtract compare tricks carry over. Output byte-identical to the scalar baseline (4583/4583 detections). threshold 3.13 -> 2.07 ms/image on M3 Pro (vide_images2, 4 threads). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

16 bytes per iteration; vshrn narrows the byte-compare mask to a nibble mask (4 bits per byte), giving ctz/popcount-driven iteration over change positions like the AVX2 movemask version. Output byte-identical (4583/4583). threshold (which includes the fused RLE passes) 2.07 -> 1.68 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Two uint64x2 registers stride the packed (x,y,gx,gy) pair pattern forward 4 points per iteration, same as the AVX2 version's single 256-bit register. Output byte-identical (4583/4583). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

vld4 deinterleaves x/y/gx/gy directly, so the key loop is 4 points per iteration with none of the AVX2 blend/permute shuffling. Keys are bit-identical to scalar; the 4-lane dot accumulator only decides the border-orientation sign. Output byte-identical (4583/4583). Also gate bench_neon.sh on a quiet-load window; this machine is shared and numbers taken under load are garbage. fit quads 12.9 -> 11.6 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Window-error eigenvalues 2-wide (with the exact double->float-> sqrtf->double sequence via vcvt_f32_f64/vsqrt_f32), the 7-tap low-pass filter 4-wide with two accumulator chains, the maxima scan 4-wide with a lane-weighted-add movemask, and the lfps gradient weights through vsqrtq_f64. Output byte-identical (4583/4583). fit quads 11.6 -> 10.8 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Two steps per iteration in f64x2 with truncating vcvtzs casts, s64-compare bounds masks, scalar-clamped gathers, and bit-masked weights so rejected lanes contribute exactly 0.0, mirroring the 4-wide AVX2 version. Output byte-identical (4583/4583). decode+refinement 1.70 -> 1.59 ms/image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The serial image_u8_decimate had exactly one caller, the detector, which always has a worker pool. Replace it with image_u8_decimate_parallel in image_u8_parallel: output rows are chunked across the pool, in whole 2-row blocks for the 1.5 path. Output is identical for every factor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Decimating by 2 now averages each 2x2 input block instead of point sampling, with AVX2 and ARM NEON paths (identical rounding tree: vertical rounding average per column, then a rounding average across the column pair). Point sampling aliases thin tag borders — whether a 1-2 px border survives decimation depends on its phase relative to the sample grid — while the box average makes segmentation deterministic and slightly improves corner accuracy. Sequential instead of strided reads also make the stage ~7x faster (1.59 -> 0.21 ms/frame on the benchmark corpus). On the 133-frame corpus at quad_decimate=2 vs the point sampler: detection retention statistically unchanged (901 vs 883 missing of 4583), spurious detections down 17 -> 5, mean corner error 0.056 -> 0.052 px. quad_decimate=1 output is unaffected. In the pixel-corner convention a decimated coordinate c maps to 2*c on the input, so the existing corner scale-up holds; the residual half-pixel sample-center shift is absorbed by refine_edges. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Python harnesses that ran on the robots via the vision closure's dt_apriltags: tier_probe (stock/frontend/fit A/B on one frame), tier_burn (extended reversed-order suite with thermal/clock trajectory), power_probe (synchronized phase frequency/RAPL sampling that settled the GPU power-coupling question), clock_spinner (pinned work-rate clock proxy), galaxy_burst (raw camera capture while vide is stopped), frame_density/frame_structure (threshold-level scene analysis), and mock_vide (launcher attempting to shim the broken upstream --mock camera path).

Re-base of the GPU integration onto faster3, replacing the faster2 base of the prior james-gpu-x-bouke-cpu (kept as -old). Same per-platform fit: - GPU frontend available: oclFrontend produces the clusters; where fp64 exists (W3) the GPU fit decides most and the leftovers are bridged to his pt_list for the CPU fit; without fp64 (W2) all clusters go to his CPU fit. - No GPU: his CPU frontend + fit, unchanged. struct pt / pt_list moved to apriltag_pt.h so the GPU bridge (oclClustersToPtList) can build his packed clusters. faster3 already carries the cluster_concat_task ctasks[16] stack-overflow fix (segfault at nthreads >= 5), so we no longer apply our own. It also adds the arm64/NEON port and parallel decimation, both inert on the x86 NUCs: NEON is #ifdef(__ARM_NEON)-guarded, and the decimate path is unused at quad_decimate=1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

find_package(OpenCL) succeeds on macOS via Apple's OpenCL.framework, so the OpenCL front-end (ocl_threshold.c) compiled there and failed on its `#include <CL/cl.h>` (the Khronos layout Apple doesn't ship). Gate both OpenCL blocks on NOT APPLE (matching line 90); the GPU detector targets Linux x86 NUCs. macOS then builds the CPU path cleanly; x86-Linux is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bouk and others added 30 commits June 10, 2026 18:56

Add TSV detection/timing output flags to apriltag_demo

3257434

--save-detections/--save-timing write per-detection and per-stage TSVs so builds can be compared for identical output. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add hyperfine benchmark script

164eaff

benchmark.sh wraps hyperfine over vide_images with git-rev-labeled artifacts; .gitignore covers the TSV and benchmark outputs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Memoize segment fits across the candidate-quad search

dc04beb

quad_segment_maxima's 4-deep maxima loop re-fit identical segments for every outer combination; memoize fit_line results per maxima pair. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Hoist the Gaussian low-pass kernel into a per-thread table

98d072c

The fixed sigma=1/cutoff=0.05 kernel (size 7) was recomputed with exp() for every cluster; compute it once per thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace modulo indexing in quad_segment_maxima with wrap adjustments

9c55305

The error, low-pass, and maxima loops indexed errs[] with % sz; a single branch adjustment wraps without the integer divide. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add output-equivalence check script

fe15384

check.sh diffs --save-detections TSVs against a baseline run with per-field epsilon gates (id/hamming exact, coords and margin tolerant). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Build with -fno-math-errno

cf1ac9c

Lets sqrt() and friends compile to bare (vectorizable) instructions instead of errno-guarded libm calls. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Destroy clusters inside the quad tasks

c88e79c

Free each cluster in its quad task (parallel, cache-warm) instead of a serial ~11k-free loop after fit_quads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Run the cluster merge tree on the worker pool

4fd699c

Each merge-tree level's pairwise merges fan out to the worker pool instead of running serially on the calling thread. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Grow the mem_pools array instead of sizing it from nclustermap

20f4bd8

With the hash table sized independently of the slab, the pool-chunk array is seeded small and realloc-doubled as chunks fill. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Reuse the bucket index for cluster_hash->hash

62511ad

The collection loop already iterates buckets in order, so the secondary key is just the bucket index instead of a re-hash. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cache the last-hit cluster entry

5a296bd

Consecutive boundary points almost always extend the same cluster, so remember the last entry and skip the hash lookup when the id matches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Inline the union-find set-size lookups in the cluster pass

06e4fb8

Read uf->size[rep] + 1 directly instead of calling unionfind_get_set_size in the hot boundary loop. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cache threshold buffers across detect calls

716a229

threshim and the four tile min/max arrays are kept on the detector and reused instead of allocated and freed every frame; every entry is rewritten before use. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Make pt_compare_angle static inline

8fb183e

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Branchless key_merge select

b21cf0c

Merge comparisons are data-dependent coin flips; select with a conditional move instead of a ~50%-mispredicted branch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

AVX2 paths for the threshold tile min/max and blur

2100301

__AVX2__-guarded vectorization (scalar fallback kept) for the threshold stage's 4x4 tile min/max and the 3x3 tile blur. Integer min/max only, so results are exact. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

bouk and others added 25 commits June 11, 2026 21:04

Hoist homography loads out of the bit-sampling loop

bc91241

Same projection expressions; the matd element loads happen once per quad instead of per bit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Hoist homography loads in the pattern-sampling loop too

98732ae

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Keep graymodel coefficients in registers in the bit loop

5935896

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Document the optimization campaign: methodology, invariants, dead ends

08e3bb6

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add macOS/arm64 benchmark+check scripts and scalar baseline results

af19458

Baseline on Apple M3 Pro, 133-image vide_images2 corpus, 4 threads, tagStandard52h13: 29.0 ms/image detector total, 6.28 s wall. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

results: neon-threshold

7dc4d98

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

results: neon-rle

b3e7b0a

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

results: neon-cluster-stores

33fd000

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

results: neon-bbox-keys

702f187

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

results: neon-segment-maxima

d80633b

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

results: neon-refine-edges

034672f

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Document the NEON port: methodology, tricks, dead ends

016ab60

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Record the fleet campaign results and findings

4c669ae

Add the definitive controlled cross-generation comparison

ae7b335

jameshmcvay force-pushed the james-gpu-x-bouke-cpu branch from f03ace1 to f0ba6e4 Compare June 16, 2026 17:20

jameshmcvay changed the base branch from james/gpu to master June 16, 2026 17:20

jameshmcvay changed the title ~~GPU frontend + faster2 CPU fit: per-platform AprilTag acceleration~~ GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration Jun 16, 2026

jameshmcvay marked this pull request as draft June 18, 2026 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration#11

GPU frontend + faster3 CPU fit: per-platform AprilTag acceleration#11
jameshmcvay wants to merge 106 commits into
masterfrom
james-gpu-x-bouke-cpu

jameshmcvay commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jameshmcvay commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Per-platform fit

ctasks[16] stack overflow — carried by faster3

Benchmarks

Synthetic — single 32-tag fixture (nthreads=8)

W2 / NUC13 (Raptor Lake i5-1340P, no fp64)

W3 / NUC14 (Core Ultra 5 125H, fp64 GPU fit engaged)

Vide corpus — 84 real frames (nthreads=4)

W3 / NUC14 (fp64 GPU fit engaged)

W2 / NUC13 (no fp64 → CPU fit)

Exactness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jameshmcvay commented Jun 16, 2026 •

edited

Loading

`ctasks[16]` stack overflow — carried by faster3

Synthetic — single 32-tag fixture (`nthreads=8`)

Vide corpus — 84 real frames (`nthreads=4`)