M13 is now active instead of aspirational.
The repo now has a reproducible host-side benchmark lane that rebuilds the staged userland pieces, stages the bhyve guest, runs a compact benchmark profile, and extracts a structured JSON baseline.
Files added for this step:
scripts/libthr/prepare-headers.shscripts/benchmarks/run-m13-baseline.shscripts/benchmarks/extract-m13-baseline.pybenchmarks/baselines/m13-initial.json
Host-side artifacts from the first run:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T120024Z.serial.log/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T120024Z.json
Focused repeat-lane artifacts with round-level counter telemetry:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T123532Z.serial.log/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T123532Z.json/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T125519Z.json/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T125820Z.serial.log/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T125820Z.json/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T130005Z.json
The first compact M13 lane intentionally favors stable structured signals over exhaustive coverage.
Dispatch modes:
basicpressureburst-reusetimeout-gapsustainedmain-executor-resume-repeat
Swift modes:
dispatch-controlmainqueue-resumedispatchmain-taskhandles-after-repeat
All selected modes completed with ok status in the first recorded baseline.
A second verification run kept the same 9/9 success result and the same
qualitative hotspots, but the heavy repeat lanes drifted modestly:
dispatch.main-executor-resume-repeatmoved fromreqthreads +564 / enter +189 / return +186toreqthreads +522 / enter +175 / return +172swift.dispatchmain-taskhandles-after-repeatmoved fromreqthreads +2799 / enter +934 / return +931toreqthreads +2640 / enter +881 / return +878
That is good enough to confirm the direction of the next optimization work, but not yet good enough for a hard regression gate.
The next two runs changed the M13 story in an important way:
- the first apparent post-fix benchmark win turned out to be partly masked by
a staging bug:
scripts/libthr/prepare-stage.shwas refreshing from a stalelibthrobjdir and not the newest build products; - after fixing that staging path, the first real post-fix repeat-only run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T125519Z.jsondropped the C repeat lane toreqthreads +379 / enter +172 / return +169and the Swift repeat lane toreqthreads +1630 / enter +780 / return +777; - a second clean repeat-only confirmation run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T130005Z.jsonkept the same direction on the C lane atreqthreads +320 / enter +150 / return +147and kept the Swift lane materially below the pre-fix request level atreqthreads +1863 / enter +884 / return +881; - the traced proof run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260410T125820Z.serial.logis noisier because tracing changes timing, but it proves the newworker-handoff-fastpathpath is live in the guest.
The first baseline confirms the current phase-1 warm-pool behavior rather than just the raw correctness path:
dispatch.burst-reusecreated4new workers in round1and0in the remaining rounds;dispatch.timeout-gapalso stayed at4settled idle workers after the long gap;- both lanes settled at
4idle workers with0active workers.
This is a good baseline for later lifecycle tuning because it shows reuse is already happening, even if retirement policy is still conservative.
The first M13 baseline still shows useful pressure shaping:
dispatch.pressurehelddefault_max_inflightto3while the higher priority work ran;- the same run produced
9block and9unblock observations; dispatch.sustaineddrove641block and641unblock observations while settling back to the4-worker warm floor.
So the new baseline does not just confirm success/failure. It also preserves the backpressure signal we care about.
The largest remaining inefficiency in the compact benchmark set is repeated worker request/enter churn on continuation-heavy lanes.
The two clearest hotspots in the first baseline are:
dispatch.main-executor-resume-repeatreqthreads_count +564,thread_enter_count +189,thread_return_count +186swift.dispatchmain-taskhandles-after-repeatreqthreads_count +2799,thread_enter_count +934,thread_return_count +931
Those numbers are much larger than the simpler control lanes:
dispatch.basicreqthreads_count +15,thread_enter_count +5swift.dispatch-controlreqthreads_count +15,thread_enter_count +5swift.mainqueue-resumereqthreads_count +13,thread_enter_count +5,thread_return_count +3
That makes the next M13 direction concrete: keep correctness fixed, then reduce redrive churn on repeated delayed-resume workloads.
The repeat benchmarks no longer rely only on whole-run before/after counters.
The C repeat lane, dispatch.main-executor-resume-repeat, now emits and
extracts round-start-counters and round-ok-counters for every round. The
same is true for the Swift repeat lane,
swift.dispatchmain-taskhandles-after-repeat.
That focused run changes what can be said honestly about the hotspot:
- the C repeat lane is not just paying a startup penalty and then flattening;
its
reqthreadsdeltas stay active across all64rounds with a first-half mean of7.66and a second-half mean of7.44; - the same C lane keeps
bucket_totalpinned at5through the run, which means the warm pool is already established while the requests continue; - the Swift repeat lane is also not startup-only:
reqthreadsdeltas average44.91in the first half and38.56in the second half, still far above the C lane late in the run; - this shifts the next tuning target more clearly toward request generation in
staged
libdispatch, not toward kernel admission or warm-pool formation.
The first live M13 optimization is no longer hypothetical.
The change had two parts:
scripts/libthr/prepare-stage.shnow auto-selects the freshest stagedlibthrobjdir instead of assuming the oldamd64.amd64path;/usr/src/lib/libthr/thread/thr_workq.cnow has a same-lane handoff fast path that skips a redundantTHREAD_RETURN -> THREAD_ENTERcycle when a worker immediately claims another item in the same kernel bucket.
Compared with the pre-fix repeat-only mean:
dispatch.main-executor-resume-repeatmoved fromreqthreads +546 / enter +183 / return +180to+379 / +172 / +169in the first clean post-fix run and+320 / +150 / +147in the second;swift.dispatchmain-taskhandles-after-repeatmoved fromreqthreads +2659.5 / enter +887.5 / return +884.5to+1630 / +780 / +777in the first clean post-fix run and+1863 / +884 / +881in the second.
The traced proof run shows why the results are mixed:
- in the C repeat section,
worker-handoff-fastpathfired30times and matched all30same-lane handoff claims; - in the Swift repeat section,
worker-handoff-fastpathfired63times, but there were216handoff claims total and153of them still crossed lanes and required a real re-enter path; - that explains why
reqthreadsimproves clearly on Swift whilethread_enter/thread_returnremain much noisier than the C lane.
The next M13 step no longer relies on inference from same-lane recycling.
The kernel and libthr now have a real cross-lane handoff op,
TWQ_OP_THREAD_TRANSFER, so a worker that claims work from a different kernel
lane can move there directly instead of always returning to the kernel and
re-entering.
The first current-branch transfer runs were misleading for two separate reasons:
- the guest kernel initially had not been rebuilt, so the new syscall op was not actually present in the running image;
- after that, the staged
libthrstill came from stale/tmp/twqlibobj/.../*.picoobjects, so new source edits in/usr/src/lib/libthr/thread/thr_workq.chad still not reached the guest.
Once both were corrected, the traced proof run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T112356Z.serial.log
showed the new path directly:
worker-handoff-transfer:183worker-handoff-enter:0worker-handoff-fastpath:85worker-handoff-claim:268
That run is timing-perturbed by tracing, so the clean repeat-only follow-up runs are the more honest performance signal:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T112557Z.jsonmoveddispatch.main-executor-resume-repeattoreqthreads +380 / enter +169 / return +166andswift.dispatchmain-taskhandles-after-repeatto+1371 / +460 / +457;/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T112757Z.jsonmoved the same lanes to+354 / +163 / +160and+1500 / +506 / +503.
The result is deliberately narrower than a blanket “all repeat churn is fixed”:
- the C repeat lane stays roughly in the same band as the earlier same-lane improvement;
- the Swift repeat lane improves materially, especially on
thread_enter/thread_return; - that is strong evidence that cross-lane recycling was a real missing piece for Swift-heavy continuation paths;
- it is also strong evidence that the next honest target is no longer worker recycling, but request generation and wake policy further up the stack.
The next libdispatch-side experiment was deliberately small:
use the otherwise-idle dgq_thread_pool_size field on the FreeBSD
pthread_workqueue path as a transient active-worker count, then suppress
drain-side repokes once a root queue already had enough active drainers.
That idea did not survive contact with the trace:
- the focused clean run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T113935Z.jsonstayed correct and produceddispatch.main-executor-resume-repeat+321 / +152 / +149andswift.dispatchmain-taskhandles-after-repeat+1407 / +476 / +473; - but the proof trace at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T114107Z.serial.logshoweddrain-one-skip-pokeonly2times whileroot-queue-poke-slowstill fired988times; - that means the apparent clean-run movement was not materially caused by the cap logic;
- the patch was reverted immediately rather than leaving a weak heuristic in the staged dispatch tree.
This is still useful progress because it narrows the next honest seam:
- the remaining churn is not going to be solved by a coarse “active workers already at target” guard;
- the real hotspot is still the root-queue request policy itself,
especially the repeated
root-queue-poke-slowtraffic oncom.apple.root.default-qosandcom.apple.root.user-initiated-qos.
The next libthr experiment targeted a more concrete redundancy:
skip a fresh kernel REQTHREADS call when a lane already had enough
tbr_ready workers to cover its current tbr_pending count.
That looked promising in static trace samples, but the live repeat-only runs did not support keeping it:
- the first clean run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T115402Z.jsonmoveddispatch.main-executor-resume-repeatto+345 / +157 / +154, but movedswift.dispatchmain-taskhandles-after-repeatto+1533 / +532 / +529; - the traced run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T115538Z.jsonlanded at+384 / +179 / +176and+1263 / +521 / +517, but the trace showedaddthreads-coveredonly4times againstaddthreads-begin: 952androot-queue-poke-slow: 952; - the second clean run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T115743Z.jsonmoved the same lanes to+316 / +149 / +146and+1424 / +491 / +488.
That is not a stable win:
- the C repeat lane improved on average versus the immediate pre-patch band;
- the Swift repeat lane did not improve on average and remained noisier than the post-transfer baseline;
- the traced proof run shows the new path barely fires, so it is not the dominant source of repeat-lane churn.
The patch was reverted and the staged libthr was refreshed back to the
reverted state. The result is useful because it closes another tempting but
weak branch:
- the main hotspot is not “already-ready work on the same lane”;
- the remaining cost is still dominated by repeated root-queue request generation and cross-queue wake behavior above this point in the stack.
The next useful step was not a blind behavior tweak. It was a traceability fix.
The staged libdispatch root-drain trace originally only matched
com.apple.root.user-initiated-qos, which meant the dominant
com.apple.root.default-qos repeat-lane traffic was invisible at the
root-drain level even though the higher-level root-poke traces already showed
it.
The trace surface was widened in ../nx/swift-corelibs-libdispatch/src/queue.c
so that:
- root-drain traces now include both default and user-initiated roots;
- root-drain events now record the popped item kind and queue label when the item is itself a queue object;
- explicit callsite markers now exist for
drain-one, contended-wait, and worker-timeout repokes.
That immediately produced one concrete new finding in the first focused trace
run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T093244Z.serial.log:
- the early repeat-lane redrive is not only about
twq.swift.executoron the default root; - the
com.apple.main-threaditem oncom.apple.root.default-qos.overcommitalso performs an immediatedrain-one-repokewhen another root item is visible; - that overcommit-main-queue repoke is therefore part of the same churn picture and deserved a direct falsification attempt.
The resulting bounded behavior branch was:
- skip the preemptive
drain-onerepoke only when the current root item iscom.apple.main-threadand the current root is overcommit.
That branch did not survive repeated clean runs.
Artifacts:
- trace seed run:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T093244Z.serial.log - clean trial 1:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T093604Z.json - clean trial 2:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T093749Z.json - clean trial 3:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T093913Z.json
The three clean trials landed at:
- dispatch repeat:
+336 / +153 / +150,+348 / +160 / +157,+340 / +153 / +150 - Swift repeat:
+1344 / +429 / +426,+1414 / +468 / +465,+1616 / +542 / +539
That is not stable enough to keep:
- the C lane stayed roughly in-band;
- the Swift lane had one promising result, one neutral result, and one clear regression;
- the behavior change was reverted immediately;
- only the improved root-trace instrumentation remains.
This is still real progress because it narrows the next honest target again:
- a coarse “skip main-queue overcommit repoke” heuristic is too timing-sensitive;
- the useful retained result is the wider root-drain visibility on the default roots;
- the next libdispatch-side change needs to distinguish more carefully between queue-object redrive that is actually productive and queue-object redrive that only manufactures extra worker requests.
The next useful narrowing step was to stop tracing the whole executor path and trace only the root-queue enqueue/drain path.
Two small infrastructure changes made that possible:
../nx/swift-corelibs-libdispatch/src/queue.cnow has a dedicatedLIBDISPATCH_TWQ_TRACE_ROOTcontrol, instead of forcing root traces to ride on the broaderLIBDISPATCH_TWQ_TRACE_MAINQUEUEswitch;scripts/bhyve/stage-guest.shnow stages and forwards a matchingTWQ_LIBDISPATCH_ROOT_TRACEguest-side control so the benchmark lane can request root-only tracing without also enabling the noisier lane and main-queue traces.
That narrower trace produced a better boundary in
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T095603Z.serial.log:
- the first root activity in the repeat lane is expected:
twq.swift.executoris pushed ontocom.apple.root.default-qosas anempty->pokeitem; - after the executor callback returns to the main-queue side, the next overcommit request is not yet delayed child work;
- instead,
com.apple.main-threaditself is pushed ontocom.apple.root.default-qos.overcommitas anotherempty->pokeroot item; - that pushed main queue is already no longer thread-bound at this point
(
head_thread_bound=0), is already marked enqueued, and already contains one queued item (head_head=head_tail=0xdb287e1a040in the traced run); - the traced repeat lane still crashes with
rc=139immediately after that push, so this root-only trace remains diagnostic-only, not a stable regression workload.
That changes the next honest seam again:
- the earliest repeat-lane overcommit request is now tied directly to
_dispatch_queue_cleanup2()turning the main queue into an ordinary queue and handing it off to the overcommit default root; - the next libdispatch-side investigation should look at the
cleanup2 -> barrier_complete -> root pushtransition itself, not only at later “next visible item” redrive; - the retained root-only trace control should stay, because it is a more targeted diagnostic lane than the earlier broad executor trace.
The next interpretation step was to compare that seam against Apple’s own
libdispatch structure, not just our local trace.
The useful donor-side facts are now explicit:
- in
../nx/apple-opensource-libdispatch/src/queue.c,_dispatch_main_qis initialized with.do_targetq = _dispatch_get_default_queue(true), which points the main queue at the default overcommit root rather than the plain default root; - in
../nx/apple-opensource-libdispatch/src/inline_internal.h,_dispatch_get_default_queue(true)resolves to the overcommit variant of the default root queue; - in
../nx/apple-opensource-libdispatch/src/queue.c,_dispatch_queue_cleanup2()clears the thread-bound state and immediately hands off through_dispatch_lane_barrier_complete(dq, 0, 0).
That does not prove our current repeat lane is efficient, but it changes the burden of proof:
- the mere existence of a
cleanup2 -> com.apple.main-thread -> com.apple.root.default-qos.overcommittransition is now likely native behavior, not an immediate porting mistake; - the real question is rate and coalescing: are we generating materially more cleanup-triggered overcommit pushes/pokes per logical delayed-resume cycle than native macOS would;
- the next honest target is therefore no longer “remove the cleanup handoff,” but “measure and reduce excess overcommit redrive after the first cleanup-triggered handoff.”
The next real M13 movement did not come from another staged-libdispatch
requeue heuristic. It came from making libthr stop treating every admitted
worker as a fresh spawn.
The useful structural issue was in /usr/src/lib/libthr/thread/thr_workq.c:
TWQ_OP_REQTHREADSreturns newly scheduled workers for a lane, not a pure “spawn this many brand-new threads” command;- the old userland planning path still treated
admittedasspawn_needed, then only woke idle workers for the remainder; - that meant the runtime had no lane-aware way to prefer already-counted same-lane idle workers or transferable idle workers from other lanes before creating more workers.
The fix is now more explicit:
- each lane runtime now tracks its own idle worker count via
tbr_idle; - a new ready-planning step first wakes same-lane idle workers for already-counted pending work, then wakes transferable idle workers for the admitted remainder, and only then spawns the rest;
- the same wake-first planning is used both in the direct
addthreadspath and in reaper-driven redrive, so the staged runtime no longer has one wake/spawn policy on the hot path and another in the idle-redrive path.
The first two clean repeat-only runs after rebuilding /tmp/twqlibobj and
refreshing the staged guest artifacts are:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T110916Z.json/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T111107Z.json
Compared with the earlier clean post-transfer band from
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T112557Z.json
and
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260413T112757Z.json,
the result is narrower but real:
dispatch.main-executor-resume-repeatstayed stable and slightly improved, moving from+380 / +169 / +166and+354 / +163 / +160to+361 / +164 / +161and+343 / +158 / +155;swift.dispatchmain-taskhandles-after-repeatimproved materially, moving from+1371 / +460 / +457and+1500 / +506 / +503to+1350 / +429 / +426and+1279 / +394 / +391;- the Swift round-level
reqthreads_deltamean also moved in the right direction, from21.297and23.312to20.984and19.891.
That changes the next honest interpretation again:
- the donor-shaped
cleanup2 -> com.apple.main-thread -> com.apple.root.default-qos.overcommitseam may still exist exactly as before, but that seam was not the whole story; - userland worker planning inside
libthrstill had real churn to remove, because it was too eager to spawn instead of waking workers that were already counted or already idle; - this is the first current-branch result that improves the Swift repeat lane again without trying to suppress the cleanup-triggered overcommit handoff itself.
The next useful question after that wake-first improvement was simple: did the benchmark win come from real wake-first behavior, or did the run just land in a better timing band?
The original trace surface was not good enough to answer that honestly,
because enabling TWQ_SWIFT_RUNTIME_TRACE also turned on the broader
libdispatch lane and main-queue traces, and the repeat-only traced runs
under that full bundle were still crashing with rc=139.
That is now fixed at the harness layer:
scripts/bhyve/stage-guest.shnow stages and forwards split guest trace controls forTWQ_LIBPTHREAD_TRACE,TWQ_LIBDISPATCH_MAINQUEUE_TRACE, andTWQ_LIBDISPATCH_ROOT_TRACE;- the old compatibility path still exists, but
LIBPTHREAD_TWQ_TRACEno longer requires the noisierlibdispatchtraces to be enabled at the same time.
The first repeat-only libthr-trace run is:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T111903Z.serial.log/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T111903Z.json
That run is timing-perturbed by tracing, so its absolute counters are not the right baseline to compare against the clean band.
What it proves is narrower and more useful:
- the C repeat lane still completed under trace with
reqthreads +230 / enter +110 / return +107and round-levelreqthreads_deltamean3.484; - the Swift repeat lane also completed under trace with
reqthreads +657 / enter +189 / return +186and round-levelreqthreads_deltamean10.156; - more importantly, the
addthreads-readymix in the traced serial log is overwhelmingly wake-dominant: dispatch showed118wake-only events versus5spawn-only events, while Swift showed456wake-only events versus7spawn-only events; - many of those wake-only decisions now happen with
admitted=0, which is the exact signal we wanted: repeated upstream requests are being serviced by already-counted idle workers instead of being translated into more worker creation.
That changes the next honest target again:
- the new
libthrwake-first planning path is now directly proven in the guest, not just inferred from benchmark deltas; - the remaining repeat-lane cost is therefore less about “still spawning too many workers” and more about “still generating too many worker requests upstream”;
- the next behavioral pass should go back to staged-
libdispatchrequest generation and coalescing, while keeping the new low-noiselibthrtrace lane available as a regression guard.
The next staged-libdispatch experiment after that trace result was a much
narrower version of the earlier rejected same-root poke suppression:
defer the root poke only when an empty root queue receives a single queue
object back onto the same root the current worker is already draining.
The hypothesis was specific: this should coalesce the repeated timer-worker to executor-queue handoff without suppressing unrelated continuation or override pushes.
The results were not stable enough to keep:
- the first clean repeat-only run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T124521Z.jsonlooked promising:dispatch.main-executor-resume-repeatlanded at+343 / +155 / +152andswift.dispatchmain-taskhandles-after-repeatdropped to+1184 / +362 / +359; - the required confirmation run at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T124736Z.jsondid not hold that win: dispatch regressed to+432 / +187 / +184and Swift regressed to+1532 / +506 / +503; - the round-level means told the same story:
the first run moved Swift
reqthreads_deltamean down to18.328, but the second run climbed back to23.750, which is materially worse than the earlier clean post-wake-first band.
That is enough to treat this as another rejected M13 line:
- the queue-only same-root root-poke deferral was reverted;
- the first improved run is now treated as timing luck, not as a valid new baseline;
- the next honest target remains upstream request generation in staged
libdispatch, but not through same-root root-poke suppression.
The next implementation pass should focus on why repeated continuation-heavy lanes still provoke so many worker requests and enter/return cycles even though they now complete correctly.
That tuning work should stay disciplined:
- measure against
benchmarks/baselines/m13-initial.json; - optimize one layer at a time;
- use the new round-level telemetry to distinguish startup effects from steady-state policy behavior;
- verify that lower churn does not regress the M12 correctness floor;
- treat the first
cleanup2 -> main queue -> overcommit roothandoff as likely legitimate and measure what happens after it; - treat the new
libthrwake-first planning path as proven enough for the current phase: the low-noise trace now shows that most remaining repeat-lane requests are wakes, not spawns; - move the next behavioral reduction back to staged
libdispatchrequest generation and coalescing above that wake path; - keep the split
libthr-only trace lane as a guardrail so laterlibdispatchchanges do not quietly regress the wake/spawn mix.
The next diagnostic pass replaced the noisy dprintf trace path with
low-overhead per-process counters inside staged libdispatch.
That changed the M13 picture again, but this time more decisively:
- the first queue-focused counter pass, at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T130059Z.serial.log, completed cleanly and showed that the suspected concurrent-lane redirect seam was inactive for these repeat workloads:concurrent_push_redirect=0,concurrent_push_fallback=0,async_redirect_invoke_entry=0,async_redirect_invoke_exit=0,lane_push_wakeup=0, andlane_push_no_wake=0; - that ruled out
_dispatch_lane_concurrent_push()and_dispatch_async_redirect_invoke()as the live source of the remaining repeat-lane churn; - the next counter pass moved down to the root queue path and the cleanest
fully instrumented run is now
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T131214Z.serial.logwith structured output at/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T131214Z.json.
What that run proves for the C repeat lane:
dispatch.main-executor-resume-repeatcompleted atreqthreads +385 / enter +170 / return +167;- its staged-
libdispatchcounter dump ended withroot_push_append_default=973,root_repoke_default=973, androot_repoke_drain_one_default=973; - the two other repoke sources stayed at zero:
root_repoke_contended_wait_default=0androot_repoke_worker_timeout_default=0; - default-overcommit participation in that C lane stayed negligible:
root_push_empty_default_overcommit=1,root_poke_default_overcommit=2, androot_repoke_default_overcommit=1.
What that run proves for the Swift repeat lane:
swift.dispatchmain-taskhandles-after-repeatcompleted atreqthreads +1401 / enter +463 / return +460;- the repeated redrive signature is still the same default-root pattern:
root_push_append_default=381,root_repoke_default=381, androot_repoke_drain_one_default=381; - again, the repoke came entirely from the next-visible drain path:
root_repoke_contended_wait_default=0,root_repoke_worker_timeout_default=0,root_repoke_contended_wait_default_overcommit=0, androot_repoke_worker_timeout_default_overcommit=0; - Swift adds one extra ingredient that the C repeat lane barely touches:
a material one-shot default-overcommit ingress,
root_push_empty_default_overcommit=208androot_poke_slow_default_overcommit=208, but not an overcommit repoke loop.
That is enough to freeze the next M13 target more narrowly:
- the active repeat bottleneck is no longer “staged
libdispatchrequest generation in general”; - it is specifically the root queue next-visible redrive path,
_dispatch_root_queue_drain_one() -> _dispatch_root_queue_poke(dq, 1, 0); - the C repeat lane is almost a perfect proof:
root_push_append_defaultandroot_repoke_drain_one_defaultmatch exactly; - the Swift repeat lane keeps the same default-root repoke signature and then adds a separate one-shot default-overcommit ingress, which should be treated as a secondary follow-up seam rather than the first optimization target;
- therefore the next behavioral pass should target root repoke coalescing or
better next-visible handoff at the default root, not concurrent-lane
redirect tuning and not another attempt to suppress the initial
cleanup2 -> overcommit roothandoff.
The next M13 pass stopped guessing about generic root policy and targeted the
dominant measured seam directly:
non-overcommit default-root drain-one-repoke was suppressed only when the
current head item was a one-shot dispatch_after timer source.
The first clean result is now:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T134322Z.serial.log/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T134322Z.json
What that run proves:
dispatch.main-executor-resume-repeatstayed correct and improved from the prior kind-classification baseline (+402 / +177 / +174at...133950Z.json) to+324 / +153 / +150;swift.dispatchmain-taskhandles-after-repeatalso stayed correct and moved from+1323 / +422 / +419to+1234 / +408 / +405;- the Swift counter dump shows the intended effect directly:
root_repoke_default=0,root_repoke_drain_one_default=0, androot_repoke_suppressed_after_source_default=363; - the remaining Swift default-root traffic in that run is therefore no longer a repoke loop at all; it is reduced to ordinary root pushes and the same one-shot default-overcommit main-queue handoff we already treat as a secondary seam.
That is enough to keep this behavioral change:
- the live repeat-lane bottleneck was not “all root repokes”;
- it was specifically over-eager next-visible repoke on one-shot
dispatch_aftersource items; - suppressing only that source class materially reduces churn without breaking the staged guest correctness floor.
Once the dispatch_after source repoke was suppressed, the next useful
question was what still remained on the C repeat lane.
The follow-up measurement run is now:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T134625Z.serial.log/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260415T134625Z.json
What that run proves:
- the source suppression still holds:
dispatch.main-executor-resume-repeatstayed in the same improved band at+329 / +153 / +150, while Swift improved again to+1137 / +386 / +383; - Swift still shows the same root-level outcome:
root_repoke_default=0androot_repoke_suppressed_after_source_default=372; - the C repeat lane no longer repokes on source items either:
root_repoke_suppressed_after_source_default=512,root_repoke_drain_one_kind_default_source=0; - the remaining C default-root repokes are now almost entirely
ASYNC_REDIRECTcontinuations plus a small lane tail:root_repoke_drain_one_kind_default_continuation=443,root_repoke_drain_one_kind_default_continuation_async_redirect=443, androot_repoke_drain_one_kind_default_lane=55.
That freezes the next honest M13 target again:
- keep the new
dispatch_aftersource suppression in stagedlibdispatch; - stop treating generic continuation traffic as the next fix target;
- move the next pass to the default-root
ASYNC_REDIRECTcontinuation path, because that is now the dominant remaining C repeat seam after source repokes were removed.
The next diagnostic attempt tried to classify objects pushed to
com.apple.root.default-qos.overcommit from inside staged libdispatch.
That was the correct question but the wrong implementation site.
The failed path:
- root-push kind counters dereferenced the pushed
headobject after_dispatch_root_queue_push_inline()had already calledos_mpsc_push_list(); - on the append path, that publish can race with an already-running drainer that pops, invokes, and recycles the continuation;
- dereferencing
dx_metatype(head)after that publish boundary produced immediate Swift repeatrc=139failures.
That patch was reverted. The two retained M13 behavior changes were kept:
- non-overcommit default-root
drain-one-repokesuppression for one-shotdispatch_aftertimer sources; - same-target
ASYNC_REDIRECTsuppression when the target lane already hasused_width >= 3.
The focused stability run after the revert is:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260416T024756Z.serial.log/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-baseline-20260416T024756Z.json
What that run proves:
swift.dispatchmain-taskhandles-after-repeatcompleted all64rounds;- the TWQ delta was
reqthreads +969 / enter +317 / return +314; - the remaining overcommit seam is still push volume, not repoke:
root_push_empty_default_overcommit=164,root_poke_slow_default_overcommit=165, androot_repoke_default_overcommit=1.
The new instrumentation rule is now explicit:
- classify push objects before publish, not after publish;
- classify drain objects only while the drain loop owns them;
- use DTrace for push-path classification until the object population is known well enough to justify a permanent in-process counter.
The repo now stages three FreeBSD DTrace helpers into bhyve guests under
/root/twq-dtrace:
m13-push-poke-drain.dfor pointer-only event ordering;m13-push-vtable.dfor vtable-pointer classification at_dispatch_root_queue_push:entry;m13-root-summary.dfor low-volume root queue aggregate counts.
The next pass fixed two diagnostic problems rather than changing dispatch policy.
First, the DTrace runner now traces the real target process directly. The
initial no-event DTrace attempts used dtrace -c "env ... probe" which made
DTrace bind to /usr/bin/env before the final probe binary was executed. The
guest script now runs env ... dtrace ... -c /root/probe, so the pid
provider sees the staged Swift or C probe process itself.
Second, the only permanent push-path classifier retained in staged
libdispatch now uses pointer identity against _dispatch_main_q. It no
longer calls dx_metatype() or dx_type() on pushed objects. That keeps the
classification before the MPSC publish boundary and avoids decoding arbitrary
continuations.
Fresh current-binary DTrace evidence:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-dtrace-push-vtable-20260416T042209Z.serial.logcompleted a2 x 8Swift repeat run underpush-vtable;scripts/dtrace/analyze-m13-vtable.pymaps the pushed objects as:defaultroot gets16__OS_dispatch_source_vtablepushes,default.overcommitgets7__OS_dispatch_queue_main_vtablepushes, anduser-initiatedgets21_dispatch_continuation_vtables+0x38pushes;- that matches the manual trace reading: timer sources land on the default root, Swift/global continuations land on user-initiated, and the default-overcommit root is primarily main-queue handoff traffic.
Fresh full-repeat counter evidence:
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-swift-repeat-counters-20260416T041819Z.serial.logcompleted the full64 x 8Swift repeat run with counters enabled;- the extracted JSON at
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-swift-repeat-counters-20260416T041819Z.jsonreportsreqthreads +1058 / enter +343 / return +340; - the round-level mean is
16.41reqthreadsper round, with a range of9to40; - the libdispatch counter dump shows
root_push_empty_default_overcommit=186,root_push_mainq_default_overcommit=186,root_poke_slow_default_overcommit=187, androot_repoke_default_overcommit=1.
That closes the object-population question for this seam:
- the default-overcommit pressure is not random continuation traffic;
- it is almost exactly main-queue handoff traffic;
- this agrees with the macOS-source expectation that
com.apple.main-threadcan target the default overcommit root; - the next question is therefore rate and coalescing compared with macOS, not whether the handoff exists.
The benchmark extractor now preserves [libdispatch-twq-counters] dumps in
the structured JSON, and scripts/benchmarks/summarize-m13-baseline.py gives
a compact CLI view of both kern.twq.* deltas and libdispatch root counters.
scripts/benchmarks/compare-m13-baselines.py is also available as the first
coarse regression gate. It compares common benchmark modes across two JSON
files, checks status regressions, and applies a drift-tolerant threshold to
reqthreads_count, thread_enter_count, and thread_return_count.
The current focused comparison against the checked-in initial baseline passes:
scripts/benchmarks/compare-m13-baselines.py \
benchmarks/baselines/m13-initial.json \
/Users/me/wip-gcd-tbb-fx/artifacts/benchmarks/m13-swift-repeat-counters-20260416T041819Z.json \
--mode swift.dispatchmain-taskhandles-after-repeatThat reports the Swift repeat lane moving from
2799 / 934 / 931 to 1058 / 343 / 340 for
reqthreads / enter / return.