You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HandleDictionary currently serializes all object registration, lookup, promotion, and disposal through a single process-wide lock (instancesLock). Every SKObject creation and disposal takes this lock. Under highly parallel workloads that create/dispose many distinct wrappers concurrently, this lock is a scalability bottleneck.
This issue proposes investigating lock striping (sharding the dictionary into N independently-locked partitions keyed by a hash of the native handle) as a future, non-blocking improvement. It is split out from #4080 (which hardens the correctness of the current single-lock design) so that #4080 can merge without taking on this larger, riskier change.
Status: exploratory. A standalone microbenchmark (below) shows real but conditional value. There is also a genuine new hazard (cross-shard lock ordering). This needs an in-tree benchmark against real SKObjects and careful design review before anyone commits to it.
binding/SkiaSharp/HandleDictionary.cs: one Dictionary<IntPtr, WeakReference> instances guarded by one IPlatformLock instancesLock (PlatformLock.Create()).
GetInstance → read lock. GetOrAddObject → upgradeable-read lock held for its entire duration, including the factory/ctor invocation (line ~112), which re-enters the write lock via RegisterHandle.
RegisterHandle / DeregisterHandle → write lock.
SKObject.Dispose() takes HandleDictionary.instancesLock.EnterWriteLock()directly around its IgnorePublicDispose check + the isDisposed CAS (SKObject.cs ~332). This coupling is what makes promote-vs-public-dispose mutually exclusive.
Invariants any striped design MUST preserve
Promote/dispose mutual exclusion — GetOrAddObject(disposeProtected:true) setting IgnorePublicDispose must be mutually exclusive with SKObject.Dispose() reading it + claiming disposal. Today both use the same global lock. A striped version is safe only because every operation is keyed on a single handle → an object's handle always maps to one shard, so GetOrAddObject and that object's Dispose() contend on the same shard lock. The accessor SKObject.Dispose() uses must resolve the shard for its handle.
Replacement teardown — RegisterHandle's "ownership handed off" branch disposes the replaced wrapper outside the lock; DeregisterHandle must still no-op when weak.Target is now a different object.
Reentrancy — GetOrAddObject holds upgradeable-read, then the factory→ctor→RegisterHandle takes the write lock on the same thread. ReaderWriterLockSlim(NoRecursion) permits upgradeable→write on one thread; a striped version must route both to the same shard for that handle to keep this legal.
⚠️ Key new hazard: cross-shard lock ordering (deadlock)
With a single lock, nested registration is reentrant on one lock object. With striping, the factory-under-lock pattern means: while holding shard(parent), constructing a child wrapper with a different handle acquires shard(child). If two threads do this in opposite handle order, they can deadlock — a class of bug that cannot occur today. Any implementation must either (a) never hold two shard locks simultaneously (e.g. run the factory outside the shard lock, which itself reworks invariant #1 and is the dangerous part), or (b) impose a strict global lock-ordering. This hazard is the main reason striping is non-trivial and is the first thing to resolve.
N=1 must reduce exactly to today's behavior (baseline + safe fallback).
Shard index = bit-mixed hash of the handle, masked with & (N-1) (so N is a power of two — no modulo).
Add HandleDictionary.GetShardLock(handle) for SKObject.Dispose() to route to the per-handle shard lock.
Shard count (default) + customization
Default: next power-of-two ≥ Environment.ProcessorCount, clamped to [1, 64] (cap bounds memory — each shard is a Dictionary + lock). Both an independent GPT‑5.5 and an Opus‑4.8 design pass converged on this (one suggested ProcessorCount, the other 2×; pow2 + clamp 64 is the shared core).
Customization (additive only — IPlatformLock is public/ABI-locked): resolve once at first use, in order: AppContext.GetData("SkiaSharp.HandleDictionary.ShardCount") → env var SKIASHARP_HANDLE_SHARDS → auto-default. Same "set before first lock is created" constraint the PlatformLock.Factory doc already states. No new public API.
Pointer-hash mixing is mandatory
Native handles are aligned pointers — the low 3–4 bits are always zero — so handle & (N-1) on the raw value dumps everything into shard 0. Must bit-mix first, e.g. multiplicative/Fibonacci: shard = (int)(((ulong)handle * 0x9E3779B97F4A7C15) >> (64 - log2N)) (mask to N).
The benchmark confirmed this distributes aligned pointers across 64 shards with CoV ≈ 0%.
Benchmark results (standalone microbenchmark)
12-core machine, Server GC, 300k ops/thread, 7-rep median + warmup. Op = Register(write) + TryGet(read) + Deregister(write) (the create/lookup/dispose cycle). spin=50 adds ~tens of ns of work outside the lock to approximate real interop and not overstate lock weight.
Churn of distinct handles (spin=50, realistic): speedup vs global lock (N=1)
threads
N=1
N=4
N=16
N=64
1
1.0×
1.0×
1.0×
1.0× (flat — zero overhead)
2
1.0×
1.0×
1.0×
1.0× (no benefit)
8
1.0×
2.1×
2.6×
2.8×
12
1.0×
2.5×
3.3×
3.7×
16
1.0×
2.5×
3.2×
3.5×
Knee ≈ ProcessorCount (N≈16). Single-thread is flat across all N (no regression).
Counter-signal — shared-singleton workload (all threads hit ~8 shared handles): striping hurts (0.5–0.9×). When contention is concentrated on a few handles they can't spread across shards, and the extra locks just bounce cache lines. Many SkiaSharp singletons (sRGB color space, default typeface/font-manager) are exactly this shape.
Verdict
Striping delivers a real 2.5–3.7× contention reduction for high-thread parallel churn of distinct objects, with zero single-thread cost — but no benefit (slight harm) for shared-singleton hot paths, and the microbench overstates lock weight versus real Skia ctor/dtor cost. Conclusion: worth a guarded experiment, not a slam dunk. Gate any real implementation on an in-tree benchmark with real SKObjects confirming apps actually hit the registry bottleneck, and on resolving the cross-shard deadlock hazard.
Kill criteria (don't ship if)
<10–15% improvement at realistic thread counts with real objects, or
any measurable single-thread regression, or
real Skia construction/destruction cost dominates so registry contention is invisible, or
the cross-shard lock-ordering hazard can't be eliminated cleanly.
Suggested next steps
Add a committed BenchmarkDotNet harness under benchmarks/ exercising N-thread create/register/dispose against the realHandleDictionary (baseline), [Params] over thread counts.
Prototype the striped dictionary behind N resolution above; verify shard occupancy + the full test suite under THROW_OBJECT_EXCEPTIONS.
Resolve the cross-shard deadlock hazard before un-drafting any PR.
Summary
HandleDictionarycurrently serializes all object registration, lookup, promotion, and disposal through a single process-wide lock (instancesLock). EverySKObjectcreation and disposal takes this lock. Under highly parallel workloads that create/dispose many distinct wrappers concurrently, this lock is a scalability bottleneck.This issue proposes investigating lock striping (sharding the dictionary into
Nindependently-locked partitions keyed by a hash of the native handle) as a future, non-blocking improvement. It is split out from #4080 (which hardens the correctness of the current single-lock design) so that #4080 can merge without taking on this larger, riskier change.Cross-link: #4080 · fixes-context #3817
Current design (baseline)
binding/SkiaSharp/HandleDictionary.cs: oneDictionary<IntPtr, WeakReference> instancesguarded by oneIPlatformLock instancesLock(PlatformLock.Create()).GetInstance→ read lock.GetOrAddObject→ upgradeable-read lock held for its entire duration, including the factory/ctor invocation (line ~112), which re-enters the write lock viaRegisterHandle.RegisterHandle/DeregisterHandle→ write lock.SKObject.Dispose()takesHandleDictionary.instancesLock.EnterWriteLock()directly around itsIgnorePublicDisposecheck + theisDisposedCAS (SKObject.cs ~332). This coupling is what makes promote-vs-public-dispose mutually exclusive.Invariants any striped design MUST preserve
GetOrAddObject(disposeProtected:true)settingIgnorePublicDisposemust be mutually exclusive withSKObject.Dispose()reading it + claiming disposal. Today both use the same global lock. A striped version is safe only because every operation is keyed on a single handle → an object's handle always maps to one shard, soGetOrAddObjectand that object'sDispose()contend on the same shard lock. The accessorSKObject.Dispose()uses must resolve the shard for its handle.RegisterHandle's "ownership handed off" branch disposes the replaced wrapper outside the lock;DeregisterHandlemust still no-op whenweak.Targetis now a different object.GetOrAddObjectholds upgradeable-read, then the factory→ctor→RegisterHandletakes the write lock on the same thread.ReaderWriterLockSlim(NoRecursion)permits upgradeable→write on one thread; a striped version must route both to the same shard for that handle to keep this legal.With a single lock, nested registration is reentrant on one lock object. With striping, the factory-under-lock pattern means: while holding shard(parent), constructing a child wrapper with a different handle acquires shard(child). If two threads do this in opposite handle order, they can deadlock — a class of bug that cannot occur today. Any implementation must either (a) never hold two shard locks simultaneously (e.g. run the factory outside the shard lock, which itself reworks invariant #1 and is the dangerous part), or (b) impose a strict global lock-ordering. This hazard is the main reason striping is non-trivial and is the first thing to resolve.
Proposed design sketch
Nshards, each ={ Dictionary<IntPtr,WeakReference>, IPlatformLock }. Build the lock array by calling the existingPlatformLock.Factory()N times — preserves the plug-in contract ([BUG] SkiaSharp can call WM_PAINT handler and random COM callbacks when creating or obtaning objects #1383 Win32CRITICAL_SECTIONfix) unchanged, one independent lock per shard.N=1must reduce exactly to today's behavior (baseline + safe fallback).& (N-1)(soNis a power of two — no modulo).HandleDictionary.GetShardLock(handle)forSKObject.Dispose()to route to the per-handle shard lock.Shard count (default) + customization
Environment.ProcessorCount, clamped to[1, 64](cap bounds memory — each shard is a Dictionary + lock). Both an independent GPT‑5.5 and an Opus‑4.8 design pass converged on this (one suggestedProcessorCount, the other2×; pow2 + clamp 64 is the shared core).IPlatformLockis public/ABI-locked): resolve once at first use, in order:AppContext.GetData("SkiaSharp.HandleDictionary.ShardCount")→ env varSKIASHARP_HANDLE_SHARDS→ auto-default. Same "set before first lock is created" constraint thePlatformLock.Factorydoc already states. No new public API.Pointer-hash mixing is mandatory
Native handles are aligned pointers — the low 3–4 bits are always zero — so
handle & (N-1)on the raw value dumps everything into shard 0. Must bit-mix first, e.g. multiplicative/Fibonacci:shard = (int)(((ulong)handle * 0x9E3779B97F4A7C15) >> (64 - log2N))(mask toN).The benchmark confirmed this distributes aligned pointers across 64 shards with CoV ≈ 0%.
Benchmark results (standalone microbenchmark)
12-core machine, Server GC, 300k ops/thread, 7-rep median + warmup. Op =
Register(write) + TryGet(read) + Deregister(write)(the create/lookup/dispose cycle).spin=50adds ~tens of ns of work outside the lock to approximate real interop and not overstate lock weight.Churn of distinct handles (spin=50, realistic): speedup vs global lock (N=1)
Knee ≈
ProcessorCount(N≈16). Single-thread is flat across all N (no regression).Counter-signal — shared-singleton workload (all threads hit ~8 shared handles): striping hurts (0.5–0.9×). When contention is concentrated on a few handles they can't spread across shards, and the extra locks just bounce cache lines. Many SkiaSharp singletons (sRGB color space, default typeface/font-manager) are exactly this shape.
Verdict
Striping delivers a real 2.5–3.7× contention reduction for high-thread parallel churn of distinct objects, with zero single-thread cost — but no benefit (slight harm) for shared-singleton hot paths, and the microbench overstates lock weight versus real Skia ctor/dtor cost. Conclusion: worth a guarded experiment, not a slam dunk. Gate any real implementation on an in-tree benchmark with real
SKObjects confirming apps actually hit the registry bottleneck, and on resolving the cross-shard deadlock hazard.Kill criteria (don't ship if)
Suggested next steps
benchmarks/exercising N-thread create/register/dispose against the realHandleDictionary(baseline),[Params]over thread counts.Nresolution above; verify shard occupancy + the full test suite underTHROW_OBJECT_EXCEPTIONS.Non-blocking for #4080.