Skip to content

Reduce HandleDictionary lock contention via lock striping/sharding #4101

@mattleibow

Description

@mattleibow

Summary

HandleDictionary currently serializes all object registration, lookup, promotion, and disposal through a single process-wide lock (instancesLock). Every SKObject creation and disposal takes this lock. Under highly parallel workloads that create/dispose many distinct wrappers concurrently, this lock is a scalability bottleneck.

This issue proposes investigating lock striping (sharding the dictionary into N independently-locked partitions keyed by a hash of the native handle) as a future, non-blocking improvement. It is split out from #4080 (which hardens the correctness of the current single-lock design) so that #4080 can merge without taking on this larger, riskier change.

Status: exploratory. A standalone microbenchmark (below) shows real but conditional value. There is also a genuine new hazard (cross-shard lock ordering). This needs an in-tree benchmark against real SKObjects and careful design review before anyone commits to it.

Cross-link: #4080 · fixes-context #3817

Current design (baseline)

  • binding/SkiaSharp/HandleDictionary.cs: one Dictionary<IntPtr, WeakReference> instances guarded by one IPlatformLock instancesLock (PlatformLock.Create()).
  • GetInstance → read lock. GetOrAddObject → upgradeable-read lock held for its entire duration, including the factory/ctor invocation (line ~112), which re-enters the write lock via RegisterHandle.
  • RegisterHandle / DeregisterHandle → write lock.
  • SKObject.Dispose() takes HandleDictionary.instancesLock.EnterWriteLock() directly around its IgnorePublicDispose check + the isDisposed CAS (SKObject.cs ~332). This coupling is what makes promote-vs-public-dispose mutually exclusive.

Invariants any striped design MUST preserve

  1. Promote/dispose mutual exclusionGetOrAddObject(disposeProtected:true) setting IgnorePublicDispose must be mutually exclusive with SKObject.Dispose() reading it + claiming disposal. Today both use the same global lock. A striped version is safe only because every operation is keyed on a single handle → an object's handle always maps to one shard, so GetOrAddObject and that object's Dispose() contend on the same shard lock. The accessor SKObject.Dispose() uses must resolve the shard for its handle.
  2. Replacement teardownRegisterHandle's "ownership handed off" branch disposes the replaced wrapper outside the lock; DeregisterHandle must still no-op when weak.Target is now a different object.
  3. ReentrancyGetOrAddObject holds upgradeable-read, then the factory→ctor→RegisterHandle takes the write lock on the same thread. ReaderWriterLockSlim(NoRecursion) permits upgradeable→write on one thread; a striped version must route both to the same shard for that handle to keep this legal.

⚠️ Key new hazard: cross-shard lock ordering (deadlock)

With a single lock, nested registration is reentrant on one lock object. With striping, the factory-under-lock pattern means: while holding shard(parent), constructing a child wrapper with a different handle acquires shard(child). If two threads do this in opposite handle order, they can deadlock — a class of bug that cannot occur today. Any implementation must either (a) never hold two shard locks simultaneously (e.g. run the factory outside the shard lock, which itself reworks invariant #1 and is the dangerous part), or (b) impose a strict global lock-ordering. This hazard is the main reason striping is non-trivial and is the first thing to resolve.

Proposed design sketch

  • N shards, each = { Dictionary<IntPtr,WeakReference>, IPlatformLock }. Build the lock array by calling the existing PlatformLock.Factory() N times — preserves the plug-in contract ([BUG] SkiaSharp can call WM_PAINT handler and random COM callbacks when creating or obtaning objects #1383 Win32 CRITICAL_SECTION fix) unchanged, one independent lock per shard.
  • N=1 must reduce exactly to today's behavior (baseline + safe fallback).
  • Shard index = bit-mixed hash of the handle, masked with & (N-1) (so N is a power of two — no modulo).
  • Add HandleDictionary.GetShardLock(handle) for SKObject.Dispose() to route to the per-handle shard lock.

Shard count (default) + customization

  • Default: next power-of-two ≥ Environment.ProcessorCount, clamped to [1, 64] (cap bounds memory — each shard is a Dictionary + lock). Both an independent GPT‑5.5 and an Opus‑4.8 design pass converged on this (one suggested ProcessorCount, the other ; pow2 + clamp 64 is the shared core).
  • Customization (additive only — IPlatformLock is public/ABI-locked): resolve once at first use, in order: AppContext.GetData("SkiaSharp.HandleDictionary.ShardCount") → env var SKIASHARP_HANDLE_SHARDS → auto-default. Same "set before first lock is created" constraint the PlatformLock.Factory doc already states. No new public API.

Pointer-hash mixing is mandatory

Native handles are aligned pointers — the low 3–4 bits are always zero — so handle & (N-1) on the raw value dumps everything into shard 0. Must bit-mix first, e.g. multiplicative/Fibonacci:
shard = (int)(((ulong)handle * 0x9E3779B97F4A7C15) >> (64 - log2N)) (mask to N).
The benchmark confirmed this distributes aligned pointers across 64 shards with CoV ≈ 0%.

Benchmark results (standalone microbenchmark)

12-core machine, Server GC, 300k ops/thread, 7-rep median + warmup. Op = Register(write) + TryGet(read) + Deregister(write) (the create/lookup/dispose cycle). spin=50 adds ~tens of ns of work outside the lock to approximate real interop and not overstate lock weight.

Churn of distinct handles (spin=50, realistic): speedup vs global lock (N=1)

threads N=1 N=4 N=16 N=64
1 1.0× 1.0× 1.0× 1.0× (flat — zero overhead)
2 1.0× 1.0× 1.0× 1.0× (no benefit)
8 1.0× 2.1× 2.6× 2.8×
12 1.0× 2.5× 3.3× 3.7×
16 1.0× 2.5× 3.2× 3.5×

Knee ≈ ProcessorCount (N≈16). Single-thread is flat across all N (no regression).

Counter-signal — shared-singleton workload (all threads hit ~8 shared handles): striping hurts (0.5–0.9×). When contention is concentrated on a few handles they can't spread across shards, and the extra locks just bounce cache lines. Many SkiaSharp singletons (sRGB color space, default typeface/font-manager) are exactly this shape.

Verdict

Striping delivers a real 2.5–3.7× contention reduction for high-thread parallel churn of distinct objects, with zero single-thread cost — but no benefit (slight harm) for shared-singleton hot paths, and the microbench overstates lock weight versus real Skia ctor/dtor cost. Conclusion: worth a guarded experiment, not a slam dunk. Gate any real implementation on an in-tree benchmark with real SKObjects confirming apps actually hit the registry bottleneck, and on resolving the cross-shard deadlock hazard.

Kill criteria (don't ship if)

  • <10–15% improvement at realistic thread counts with real objects, or
  • any measurable single-thread regression, or
  • real Skia construction/destruction cost dominates so registry contention is invisible, or
  • the cross-shard lock-ordering hazard can't be eliminated cleanly.

Suggested next steps

  1. Add a committed BenchmarkDotNet harness under benchmarks/ exercising N-thread create/register/dispose against the real HandleDictionary (baseline), [Params] over thread counts.
  2. Prototype the striped dictionary behind N resolution above; verify shard occupancy + the full test suite under THROW_OBJECT_EXCEPTIONS.
  3. Resolve the cross-shard deadlock hazard before un-drafting any PR.

Non-blocking for #4080.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions