Reduce HandleDictionary lock contention via lock striping/sharding

## Summary

`HandleDictionary` currently serializes **all** object registration, lookup, promotion, and disposal through a **single process-wide lock** (`instancesLock`). Every `SKObject` creation and disposal takes this lock. Under highly parallel workloads that create/dispose many distinct wrappers concurrently, this lock is a scalability bottleneck.

This issue proposes investigating **lock striping** (sharding the dictionary into `N` independently-locked partitions keyed by a hash of the native handle) as a **future, non-blocking** improvement. It is split out from #4080 (which hardens the *correctness* of the current single-lock design) so that #4080 can merge without taking on this larger, riskier change.

> **Status: exploratory.** A standalone microbenchmark (below) shows real but **conditional** value. There is also a genuine new hazard (cross-shard lock ordering). This needs an in-tree benchmark against real `SKObject`s and careful design review before anyone commits to it.

Cross-link: #4080 · fixes-context #3817

## Current design (baseline)

- `binding/SkiaSharp/HandleDictionary.cs`: one `Dictionary<IntPtr, WeakReference> instances` guarded by one `IPlatformLock instancesLock` (`PlatformLock.Create()`).
- `GetInstance` → read lock. `GetOrAddObject` → upgradeable-read lock held for its **entire** duration, **including the factory/ctor invocation** (line ~112), which re-enters the **write** lock via `RegisterHandle`.
- `RegisterHandle` / `DeregisterHandle` → write lock.
- `SKObject.Dispose()` takes `HandleDictionary.instancesLock.EnterWriteLock()` **directly** around its `IgnorePublicDispose` check + the `isDisposed` CAS (SKObject.cs ~332). This coupling is what makes promote-vs-public-dispose mutually exclusive.

## Invariants any striped design MUST preserve

1. **Promote/dispose mutual exclusion** — `GetOrAddObject(disposeProtected:true)` setting `IgnorePublicDispose` must be mutually exclusive with `SKObject.Dispose()` reading it + claiming disposal. Today both use the *same* global lock. A striped version is safe **only because every operation is keyed on a single handle** → an object's handle always maps to **one** shard, so `GetOrAddObject` and that object's `Dispose()` contend on the **same** shard lock. The accessor `SKObject.Dispose()` uses must resolve the shard for *its* handle.
2. **Replacement teardown** — `RegisterHandle`'s "ownership handed off" branch disposes the replaced wrapper **outside** the lock; `DeregisterHandle` must still no-op when `weak.Target` is now a different object.
3. **Reentrancy** — `GetOrAddObject` holds upgradeable-read, then the factory→ctor→`RegisterHandle` takes the write lock **on the same thread**. `ReaderWriterLockSlim(NoRecursion)` permits upgradeable→write on one thread; a striped version must route both to the **same shard** for that handle to keep this legal.

## ⚠️ Key new hazard: cross-shard lock ordering (deadlock)

With a single lock, nested registration is reentrant on one lock object. With striping, the factory-under-lock pattern means: while holding **shard(parent)**, constructing a child wrapper with a *different* handle acquires **shard(child)**. If two threads do this in opposite handle order, they can deadlock — a class of bug that **cannot occur today**. Any implementation must either (a) never hold two shard locks simultaneously (e.g. run the factory outside the shard lock, which itself reworks invariant #1 and is the dangerous part), or (b) impose a strict global lock-ordering. **This hazard is the main reason striping is non-trivial and is the first thing to resolve.**

## Proposed design sketch

- `N` shards, each = `{ Dictionary<IntPtr,WeakReference>, IPlatformLock }`. Build the lock array by calling the existing **`PlatformLock.Factory()` N times** — preserves the plug-in contract (#1383 Win32 `CRITICAL_SECTION` fix) unchanged, one independent lock per shard.
- `N=1` must reduce **exactly** to today's behavior (baseline + safe fallback).
- Shard index = **bit-mixed** hash of the handle, masked with `& (N-1)` (so `N` is a power of two — no modulo).
- Add `HandleDictionary.GetShardLock(handle)` for `SKObject.Dispose()` to route to the per-handle shard lock.

### Shard count (default) + customization

- **Default:** next power-of-two ≥ `Environment.ProcessorCount`, clamped to `[1, 64]` (cap bounds memory — each shard is a Dictionary + lock). Both an independent GPT‑5.5 and an Opus‑4.8 design pass converged on this (one suggested `ProcessorCount`, the other `2×`; pow2 + clamp 64 is the shared core).
- **Customization (additive only — `IPlatformLock` is public/ABI-locked):** resolve once at first use, in order: `AppContext.GetData("SkiaSharp.HandleDictionary.ShardCount")` → env var `SKIASHARP_HANDLE_SHARDS` → auto-default. Same "set before first lock is created" constraint the `PlatformLock.Factory` doc already states. No new public API.

### Pointer-hash mixing is mandatory

Native handles are **aligned pointers** — the low 3–4 bits are always zero — so `handle & (N-1)` on the raw value dumps everything into shard 0. Must bit-mix first, e.g. multiplicative/Fibonacci:
`shard = (int)(((ulong)handle * 0x9E3779B97F4A7C15) >> (64 - log2N))` (mask to `N`).
The benchmark confirmed this distributes aligned pointers across 64 shards with **CoV ≈ 0%**.

## Benchmark results (standalone microbenchmark)

12-core machine, Server GC, 300k ops/thread, 7-rep median + warmup. Op = `Register(write) + TryGet(read) + Deregister(write)` (the create/lookup/dispose cycle). `spin=50` adds ~tens of ns of work **outside** the lock to approximate real interop and not overstate lock weight.

**Churn of distinct handles (spin=50, realistic):** speedup vs global lock (N=1)

| threads | N=1 | N=4 | N=16 | N=64 |
|--:|--:|--:|--:|--:|
| 1 | 1.0× | 1.0× | 1.0× | 1.0× (flat — **zero overhead**) |
| 2 | 1.0× | 1.0× | 1.0× | 1.0× (no benefit) |
| 8 | 1.0× | 2.1× | 2.6× | 2.8× |
| 12 | 1.0× | 2.5× | **3.3×** | 3.7× |
| 16 | 1.0× | 2.5× | 3.2× | 3.5× |

Knee ≈ `ProcessorCount` (N≈16). Single-thread is flat across all N (no regression).

**Counter-signal — shared-singleton workload (all threads hit ~8 shared handles):** striping **hurts** (0.5–0.9×). When contention is concentrated on a few handles they can't spread across shards, and the extra locks just bounce cache lines. Many SkiaSharp singletons (sRGB color space, default typeface/font-manager) are exactly this shape.

### Verdict

Striping delivers a real **2.5–3.7× contention reduction** for high-thread parallel churn of **distinct** objects, with **zero single-thread cost** — but **no benefit (slight harm)** for shared-singleton hot paths, and the microbench overstates lock weight versus real Skia ctor/dtor cost. **Conclusion: worth a guarded experiment, not a slam dunk.** Gate any real implementation on an **in-tree benchmark with real `SKObject`s** confirming apps actually hit the registry bottleneck, and on resolving the cross-shard deadlock hazard.

### Kill criteria (don't ship if)

- <10–15% improvement at realistic thread counts with **real** objects, or
- any measurable single-thread regression, or
- real Skia construction/destruction cost dominates so registry contention is invisible, or
- the cross-shard lock-ordering hazard can't be eliminated cleanly.

## Suggested next steps

1. Add a committed BenchmarkDotNet harness under `benchmarks/` exercising N-thread create/register/dispose against the **real** `HandleDictionary` (baseline), `[Params]` over thread counts.
2. Prototype the striped dictionary behind `N` resolution above; verify shard occupancy + the full test suite under `THROW_OBJECT_EXCEPTIONS`.
3. Resolve the cross-shard deadlock hazard before un-drafting any PR.

Non-blocking for #4080.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce HandleDictionary lock contention via lock striping/sharding #4101

Summary

Current design (baseline)

Invariants any striped design MUST preserve

⚠️ Key new hazard: cross-shard lock ordering (deadlock)

Proposed design sketch

Shard count (default) + customization

Pointer-hash mixing is mandatory

Benchmark results (standalone microbenchmark)

Verdict

Kill criteria (don't ship if)

Suggested next steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

threads	N=1	N=4	N=16	N=64
1	1.0×	1.0×	1.0×	1.0× (flat — zero overhead)
2	1.0×	1.0×	1.0×	1.0× (no benefit)
8	1.0×	2.1×	2.6×	2.8×
12	1.0×	2.5×	3.3×	3.7×
16	1.0×	2.5×	3.2×	3.5×

Reduce HandleDictionary lock contention via lock striping/sharding #4101

Description

Summary

Current design (baseline)

Invariants any striped design MUST preserve

⚠️ Key new hazard: cross-shard lock ordering (deadlock)

Proposed design sketch

Shard count (default) + customization

Pointer-hash mixing is mandatory

Benchmark results (standalone microbenchmark)

Verdict

Kill criteria (don't ship if)

Suggested next steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions