stats: remove storm of refcount bumping during metrics snapshot. #43958
Mobile/CC (success)
Check has finished
Details
Check run finished (success ✔️)
The check run can be viewed here:
Mobile/CC (pr/43958/main@967543a)
Check started by
Request (pr/43958/main@967543a)
@jmarantz
967543a #43958
merge main@eea4eab
stats: remove storm of refcount bumping during metrics snapshot.
Commit Message:
Changes Envoy’s stats snapshotting/lifetime strategy to avoid per-metric refcount churn during a metrics flush by retaining scope references instead of taking a refcounted handle to every individual stat.
Changes:
Update MetricSnapshotImpl to keep metrics alive by capturing ConstScopeSharedPtrs for all scopes, rather than incrementing refcounts on every stat during snapshot creation.
Restrict Allocator/AllocatorImpl stat-construction APIs (makeCounter/makeGauge/makeTextReadout) to protected, pushing call sites toward Scope APIs and enabling the snapshot optimization assumption (“stats are scope-owned”).
Update related tests/benchmarks to use scope-based stat construction and adjust benchmark ranges.This change makes the existing bmFlushToSinks benchmark in test/server/server_stats_flush_benchmark_test.cc run 5x faster, especially visible when extended to 10M stats, which is done in this PR, highlighting the problematic metrics flush duration >5s when the interval is 5s. I will note also that the problem with the excessive ref-count churn is likely much worse on ARM though I did not measure that.
Before:
Benchmark Time CPU Iterations ----------------------------------------------------------------------------------- bmFlushToSinks/10 0.002 ms 0.002 ms 299435 bmFlushToSinks/100 0.011 ms 0.011 ms 61284 bmFlushToSinks/1000 0.109 ms 0.109 ms 6353 bmFlushToSinks/10000 1.12 ms 1.12 ms 616 bmFlushToSinks/100000 27.0 ms 27.0 ms 23 bmFlushToSinks/1000000 394 ms 394 ms 2 bmFlushToSinks/10000000 4841 ms 4840 ms 1After:
Benchmark Time CPU Iterations ----------------------------------------------------------------------------------- bmFlushToSinks/10 0.002 ms 0.002 ms 436476 bmFlushToSinks/100 0.004 ms 0.004 ms 202299 bmFlushToSinks/1000 0.030 ms 0.030 ms 23043 bmFlushToSinks/10000 0.376 ms 0.375 ms 1879 bmFlushToSinks/100000 6.73 ms 6.73 ms 103 bmFlushToSinks/1000000 100 ms 100 ms 7 bmFlushToSinks/10000000 1065 ms 1065 ms 1Additional Description:
Risk Level: medium -- changing mechanism of keeping stats alive during metrics sink
Testing: //test/... but, man, Envoy test has gotten super flaky since I last did dev. Things eventually pass when I re-run 5 or so integration tests one at a time, on a clean client. IDK what's up with CI.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes #43836
Environment
Request variables
| Key | Value |
|---|---|
| ref | 09e6a49 |
| sha | 967543a |
| pr | 43958 |
| base-sha | eea4eab |
| actor | |
| message | stats: remove storm of refcount bumping during metrics snapshot.... |
| started | 1773779840.942395 |
| target-branch | main |
| trusted | false |
Build image
Container image/s (as used in this CI run)
| Key | Value |
|---|---|
| default | docker.io/envoyproxy/envoy-build:86873047235e9b8232df989a5999b9bebf9db69c |
| mobile | docker.io/envoyproxy/envoy-build:mobile-86873047235e9b8232df989a5999b9bebf9db69c |
Version
Envoy version (as used in this CI run)
| Key | Value |
|---|---|
| major | 1 |
| minor | 38 |
| patch | 0 |
| dev | true |