Skip to content

format

967543a
Select commit
Loading
Failed to load commit list.
Closed

stats: remove storm of refcount bumping during metrics snapshot. #43958

format
967543a
Select commit
Loading
Failed to load commit list.
CI (Envoy) / Mobile/CC succeeded Mar 17, 2026 in 11m 20s

Mobile/CC (success)

Check has finished

Details

Check run finished (success ✔️)

The check run can be viewed here:

Mobile/CC (pr/43958/main@967543a)

Check started by

Request (pr/43958/main@967543a)

jmarantz @jmarantz 967543a #43958 merge main@eea4eab

stats: remove storm of refcount bumping during metrics snapshot.

Commit Message:

Changes Envoy’s stats snapshotting/lifetime strategy to avoid per-metric refcount churn during a metrics flush by retaining scope references instead of taking a refcounted handle to every individual stat.

Changes:

Update MetricSnapshotImpl to keep metrics alive by capturing ConstScopeSharedPtrs for all scopes, rather than incrementing refcounts on every stat during snapshot creation.
Restrict Allocator/AllocatorImpl stat-construction APIs (makeCounter/makeGauge/makeTextReadout) to protected, pushing call sites toward Scope APIs and enabling the snapshot optimization assumption (“stats are scope-owned”).
Update related tests/benchmarks to use scope-based stat construction and adjust benchmark ranges.

This change makes the existing bmFlushToSinks benchmark in test/server/server_stats_flush_benchmark_test.cc run 5x faster, especially visible when extended to 10M stats, which is done in this PR, highlighting the problematic metrics flush duration >5s when the interval is 5s. I will note also that the problem with the excessive ref-count churn is likely much worse on ARM though I did not measure that.

Before:

Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
bmFlushToSinks/10                             0.002 ms        0.002 ms       299435
bmFlushToSinks/100                            0.011 ms        0.011 ms        61284
bmFlushToSinks/1000                           0.109 ms        0.109 ms         6353
bmFlushToSinks/10000                           1.12 ms         1.12 ms          616
bmFlushToSinks/100000                          27.0 ms         27.0 ms           23
bmFlushToSinks/1000000                          394 ms          394 ms            2
bmFlushToSinks/10000000                        4841 ms         4840 ms            1

After:

Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
bmFlushToSinks/10                             0.002 ms        0.002 ms       436476
bmFlushToSinks/100                            0.004 ms        0.004 ms       202299
bmFlushToSinks/1000                           0.030 ms        0.030 ms        23043
bmFlushToSinks/10000                          0.376 ms        0.375 ms         1879
bmFlushToSinks/100000                          6.73 ms         6.73 ms          103
bmFlushToSinks/1000000                          100 ms          100 ms            7
bmFlushToSinks/10000000                        1065 ms         1065 ms            1

Additional Description:
Risk Level: medium -- changing mechanism of keeping stats alive during metrics sink
Testing: //test/... but, man, Envoy test has gotten super flaky since I last did dev. Things eventually pass when I re-run 5 or so integration tests one at a time, on a clean client. IDK what's up with CI.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes #43836

Environment

Request variables

Key Value
ref 09e6a49
sha 967543a
pr 43958
base-sha eea4eab
actor jmarantz @jmarantz
message stats: remove storm of refcount bumping during metrics snapshot....
started 1773779840.942395
target-branch main
trusted false
Build image

Container image/s (as used in this CI run)

Key Value
default docker.io/envoyproxy/envoy-build:86873047235e9b8232df989a5999b9bebf9db69c
mobile docker.io/envoyproxy/envoy-build:mobile-86873047235e9b8232df989a5999b9bebf9db69c
Version

Envoy version (as used in this CI run)

Key Value
major 1
minor 38
patch 0
dev true