Expected Behavior
The cache_size{cache_type="mutablestate"} Prometheus metric should reflect the configured
capacity of the mutable state cache (the workflow execution LRU cache in the history service).
When cacheSizeBasedLimit: true and hostLevelCacheMaxSizeBytes: 629145600 are set in dynamic
config, the metric should report 629145600 (the byte-mode capacity).
Actual Behavior
cache_size{cache_type="mutablestate"} always reports 128000 - the default value of
ReplicationProgressCacheMaxSize - regardless of the mutable state cache's actual configuration.
This happens because the replication progress cache (service/history/replication/progress_cache.go:61)
reuses MutableStateCacheTypeTagValue as its metrics tag:
// progress_cache.go:61
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(
metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue), // <-- should be its own tag
)),
|
return &progressCacheImpl{ |
|
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue))), |
|
} |
|
} |
Both caches call cache.NewWithMetrics(), which records a cache_size gauge at construction time. Since Prometheus gauges use last-write-wins semantics, whichever cache is constructed last determines the reported value. In practice, the replication progress cache is constructed after the mutable state cache (via fx dependency ordering), so it overwrites the gauge with its own maxSize of 128000.
This makes it impossible to monitor the actual mutable state cache capacity via Prometheus.
Impact
This bug is particularly misleading when investigating cacheSizeBasedLimit. Users who enable
byte-based cache limiting (cacheSizeBasedLimit: true) and check the cache_size{mutablestate}
metric will see 128000 instead of their configured byte limit - leading them to incorrectly
conclude that byte mode did not activate.
This has already caused confusion for at least two independent users:
- Our team spent significant time investigating a phantom "bug" in
cacheSizeBasedLimit, including
full source code tracing, unit tests, and Docker-level debugging before discovering the gauge
collision. We were about to switch to count-based mode as a workaround for a problem that didn't
exist.
- @andropler in community thread #18787 and issue #8902 reported the same
cache_size = 128000 observation and switched to count-based mode. Their observation is consistent with this gauge collision - byte mode may have been working for them too.
Steps to Reproduce the Problem
- Deploy Temporal v1.29.1 (or latest
main) with history service and this dynamic config:
history.cacheSizeBasedLimit:
- value: true
history.hostLevelCacheMaxSizeBytes:
- value: 629145600 # 600 MiB
- Wait for the history service to start.
- Scrape the Prometheus metrics endpoint (default
:9090/metrics).
- Observe:
cache_size{cache_type="mutablestate"} 128000
- Expected:
cache_size{cache_type="mutablestate"} 629145600
Verification via debug logging
We built a patched binary from v1.29.1 source with fmt.Printf in both NewHostLevelCache
(cache.go) and NewProgressCache (progress_cache.go). Output confirms the initialization
order and gauge overwrite:
DEBUG: HistoryCacheSizeBasedLimit = true
DEBUG NewHostLevelCache: HistoryCacheLimitSizeBased=true maxSize(count)=128000
DEBUG NewHostLevelCache: maxSize(bytes)=629145600
DEBUG NewProgressCache: maxSize=128000, using tag=MutableStateCacheTypeTagValue
The mutable state cache correctly enters byte mode with maxSize=629145600. Then the replication
progress cache overwrites the gauge with 128000.
Suggested Fix
Give the replication progress cache its own metric tag value. For example:
// common/metrics/metric_defs.go - add new constant
ReplicationProgressCacheTypeTagValue = "replication_progress"
// service/history/replication/progress_cache.go:61 - use new tag
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(
metrics.CacheTypeTag(metrics.ReplicationProgressCacheTypeTagValue),
)),
This is a one-line behavioral change (plus the new constant definition). It would allow both caches to report their cache_size independently via distinct cache_type label values.
Specifications
- Version: v1.29.1 (also confirmed on latest
main - the code is unchanged)
- Platform: Linux/arm64 (Docker), also observed on Kubernetes (EKS)
- File:
|
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue))), |
- Introduced in: 4d6dc3614
Related Issues
- #8902 - "History service memory usage
upward trend"
- Community thread #18787 - "Memory OOM issues with history pod and size-based cache configuration"
Expected Behavior
The
cache_size{cache_type="mutablestate"}Prometheus metric should reflect the configuredcapacity of the mutable state cache (the workflow execution LRU cache in the history service).
When
cacheSizeBasedLimit: trueandhostLevelCacheMaxSizeBytes: 629145600are set in dynamicconfig, the metric should report
629145600(the byte-mode capacity).Actual Behavior
cache_size{cache_type="mutablestate"}always reports128000- the default value ofReplicationProgressCacheMaxSize- regardless of the mutable state cache's actual configuration.This happens because the replication progress cache (
service/history/replication/progress_cache.go:61)reuses
MutableStateCacheTypeTagValueas its metrics tag:temporal/service/history/replication/progress_cache.go
Lines 60 to 63 in a4e6f11
Both caches call
cache.NewWithMetrics(), which records acache_sizegauge at construction time. Since Prometheus gauges use last-write-wins semantics, whichever cache is constructed last determines the reported value. In practice, the replication progress cache is constructed after the mutable state cache (via fx dependency ordering), so it overwrites the gauge with its ownmaxSizeof128000.This makes it impossible to monitor the actual mutable state cache capacity via Prometheus.
Impact
This bug is particularly misleading when investigating
cacheSizeBasedLimit. Users who enablebyte-based cache limiting (
cacheSizeBasedLimit: true) and check thecache_size{mutablestate}metric will see
128000instead of their configured byte limit - leading them to incorrectlyconclude that byte mode did not activate.
This has already caused confusion for at least two independent users:
cacheSizeBasedLimit, includingfull source code tracing, unit tests, and Docker-level debugging before discovering the gauge
collision. We were about to switch to count-based mode as a workaround for a problem that didn't
exist.
cache_size = 128000observation and switched to count-based mode. Their observation is consistent with this gauge collision - byte mode may have been working for them too.Steps to Reproduce the Problem
main) with history service and this dynamic config::9090/metrics).cache_size{cache_type="mutablestate"} 128000cache_size{cache_type="mutablestate"} 629145600Verification via debug logging
We built a patched binary from v1.29.1 source with
fmt.Printfin bothNewHostLevelCache(
cache.go) andNewProgressCache(progress_cache.go). Output confirms the initializationorder and gauge overwrite:
The mutable state cache correctly enters byte mode with
maxSize=629145600. Then the replicationprogress cache overwrites the gauge with
128000.Suggested Fix
Give the replication progress cache its own metric tag value. For example:
This is a one-line behavioral change (plus the new constant definition). It would allow both caches to report their
cache_sizeindependently via distinctcache_typelabel values.Specifications
main- the code is unchanged)temporal/service/history/replication/progress_cache.go
Line 61 in a4e6f11
Related Issues
upward trend"