-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
I'm trying to squeeze as much performance as I can from my poor laptop running Omes against Temporal with ScyllaDB / Cassandra, and I'm unsure where the bottleneck is. Among other things, noticed this issue (well, AI and I noticed it). Here's the AI description, which I think is reasonable:
Setting history.persistenceMaxQPS: 0 in dynamic config (intended to mean "unlimited") causes all queue reader host-level rate limiters to be created with rate=0, burst=0. This results in every loadAndSubmitTasks call failing its Wait() and logging an unthrottled error with full stacktrace.
Root cause:
NewHostRateLimiterRateFn in service/history/queue_factory_base.go:224-233 falls back to persistenceMaxRPS() * ratio when MaxPollHostRPS=0. If persistenceMaxQPS is also 0, the effective rate becomes 0 * 0.3 = 0, creating a rate limiter with burst=0. Go's rate.Limiter.ReserveN(now, 1) returns OK()=false when burst < tokens, triggering the error path at service/history/queues/reader.go:433.
Impact:
- 317K error log lines in a 5-minute run (99.2% of all server log output)
- Each log line includes a full JSON-serialized stacktrace
- Affects all queue processors: transfer (106K), timer (105K), archival (104K), visibility (22K), outbound (2K)
- All 128 shards affected (~2500 errors per shard)
- Significant CPU and I/O overhead from log serialization
Secondary issue:
The error log at reader.go:433 has no rate limiting despite being in a hot loop. Even when triggered legitimately, it should use a throttled logger.
Suggested fixes:
- In NewHostRateLimiterRateFn, handle persistenceMaxRPS() <= 0 by using the default value (9000) or returning math.MaxFloat64
- Add rate limiting to the error log at reader.go:433
Reproduction: Set history.persistenceMaxQPS: 0 in dynamic config, start server with 128 shards, observe log output.
I can of course work on fixing this / these.