expose metrics on timing of server -> compiler pool RPC operations #8356

zackelan · 2025-02-20T02:41:22Z

repro steps:

deploy an EdgeDB server
configure a system like Consul to query the /server/status/alive and /server/status/ready healthcheck endpoints (https://developer.hashicorp.com/consul/docs/services/usage/checks)
configure a timeout on the healthcheck endpoints (possibly fairly short, such as 1-2 seconds, under the assumption that the Consul agent performing the healthcheck is on the same machine as the EdgeDB server, so host-to-host latency or network partition should not be a factor)

current behavior:

if the server is under heavy load, the healthcheck endpoints can time out (see #8355)

the Consul healthchecks only give us a binary signal - timed out yes/no. it doesn't expose any more granular timing metrics than that, which limits our observability into performance problems of this kind. for example, if our timeout is 2 seconds, we can't see whether the average successful check is 1sec vs 1msec, and we can't see any times where it spiked up to "only" 1.8sec.

desired behavior:

expose Prometheus metrics on how long these operations take

since operations can sit in a queue, I think we want to record separately how long it sat in queue, and how long it took to execute once it was pulled off the queue.

nice-to have:

current queue depth as another metric - it is expected to almost always be zero, and if non-zero then almost always be small, so graphing it and looking for spikes is very useful way to find "something got slow here" events.

The text was updated successfully, but these errors were encountered:

zackelan assigned fantix Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expose metrics on timing of server -> compiler pool RPC operations #8356

expose metrics on timing of server -> compiler pool RPC operations #8356

zackelan commented Feb 20, 2025

expose metrics on timing of server -> compiler pool RPC operations #8356

expose metrics on timing of server -> compiler pool RPC operations #8356

Comments

zackelan commented Feb 20, 2025