You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
configure a timeout on the healthcheck endpoints (possibly fairly short, such as 1-2 seconds, under the assumption that the Consul agent performing the healthcheck is on the same machine as the EdgeDB server, so host-to-host latency or network partition should not be a factor)
current behavior:
if the server is under heavy load, the healthcheck endpoints can time out (see #8355)
the Consul healthchecks only give us a binary signal - timed out yes/no. it doesn't expose any more granular timing metrics than that, which limits our observability into performance problems of this kind. for example, if our timeout is 2 seconds, we can't see whether the average successful check is 1sec vs 1msec, and we can't see any times where it spiked up to "only" 1.8sec.
desired behavior:
expose Prometheus metrics on how long these operations take
since operations can sit in a queue, I think we want to record separately how long it sat in queue, and how long it took to execute once it was pulled off the queue.
nice-to have:
current queue depth as another metric - it is expected to almost always be zero, and if non-zero then almost always be small, so graphing it and looking for spikes is very useful way to find "something got slow here" events.
The text was updated successfully, but these errors were encountered:
repro steps:
deploy an EdgeDB server
configure a system like Consul to query the
/server/status/alive
and/server/status/ready
healthcheck endpoints (https://developer.hashicorp.com/consul/docs/services/usage/checks)configure a timeout on the healthcheck endpoints (possibly fairly short, such as 1-2 seconds, under the assumption that the Consul agent performing the healthcheck is on the same machine as the EdgeDB server, so host-to-host latency or network partition should not be a factor)
current behavior:
if the server is under heavy load, the healthcheck endpoints can time out (see #8355)
the Consul healthchecks only give us a binary signal - timed out yes/no. it doesn't expose any more granular timing metrics than that, which limits our observability into performance problems of this kind. for example, if our timeout is 2 seconds, we can't see whether the average successful check is 1sec vs 1msec, and we can't see any times where it spiked up to "only" 1.8sec.
desired behavior:
expose Prometheus metrics on how long these operations take
since operations can sit in a queue, I think we want to record separately how long it sat in queue, and how long it took to execute once it was pulled off the queue.
nice-to have:
current queue depth as another metric - it is expected to almost always be zero, and if non-zero then almost always be small, so graphing it and looking for spikes is very useful way to find "something got slow here" events.
The text was updated successfully, but these errors were encountered: