Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose metrics on timing of server -> compiler pool RPC operations #8356

Open
zackelan opened this issue Feb 20, 2025 · 0 comments
Open

expose metrics on timing of server -> compiler pool RPC operations #8356

zackelan opened this issue Feb 20, 2025 · 0 comments
Assignees

Comments

@zackelan
Copy link
Contributor

repro steps:

  • deploy an EdgeDB server

  • configure a system like Consul to query the /server/status/alive and /server/status/ready healthcheck endpoints (https://developer.hashicorp.com/consul/docs/services/usage/checks)

  • configure a timeout on the healthcheck endpoints (possibly fairly short, such as 1-2 seconds, under the assumption that the Consul agent performing the healthcheck is on the same machine as the EdgeDB server, so host-to-host latency or network partition should not be a factor)

current behavior:

if the server is under heavy load, the healthcheck endpoints can time out (see #8355)

the Consul healthchecks only give us a binary signal - timed out yes/no. it doesn't expose any more granular timing metrics than that, which limits our observability into performance problems of this kind. for example, if our timeout is 2 seconds, we can't see whether the average successful check is 1sec vs 1msec, and we can't see any times where it spiked up to "only" 1.8sec.

desired behavior:

expose Prometheus metrics on how long these operations take

since operations can sit in a queue, I think we want to record separately how long it sat in queue, and how long it took to execute once it was pulled off the queue.

nice-to have:

current queue depth as another metric - it is expected to almost always be zero, and if non-zero then almost always be small, so graphing it and looking for spikes is very useful way to find "something got slow here" events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants