scertecd: restructure metrics to match Prometheus guidelines #11

knyar · 2024-02-21T20:29:43Z

I left this as a comment in an already-merged PR #8 which I suspect was not the best way to raise awareness. Opening a PR just to start a discussion, but I won't be too sad if you decide to address this in some other way.

I don't particularly care what Prometheus metric library we use, but if we are open-sourcing this tool I think we should structure metrics in a more idiomatic way, aligning with Prometheus instrumentation and metric naming guidelines. Specifically:

Instead of a catch-all metric for errors, we should make sure an error counter can easily be associated with a counter of total attempts (or successes), which will assist users in calculating an error ratio to be used for dashboards or alerts;
Instead of exporting time-since or time-until, we should export the unix timestamp and let users use the time query function. In many time series storage engines this will also be cheaper to store (most samples will have the same value).
Similarly, instead of making our own decision of what "good" and "bad" is, it's better to export metrics and let users configure their own expressions in dashboards and alerts.
Metric names should have suffixes indicating units, with an additional _total for counters.

This PR is an example of what it could look like using a standard Prometheus library. It allows users to monitor things like:

Error ratio of renewals:

sum(rate(scertecd_renewal_errors_total[1h])) / sum(rate(scertecd_renewal_attempts_total[1h]))

Error ratio of setec operations:

sum(rate(scertecd_setec_requests_total{status!="ok"}[1h])) / sum(rate(scertecd_setec_requests_total[1h]))

Domains that should have been renewed more than 15 minutes ago:

time() - scertecd_renewal_time_seconds > 15*60

Instances with broken AWS credentials:

time() - scertecd_last_dns_change_seconds > 60*60

I dropped the various OCSP error codes since it seems like many of them are very unlikely to happen, and it's now possible to detect any of them via a metric label (scertecd_ocsp_checks_total{result="error"}) and then look at the logs to learn more.

Again, I don't feel too strongly about the library used, just that we make these metrics easy to understand and use.

WIP prometheus metrics

f0a92a7

knyar requested review from bradfitz and creachadair February 21, 2024 20:29

knyar changed the title ~~WIP prometheus metrics~~ Align exported metrics with Prometheus recommendations Feb 21, 2024

knyar changed the title ~~Align exported metrics with Prometheus recommendations~~ scertecd: restructure metrics to match Prometheus guidelines Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scertecd: restructure metrics to match Prometheus guidelines #11

scertecd: restructure metrics to match Prometheus guidelines #11

Uh oh!

knyar commented Feb 21, 2024

Uh oh!

Uh oh!

scertecd: restructure metrics to match Prometheus guidelines #11

Are you sure you want to change the base?

scertecd: restructure metrics to match Prometheus guidelines #11

Uh oh!

Conversation

knyar commented Feb 21, 2024

Uh oh!

Uh oh!