Skip to content

scertecd: restructure metrics to match Prometheus guidelines #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

knyar
Copy link

@knyar knyar commented Feb 21, 2024

I left this as a comment in an already-merged PR #8 which I suspect was not the best way to raise awareness. Opening a PR just to start a discussion, but I won't be too sad if you decide to address this in some other way.

I don't particularly care what Prometheus metric library we use, but if we are open-sourcing this tool I think we should structure metrics in a more idiomatic way, aligning with Prometheus instrumentation and metric naming guidelines. Specifically:

  • Instead of a catch-all metric for errors, we should make sure an error counter can easily be associated with a counter of total attempts (or successes), which will assist users in calculating an error ratio to be used for dashboards or alerts;
  • Instead of exporting time-since or time-until, we should export the unix timestamp and let users use the time query function. In many time series storage engines this will also be cheaper to store (most samples will have the same value).
  • Similarly, instead of making our own decision of what "good" and "bad" is, it's better to export metrics and let users configure their own expressions in dashboards and alerts.
  • Metric names should have suffixes indicating units, with an additional _total for counters.

This PR is an example of what it could look like using a standard Prometheus library. It allows users to monitor things like:

Error ratio of renewals:

sum(rate(scertecd_renewal_errors_total[1h])) / sum(rate(scertecd_renewal_attempts_total[1h]))

Error ratio of setec operations:

sum(rate(scertecd_setec_requests_total{status!="ok"}[1h])) / sum(rate(scertecd_setec_requests_total[1h]))

Domains that should have been renewed more than 15 minutes ago:

time() - scertecd_renewal_time_seconds > 15*60

Instances with broken AWS credentials:

time() - scertecd_last_dns_change_seconds > 60*60

I dropped the various OCSP error codes since it seems like many of them are very unlikely to happen, and it's now possible to detect any of them via a metric label (scertecd_ocsp_checks_total{result="error"}) and then look at the logs to learn more.

Again, I don't feel too strongly about the library used, just that we make these metrics easy to understand and use.

@knyar knyar changed the title WIP prometheus metrics Align exported metrics with Prometheus recommendations Feb 21, 2024
@knyar knyar changed the title Align exported metrics with Prometheus recommendations scertecd: restructure metrics to match Prometheus guidelines Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant