diff --git a/README.md b/README.md index 6788a90..bf629cc 100644 --- a/README.md +++ b/README.md @@ -10,52 +10,54 @@ It is meant to: - Be flexible enough to support unmanaged configuration outside the boilerplate that it manages Currently, the two kinds of boilerplate that are supported: -- Node exporter rules and alerts for vms (number of hosts detected, cpu, ram, disks) +- Node exporter rules and alerts for VMs (number of hosts detected, CPU, RAM, disks) - Terracd jobs metrics and alerts (to get the interval since the last plan/apply and a threshold value that will trigger an alert) # Inputs - **config**: This should be the value of the entrypoint **prometheus.yml** configuration file which will be generated from this value. The module will add some **rule_files** entries for the rule files it generates and otherwise will leave the content as is. -- **fs_path**: Path where the prometheus configuration will be generated prior to synchronizting it with etcd. Beyond generating the **prometheus.yml** file there, boilerplate rule files will be generated in the **rules** subdirectory. +- **fs_path**: Path where the prometheus configuration will be generated prior to synchronizing it with etcd. Beyond generating the **prometheus.yml** file there, boilerplate rule files will be generated in the **rules** subdirectory. - **etcd_key_prefix**: Etcd prefix where the processed prometheus configuration will be synchronized. - **node_exporter_jobs**: List of node exporter jobs to generate boilerplate for. Each entry should take the following keys: - - **tag**: Tag for the node exporter job. Is should consist of words separated by dashes. The job is expected to be called `-node-exporter` + - **tag**: Tag for the node exporter job. It should consist of words separated by dashes. The job is expected to be called `-node-exporter` - **expected_count**: Expected number of instances associated with the job - - **memory_usage_threshold**: Maximum memory usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more. - - **cpu_usage_threshold**: Maximum cpu usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more. - - **expected_disks_count**: Expected number of disks (ex: 2). An alert will be triggered if the number of disks doesn't match. Can be set to -1 to disable the alert. - - **disk_space_usage_threshold**: Maximum disk space usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more. - - **disk_io_usage_threshold**: Maximum disk io usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more. - - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts. -- **blackbox_exporter_jobs**: List of blackbox tcp/http exporter jobs to generate boilerplate for. Each entry should take the following keys: - - **tag**: Tag for the blackbox exporter job. Is should consist of words separated by dashes. The job is expected to be called `-blackbox-exporter` - - **unavailability_tolerance**: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formated as prometheus expects in the **for** field of alert rules. + - **memory_usage_threshold**: Maximum memory usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more. + - **cpu_usage_threshold**: Maximum CPU usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more. + - **expected_disks_count**: Expected number of disks (e.g., 7). If set, an alert will be triggered if the number of disks does not match. Can be set to `-1` to disable this alert. + - **min_disks_count**: Minimum expected number of disks (e.g., 5). If both `min_disks_count` and `max_disks_count` are set, an alert will be triggered if the disk count falls outside the range. + - **max_disks_count**: Maximum expected number of disks (e.g., 7). If both `min_disks_count` and `max_disks_count` are set, an alert will be triggered if the disk count falls outside the range. + - **disk_space_usage_threshold**: Maximum disk space usage as a percentage (e.g., 90). An alert will be triggered if this threshold is crossed for 15 minutes or more. + - **disk_io_usage_threshold**: Maximum disk IO usage as a percentage (e.g., 95). An alert will be triggered if this threshold is crossed for 15 minutes or more. + - **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts. +- **blackbox_exporter_jobs**: List of blackbox TCP/HTTP exporter jobs to generate boilerplate for. Each entry should take the following keys: + - **tag**: Tag for the blackbox exporter job. It should consist of words separated by dashes. The job is expected to be called `-blackbox-exporter` + - **unavailability_tolerance**: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formatted as Prometheus expects in the **for** field of alert rules. - **max_acceptable_latency**: Duration in seconds indicating the maximum acceptable response time for the service. If the service continuously takes longer than this to respond for an interval of time longer than **unavailability_tolerance**, a slow service alert will be triggered. - - **cert_renewal_window**: Delay in days indicating the expected renewal window for the tls certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly. - - **has_tls**: Boolean indicating whether the service expects a tls connection. If false, alerts for the cert renewal window and tls version will not be set. - - **expect_recent_tls**: Boolean indicating whether the service is expected to use tls version 1.3. If set to true and the service uses a version of tls older than 1.3, an alert will be triggered. - - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts. + - **cert_renewal_window**: Delay in days indicating the expected renewal window for the TLS certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly. + - **has_tls**: Boolean indicating whether the service expects a TLS connection. If false, alerts for the cert renewal window and TLS version will not be set. + - **expect_recent_tls**: Boolean indicating whether the service is expected to use TLS version 1.3. If set to true and the service uses a version of TLS older than 1.3, an alert will be triggered. + - **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts. - **terracd_jobs**: List of terracd jobs to generate boilerplate for. Each entry should take the following keys: - **tag**: Tag for the terracd job. It should correspond to the job name. - **plan_interval_threshold**: Interval threshold after which an alert will be triggered if a **plan** or **apply** command did not run successfully. Used to diagnose a broken or non-running pipeline. - **apply_interval_threshold**: Interval threshold after which an alert will be triggered if an **apply** command did not run successfully. Used to detect a pipeline that was left in **plan** and never put back on **apply**. - - **unit**: Base time unit to use (**minute** or **hour**) that will affect how the thresholds are interepreted and how the rules are processed (to be either in minutes or hours) - - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts. -- **kubernetes_cluster_jobs**: List of kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key: - - **tag**: Tag for the kubernetes cluster job. It should correspond to the cluster name. - - **expected_services**: List of expected deployments that should have a certain number of long running instances. Each entry should have the following keys: - - **namespace**: Namespace where the service is expected to run - - **name**: Name of the service. It should match the k8 deployment name. + - **unit**: Base time unit to use (**minute** or **hour**) that will affect how the thresholds are interpreted and how the rules are processed (to be either in minutes or hours). + - **alert_labels**: Map of string keys and values corresponding to labels to add to all the job's alerts. +- **kubernetes_cluster_jobs**: List of Kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key: + - **tag**: Tag for the Kubernetes cluster job. It should correspond to the cluster name. + - **expected_services**: List of expected deployments that should have a certain number of long-running instances. Each entry should have the following keys: + - **namespace**: Namespace where the service is expected to run. + - **name**: Name of the service. It should match the Kubernetes deployment name. - **expected_min_count**: Minimum expected number of instances that should be running. - **expected_start_delay**: Expected delay before an instance is started. Running instances that have been around for less than that delay won't be considered running. - **alert_labels**: Extra labels to add to alerts triggered for the service. -- **minio_cluster_jobs**: List of minio cluster jobs to generate boilerplate for. Each entry should take the following key: - - **tag**: Tag for the minio cluster job. It should correspond to the cluster name. +- **minio_cluster_jobs**: List of MinIO cluster jobs to generate boilerplate for. Each entry should take the following key: + - **tag**: Tag for the MinIO cluster job. It should correspond to the cluster name. - **etcd_exporter_jobs**: List of etcd exporter jobs to generate boilerplate for. Each entry should take the following keys: - - **tag**: Tag for the etcd exporter job. Is should consist of words separated by dashes. The job is expected to be called `-etcd-exporter` - - **expected_count**: Expected number of etcd members associated with the job - - **max_learn_time**: Max expected time for an etcd learner to catchup. - - **max_db_size**: Maximum expected data size (note that etcd has its own limit if 8GiB) + - **tag**: Tag for the etcd exporter job. It should consist of words separated by dashes. The job is expected to be called `-etcd-exporter` + - **expected_count**: Expected number of etcd members associated with the job. + - **max_learn_time**: Maximum expected time for an etcd learner to catch up. + - **max_db_size**: Maximum expected data size (note that etcd has its own limit of 8GiB). - **alert_labels**: Map of string keys and values corresponding to labels to add to all the jobs' alerts. # Example diff --git a/templates/node-exporter.yml.tpl b/templates/node-exporter.yml.tpl index 3113fc0..3554210 100644 --- a/templates/node-exporter.yml.tpl +++ b/templates/node-exporter.yml.tpl @@ -67,6 +67,23 @@ groups: annotations: summary: "${title(replace(job.tag, "-", " "))} Number of Disks Unexpected" description: "Instance *{{ $labels.instance }}* of job *{{ $labels.job }}* has *{{ $value }}* disks. Expected *${job.expected_disks_count}*." +%{ else ~} +%{ if job.min_disks_count >= 0 ~} + %{ if job.max_disks_count >= 0 } + - alert: ${replace(title(replace(job.tag, "-", " ")), " ", "")}DiskCountRangeMismatch + expr: (${replace(job.tag, "-", "_")}:disks:count < ${job.min_disks_count} or ${replace(job.tag, "-", "_")}:disks:count > ${job.max_disks_count}) + for: 15m +%{ if length(job.alert_labels) > 0 ~} + labels: +%{ for key, val in job.alert_labels ~} + ${key}: "${val}" +%{ endfor ~} +%{ endif ~} + annotations: + summary: "${title(replace(job.tag, "-", " "))} Disk Count Out of Range" + description: "Instance *{{ $labels.instance }}* of job *{{ $labels.job }}* has *{{ $value }}* disks. Expected between *${job.min_disks_count}* and *${job.max_disks_count}*." +%{ endif } +%{ endif ~} %{ endif ~} - record: ${replace(job.tag, "-", "_")}:filesystem_size:gigabytes expr: node_filesystem_size_bytes{job="${job.tag}-node-exporter", fstype="ext4"} / 1024 / 1024 / 1024