A reference for tools, configurations, and documentation used to monitor CircleCI server.
🚧 Under Development
This repository is currently under active development and is not yet a supported resource. Please refer to it at your own discretion until further notice.
A reference Helm chart for setting up a monitoring stack for CircleCI server
Repository | Name | Version |
---|---|---|
https://grafana.github.io/helm-charts | grafanaoperator(grafana-operator) | 5.17.* |
https://prometheus-community.github.io/helm-charts | prometheusOperator(prometheus-operator-crds) | 19.0.* |
To set up monitoring for a CircleCI server instance, you need to configure Telegraf to set up a Prometheus client and expose a metrics endpoint. Add the following configuration to the CircleCI server Helm chart values:
telegraf:
config:
outputs:
- file:
files: ["stdout"]
- prometheus_client:
listen: ":9273"
First, add the CircleCI Server Monitoring Stack Helm repository:
$ helm repo add server-monitoring-stack https://packagecloud.io/circleci/server-monitoring-stack/helm
$ helm repo update
Before installing the full chart, you must first install the dependency subcharts, including the Prometheus Custom Resource Definitions (CRDs) and the Grafana operator chart. This assumes you are installing it in the same namespace as your CircleCI server installation:
$ helm install server-monitoring-stack server-monitoring-stack/server-monitoring-stack --set global.enabled=false --set prometheusOperator.installCRDs=true --version 0.1.0-alpha.3 -n <your-server-namespace>
NOTE: It's possible to install the monitoring stack in a different namespace than the CircleCI server installation. If you do so, set the
prometheus.serviceMonitor.selectorNamespaces
value with the target namespace.
Next, install the Helm chart using the following command:
$ helm upgrade --install server-monitoring-stack server-monitoring-stack/server-monitoring-stack --reset-values --version 0.1.0-alpha.3 -n <your-server-namespace>
To verify that Prometheus is working correctly and targeting Telegraf, use the following command to port-forward Prometheus:
$ kubectl port-forward svc/server-monitoring-prometheus 9090:9090 -n <your-namespace-here>
Then visit http://localhost:9090/targets in your browser. Verify that Telegraf appears as a target and that its state is "up".
To verify that Grafana is working correctly and connected to Prometheus, use the following command to port-forward Grafana:
$ kubectl port-forward svc/server-monitoring-grafana-service 3000:3000 <your-namespace-here>
Then visit http://localhost:3000 in your browser. Once logged in with the default credentials, navigate to http://localhost:3000/dashboards and verify that the default dashboards are present and populating with data.
After ensuring both Prometheus and Grafana are operational, consider these enhancements:
Secure Grafana by configuring credentials:
grafana:
credentials:
# Directly set these for quick setups
adminUser: "admin"
adminPassword: "<your-secure-password-here>"
# For production, use a Kubernetes secret to manage credentials securely
existingSecretName: "<your-secret-here>"
For external access, modify the service or ingress values. For example:
grafana:
service:
type: LoadBalancer
Persist data by enabling storage for Prometheus and Grafana:
prometheus:
persistence:
enabled: true
storageClass: <your-custom-storage-class>
grafana:
persistence:
enabled: true
storageClass: <your-custom-storage-class>
NOTE: Use a custom storage class with a 'Retain' policy to allow for data retention even after uninstalling the chart.
The default dashboards are located in the dashboards
directory of the reference chart. To add new dashboards or modify existing ones, follow these steps.
Dashboards are provisioned directly from CRDs, which means any manual edits will be lost upon a refresh. As such, the workflow outlined below is recommended for making changes:
- Create a Copy:
- Select Edit in the upper right corner.
- Choose Save dashboard -> Save as copy.
- After saving, navigate to the copy.
- Make Edits:
- Modify the copy as needed and exit edit mode.
- Export as JSON:
- Select Export in the upper right corner and then Export as JSON.
- Ensure that
Export the dashboard to use in another instance
is toggled on.
- Update the JSON File:
- Download the file and replace the
./dashboards/server-slis.json
file with the updated copy. - Run the following command to automatically validate the JSON and apply necessary updates:
./do validate-dashboards
- Download the file and replace the
- Commit and Open a PR:
- Review and commit the changes.
- Open a pull request for the On-Prem team to review.
Key | Type | Default | Description |
---|---|---|---|
global.enabled | bool | true |
|
global.fullnameOverride | string | "server-monitoring" |
Override the full name for resources |
global.imagePullSecrets | list | [] |
List of image pull secrets to be used across the deployment |
global.nameOverride | string | "" |
Override the release name |
grafana.credentials.adminPassword | string | "admin" |
Grafana admin password. Change from default for production environments. |
grafana.credentials.adminUser | string | "admin" |
Grafana admin username. |
grafana.credentials.existingSecretName | string | "" |
Name of an existing secret for Grafana credentials. Leave empty to create a new secret. |
grafana.customConfig | string | "" |
Add any custom Grafana configurations you require here. This should be a YAML-formatted string of additional settings for Grafana. |
grafana.dashboards.jsonDirectory | string | "dashboards" |
The directory containing JSON files for Grafana dashboards. |
grafana.datasource.jsonData.timeInterval | string | "5s" |
The time interval for Grafana to poll Prometheus. Specifies the frequency of data requests. |
grafana.enabled | string | "-" |
|
grafana.image.repository | string | "grafana/grafana" |
Image repository for Grafana. |
grafana.image.tag | string | "11.6.0" |
Tag for the Grafana image. |
grafana.ingress.className | string | "" |
Specifies the class of the Ingress controller. Required if the Kubernetes cluster includes multiple Ingress controllers. |
grafana.ingress.enabled | bool | false |
Enable to create an Ingress resource for Grafana. Disabled by default. |
grafana.ingress.host | string | "" |
Hostname to use for the Ingress. Must be set if Ingress is enabled. |
grafana.ingress.tls.enabled | bool | false |
Enable TLS for Ingress. Requires a TLS secret to be specified. |
grafana.ingress.tls.secretName | string | "" |
Name of the TLS secret used for securing the Ingress. Must be provided if TLS is enabled. |
grafana.persistence.accessModes | list | ["ReadWriteOnce"] |
Access modes for the persistent volume. |
grafana.persistence.enabled | bool | false |
Enable persistent storage for Grafana. |
grafana.persistence.size | string | "10Gi" |
Size of the persistent volume claim. |
grafana.persistence.storageClass | string | "" |
Storage class for persistent volume provisioner. You can create a custom storage class with a "retain" policy to ensure the persistent volume remains even after the chart is uninstalled. |
grafana.replicas | int | 1 |
Number of Grafana replicas to deploy. |
grafana.service.annotations | object | {} |
Metadata annotations for the service. |
grafana.service.port | int | 3000 |
Port on which the Grafana service will be exposed. |
grafana.service.type | string | "ClusterIP" |
Specifies the type of service for Grafana. Options include ClusterIP, NodePort, or LoadBalancer. Use NodePort or LoadBalancer to expose Grafana externally. Ensure that grafana.credentials are set for security purposes. |
grafanaoperator | object | {"fullnameOverride":"server-monitoring-grafana-operator","image":{"repository":"quay.io/grafana-operator/grafana-operator","tag":"v5.17.0"}} |
Full values for the Grafana Operator chart can be obtained at: https://github.com/grafana/grafana-operator/blob/master/deploy/helm/grafana-operator/values.yaml |
grafanaoperator.fullnameOverride | string | "server-monitoring-grafana-operator" |
Overrides the fully qualified app name. |
grafanaoperator.image.repository | string | "quay.io/grafana-operator/grafana-operator" |
Image repository for the Grafana Operator. |
grafanaoperator.image.tag | string | "v5.17.0" |
Tag for the Grafana Operator image. |
prometheus.enabled | string | "-" |
|
prometheus.image.repository | string | "quay.io/prometheus/prometheus" |
Image repository for Prometheus. |
prometheus.image.tag | string | "v3.2.1" |
Tag for the Prometheus image. |
prometheus.persistence.accessModes | list | ["ReadWriteOnce"] |
Access modes for the persistent volume. |
prometheus.persistence.enabled | bool | false |
Enable persistent storage for Prometheus. |
prometheus.persistence.size | string | "10Gi" |
Size of the persistent volume claim. |
prometheus.persistence.storageClass | string | "" |
Storage class for persistent volume provisioner. You can create a custom storage class with a "retain" policy to ensure the persistent volume remains even after the chart is uninstalled. |
prometheus.replicas | int | 2 |
Number of Prometheus replicas to deploy. |
prometheus.serviceMonitor.endpoints[0].metricRelabelings[0].action | string | "labeldrop" |
|
prometheus.serviceMonitor.endpoints[0].metricRelabelings[0].regex | string | "instance" |
|
prometheus.serviceMonitor.endpoints[0].port | string | "prometheus-client" |
Port name for the Prometheus client service. |
prometheus.serviceMonitor.endpoints[0].relabelings[0].action | string | "labeldrop" |
|
prometheus.serviceMonitor.endpoints[0].relabelings[0].regex | string | `"(container | endpoint |
prometheus.serviceMonitor.selectorLabels | object | {"app.kubernetes.io/instance":"circleci-server","app.kubernetes.io/name":"telegraf"} |
Labels to select ServiceMonitors for scraping metrics. By default, it's configured to scrape the existing Telegraf deployment in CircleCI server. |
prometheus.serviceMonitor.selectorNamespaces | list | [] |
Namespaces to look for ServiceMonitor objects. Set this if the CircleCI server monitoring stack is deploying in a different namespace than the actual CircleCI server installation. |
prometheusOperator.crds.annotations."helm.sh/resource-policy" | string | "keep" |
|
prometheusOperator.enabled | string | "-" |
|
prometheusOperator.image.repository | string | "quay.io/prometheus-operator/prometheus-operator" |
Image repository for Prometheus Operator. |
prometheusOperator.image.tag | string | "v0.81.0" |
Tag for the Prometheus Operator image. |
prometheusOperator.installCRDs | bool | false |
|
prometheusOperator.prometheusConfigReloader.image.repository | string | "quay.io/prometheus-operator/prometheus-config-reloader" |
Image repository for Prometheus Config Reloader. |
prometheusOperator.prometheusConfigReloader.image.tag | string | "v0.81.0" |
Tag for the Prometheus Config Reloader image. |
prometheusOperator.replicas | int | 1 |
Number of Prometheus Operator replicas to deploy. |
Releases are managed by the CI/CD pipeline on the main branch, with an approval job gate called approve-deploy-chart
. Before releasing, increment the Helm chart version in Chart.yaml
and regenerate the documentation using ./do helm-docs
. Once approved, the release will be available in the package repository.