Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed alertmanager no longer sending slack alerts #1550

Open
gravelg opened this issue Mar 25, 2025 · 15 comments · May be fixed by #1562
Open

Managed alertmanager no longer sending slack alerts #1550

gravelg opened this issue Mar 25, 2025 · 15 comments · May be fixed by #1562
Assignees

Comments

@gravelg
Copy link

gravelg commented Mar 25, 2025

Hello,

We have noticed that since March 18th, our managed alertmanager is no longer sending alerts. We hve not made any configuration changes to the alertmanager config contained in the alertmanager secret, and we can see that the configuration is loaded correctly.

Alertmanager logs are full of errors now however:

{"caller":"main.go:192","level":"info","msg":"Starting Alertmanager","ts":"2025-03-22T08:25:10.555Z","version":"(version=0.27.0-gmp.1, branch=, revision=0ddd406d04a5076cb73567d5a11972bfefc7e833)"}
{"build_context":"(go=go1.22.7 X:boringcrypto, platform=linux/amd64, user=, date=2024-09-20T12:17:58+00:00, tags=boring)","caller":"main.go:193","level":"info","ts":"2025-03-22T08:25:10.555Z"}
{"caller":"cluster.go:683","component":"cluster","interval":"2s","level":"info","msg":"Waiting for gossip to settle...","ts":"2025-03-22T08:25:10.556Z"}
{"caller":"coordinator.go:113","component":"configuration","file":"/alertmanager/config_out/config.yaml","level":"info","msg":"Loading configuration file","ts":"2025-03-22T08:25:10.593Z"}
{"caller":"coordinator.go:126","component":"configuration","file":"/alertmanager/config_out/config.yaml","level":"info","msg":"Completed loading of configuration file","ts":"2025-03-22T08:25:10.594Z"}
{"address":"[::]:9093","caller":"tls_config.go:313","level":"info","msg":"Listening on","ts":"2025-03-22T08:25:10.597Z"}
{"address":"[::]:9093","caller":"tls_config.go:316","http2":false,"level":"info","msg":"TLS is disabled.","ts":"2025-03-22T08:25:10.597Z"}
{"caller":"coordinator.go:113","component":"configuration","file":"/alertmanager/config_out/config.yaml","level":"info","msg":"Loading configuration file","ts":"2025-03-22T08:25:11.659Z"}
{"caller":"coordinator.go:126","component":"configuration","file":"/alertmanager/config_out/config.yaml","level":"info","msg":"Completed loading of configuration file","ts":"2025-03-22T08:25:11.661Z"}
{"before":0,"caller":"cluster.go:708","component":"cluster","elapsed":"2.000733581s","level":"info","msg":"gossip not settled","now":1,"polls":0,"ts":"2025-03-22T08:25:12.557Z"}
{"caller":"cluster.go:700","component":"cluster","elapsed":"10.004375193s","level":"info","msg":"gossip settled; proceeding","ts":"2025-03-22T08:25:20.561Z"}
{"aggrGroup":"{}/{airflow=\"true\"}:{alertname=\"Airflow DAG failures\"}","attempts":1,"caller":"notify.go:848","component":"dispatcher","err":"Post \"<redacted>\": unsupported protocol scheme \"\"","integration":"slack[0]","level":"warn","msg":"Notify attempt failed, will retry later","receiver":"airflow-slack-notifications","ts":"2025-03-22T14:40:35.220Z"}
{"caller":"dispatch.go:353","component":"dispatcher","err":"airflow-slack-notifications/slack[0]: notify retry canceled after 15 attempts: Post \"<redacted>\": unsupported protocol scheme \"\"","level":"error","msg":"Notify for alerts failed","num_alerts":1,"ts":"2025-03-22T14:45:35.216Z"}
{"aggrGroup":"{}/{airflow=\"true\"}:{alertname=\"Airflow DAG failures\"}","attempts":1,"caller":"notify.go:848","component":"dispatcher","err":"Post \"<redacted>\": unsupported protocol scheme \"\"","integration":"slack[0]","level":"warn","msg":"Notify attempt failed, will retry later","receiver":"airflow-slack-notifications","ts":"2025-03-22T14:45:35.217Z"}
{"caller":"dispatch.go:353","component":"dispatcher","err":"airflow-slack-notifications/slack[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": unsupported protocol scheme \"\"","level":"error","msg":"Notify for alerts failed","num_alerts":1,"ts":"2025-03-22T14:50:35.217Z"}

I imagine this is somehow related to the 0.27.0 update but I am unable to tell where the problem is in our config since it gets loaded correctly. I also cannot find any breaking change announcements in the docs about this changing in some way. Can you please provide more guidance?

Thanks!

@b-n
Copy link

b-n commented Mar 28, 2025

We're also experiencing something similar whilst using GKE. Below are some details from a ticket we raised with google support:

Setup

Generate a "valid" alertmanager.yaml:

cat <<EOF > alertmanager.yaml
route:
  receiver: 'slack'
receivers:
- name: 'slack'
  slack_configs:
  - channel: '#some_channel'
    api_url: https://slack.com/api/chat.postMessage
    http_config:
      authorization:
        type: 'Bearer'
        credentials: 'redacted'
EOF

(Note: 'redacted' was the true slack token if)

Apply to the cluster (per the docs):

kubectl create secret generic alertmanager -n gmp-public --from-file=alertmanager.yaml

Get the resultant secret from the gmp-system namespace to show configuration error

Check: This shell command will get the resultant alertmanager config: kubectl get -n gmp-system secrets/alertmanager --template='{{index .data "config.yaml"}}' | base64 -d

Result on v1.30.x GKE cluster which runs v0.13

route:
  receiver: 'slack'
receivers:
- name: 'slack'
  slack_configs:
  - channel: '#some_channel'
    api_url: https://slack.com/api/chat.postMessage
    http_config:
      authorization:
        type: 'Bearer'
        credentials: 'redacted'

This worked for us previously

Result on v1.31.x GKE cluster which runs v0.14

global:
   resolve_timeout: 5m
   http_config:
       follow_redirects: true
       enable_http2: true
   smtp_hello: localhost
   smtp_require_tls: true
   pagerduty_url: https://events.pagerduty.com/v2/enqueue
   opsgenie_api_url: https://api.opsgenie.com/
   wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
   victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
   telegram_api_url: https://api.telegram.org
   webex_api_url: https://webexapis.com/v1/messages
route:
   receiver: slack
   continue: false
receivers:
   - name: slack
     slack_configs:
       - send_resolved: false
         http_config:
           authorization:
               type: Bearer
               credentials: <secret>
           follow_redirects: true
           enable_http2: true
         api_url: <secret>
         channel: '#ops-alerts-dev'
         username: '{{ template "slack.default.username" . }}'
         color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
         title: '{{ template "slack.default.title" . }}'
         title_link: '{{ template "slack.default.titlelink" . }}'
         pretext: '{{ template "slack.default.pretext" . }}'
         text: '{{ template "slack.default.text" . }}'
         short_fields: false
         footer: '{{ template "slack.default.footer" . }}'
         fallback: '{{ template "slack.default.fallback" . }}'
         callback_id: '{{ template "slack.default.callbackid" . }}'
         icon_emoji: '{{ template "slack.default.iconemoji" . }}'
         icon_url: '{{ template "slack.default.iconurl" . }}'
         link_names: false
templates: []

Notes:

  • Lots more detail values - generally fine/good
  • receivers[0].slack_configs[0].http_config.authorization.credentials is now the fixed string <secret> 👈 I assume this is where it is breaking.

More note: If the global.slack_api_url is provided in the alertmanager config secret, then that also gets replaced with <secret>

@b-n
Copy link

b-n commented Mar 28, 2025

And to be thorough, I mounted a debug container on the running alertmanager-0 pod with the same volume mounts, I can confirm that /alertmanager/config/config.yaml has the same contents as the secret (and if the secret changes, the values on disk also change).

And another find, if I add a value into global.smtp_auth_password, which is also a default <secret> field per the [AM docs] https://prometheus.io/docs/alerting/latest/configuration/#file-layout-and-global-settings), then that field also shows up as <secret>. e.g. Assumption is that any field in the AM config that is <secret> per the docs is not being rendered correctly.

@b-n
Copy link

b-n commented Mar 28, 2025

And for more investigation, I went on a git diff hunt git diff v0.13.1..v0.14.0.

I suspect #1074 to be the problem. I'm not that familar with the code base, but my guess:

TL;DR: Previously alertmanager config was taken from the secret in gmp-public and used directly. #1074 uses the prometheus engine config unmarshaller to parse the bytes from the provided config, and that has security measures to ensure that secrets aren't leaked. Unfortunately that means any "secret" field that was in the secret is being obsfucated.

Note: I'm not a go expert, but just stepping through the logic to see what changed and what might be the cause, so take the above with a large grain of salt.

Either way, I don't know if there are any workarounds. I suspect we can't use the global.slack_api_url_file (type ) because the operatorconfig doesn't give us the ability to write any other files sadly.

@bwplotka
Copy link
Collaborator

Thanks for helping! It got to our attention just now, it feels it's the obfuscation code that might be problematic. We changed the configuration propagation flow for security reasons (so GMP operator have less permissions), but something got wrong.

We are on it, will give an update today!

@bwplotka
Copy link
Collaborator

bwplotka commented Mar 28, 2025

We are double checking the details and will release a bugfix but there is a quick mitigation anyone can do:

  • In the user provided alertmanager secret in the gmp-public namespace (file that OperatorConfig:managedAlertmanager.configSecret references), add custom google_cloud.externalURL field with exactly the same value as in your OperatorConfig:managedAlertmanager.externalURL field. This is only relevant if you have a custom OperatorConfig:managedAlertmanager.externalURL setting.

For example:

google_cloud:
  # Must be exactly the same value as in OperatorConfig.managedAlertmanager.externalURL,
  # so buggy re-encoding is skipped until the 0.14.3 bugfix is rolled.
  external_url: "https://alertmanager.mycompany.com/"

# Rest of your AM config with or without inlined secrets.
receivers:
  - name: "foobar"
route:
  receiver: "foobar"

Let us know if this mitigates this issue!

bernot-dev pushed a commit that referenced this issue Mar 28, 2025
bwplotka added a commit that referenced this issue Mar 28, 2025
bwplotka added a commit that referenced this issue Mar 28, 2025
@gravelg
Copy link
Author

gravelg commented Mar 28, 2025

I can confirm that adding the google_cloud.external_url section in our alertmanager secret has resolved the situation for now and we are once again getting alerts.

@bwplotka
Copy link
Collaborator

Repro: #1558

@awasilyev
Copy link

awasilyev commented Mar 31, 2025

#1550 (comment)

does not work for me, still see <secret> values in the alertmanager secret

@bwplotka
Copy link
Collaborator

Are you sure? Can you share your config and OperatorConfig in the cluster?

@awasilyev
Copy link

awasilyev commented Mar 31, 2025

secret gmp-public/alertmanager

google_cloud:
  externalUrl: "https://gmp-alertmanager.xxxx"

global:
  resolve_timeout: 5m
  slack_api_url: https://slack.com/api/chat.postMessage

inhibit_rules:
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = critical
  target_matchers:
  - severity =~ warning|error|info
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = error
  target_matchers:
  - severity = warning|info  
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = warning
  target_matchers:
  - severity = info
- equal:
  - namespace
  source_matchers:
  - alertname = InfoInhibitor
  target_matchers:
  - severity = info
- target_matchers:
  - alertname = InfoInhibitor

receivers:
- name: "null"
- name: incident.io
  webhook_configs:
  - url: 'https://api.incident.io/v2/alert_events/alertmanager/xxxx'
    send_resolved: true
    http_config:
     authorization:
      credentials: xxxx

- name: slack
  slack_configs:
  - channel: monitoring-non_prod
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}
      {{- if .CommonLabels.team }}*team*: {{ .CommonLabels.team }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.namespace }}*ns*: {{ .CommonLabels.namespace }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.pod }}*pod*: {{ .CommonLabels.pod }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.app }}*app*: {{ .CommonLabels.app }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.name }}*name*: {{ .CommonLabels.name }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.domain }}*domain*: {{ .CommonLabels.domain }}{{"\n"}}{{ end -}}

- name: slack_argo
  slack_configs:
  - channel: monitoring-non_prod
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}

- name: slack_airflow
  slack_configs:
  - channel: monitoring-airflow-xxxx
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}

route:
  group_by:
  - alertname
  - cluster
  - service
  group_interval: 10m
  group_wait: 60s
  receiver: "null"
  repeat_interval: 3h
  routes:
  - receiver: incident.io
    match_re:
     severity: warning|critical|error
    continue: true
  - match:
      namespace: airflow
    receiver: slack_airflow
  - match:
      service: data-xxxx-pipelines-web
    receiver: slack_airflow
  - match:
      job: argocd-application-controller-metrics
    receiver: slack_argo
  - match_re:
      severity: warning|error|critical
    receiver: slack
google_cloud:
  externalUrl: "https://gmp-alertmanager.xxxx"

global:
  resolve_timeout: 5m
  slack_api_url: https://slack.com/api/chat.postMessage

inhibit_rules:
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = critical
  target_matchers:
  - severity =~ warning|error|info
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = error
  target_matchers:
  - severity = warning|info  
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = warning
  target_matchers:
  - severity = info
- equal:
  - namespace
  source_matchers:
  - alertname = InfoInhibitor
  target_matchers:
  - severity = info
- target_matchers:
  - alertname = InfoInhibitor

receivers:
- name: "null"
- name: incident.io
  webhook_configs:
  - url: 'https://api.incident.io/v2/alert_events/alertmanager/xxxx'
    send_resolved: true
    http_config:
     authorization:
      credentials: xxxx

- name: slack
  slack_configs:
  - channel: monitoring-non_prod
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}
      {{- if .CommonLabels.team }}*team*: {{ .CommonLabels.team }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.namespace }}*ns*: {{ .CommonLabels.namespace }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.pod }}*pod*: {{ .CommonLabels.pod }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.app }}*app*: {{ .CommonLabels.app }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.name }}*name*: {{ .CommonLabels.name }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.domain }}*domain*: {{ .CommonLabels.domain }}{{"\n"}}{{ end -}}

- name: slack_argo
  slack_configs:
  - channel: monitoring-non_prod
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}

- name: slack_airflow
  slack_configs:
  - channel: monitoring-airflow-xxxx
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}

route:
  group_by:
  - alertname
  - cluster
  - service
  group_interval: 10m
  group_wait: 60s
  receiver: "null"
  repeat_interval: 3h
  routes:
  - receiver: incident.io
    match_re:
     severity: warning|critical|error
    continue: true
  - match:
      namespace: airflow
    receiver: slack_airflow
  - match:
      service: data-xxxx-pipelines-web
    receiver: slack_airflow
  - match:
      job: argocd-application-controller-metrics
    receiver: slack_argo
  - match_re:
      severity: warning|error|critical
    receiver: slack

OperatorConfig:

apiVersion: monitoring.googleapis.com/v1
collection:
  externalLabels:
    cluster: xxxx
    location: us-east1
    project_id: xxxx
  filter:
    matchOneOf:
    - '{__name__="argocd_app_info"}'
    - '{__name__="kube_pod_container_status_restarts_total"}'
    - '{__name__="kube_pod_status_phase"}'
    - '{__name__="kubelet_volume_stats_available_bytes"}'
    - '{__name__="kubelet_volume_stats_capacity_bytes"}'
    - '{__name__="kube_pod_container_status_last_terminated_reason"}'
    - '{__name__="kube_pod_status_ready"}'
    - '{__name__="up"}'
    - '{__name__="ray_cluster_active_nodes"}'
    - '{__name__="ray_cluster_pending_nodes"}'
    - '{__name__="ray_cluster_recently_failed_nodes"}'
    - '{__name__="ray_actors"}'
    - '{__name__="ray_component_cpu_percentage"}'
    - '{__name__="ray_component_mem_shared_bytes"}'
    - '{__name__="ray_component_rss_mb"}'
    - '{__name__="ray_memory_manager_worker_eviction_total"}'
    - '{__name__="ray_node_cpu_count"}'
    - '{__name__="ray_node_cpu_utilization"}'
    - '{__name__="ray_node_mem_shared_bytes"}'
    - '{__name__="ray_node_mem_total"}'
    - '{__name__="ray_node_mem_used"}'
    - '{__name__="ray_node_network_receive_speed"}'
    - '{__name__="ray_node_network_send_speed"}'
    - '{__name__="ray_object_store_memory"}'
    - '{__name__="ray_resources"}'
    - '{__name__="ray_tasks"}'
    - '{__name__="airflow_dag_last_status"}'
    - '{__name__="chi_clickhouse_event_FailedQuery"}'
    - '{__name__="chi_clickhouse_event_FailedSelectQuery"}'
    - '{__name__="chi_clickhouse_metric_DiskSpaceReservedForMerge"}'
    - '{__name__="chi_clickhouse_metric_DiskTotalBytes"}'
    - '{__name__="chi_clickhouse_metric_DiskTotal_default"}'
    - '{__name__="chi_clickhouse_metric_DiskUnreserved_default"}'
    - '{__name__="chi_clickhouse_metric_DiskUsed_default"}'
    - '{__name__="chi_clickhouse_metric_GlobalThread"}'
    - '{__name__="chi_clickhouse_metric_GlobalThreadActive"}'
    - '{__name__="chi_clickhouse_metric_GlobalThreadScheduled"}'
    - '{__name__="chi_clickhouse_metric_HTTPConnection"}'
    - '{__name__="chi_clickhouse_metric_HTTPConnectionsStored"}'
    - '{__name__="chi_clickhouse_metric_HTTPConnectionsTotal"}'
    - '{__name__="chi_clickhouse_metric_HTTPRejectedConnections"}'
    - '{__name__="chi_clickhouse_metric_HTTPThreads"}'
    - '{__name__="chi_clickhouse_metric_IOThreads"}'
    - '{__name__="chi_clickhouse_metric_IOThreadsActive"}'
    - '{__name__="chi_clickhouse_metric_IOThreadsScheduled"}'
    - '{__name__="chi_clickhouse_metric_LoadAverage1"}'
    - '{__name__="chi_clickhouse_metric_LoadAverage15"}'
    - '{__name__="chi_clickhouse_metric_LoadAverage5"}'
    - '{__name__="chi_clickhouse_metric_LocalThread"}'
    - '{__name__="chi_clickhouse_metric_LocalThreadActive"}'
    - '{__name__="chi_clickhouse_metric_LocalThreadScheduled"}'
    - '{__name__="chi_clickhouse_metric_LongestRunningQuery"}'
    - '{__name__="chi_clickhouse_metric_MergeTreeBackgroundExecutorThreads"}'
    - '{__name__="chi_clickhouse_metric_MergeTreeBackgroundExecutorThreadsActive"}'
    - '{__name__="chi_clickhouse_metric_MergeTreeBackgroundExecutorThreadsScheduled"}'
    - '{__name__="chi_clickhouse_metric_MergeTreeDataSelectExecutorThreads"}'
    - '{__name__="chi_clickhouse_metric_MergeTreeDataSelectExecutorThreadsActive"}'
    - '{__name__="chi_clickhouse_metric_MergeTreeDataSelectExecutorThreadsScheduled"}'
    - '{__name__="chi_clickhouse_metric_NetworkReceive"}'
    - '{__name__="chi_clickhouse_metric_NetworkSend"}'
    - '{__name__="chi_clickhouse_metric_OSNiceTime"}'
    - '{__name__="chi_clickhouse_metric_OSNiceTimeNormalized"}'
    - '{__name__="chi_clickhouse_metric_Query"}'
    - '{__name__="chi_clickhouse_metric_MemoryTracking"}'
    - '{__name__="chi_clickhouse_event_InsertedRows"}'
    - '{__name__="chi_clickhouse_metric_S3Requests"}'
    - '{__name__="chi_clickhouse_metric_StorageS3Threads"}'
    - '{__name__="chi_clickhouse_metric_StorageS3ThreadsActive"}'
    - '{__name__="chi_clickhouse_metric_StorageS3ThreadsScheduled"}'
  kubeletScraping:
    interval: 60s
features:
  config: {}
  targetStatus:
    enabled: true
kind: OperatorConfig
managedAlertmanager:
  configSecret:
    key: alertmanager.yaml
    name: alertmanager
  externalURL: https://gmp-alertmanager.xxxx
metadata:
  annotations:
    components.gke.io/component-name: managed-prometheus
    components.gke.io/component-version: 0.13.1-gke.0
    components.gke.io/layer: addon
  creationTimestamp: "2025-02-21T13:54:35Z"
  generation: 23
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    argocd.argoproj.io/instance: devops
    k8slens-edit-resource-version: v1
  name: config
  namespace: gmp-public
  resourceVersion: "1729751778"
  uid: c8127927-c879-4952-bdc3-a24fd8ef55b8
rules:
  alerting: {}
  externalLabels:
    cluster: xxxx
    location: us-east1
    project_id: xxxx
scaling:
  vpa: {}

@bwplotka
Copy link
Collaborator

bwplotka commented Mar 31, 2025

Thanks! You have a typo:

google_cloud:
  externalUrl: "https://gmp-alertmanager.xxxx"

should be:

google_cloud:
  external_url: "https://gmp-alertmanager.xxxx"

Sorry for the mess, we are releasing bug fix today.

@awasilyev
Copy link

updated to the

google_cloud:
  external_url: "https://gmp-alertmanager.xxxx"

nothing changed

@bwplotka
Copy link
Collaborator

I also just noticed in your alertmanager config you pasted (#1550 (comment)) there are two google_cloud entries, do you mind removing all but one?

@awasilyev
Copy link

yes, only one left

@awasilyev
Copy link

here is what I currently have
value in external_url 100% equal to the OperatorConfig:managedAlertmanager.externalURL field.
I still see <secret> in the gmp-system/alertmanager secret. tried to remove it - same after recreation.
GKE 1.32.2-gke.1182001, gmp-operator version is v0.15.1-gke

google_cloud:
  external_url: "https://gmp-alertmanager.xxxx"
  
global:
  resolve_timeout: 5m
  slack_api_url: https://slack.com/api/chat.postMessage

inhibit_rules:
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = critical
  target_matchers:
  - severity =~ warning|error|info
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = error
  target_matchers:
  - severity = warning|info  
- equal:
  - namespace
  - alertname
  source_matchers:
  - severity = warning
  target_matchers:
  - severity = info
- equal:
  - namespace
  source_matchers:
  - alertname = InfoInhibitor
  target_matchers:
  - severity = info
- target_matchers:
  - alertname = InfoInhibitor

receivers:
- name: "null"
- name: incident.io
  webhook_configs:
  - url: 'https://api.incident.io/v2/alert_events/alertmanager/xxxx'
    send_resolved: true
    http_config:
     authorization:
      credentials: xxxx

- name: slack
  slack_configs:
  - channel: monitoring-non_prod
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}
      {{- if .CommonLabels.team }}*team*: {{ .CommonLabels.team }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.namespace }}*ns*: {{ .CommonLabels.namespace }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.pod }}*pod*: {{ .CommonLabels.pod }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.app }}*app*: {{ .CommonLabels.app }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.name }}*name*: {{ .CommonLabels.name }}{{"\n"}}{{ end -}}
      {{- if .CommonLabels.domain }}*domain*: {{ .CommonLabels.domain }}{{"\n"}}{{ end -}}

- name: slack_argo
  slack_configs:
  - channel: monitoring-non_prod
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}

- name: slack_airflow
  slack_configs:
  - channel: monitoring-airflow-xxxx-sandbox
    http_config:
      authorization:
        credentials: xoxb-xxxx
    send_resolved: true
    title: >
      {{- if eq .Status "firing" -}}:exclamation: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- else -}}:heavy_check_mark: {{ .CommonLabels.project_id }} {{ .CommonLabels.severity | toUpper }} {{ .CommonLabels.alertname }}
      {{- end -}}
    text: >
      {{- range .Alerts -}}
        {{"\n"}}
        {{- if .Annotations.summary }}{{ .Annotations.summary }}{{"\n"}}{{ end -}}
        {{- if .Annotations.message }}{{ .Annotations.message }}{{"\n"}}{{ end -}}
        {{- if .Annotations.description }}{{ .Annotations.description }}{{"\n"}}{{ end -}}
      {{- end -}}
      {{"\n"}}

route:
  group_by:
  - alertname
  - cluster
  - service
  group_interval: 10m
  group_wait: 60s
  receiver: "null"
  repeat_interval: 3h
  routes:
  - receiver: incident.io
    match_re:
     severity: warning|critical|error
    continue: true
  - match:
      namespace: airflow
    receiver: slack_airflow
  - match:
      service: xxxx
    receiver: slack_airflow
  - match:
      job: argocd-application-controller-metrics
    receiver: slack_argo
  - match_re:
      severity: warning|error|critical
    receiver: slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants