Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions api/v1alpha1/workerresourcetemplate_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,10 @@ type WorkerResourceTemplateSpec struct {
// PodDisruptionBudgets and other resources that select pods.
//
// spec.metrics[*].external.metric.selector.matchLabels: {} (or with user labels)
// The controller appends temporal_temporal_worker_deployment_name, temporal_worker_build_id, and
// The controller appends temporal_worker_deployment_name, temporal_worker_build_id, and
// temporal_namespace to any External metric selector where matchLabels is present.
// User labels (e.g. task_type: "Activity") coexist alongside the injected keys.
// Do not set temporal_temporal_worker_deployment_name, temporal_worker_build_id, or
// Do not set temporal_worker_deployment_name, temporal_worker_build_id, or
// temporal_namespace manually — the webhook will reject them.
// +kubebuilder:validation:Required
// +kubebuilder:pruning:PreserveUnknownFields
Expand Down
11 changes: 6 additions & 5 deletions internal/demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,11 +263,12 @@ Stop the load generator (`Ctrl-C`) and watch the HPA scale back down as in-fligh
`approximate_backlog_count` measures tasks queued in Temporal but not yet started on a worker. Adding it as a second HPA metric means the HPA scales up on *arriving* work even before slots are full — important for bursty traffic.

> **Note:** Temporal Cloud emits `temporal_approximate_backlog_count` with a combined
> `version="namespace/twd-name:build-id"` label that contains characters invalid in
> Kubernetes label values (`/` and `:`). The recording rule in
> `prometheus-stack-values.yaml` uses `label_replace` to extract `twd_name` and
> `build_id` as separate k8s-compatible labels, producing `temporal_backlog_count_by_version`.
> The HPA then selects on those labels — the same pair used by Phase 1.
> `worker_version="<worker-deployment-name>_<build-id>"` label that easily exceeds Kubernetes max label
> length of 63 characters. The recording rule in `prometheus-stack-values.yaml` uses `label_replace`
> to extract `temporal_worker_deployment_name` and `temporal_worker_build_id` as separate k8s-compatible
> labels, producing `temporal_backlog_count_by_version`. The HPA then selects on those labels — the same
> pair used by Phase 1. Temporal Cloud is in the process of rolling out the new separate labels, so this
> workaround is required until then.

**Step 1 — Create the Temporal Cloud credentials secret.**

Expand Down
8 changes: 4 additions & 4 deletions internal/demo/k8s/grafana-dashboard.json
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@
},
"targets": [
{
"expr": "temporal_slot_utilization",
"legendFormat": "{{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
"expr": "temporal_slot_utilization{temporal_worker_deployment_name=\"default_helloworld\"}",
"legendFormat": "{{worker_type}} - {{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
"refId": "A"
}
],
Expand Down Expand Up @@ -127,7 +127,7 @@
},
"targets": [
{
"expr": "temporal_backlog_count_by_version{task_type=\"Workflow\"}",
"expr": "temporal_backlog_count_by_version{task_type=\"Workflow\", temporal_worker_deployment_name=\"default_helloworld\"}",
"legendFormat": "{{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
"refId": "A"
}
Expand All @@ -152,7 +152,7 @@
},
"targets": [
{
"expr": "temporal_backlog_count_by_version{task_type=\"Activity\"}",
"expr": "temporal_backlog_count_by_version{task_type=\"Activity\", temporal_worker_deployment_name=\"default_helloworld\"}",
"legendFormat": "{{temporal_worker_deployment_name}} / {{temporal_worker_build_id}}",
"refId": "A"
}
Expand Down
13 changes: 7 additions & 6 deletions internal/demo/k8s/prometheus-adapter-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,21 +21,22 @@ prometheus:
rules:
external:
# Phase 1: slot utilization per worker version.
# HPA selector: metric.name=temporal_slot_utilization, matchLabels: twd_name + build_id.
# The worker emits twd_name and build_id as separate Prometheus labels (both valid
# Kubernetes label values), and the recording rule in prometheus-stack-values.yaml
# aggregates them into temporal_slot_utilization.
# HPA selector: metric.name=temporal_slot_utilization,
# matchLabels: temporal_worker_deployment_name + temporal_worker_build_id + temporal_namespace.
# The worker emits those as separate Prometheus labels (all valid Kubernetes label values), and
# the recording rule in prometheus-stack-values.yaml aggregates them into temporal_slot_utilization.
- seriesQuery: 'temporal_slot_utilization{}'
metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>})'
name:
as: "temporal_slot_utilization"
resources:
namespaced: false # cluster-scoped: HPAs in any namespace can consume this metric
namespaced: false # cluster-scoped: HPAs in any k8s namespace can consume this metric

# Phase 2: approximate backlog count per worker version (from Temporal Cloud).
# Uses the temporal_backlog_count_by_version recording rule.
# cluster-scoped so HPAs in any namespace can consume it; temporal_worker_deployment_name
# + build_id matchLabels in the HPA are sufficient to select the right series.
# + temporal_worker_build_id + temporal_namespace matchLabels in the HPA are sufficient to
# select the right series.
- seriesQuery: 'temporal_backlog_count_by_version{}'
metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
name:
Expand Down
7 changes: 4 additions & 3 deletions internal/demo/k8s/prometheus-stack-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ additionalPrometheusRulesMap:
# Slot utilization ratio per worker version, filtered to activity workers only.
# Range: 0.0 (idle) to 1.0 (fully saturated).
#
# The worker emits twd_name and build_id as separate labels (both valid
# Kubernetes label values), so no label_replace is needed here.
# The worker emits temporal_worker_deployment_name, temporal_worker_build_id, temporal_namespace, and worker_type
# as separate labels (both valid Kubernetes label values), so no label_replace is needed here.
- record: temporal_slot_utilization
expr: |
sum by (temporal_worker_deployment_name, temporal_worker_build_id, temporal_namespace, worker_type) (
Expand All @@ -86,7 +86,8 @@ additionalPrometheusRulesMap:
# Backlog count per worker version, shaped to match the label format that
# Temporal Cloud will emit natively in a future release. This recording rule
# is a temporary shim: once Temporal Cloud emits temporal_worker_deployment_name and
# build_id as separate labels, this rule can be deleted with no other changes.
# temporal_worker_build_id as separate labels, this rule can be deleted with no
# other changes. Note: this rule only works with Build IDs that don't have underscores.
#
# Current Temporal Cloud label:
# worker_version="{k8s-namespace}_{twd-name}_{build-id}"
Expand Down
2 changes: 1 addition & 1 deletion internal/demo/util/observability.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ func configureObservability(deploymentName, buildID, temporalNamespace string, m
m = opentelemetry.NewMetricsHandler(opentelemetry.MetricsHandlerOptions{
Meter: metric.NewMeterProvider(metric.WithReader(exporter)).Meter("worker"),
InitialAttributes: attribute.NewSet(
attribute.String("temporal_temporal_worker_deployment_name", deploymentNameCleanForLabel),
attribute.String("temporal_worker_deployment_name", deploymentNameCleanForLabel),
attribute.String("temporal_worker_build_id", buildID),
attribute.String("temporal_namespace", temporalNamespace),
),
Expand Down
Loading