The goal of the Telemetry module is to support you in collecting all relevant metrics of a workload in a Kyma cluster and ship them to a backend for further analysis. Kyma modules like Istio Module or Serverless contribute metrics instantly, and the Telemetry module enriches the data. You can choose among multiple vendors for OTLP-based backends.
Observability is all about exposing the internals of the components belonging to a distributed application and making that data analysable at a central place. While application logs and traces usually provide request-oriented data, metrics are aggregated statistics exposed by a component to reflect the internal state. Typical statistics like the amount of processed requests, or the amount of registered users, can be very useful to monitor the current state and also the health of a component. Also, you can define proactive and reactive alerts if metrics are about to reach thresholds, or if they already passed thresholds.
The Telemetry module provides a metric gateway and, optionally, an agent for the collection and shipment of metrics of any container running in the Kyma runtime.
You can configure the metric gateway with external systems using runtime configuration with a dedicated Kubernetes API (CRD) named MetricPipeline
. The Metric feature is optional. If you don't want to use it, simply don't set up a MetricPipeline
.
-
Before you can collect metrics data from a component, it must expose (or instrument) the metrics. Typically, it instruments specific metrics for the used language runtime (like Node.js) and custom metrics specific to the business logic. Also, the exposure can be in different formats, like the pull-based Prometheus format or the push-based OTLP format.
-
If you want to use Prometheus-based metrics, you must have instrumented your application using a library like the Prometheus client library, with a port in your workload exposed serving as a Prometheus metrics endpoint.
-
For the instrumentation, you typically use an SDK, namely the Prometheus client libraries or the Open Telemetry SDKs. Both libraries provide extensions to activate language-specific auto-instrumentation like for Node.js, and an API to implement custom instrumentation.
In the Telemetry module, a central in-cluster Deployment of an OTel Collector acts as a gateway. The gateway exposes endpoints for the OpenTelemetry Protocol (OTLP) for GRPC and HTTP-based communication using the dedicated telemetry-otlp-metrics
service, to which all Kyma modules and users’ applications send the metrics data.
Optionally, the Telemetry module provides a DaemonSet of an OTel Collector acting as an agent. This agent can pull metrics of a workload and the Istio sidecar in the Prometheus pull-based format and can provide runtime-specific metrics for the workload.
-
An application (exposing metrics in OTLP) sends metrics to the central metric gateway service.
-
An application (exposing metrics in Prometheus protocol) activates the agent to scrape the metrics with an annotation-based configuration.
-
Additionally, you can activate the agent to pull metrics of each Istio sidecar.
-
The agent supports collecting metrics from the Kubelet and Kubernetes APIServer.
-
The agent converts and sends all collected metric data to the gateway in OTLP.
-
The gateway discovers the metadata and enriches all received data with typical metadata of the source by communicating with the Kubernetes APIServer. Furthermore, it filters data according to the pipeline configuration.
-
Telemetry Manager configures the agent and gateway according to the
MetricPipeline
resource specification, including the target backend for the metric gateway. Also, it observes the metrics flow to the backend and reports problems in theMetricPipeline
status. -
The metric gateway sends the data to the observability system that’s specified in your
MetricPipeline
resource - either within the Kyma cluster, or, if authentication is set up, to an external observability backend. -
You can analyze the metric data with your preferred backend system.
The MetricPipeline
resource is watched by Telemetry Manager, which is responsible for generating the custom parts of the OTel Collector configuration.
-
Telemetry Manager watches all
MetricPipeline
resources and related Secrets. -
Furthermore, Telemetry Manager takes care of the full lifecycle of the gateway Deployment and the agent DaemonSet. Only if you defined a
MetricPipeline
, the gateway and agent are deployed. -
Whenever the user configuration changes, Telemetry Manager validates it and generates a single configuration for the gateway and agent.
-
Referenced Secrets are copied into one Secret that is mounted to the gateway as well.
In a Kyma cluster, the metric gateway is the central component to which all components can send their individual metrics. The gateway collects, enriches, and dispatches the data to the configured backend. For more information, see Telemetry Gateways.
If a MetricPipeline
configures a feature in the input
section, an additional DaemonSet is deployed acting as an agent. The agent is also based on an OTel Collector and encompasses the collection and conversion of Prometheus-based metrics. Hereby, the workload puts a prometheus.io/scrape
annotation on the specification of the Pod or service, and the agent collects it. The agent sends all data in OTLP to the central gateway.
In the following steps, you can see how to construct and deploy a typical MetricPipeline
. Learn more about the available parameters and attributes.
To ship metrics to a new OTLP output, create a resource of the kind MetricPipeline
and save the file (named, for example, metricpipeline.yaml
).
This configures the underlying OTel Collector with a pipeline for metrics and opens a push endpoint that is accessible with the telemetry-otlp-metrics
service. For details, see Usage.
The following push URLs are set up:
- GRPC:
http://telemetry-otlp-metrics.kyma-system:4317
- HTTP:
http://telemetry-otlp-metrics.kyma-system:4318
The default protocol for shipping the data to a backend is GRPC, but you can choose HTTP instead. Depending on the configured protocol, an otlp
or an otlphttp
exporter is used. Ensure that the correct port is configured as part of the endpoint.
-
For GRPC, use:
apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: endpoint: value: https://backend.example.com:4317
-
For HTTP, use the
protocol
attribute:apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: protocol: http endpoint: value: https://backend.example.com:4318
To integrate with external systems, you must configure authentication details. You can use mutual TLS (mTLS), Basic Authentication, or custom headers:
-
mTLS:
apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: endpoint: value: https://backend.example.com/otlp:4317 tls: cert: value: | -----BEGIN CERTIFICATE----- ... key: value: | -----BEGIN RSA PRIVATE KEY----- ...
-
Basic Authentication:
apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: endpoint: value: https://backend.example.com/otlp:4317 authentication: basic: user: value: myUser password: value: myPwd
-
Token-based authentication with custom headers:
apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: endpoint: value: https://backend.example.com/otlp:4317 headers: - name: Authorization prefix: Bearer value: "myToken"
Integrations into external systems usually need authentication details dealing with sensitive data. To handle that data properly in Secrets, MetricsPipeline
supports the reference of Secrets.
Using the valueFrom
attribute, you can map Secret keys for mutual TLS (mTLS), Basic Authentication, or with custom headers.
You can store the value of the token in the referenced Secret without any prefix or scheme, and you can configure it in the headers
section of the MetricPipeline
. In this example, the token has the prefix “Bearer”.
-
mTLS:
apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: endpoint: value: https://backend.example.com/otlp:4317 tls: cert: valueFrom: secretKeyRef: name: backend namespace: default key: cert key: valueFrom: secretKeyRef: name: backend namespace: default key: key
-
Basic Authentication:
apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: endpoint: valueFrom: secretKeyRef: name: backend namespace: default key: endpoint authentication: basic: user: valueFrom: secretKeyRef: name: backend namespace: default key: user password: valueFrom: secretKeyRef: name: backend namespace: default key: password
-
Token-based authentication with custom headers:
apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: output: otlp: endpoint: value: https://backend.example.com:4317 headers: - name: Authorization prefix: Bearer valueFrom: secretKeyRef: name: backend namespace: default key: token
The related Secret must have the referenced name, be located in the referenced namespace, and contain the mapped key. See the following example:
kind: Secret
apiVersion: v1
metadata:
name: backend
namespace: default
stringData:
endpoint: https://backend.example.com:4317
user: myUser
password: XXX
token: YYY
Telemetry Manager continuously watches the Secret referenced with the secretKeyRef construct. You can update the Secret’s values, and Telemetry Manager detects the changes and applies the new Secret to the setup.
If you use a Secret owned by the SAP BTP Service Operator, you can configure an automated rotation using a credentialsRotationPolicy with a specific rotationFrequency and don’t have to intervene manually.
For the following approach, you must have instrumented your application using a library like the Prometheus client library, with a port in your workload exposed serving as a Prometheus metrics endpoint.
To enable collection of Prometheus-based metrics, define a MetricPipeline
that has the prometheus
section enabled as input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
prometheus:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
The Metric agent is configured with a generic scrape configuration, which uses annotations to specify the endpoints to scrape in the cluster.
For metrics ingestion to start automatically, use the annotations of the following table. If an Istio sidecar is present, apply them to a Service that resolves your metrics port. By annotating the Service, all endpoints targeted by the Service are resolved and scraped by the Metric agent bypassing the Service itself. Only if Istio sidecar is not present, you can alternatively apply the annotations directly to the Pod.
Prometheus Metrics Annotations
Annotation Key |
Example Values |
Default Value |
Description |
---|---|---|---|
|
true, false |
none |
Controls whether Prometheus Receiver automatically scrapes metrics from this target. |
|
8080, 9100 |
none |
Specifies the port where the metrics are exposed. |
|
/metrics, /custom_metrics |
/metrics |
Defines the HTTP path where Prometheus Receiver can find metrics data. |
|
http, https |
If Istio is active, https is supported; otherwise, only http is available. The default scheme is http unless an Istio sidecar is present, denoted by the label |
Determines the protocol used for scraping metrics — either HTTPS with mTLS or plain HTTP. |
|
prometheus.io/param_format: prometheus |
none |
Instructs Prometheus Receiver to pass name-value pairs as URL parameters when calling the metrics endpoint. |
If you're running the Pod targeted by a Service with Istio, Istio must be able to derive the appProtocol from the Service port definition; otherwise the communication for scraping the metric endpoint cannot be established. You must either prefix the port name with the protocol like in http-metrics, or explicitly define the appProtocol
attribute. For example, see the following Service
configuration:
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "8080"
prometheus.io/scrape: "true"
name: sample
spec:
ports:
- name: http-metrics
appProtocol: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: sample
type: ClusterIP
The Metric agent can scrape endpoints even if the workload is a part of the Istio service mesh and accepts mTLS communication. However, there’s a constraint: For scraping through HTTPS, Istio must configure the workload using “STRICT” mTLS mode. Without “STRICT” mTLS mode, you can set up scraping through HTTP by applying the annotation
prometheus.io/scheme=http
. For related troubleshooting, see Log Entry: Failed to Scrape Prometheus Endpoint.
By default, a MetricPipeline
emits metrics about the health of all pipelines managed by the Telemetry module. Based on these metrics, you can track the status of every individual pipeline and set up alerting for it.
Metrics for Pipelines and the Telemetry Module
Metric |
Description |
Availability |
---|---|---|
|
Value represents status of different conditions reported by the resource. Possible values are 1 (“True”), 0 (“False”), and -1 (other status values) |
Available for both, the pipelines and the Telemetry resource |
|
Value represents the state of the resource (if present) |
Available for the Telemetry resource |
Metric Attributes for Monitoring
Name |
Description |
---|---|
|
Type of the condition |
|
Status of the condition |
|
Contains a programmatic identifier indicating the reason for the condition's last transition |
To set up alerting, use an alert rule. In the following example, the alert is triggered if metrics are not delivered to the backend:
min by (k8s_resource_name) ((kyma_resource_status_conditions{type="TelemetryFlowHealthy",k8s_resource_kind="metricpipelines"})) == 0
To enable collection of runtime metrics, define a MetricPipeline
that has the runtime
section enabled as input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
runtime:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
By default, metrics for all resources (Pod, container, Node, Volume, DaemonSet, Deployment, StatefulSet, and Job) are collected.
To enable or disable the collection of metrics for a specific resource, use the resources
section in the runtime
input.
The following example collects only DaemonSet, Deployment, StatefulSet, and Job metrics:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
runtime:
enabled: true
resources:
pod:
enabled: false
container:
enabled: false
node:
enabled: false
volume:
enabled: false
daemonset:
enabled: true
deployment:
enabled: true
statefulset:
enabled: true
job:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
Collected Metrics per Resource
Resource |
From the kubletstatsreceiver |
From the k8sclusterreceiver |
---|---|---|
Pod |
|
|
Container |
|
|
Node |
|
- |
Volume |
|
- |
Deployment |
- |
|
DaemonSet |
- |
|
StatefulSet |
- |
|
Job |
- |
|
To enable collection of Istio metrics, define a MetricPipeline
that has the istio
section enabled as input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
istio:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
With this, the agent starts collecting all Istio metrics from Istio sidecars.
If you are using the istio
input, you can also collect Envoy metrics. Envoy metrics provide insights into the performance and behavior of the Envoy proxy, such as request rates, latencies, and error counts. These metrics are useful for observability and troubleshooting service mesh traffic.
For details, see the list of available Envoy metrics and server metrics.
Envoy metrics are only available for the
istio
input. Ensure that Istio sidecars are correctly injected into your workloads for Envoy metrics to be available.
By default, Envoy metrics collection is disabled.
To activate Envoy metrics, enable the envoyMetrics
section in the MetricPipeline
specification under the istio
input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: envoy-metrics
spec:
input:
istio:
enabled: true
envoyMetrics:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
By default, otlp
input is enabled.
To drop the push-based OTLP metrics that are received by the Metric gateway, define a MetricPipeline
that has the otlp
section disabled as an input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
istio:
enabled: true
otlp:
disabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
With this, the agent starts collecting all Istio metrics from Istio sidecars, and the push-based OTLP metrics are dropped.
To filter metrics by namespaces, define a MetricPipeline
that has the namespaces
section defined in one of the inputs. For example, you can specify the namespaces from which metrics are collected or the namespaces from which metrics are dropped. Learn more about the available parameters and attributes.
-
The following example collects runtime metrics only from the
foo
andbar
namespaces:apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: input: runtime: enabled: true namespaces: include: - foo - bar output: otlp: endpoint: value: https://backend.example.com:4317
-
The following example collects runtime metrics from all namespaces except the
foo
andbar
namespaces:apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: input: runtime: enabled: true namespaces: exclude: - foo - bar output: otlp: endpoint: value: https://backend.example.com:4317
The default settings depend on the input:
If no namespace selector is defined for the
prometheus
orruntime
input, then metrics from system namespaces are excluded by default.However, if the namespace selector is not defined for the
istio
andotlp
input, then metrics from system namespaces are included by default.
If you use the prometheus
or istio
input, for every metric source typical scrape metrics are produced, such as up
, scrape_duration_seconds
, scrape_samples_scraped
, scrape_samples_post_metric_relabeling
, and scrape_series_added
.
By default, they are disabled.
If you want to use them for debugging and diagnostic purposes, you can activate them. To activate diagnostic metrics, define a MetricPipeline
that has the diagnosticMetrics
section defined.
-
The following example collects diagnostic metrics only for input
istio
:apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: input: istio: enabled: true diagnosticMetrics: enabled: true output: otlp: endpoint: value: https://backend.example.com:4317
-
The following example collects diagnostic metrics only for input
prometheus
:apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: input: prometheus: enabled: true diagnosticMetrics: enabled: true output: otlp: endpoint: value: https://backend.example.com:4317
Diagnostic metrics are only available for inputs
prometheus
andistio
. Learn more about the available parameters and attributes.
To activate the MetricPipeline
, apply the metricpipeline.yaml
resource file in your cluster:
kubectl apply -f metricpipeline.yaml
You activated a MetricPipeline
and metrics start streaming to your backend.
To check that the pipeline is running, wait until the status conditions of the MetricPipeline
in your cluster have status True:
kubectl get metricpipeline NAME CONFIGURATION GENERATED GATEWAY HEALTHY AGENT HEALTHY FLOW HEALTHY backend True True True True
A MetricPipeline
runs several OTel Collector instances in your cluster. This Deployment serves OTLP endpoints and ships received data to the configured backend.
The Telemetry module ensures that the OTel Collector instances are operational and healthy at any time, for example, with buffering and retries. However, there may be situations when the instances drop metrics, or cannot handle the metric load.
To detect and fix such situations, check the pipeline status and check out Troubleshooting.
If you have set up pipeline health monitoring, check the alerts and reports in an integrated backend like SAP Cloud Logging. For details, see Setting up a MetricPipeline, step Monitor Pipeline Health, as well as Integrate with SAP Cloud Logging.
It's not recommended to access the metrics endpoint of the used OTel Collector instances directly, because the exposed metrics are no official API of the Kyma Telemetry module. Breaking changes can happen if the underlying OTel Collector version introduces such.
Instead, use the pipeline status.
-
Throughput: Assuming an average metric with 20 metric data points and 10 labels, the default metric gateway setup has a maximum throughput of 34K metric data points/sec. If more data is sent to the gateway, it is refused. To increase the maximum throughput, manually scale out the gateway by increasing the number of replicas for the Metric gateway.
The metric agent setup has a maximum throughput of 14K metric data points/sec per instance. If more data must be ingested, it is refused. If a metric data endpoint emits more than 50.000 metric data points per scrape loop, the metric agent refuses all the data.
-
Load Balancing With Istio: To ensure availability, the metric gateway runs with multiple instances. If you want to increase the maximum throughput, use manual scaling and enter a higher number of instances.
By design, the connections to the gateway are long-living connections (because OTLP is based on gRPC and HTTP/2). For optimal scaling of the gateway, the clients or applications must balance the connections across the available instances, which is automatically achieved if you use an Istio sidecar. If your application has no Istio sidecar, the data is always sent to one instance of the gateway.
-
Unavailability of Output: For up to 5 minutes, a retry for data is attempted when the destination is unavailable. After that, data is dropped.
-
No Guaranteed Delivery: The used buffers are volatile. If the gateway or agent instances crash, metric data can be lost.
-
Multiple MetricPipeline Support: The maximum amount of
MetricPipeline
resources is 3.
Symptom:
-
No metrics arrive at the backend.
-
In the
MetricPipeline
status, theTelemetryFlowHealthy
condition has status AllDataDropped.
Cause: Incorrect backend endpoint configuration (such as using the wrong authentication credentials) or the backend is unreachable.
Solution:
-
Check the
telemetry-metric-gateway
Pods for error logs by callingkubectl logs -n kyma-system {POD_NAME}
. -
Check if the backend is up and reachable.
-
Fix the errors.
Symptom:
-
The backend is reachable and the connection is properly configured, but some metrics are refused.
-
In the
MetricPipeline
status, theTelemetryFlowHealthy
condition has status SomeDataDropped.
Cause: It can happen due to a variety of reasons - for example, the backend is limiting the ingestion rate.
Solution:
-
Check the
telemetry-metric-gateway
Pods for error logs by callingkubectl logs -n kyma-system {POD_NAME}
. Also, check your observability backend to investigate potential causes. -
If backend is limiting the rate by refusing metrics, try the options described in Gateway Buffer Filling Up.
-
Otherwise, take the actions appropriate to the cause indicated in the logs.
Symptom: Custom metrics don’t arrive at the backend, but Istio metrics do.
Cause: Your SDK version is incompatible with the OTel Collector version.
Solution:
-
Check which SDK version you are using for instrumentation.
-
Investigate whether it is compatible with the OTel Collector version.
-
If required, upgrade to a supported SDK version.
Symptom: Custom metrics don’t arrive at the destination. The OTel Collector produces log entries saying “Failed to scrape Prometheus endpoint”, such as the following example:
2023-08-29T09:53:07.123Z warn internal/transaction.go:111 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/app-pods", "data_type": "metrics", "scrape_timestamp": 1693302787120, "target_labels": "{__name__=\"up\", instance=\"10.42.0.18:8080\", job=\"app-pods\"}"}
Cause 1: The workload is not configured to use “STRICT” mTLS mode. For details, see Activate Prometheus-Based Metrics.
Solution 1: You can either set up “STRICT” mTLS mode or HTTP scraping:
-
Configure the workload using “STRICT” mTLS mode (for example, by applying a corresponding PeerAuthentication).
-
Set up scraping through HTTP by applying the
prometheus.io/scheme=http
annotation.
Cause 2: The Service definition enabling the scrape with Prometheus annotations does not reveal the application protocol to use in the port definition. For details, see Activate Prometheus-Based Metrics.
Solution 2: Define the application protocol in the Service port definition by either prefixing the port name with the protocol, like in http-metrics or define the appProtocol
attribute.
Cause 3: A deny-all NetworkPolicy
was created in the workload namespace, which prevents that the agent can scrape metrics from annotated workloads.
Solution 3: Create a separate NetworkPolicy
to explicitly let the agent scrape your workload using the telemetry.kyma-project.io/metric-scrape
label. For example, see the following NetworkPolicy
configuration:
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-traffic-from-agent
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: "annotated-workload" # <your workload here>
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kyma-system
podSelector:
matchLabels:
telemetry.kyma-project.io/metric-scrape: "true"
policyTypes:
- Ingress
Symptom: In the MetricPipeline
status, the TelemetryFlowHealthy
condition has status BufferFillingUp.
Cause: The backend export rate is too low compared to the gateway ingestion rate.
Solution:
-
Option 1: Increase maximum backend ingestion rate. For example, by scaling out the SAP Cloud Logging instances.
-
Option 2: Reduce emitted metrics by re-configuring the
MetricPipeline
(for example, by disabling certain inputs or applying namespace filters). -
Option 3: Reduce emitted metrics in your applications.
Symptom: In the MetricPipeline
status, the TelemetryFlowHealthy
condition has status GatewayThrottling.
Cause: Gateway cannot receive metrics at the given rate.
Solution: Manually scale out the gateway by increasing the number of replicas for the Metric gateway. See Module Configuration and Status.