Skip to content

Commit cf9e5b2

Browse files
tiffany76theletterfchalinmx-psisvrnm
authored
Unify internal observability documentation - 2 of 3 (#4322)
Co-authored-by: Fabrizio Ferri-Benedetti <[email protected]> Co-authored-by: Patrice Chalin <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]> Co-authored-by: Severin Neumann <[email protected]>
1 parent 157a5e8 commit cf9e5b2

File tree

1 file changed

+163
-15
lines changed

1 file changed

+163
-15
lines changed

content/en/docs/collector/internal-telemetry.md

+163-15
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
---
22
title: Internal telemetry
33
weight: 25
4-
cSpell:ignore: journalctl kube otecol pprof tracez zpages
4+
# prettier-ignore
5+
cSpell:ignore: alloc journalctl kube otecol pprof tracez underperforming zpages
56
---
67

78
You can monitor the health of any OpenTelemetry Collector instance by checking
8-
its own internal telemetry. Read on to learn how to configure this telemetry to
9-
help you [troubleshoot](/docs/collector/troubleshooting/) Collector issues.
9+
its own internal telemetry. Read on to learn about this telemetry and how to
10+
configure it to help you [troubleshoot](/docs/collector/troubleshooting/)
11+
Collector issues.
1012

1113
## Activate internal telemetry in the Collector
1214

@@ -31,26 +33,29 @@ Set the address in the config `service::telemetry::metrics`:
3133
service:
3234
telemetry:
3335
metrics:
34-
address: '0.0.0.0:8888'
36+
address: 0.0.0.0:8888
3537
```
3638
37-
You can enhance the metrics telemetry level using the `level` field. The
38-
following is a list of all possible values and their explanations.
39+
You can adjust the verbosity of the Collector metrics output by setting the
40+
`level` field to one of the following values:
3941

40-
- `none` indicates that no telemetry data should be collected.
41-
- `basic` is the recommended value and covers the basics of the service
42-
telemetry.
43-
- `normal` adds other indicators on top of basic.
44-
- `detailed` adds dimensions and views to the previous levels.
42+
- `none`: no telemetry is collected.
43+
- `basic`: essential service telemetry.
44+
- `normal`: the default level, adds standard indicators on top of basic.
45+
- `detailed`: the most verbose level, includes dimensions and views.
4546

46-
For example:
47+
Each verbosity level represents a threshold at which certain metrics are
48+
emitted. For the complete list of metrics, with a breakdown by level, see
49+
[Lists of internal metrics](#lists-of-internal-metrics).
50+
51+
The default level for metrics output is `normal`. To use another level, set
52+
`service::telemetry::metrics::level`:
4753

4854
```yaml
4955
service:
5056
telemetry:
5157
metrics:
5258
level: detailed
53-
address: ':8888'
5459
```
5560

5661
The Collector can also be configured to scrape its own metrics and send them
@@ -80,8 +85,11 @@ service:
8085

8186
{{% alert title="Caution" color="warning" %}}
8287

83-
Self-monitoring is a risky practice. If an issue arises, the source of the
84-
problem is unclear and the telemetry is unreliable.
88+
When self-monitoring, the Collector collects its own telemetry and sends it to
89+
the desired backend for analysis. This can be a risky practice. If the Collector
90+
is underperforming, its self-monitoring capability could be impacted. As a
91+
result, the self-monitored telemetry might not reach the backend in time for
92+
critical analysis.
8593

8694
{{% /alert %}}
8795

@@ -113,3 +121,143 @@ journalctl | grep otelcol | grep Error
113121
```
114122

115123
{{% /tab %}} {{< /tabpane >}}
124+
125+
## Types of internal observability
126+
127+
The OpenTelemetry Collector aims to be a model of observable service by clearly
128+
exposing its own operational metrics. Additionally, it collects host resource
129+
metrics that can help you understand if problems are caused by a different
130+
process on the same host. Specific components of the Collector can also emit
131+
their own custom telemetry. In this section, you will learn about the different
132+
types of observability emitted by the Collector itself.
133+
134+
### Values observable with internal metrics
135+
136+
The Collector emits internal metrics for the following **current values**:
137+
138+
- Resource consumption, including CPU, memory, and I/O.
139+
- Data reception rate, broken down by receiver.
140+
- Data export rate, broken down by exporters.
141+
- Data drop rate due to throttling, broken down by data type.
142+
- Data drop rate due to invalid data received, broken down by data type.
143+
- Throttling state, including Not Throttled, Throttled by Downstream, and
144+
Internally Saturated.
145+
- Incoming connection count, broken down by receiver.
146+
- Incoming connection rate showing new connections per second, broken down by
147+
receiver.
148+
- In-memory queue size in bytes and in units.
149+
- Persistent queue size.
150+
- End-to-end latency from receiver input to exporter output.
151+
- Latency broken down by pipeline elements, including exporter network roundtrip
152+
latency for request/response protocols.
153+
154+
Rate values are averages over 10 second periods, measured in bytes/sec or
155+
units/sec (for example, spans/sec).
156+
157+
{{% alert title="Caution" color="warning" %}}
158+
159+
Byte measurements can be expensive to compute.
160+
161+
{{% /alert %}}
162+
163+
The Collector also emits internal metrics for these **cumulative values**:
164+
165+
- Total received data, broken down by receivers.
166+
- Total exported data, broken down by exporters.
167+
- Total dropped data due to throttling, broken down by data type.
168+
- Total dropped data due to invalid data received, broken down by data type.
169+
- Total incoming connection count, broken down by receiver.
170+
- Uptime since start.
171+
172+
### Lists of internal metrics
173+
174+
The following tables group each internal metric by level of verbosity: `basic`,
175+
`normal`, and `detailed`. Each metric is identified by name and description and
176+
categorized by instrumentation type.
177+
178+
<!---To compile this list, configure a Collector instance to emit its own metrics to the localhost:8888/metrics endpoint. Select a metric and grep for it in the Collector core repository. For example, the `otelcol_process_memory_rss` can be found using:`grep -Hrn "memory_rss" .` Make sure to eliminate from your search string any words that might be prefixes. Look through the results until you find the .go file that contains the list of metrics. In the case of `otelcol_process_memory_rss`, it and other process metrics can be found in https://github.com/open-telemetry/opentelemetry-collector/blob/31528ce81d44e9265e1a3bbbd27dc86d09ba1354/service/internal/proctelemetry/process_telemetry.go#L92. Note that the Collector's internal metrics are defined in several different files in the repository.--->
179+
180+
#### `basic`-level metrics
181+
182+
| Metric name | Description | Type |
183+
| ------------------------------------------------------ | --------------------------------------------------------------------------------------- | --------- |
184+
| `otelcol_exporter_enqueue_failed_`<br>`log_records` | Number of spans that exporter(s) failed to enqueue. | Counter |
185+
| `otelcol_exporter_enqueue_failed_`<br>`metric_points` | Number of metric points that exporter(s) failed to enqueue. | Counter |
186+
| `otelcol_exporter_enqueue_failed_`<br>`spans` | Number of spans that exporter(s) failed to enqueue. | Counter |
187+
| `otelcol_exporter_queue_capacity` | Fixed capacity of the retry queue, in batches. | Gauge |
188+
| `otelcol_exporter_queue_size` | Current size of the retry queue, in batches. | Gauge |
189+
| `otelcol_exporter_send_failed_`<br>`log_records` | Number of logs that exporter(s) failed to send to destination. | Counter |
190+
| `otelcol_exporter_send_failed_`<br>`metric_points` | Number of metric points that exporter(s) failed to send to destination. | Counter |
191+
| `otelcol_exporter_send_failed_`<br>`spans` | Number of spans that exporter(s) failed to send to destination. | Counter |
192+
| `otelcol_exporter_sent_log_records` | Number of logs successfully sent to destination. | Counter |
193+
| `otelcol_exporter_sent_metric_points` | Number of metric points successfully sent to destination. | Counter |
194+
| `otelcol_exporter_sent_spans` | Number of spans successfully sent to destination. | Counter |
195+
| `otelcol_process_cpu_seconds` | Total CPU user and system time in seconds. | Counter |
196+
| `otelcol_process_memory_rss` | Total physical memory (resident set size). | Gauge |
197+
| `otelcol_process_runtime_heap_`<br>`alloc_bytes` | Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc'). | Gauge |
198+
| `otelcol_process_runtime_total_`<br>`alloc_bytes` | Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc'). | Counter |
199+
| `otelcol_process_runtime_total_`<br>`sys_memory_bytes` | Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys'). | Gauge |
200+
| `otelcol_process_uptime` | Uptime of the process. | Counter |
201+
| `otelcol_processor_accepted_`<br>`log_records` | Number of logs successfully pushed into the next component in the pipeline. | Counter |
202+
| `otelcol_processor_accepted_`<br>`metric_points` | Number of metric points successfully pushed into the next component in the pipeline. | Counter |
203+
| `otelcol_processor_accepted_spans` | Number of spans successfully pushed into the next component in the pipeline. | Counter |
204+
| `otelcol_processor_batch_batch_`<br>`send_size_bytes` | Number of bytes in the batch that was sent. | Histogram |
205+
| `otelcol_processor_dropped_`<br>`log_records` | Number of logs dropped by the processor. | Counter |
206+
| `otelcol_processor_dropped_`<br>`metric_points` | Number of metric points dropped by the processor. | Counter |
207+
| `otelcol_processor_dropped_spans` | Number of spans dropped by the processor. | Counter |
208+
| `otelcol_receiver_accepted_`<br>`log_records` | Number of logs successfully ingested and pushed into the pipeline. | Counter |
209+
| `otelcol_receiver_accepted_`<br>`metric_points` | Number of metric points successfully ingested and pushed into the pipeline. | Counter |
210+
| `otelcol_receiver_accepted_spans` | Number of spans successfully ingested and pushed into the pipeline. | Counter |
211+
| `otelcol_receiver_refused_`<br>`log_records` | Number of logs that could not be pushed into the pipeline. | Counter |
212+
| `otelcol_receiver_refused_`<br>`metric_points` | Number of metric points that could not be pushed into the pipeline. | Counter |
213+
| `otelcol_receiver_refused_spans` | Number of spans that could not be pushed into the pipeline. | Counter |
214+
| `otelcol_scraper_errored_`<br>`metric_points` | Number of metric points the Collector failed to scrape. | Counter |
215+
| `otelcol_scraper_scraped_`<br>`metric_points` | Number of metric points scraped by the Collector. | Counter |
216+
217+
#### Additional `normal`-level metrics
218+
219+
| Metric name | Description | Type |
220+
| ------------------------------------------------------- | --------------------------------------------------------------- | --------- |
221+
| `otelcol_processor_batch_batch_`<br>`send_size` | Number of units in the batch. | Histogram |
222+
| `otelcol_processor_batch_batch_`<br>`size_trigger_send` | Number of times the batch was sent due to a size trigger. | Counter |
223+
| `otelcol_processor_batch_metadata_`<br>`cardinality` | Number of distinct metadata value combinations being processed. | Counter |
224+
| `otelcol_processor_batch_timeout_`<br>`trigger_send` | Number of times the batch was sent due to a timeout trigger. | Counter |
225+
226+
#### Additional `detailed`-level metrics
227+
228+
| Metric name | Description | Type |
229+
| --------------------------------- | ----------------------------------------------------------------------------------------- | --------- |
230+
| `http_client_active_requests` | Number of active HTTP client requests. | Counter |
231+
| `http_client_connection_duration` | Measures the duration of the successfully established outbound HTTP connections. | Histogram |
232+
| `http_client_open_connections` | Number of outbound HTTP connections that are active or idle on the client. | Counter |
233+
| `http_client_request_body_size` | Measures the size of HTTP client request bodies. | Histogram |
234+
| `http_client_request_duration` | Measures the duration of HTTP client requests. | Histogram |
235+
| `http_client_response_body_size` | Measures the size of HTTP client response bodies. | Histogram |
236+
| `http_server_active_requests` | Number of active HTTP server requests. | Counter |
237+
| `http_server_request_body_size` | Measures the size of HTTP server request bodies. | Histogram |
238+
| `http_server_request_duration` | Measures the duration of HTTP server requests. | Histogram |
239+
| `http_server_response_body_size` | Measures the size of HTTP server response bodies. | Histogram |
240+
| `rpc_client_duration` | Measures the duration of outbound RPC. | Histogram |
241+
| `rpc_client_request_size` | Measures the size of RPC request messages (uncompressed). | Histogram |
242+
| `rpc_client_requests_per_rpc` | Measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs. | Histogram |
243+
| `rpc_client_response_size` | Measures the size of RPC response messages (uncompressed). | Histogram |
244+
| `rpc_client_responses_per_rpc` | Measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs. | Histogram |
245+
| `rpc_server_duration` | Measures the duration of inbound RPC. | Histogram |
246+
| `rpc_server_request_size` | Measures the size of RPC request messages (uncompressed). | Histogram |
247+
| `rpc_server_requests_per_rpc` | Measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs. | Histogram |
248+
| `rpc_server_response_size` | Measures the size of RPC response messages (uncompressed). | Histogram |
249+
| `rpc_server_responses_per_rpc` | Measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs. | Histogram |
250+
251+
### Events observable with internal logs
252+
253+
The Collector logs the following internal events:
254+
255+
- A Collector instance starts or stops.
256+
- Data dropping begins due to throttling for a specified reason, such as local
257+
saturation, downstream saturation, downstream unavailable, etc.
258+
- Data dropping due to throttling stops.
259+
- Data dropping begins due to invalid data. A sample of the invalid data is
260+
included.
261+
- Data dropping due to invalid data stops.
262+
- A crash is detected, differentiated from a clean stop. Crash data is included
263+
if available.

0 commit comments

Comments
 (0)