From 5df52e172c87f918fe998b2e65ae1501c87d20bd Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Wed, 9 Oct 2024 15:56:50 -0400 Subject: [PATCH 01/13] RFC - Auto-instrumentation of pipeline components --- docs/rfcs/component-universal-telemetry.md | 119 +++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 docs/rfcs/component-universal-telemetry.md diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md new file mode 100644 index 00000000000..4e349281312 --- /dev/null +++ b/docs/rfcs/component-universal-telemetry.md @@ -0,0 +1,119 @@ +# Auto-Instrumented Component Telemetry + +## Motivation + +The collector should be observable and this must naturally include observability of its pipeline components. It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally though, the collector should also describe the behavior of components using broadly normalized telemetry. A major challenge in pursuit is that there must be a clear mechanism by which such telemetry can be automatically captured. Therefore, this RFC is first and foremost a proposal for a _mechanism_. Then, based on what _can_ be captured by this mechanism, the RFC describes specific metrics, spans, and logs which can be broadly normalized. + +## Goals + +1. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_. +2. Define attributes that are (A) specific enough to describe individual component [_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) and (B) consistent enough for correlation across signals. +3. Define specific metrics for each kind of pipeline component. +4. Define specific spans for processors and connectors. +5. Define specific logs for all kinds of pipeline component. + +### Mechanism + +The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, this means that every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry which is ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently. In the case of processors and connectors, the appropriate layers can act in concert (e.g. record the start and end of a span). + +### Attributes + +All signals should use the following attributes: + +#### Receivers + +- `otel.component.kind`: `receiver` +- `otel.component.id`: The component ID +- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ALL`** + +#### Processors + +- `otel.component.kind`: `processor` +- `otel.component.id`: The component ID +- `otel.pipeline.id`: The pipeline ID, **OR `ALL`** +- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ALL`** + +#### Exporters + +- `otel.component.kind`: `exporter` +- `otel.component.id`: The component ID +- `otel.signal`: `logs`, `metrics` `traces`, **OR `ALL`** + +#### Connectors + +- `otel.component.kind`: `connector` +- `otel.component.id`: The component ID +- `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ALL`** + +Notes: The use of `ALL` is based on the assumption that components are instanced either in the default way, or, as a single instance per configuration (e.g. otlp receiver). + +### Metrics + +There are two straightforward measurements that can be made on any pdata: + +1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default. +2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11). These are high cost to compute, so by default they should be disabled (and not calculated). + +The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the component to which the telemetry is ascribed. + +1. Incoming measurements are attributed to the component which is _consuming_ the data. +2. Outgoing measurements are attributed to the component which is _producing_ the data. + +For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the function call returned an error. Outgoing measurements will be recorded with `outcome` as `failure` when the next consumer returns an error, and `success` otherwise. Likewise, incoming measurements will be recorded with `outcome` as `failure` when the component itself returns an error, and `success` otherwise. + +```yaml + otelcol_component_incoming_items: + enabled: true + description: Number of items passed to the component. + unit: "{items}" + sum: + value_type: int + monotonic: true + otelcol_component_outgoing_items: + enabled: true + description: Number of items emitted from the component. + unit: "{items}" + sum: + value_type: int + monotonic: true + + otelcol_component_incoming_size: + enabled: false + description: Size of items passed to the component. + unit: "By" + sum: + value_type: int + monotonic: true + otelcol_component_outgoing_size: + enabled: false + description: Size of items emitted from the component. + unit: "By" + sum: + value_type: int + monotonic: true +``` + +### Spans + +A span should be recorded for each execution of a processor or connector. The instrumentation layers adjacent to these components can start and end the span as appropriate. + +### Logs + +Metrics and spans provide most of the observability we need but there are some gaps which logs can fill. For example, we can record spans for processors and connectors but logs are useful for capturing precise timing as it relates to data produced by receivers and consumed by exporters. Additionally, although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed. + +For security and performance reasons, it would not be appropriate to log the contents of telemetry. + +It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to many users if they are not handled automatically. + +With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for detailed troubleshooting but does not impact users otherwise. + +In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example of future possibilities. + +### Additional context + +This proposal pulls from a number of issues and PRs: + +- [Demonstrate graph-based metrics](https://github.com/open-telemetry/opentelemetry-collector/pull/11311) +- [Attributes for component instancing](https://github.com/open-telemetry/opentelemetry-collector/issues/11179) +- [Simple processor metrics](https://github.com/open-telemetry/opentelemetry-collector/issues/10708) +- [Component instancing is complicated](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) From d0f1637c8eeee5a300e029ff638be66a95f6bb21 Mon Sep 17 00:00:00 2001 From: Daniel Jaglowski Date: Thu, 10 Oct 2024 11:45:32 -0500 Subject: [PATCH 02/13] Update docs/rfcs/component-universal-telemetry.md Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com> --- docs/rfcs/component-universal-telemetry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 4e349281312..aeacdb646d7 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -14,7 +14,7 @@ The collector should be observable and this must naturally include observability ### Mechanism -The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, this means that every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry which is ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently. In the case of processors and connectors, the appropriate layers can act in concert (e.g. record the start and end of a span). +The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently. In the case of processors and connectors, the appropriate layers can act in concert (e.g. record the start and end of a span). ### Attributes From bea0e2fc03c0192cb928495b74763c730b124770 Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Thu, 10 Oct 2024 13:16:50 -0400 Subject: [PATCH 03/13] Feedback --- docs/rfcs/component-universal-telemetry.md | 29 +++++++++++----------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index aeacdb646d7..5bc2329106c 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -2,19 +2,18 @@ ## Motivation -The collector should be observable and this must naturally include observability of its pipeline components. It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally though, the collector should also describe the behavior of components using broadly normalized telemetry. A major challenge in pursuit is that there must be a clear mechanism by which such telemetry can be automatically captured. Therefore, this RFC is first and foremost a proposal for a _mechanism_. Then, based on what _can_ be captured by this mechanism, the RFC describes specific metrics, spans, and logs which can be broadly normalized. +The collector should be observable and this must naturally include observability of its pipeline components. It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally though, the collector should also describe the behavior of components using broadly normalized telemetry. A major challenge in pursuit is that there must be a clear mechanism by which such telemetry can be automatically captured. Therefore, this RFC is first and foremost a proposal for a _mechanism_. Then, based on what _can_ be captured by this mechanism, the RFC describes specific metrics and logs which can be broadly normalized. ## Goals 1. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_. 2. Define attributes that are (A) specific enough to describe individual component [_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) and (B) consistent enough for correlation across signals. 3. Define specific metrics for each kind of pipeline component. -4. Define specific spans for processors and connectors. -5. Define specific logs for all kinds of pipeline component. +4. Define specific logs for all kinds of pipeline component. ### Mechanism -The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently. In the case of processors and connectors, the appropriate layers can act in concert (e.g. record the start and end of a span). +The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently. ### Attributes @@ -24,28 +23,28 @@ All signals should use the following attributes: - `otel.component.kind`: `receiver` - `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ALL`** +- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ANY`** #### Processors - `otel.component.kind`: `processor` - `otel.component.id`: The component ID -- `otel.pipeline.id`: The pipeline ID, **OR `ALL`** -- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ALL`** +- `otel.pipeline.id`: The pipeline ID, **OR `ANY`** +- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ANY`** #### Exporters - `otel.component.kind`: `exporter` - `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics` `traces`, **OR `ALL`** +- `otel.signal`: `logs`, `metrics` `traces`, **OR `ANY`** #### Connectors - `otel.component.kind`: `connector` - `otel.component.id`: The component ID -- `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ALL`** +- `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ANY`** -Notes: The use of `ALL` is based on the assumption that components are instanced either in the default way, or, as a single instance per configuration (e.g. otlp receiver). +Notes: The use of `ANY` indicates that values are not associated with a particular signal or pipeline. This is used when a component enforces non-standard instancing patterns. For example, the `otlp` receiver isa singleton, so the values are aggregated across signals. Similarly, the `memory_limiter` processor is a singleton, so the values are aggregated across pipelines. ### Metrics @@ -93,13 +92,9 @@ For both metrics, an `outcome` attribute with possible values `success` and `fai monotonic: true ``` -### Spans - -A span should be recorded for each execution of a processor or connector. The instrumentation layers adjacent to these components can start and end the span as appropriate. - ### Logs -Metrics and spans provide most of the observability we need but there are some gaps which logs can fill. For example, we can record spans for processors and connectors but logs are useful for capturing precise timing as it relates to data produced by receivers and consumed by exporters. Additionally, although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed. +Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed. For security and performance reasons, it would not be appropriate to log the contents of telemetry. @@ -109,6 +104,10 @@ With the above considerations, this proposal includes only that we add a DEBUG l In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example of future possibilities. +### Spans + +It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between incoming and outgoing data. + ### Additional context This proposal pulls from a number of issues and PRs: From 35c82a47dac830f190f23d923a0b35b9fd56aaf7 Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Fri, 11 Oct 2024 15:09:05 -0400 Subject: [PATCH 04/13] Feedback --- docs/rfcs/component-universal-telemetry.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 5bc2329106c..94f34b79a14 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -23,28 +23,29 @@ All signals should use the following attributes: - `otel.component.kind`: `receiver` - `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ANY`** +- `otel.signal`: `logs`, `metrics`, `traces` #### Processors - `otel.component.kind`: `processor` - `otel.component.id`: The component ID -- `otel.pipeline.id`: The pipeline ID, **OR `ANY`** -- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ANY`** +- `otel.pipeline.id`: The pipeline ID +- `otel.signal`: `logs`, `metrics`, `traces` #### Exporters - `otel.component.kind`: `exporter` - `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics` `traces`, **OR `ANY`** +- `otel.signal`: `logs`, `metrics` `traces` #### Connectors - `otel.component.kind`: `connector` - `otel.component.id`: The component ID -- `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ANY`** +- `otel.signal`: `logs`, `metrics` `traces` +- `otel.output.signal`: `logs`, `metrics` `traces` -Notes: The use of `ANY` indicates that values are not associated with a particular signal or pipeline. This is used when a component enforces non-standard instancing patterns. For example, the `otlp` receiver isa singleton, so the values are aggregated across signals. Similarly, the `memory_limiter` processor is a singleton, so the values are aggregated across pipelines. +Note: The `otel.signal`, `otel.output.signal`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal. Similarly, the `memory_limiter` processor is a singleton, so its telemetry is not specific to a pipeline. ### Metrics @@ -58,7 +59,7 @@ The location of these measurements can be described in terms of whether the data 1. Incoming measurements are attributed to the component which is _consuming_ the data. 2. Outgoing measurements are attributed to the component which is _producing_ the data. -For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the function call returned an error. Outgoing measurements will be recorded with `outcome` as `failure` when the next consumer returns an error, and `success` otherwise. Likewise, incoming measurements will be recorded with `outcome` as `failure` when the component itself returns an error, and `success` otherwise. +For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise. ```yaml otelcol_component_incoming_items: From 26008368deebdb0621b387908a53f107165a2bba Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Wed, 16 Oct 2024 15:24:17 -0400 Subject: [PATCH 05/13] Broaden scope and convert to evolving consensus --- docs/rfcs/component-universal-telemetry.md | 93 +++++++++++++++------- 1 file changed, 66 insertions(+), 27 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 94f34b79a14..8fed5e062bc 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -1,65 +1,96 @@ -# Auto-Instrumented Component Telemetry +# Pipeline Component Telemetry -## Motivation +## Motivation and Scope -The collector should be observable and this must naturally include observability of its pipeline components. It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally though, the collector should also describe the behavior of components using broadly normalized telemetry. A major challenge in pursuit is that there must be a clear mechanism by which such telemetry can be automatically captured. Therefore, this RFC is first and foremost a proposal for a _mechanism_. Then, based on what _can_ be captured by this mechanism, the RFC describes specific metrics and logs which can be broadly normalized. +The collector should be observable and this must naturally include observability of its pipeline components. Pipeline components +are those components of the collector which directly interact with data, specifically receivers, processors, exporters, and connectors. + +It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, +and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally +though, there is much we can do to normalize the telemetry emitted from and about pipeline components. + +Two major challenges in pursuit of broadly normalized telemetry are (1) consistent attributes, and (2) automatic capture. + +This RFC represents an evolving consensus about the desired end state of component telemetry. It does _not_ claim +to describe the final state of all component telemetry, but rather seeks to document some specific aspects. It proposes a set of +attributes which are both necessary and sufficient to identify components and their instances. It also articulates one specific +mechanism by which some telemetry can be automatically captured. Finally, it describes some specific metrics and logs which should +be automatically captured for each kind of pipeline component. ## Goals -1. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_. -2. Define attributes that are (A) specific enough to describe individual component [_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) and (B) consistent enough for correlation across signals. +1. Define attributes that are (A) specific enough to describe individual component[_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) + and (B) consistent enough for correlation across signals. +2. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_. 3. Define specific metrics for each kind of pipeline component. 4. Define specific logs for all kinds of pipeline component. -### Mechanism - -The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently. - -### Attributes +## Attributes All signals should use the following attributes: -#### Receivers +### Receivers - `otel.component.kind`: `receiver` - `otel.component.id`: The component ID - `otel.signal`: `logs`, `metrics`, `traces` -#### Processors +### Processors - `otel.component.kind`: `processor` - `otel.component.id`: The component ID - `otel.pipeline.id`: The pipeline ID - `otel.signal`: `logs`, `metrics`, `traces` -#### Exporters +### Exporters - `otel.component.kind`: `exporter` - `otel.component.id`: The component ID - `otel.signal`: `logs`, `metrics` `traces` -#### Connectors +### Connectors - `otel.component.kind`: `connector` - `otel.component.id`: The component ID - `otel.signal`: `logs`, `metrics` `traces` - `otel.output.signal`: `logs`, `metrics` `traces` -Note: The `otel.signal`, `otel.output.signal`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal. Similarly, the `memory_limiter` processor is a singleton, so its telemetry is not specific to a pipeline. +Note: The `otel.signal`, `otel.output.signal`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances +are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal. +Similarly, the `memory_limiter` processor is a singleton, so its telemetry is not specific to a pipeline. + +## Auto-Instrumentation Mechanism + +The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a +component passes data to another component, and, at each point where a component consumes data from another component. In terms of the +component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the +consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per +edge we can describe both sides of each handoff independently. + +Telemetry captured by this mechanism should be associated with an instrumentation scope corresponding to the package which implements +the mechanism. Currently, that package is `service/internal/graph`, but this may change in the future. Notably, this telemetry is not +ascribed to individual component packages, both because the instrumentation scope is intended to describe the origin of the telemetry, +and because no mechanism is presently identified which would allow us to determine the characteristics of a component-specific scope. -### Metrics +### Auto-Instrumented Metrics There are two straightforward measurements that can be made on any pdata: 1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default. -2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11). These are high cost to compute, so by default they should be disabled (and not calculated). +2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11). + These are high cost to compute, so by default they should be disabled (and not calculated). -The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the component to which the telemetry is ascribed. +The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the +component to which the telemetry is ascribed. 1. Incoming measurements are attributed to the component which is _consuming_ the data. 2. Outgoing measurements are attributed to the component which is _producing_ the data. -For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise. +For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to +whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as +`failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing +measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and +`success` otherwise. ```yaml otelcol_component_incoming_items: @@ -93,23 +124,31 @@ For both metrics, an `outcome` attribute with possible values `success` and `fai monotonic: true ``` -### Logs +### Auto-Instrumented Logs -Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed. +Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall +item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but +100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric +reports only that a 50% success rate is observed. For security and performance reasons, it would not be appropriate to log the contents of telemetry. -It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to many users if they are not handled automatically. +It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to +many users if they are not handled automatically. -With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for detailed troubleshooting but does not impact users otherwise. +With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for +detailed troubleshooting but does not impact users otherwise. -In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example of future possibilities. +In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in +a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example +of future possibilities. -### Spans +### Auto-Instrumented Spans -It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between incoming and outgoing data. +It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both +before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between incoming and outgoing data. -### Additional context +## Additional Context This proposal pulls from a number of issues and PRs: From b1fd90cd811ec588953b0aabdc3a2fc8c44b2774 Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Mon, 21 Oct 2024 10:41:28 -0400 Subject: [PATCH 06/13] Update names to consumed and produced --- docs/rfcs/component-universal-telemetry.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 8fed5e062bc..702b58dde6c 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -80,27 +80,27 @@ There are two straightforward measurements that can be made on any pdata: 2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11). These are high cost to compute, so by default they should be disabled (and not calculated). -The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the +The location of these measurements can be described in terms of whether the data is "consumed" or "produced", from the perspective of the component to which the telemetry is ascribed. 1. Incoming measurements are attributed to the component which is _consuming_ the data. 2. Outgoing measurements are attributed to the component which is _producing_ the data. For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to -whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as -`failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing +whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as +`failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise. ```yaml - otelcol_component_incoming_items: + otelcol_component_consumed_items: enabled: true description: Number of items passed to the component. unit: "{items}" sum: value_type: int monotonic: true - otelcol_component_outgoing_items: + otelcol_component_produced_items: enabled: true description: Number of items emitted from the component. unit: "{items}" @@ -108,14 +108,14 @@ measurements will be recorded with `outcome` as `failure` when a call to the nex value_type: int monotonic: true - otelcol_component_incoming_size: + otelcol_component_consumed_size: enabled: false description: Size of items passed to the component. unit: "By" sum: value_type: int monotonic: true - otelcol_component_outgoing_size: + otelcol_component_produced_size: enabled: false description: Size of items emitted from the component. unit: "By" @@ -127,7 +127,7 @@ measurements will be recorded with `outcome` as `failure` when a call to the nex ### Auto-Instrumented Logs Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall -item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but +item counts, it is helpful in some cases to record more granular events. e.g. If a produced batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed. @@ -146,7 +146,7 @@ of future possibilities. ### Auto-Instrumented Spans It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both -before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between incoming and outgoing data. +before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between consumed and produced data. ## Additional Context From 90d19ab9b16b7e03f180ceadcafa2417c1ea7d79 Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Wed, 23 Oct 2024 14:08:58 -0400 Subject: [PATCH 07/13] Change proposed metric names to use '.' instead of '_' --- docs/rfcs/component-universal-telemetry.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 702b58dde6c..94372c015bb 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -93,14 +93,14 @@ measurements will be recorded with `outcome` as `failure` when a call to the nex `success` otherwise. ```yaml - otelcol_component_consumed_items: + otelcol.component.consumed.items: enabled: true description: Number of items passed to the component. unit: "{items}" sum: value_type: int monotonic: true - otelcol_component_produced_items: + otelcol.component.produced.items: enabled: true description: Number of items emitted from the component. unit: "{items}" @@ -108,14 +108,14 @@ measurements will be recorded with `outcome` as `failure` when a call to the nex value_type: int monotonic: true - otelcol_component_consumed_size: + otelcol.component.consumed.size: enabled: false description: Size of items passed to the component. unit: "By" sum: value_type: int monotonic: true - otelcol_component_produced_size: + otelcol.component.produced.size: enabled: false description: Size of items emitted from the component. unit: "By" From 497587c5a37dd100b6ac2fd9a0bd8a2247436965 Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Wed, 23 Oct 2024 14:41:46 -0400 Subject: [PATCH 08/13] Separate metrics by component kind --- docs/rfcs/component-universal-telemetry.md | 78 ++++++++++++++++++---- 1 file changed, 66 insertions(+), 12 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 94372c015bb..9ef5347ffd5 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -81,10 +81,8 @@ There are two straightforward measurements that can be made on any pdata: These are high cost to compute, so by default they should be disabled (and not calculated). The location of these measurements can be described in terms of whether the data is "consumed" or "produced", from the perspective of the -component to which the telemetry is ascribed. - -1. Incoming measurements are attributed to the component which is _consuming_ the data. -2. Outgoing measurements are attributed to the component which is _producing_ the data. +component to which the telemetry is attributed. Metrics which contain the term "procuded" describe data which is emitted from the component, +while metrics which contain the term "consumed" describe data which is received by the component. For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as @@ -93,31 +91,87 @@ measurements will be recorded with `outcome` as `failure` when a call to the nex `success` otherwise. ```yaml - otelcol.component.consumed.items: + otelcol.receiver.produced.items: + enabled: true + description: Number of items emitted from the receiver. + unit: "{items}" + sum: + value_type: int + monotonic: true + otelcol.processor.consumed.items: + enabled: true + description: Number of items passed to the processor. + unit: "{items}" + sum: + value_type: int + monotonic: true + otelcol.processor.produced.items: + enabled: true + description: Number of items emitted from the processor. + unit: "{items}" + sum: + value_type: int + monotonic: true + otelcol.connector.consumed.items: + enabled: true + description: Number of items passed to the connector. + unit: "{items}" + sum: + value_type: int + monotonic: true + otelcol.connector.produced.items: enabled: true - description: Number of items passed to the component. + description: Number of items emitted from the connector. unit: "{items}" sum: value_type: int monotonic: true - otelcol.component.produced.items: + otelcol.exporter.consumed.items: enabled: true - description: Number of items emitted from the component. + description: Number of items passed to the exporter. unit: "{items}" sum: value_type: int monotonic: true - otelcol.component.consumed.size: + otelcol.receiver.produced.size: + enabled: false + description: Size of items emitted from the receiver. + unit: "By" + sum: + value_type: int + monotonic: true + otelcol.processor.consumed.size: + enabled: false + description: Size of items passed to the processor. + unit: "By" + sum: + value_type: int + monotonic: true + otelcol.processor.produced.size: + enabled: false + description: Size of items emitted from the processor. + unit: "By" + sum: + value_type: int + monotonic: true + otelcol.connector.consumed.size: + enabled: false + description: Size of items passed to the connector. + unit: "By" + sum: + value_type: int + monotonic: true + otelcol.connector.produced.size: enabled: false - description: Size of items passed to the component. + description: Size of items emitted from the connector. unit: "By" sum: value_type: int monotonic: true - otelcol.component.produced.size: + otelcol.exporter.consumed.size: enabled: false - description: Size of items emitted from the component. + description: Size of items passed to the exporter. unit: "By" sum: value_type: int From 64637594ccc389d7283f76940452fd659e32eb39 Mon Sep 17 00:00:00 2001 From: Daniel Jaglowski Date: Thu, 24 Oct 2024 09:52:28 -0500 Subject: [PATCH 09/13] Add profiles as attribute value Co-authored-by: Damien Mathieu <42@dmathieu.com> --- docs/rfcs/component-universal-telemetry.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 9ef5347ffd5..38d9f5ec396 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -33,27 +33,27 @@ All signals should use the following attributes: - `otel.component.kind`: `receiver` - `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics`, `traces` +- `otel.signal`: `logs`, `metrics`, `traces`, `profiles` ### Processors - `otel.component.kind`: `processor` - `otel.component.id`: The component ID - `otel.pipeline.id`: The pipeline ID -- `otel.signal`: `logs`, `metrics`, `traces` +- `otel.signal`: `logs`, `metrics`, `traces`, `profiles` ### Exporters - `otel.component.kind`: `exporter` - `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics` `traces` +- `otel.signal`: `logs`, `metrics` `traces`, `profiles` ### Connectors - `otel.component.kind`: `connector` - `otel.component.id`: The component ID - `otel.signal`: `logs`, `metrics` `traces` -- `otel.output.signal`: `logs`, `metrics` `traces` +- `otel.output.signal`: `logs`, `metrics` `traces`, `profiles` Note: The `otel.signal`, `otel.output.signal`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal. From 1b26ae2e0cdfc24983be9f7c25936bf260ea86b0 Mon Sep 17 00:00:00 2001 From: Daniel Jaglowski Date: Wed, 30 Oct 2024 08:04:26 -0500 Subject: [PATCH 10/13] Update docs/rfcs/component-universal-telemetry.md Co-authored-by: William Dumont --- docs/rfcs/component-universal-telemetry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index 38d9f5ec396..a69a6d652dd 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -81,7 +81,7 @@ There are two straightforward measurements that can be made on any pdata: These are high cost to compute, so by default they should be disabled (and not calculated). The location of these measurements can be described in terms of whether the data is "consumed" or "produced", from the perspective of the -component to which the telemetry is attributed. Metrics which contain the term "procuded" describe data which is emitted from the component, +component to which the telemetry is attributed. Metrics which contain the term "produced" describe data which is emitted from the component, while metrics which contain the term "consumed" describe data which is received by the component. For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to From bba26c9bc31ff09cf09df0ed0b37b2271481f65b Mon Sep 17 00:00:00 2001 From: Daniel Jaglowski Date: Mon, 4 Nov 2024 11:15:52 -0600 Subject: [PATCH 11/13] Update docs/rfcs/component-universal-telemetry.md Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com> --- docs/rfcs/component-universal-telemetry.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index a69a6d652dd..e5f9269f420 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -187,8 +187,8 @@ reports only that a 50% success rate is observed. For security and performance reasons, it would not be appropriate to log the contents of telemetry. -It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to -many users if they are not handled automatically. +It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, only the errors that are not +handled automatically will be of interest to most users. With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for detailed troubleshooting but does not impact users otherwise. From 792501296fefcc1022500353bac88cb41d19cc1e Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Mon, 4 Nov 2024 16:39:44 -0500 Subject: [PATCH 12/13] Change 'otel.output.signal' to 'otel.signal.output' --- docs/rfcs/component-universal-telemetry.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index e5f9269f420..bc61d5a3cee 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -53,9 +53,9 @@ All signals should use the following attributes: - `otel.component.kind`: `connector` - `otel.component.id`: The component ID - `otel.signal`: `logs`, `metrics` `traces` -- `otel.output.signal`: `logs`, `metrics` `traces`, `profiles` +- `otel.signal.output`: `logs`, `metrics` `traces`, `profiles` -Note: The `otel.signal`, `otel.output.signal`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances +Note: The `otel.signal`, `otel.signal.output`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal. Similarly, the `memory_limiter` processor is a singleton, so its telemetry is not specific to a pipeline. From 3211be70a8ee08c2dede207a65845556d55a9360 Mon Sep 17 00:00:00 2001 From: Dan Jaglowski Date: Thu, 7 Nov 2024 09:24:22 -0500 Subject: [PATCH 13/13] Change 'otel.*' to 'otelcol.*' --- docs/rfcs/component-universal-telemetry.md | 30 +++++++++++----------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md index bc61d5a3cee..fa573d56aef 100644 --- a/docs/rfcs/component-universal-telemetry.md +++ b/docs/rfcs/component-universal-telemetry.md @@ -31,31 +31,31 @@ All signals should use the following attributes: ### Receivers -- `otel.component.kind`: `receiver` -- `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics`, `traces`, `profiles` +- `otelcol.component.kind`: `receiver` +- `otelcol.component.id`: The component ID +- `otelcol.signal`: `logs`, `metrics`, `traces`, `profiles` ### Processors -- `otel.component.kind`: `processor` -- `otel.component.id`: The component ID -- `otel.pipeline.id`: The pipeline ID -- `otel.signal`: `logs`, `metrics`, `traces`, `profiles` +- `otelcol.component.kind`: `processor` +- `otelcol.component.id`: The component ID +- `otelcol.pipeline.id`: The pipeline ID +- `otelcol.signal`: `logs`, `metrics`, `traces`, `profiles` ### Exporters -- `otel.component.kind`: `exporter` -- `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics` `traces`, `profiles` +- `otelcol.component.kind`: `exporter` +- `otelcol.component.id`: The component ID +- `otelcol.signal`: `logs`, `metrics` `traces`, `profiles` ### Connectors -- `otel.component.kind`: `connector` -- `otel.component.id`: The component ID -- `otel.signal`: `logs`, `metrics` `traces` -- `otel.signal.output`: `logs`, `metrics` `traces`, `profiles` +- `otelcol.component.kind`: `connector` +- `otelcol.component.id`: The component ID +- `otelcol.signal`: `logs`, `metrics` `traces` +- `otelcol.signal.output`: `logs`, `metrics` `traces`, `profiles` -Note: The `otel.signal`, `otel.signal.output`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances +Note: The `otelcol.signal`, `otelcol.signal.output`, or `otelcol.pipeline.id` attributes may be omitted if the corresponding component instances are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal. Similarly, the `memory_limiter` processor is a singleton, so its telemetry is not specific to a pipeline.