Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Pipeline Component Telemetry #11406

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
119 changes: 119 additions & 0 deletions docs/rfcs/component-universal-telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Auto-Instrumented Component Telemetry
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

The collector should be observable and this must naturally include observability of its pipeline components. It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally though, the collector should also describe the behavior of components using broadly normalized telemetry. A major challenge in pursuit is that there must be a clear mechanism by which such telemetry can be automatically captured. Therefore, this RFC is first and foremost a proposal for a _mechanism_. Then, based on what _can_ be captured by this mechanism, the RFC describes specific metrics, spans, and logs which can be broadly normalized.

## Goals

1. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_.
2. Define attributes that are (A) specific enough to describe individual component [_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) and (B) consistent enough for correlation across signals.
3. Define specific metrics for each kind of pipeline component.
4. Define specific spans for processors and connectors.
mx-psi marked this conversation as resolved.
Show resolved Hide resolved
5. Define specific logs for all kinds of pipeline component.

### Mechanism

The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, this means that every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry which is ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently. In the case of processors and connectors, the appropriate layers can act in concert (e.g. record the start and end of a span).
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

### Attributes
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

All signals should use the following attributes:

#### Receivers

- `otel.component.kind`: `receiver`
djaglowski marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

@dmitryax dmitryax Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel SemConv guidelines require metric attributes to be namespaces so they can be placed in a global registry along with resource attributes. I don't think we can use otel for collector. It probably should be otelcol. Or otel.collector instead. However, otel. seems like kind of a restricted namespace. Not sure if we can use it or not. I think we need to run this by the otel semantic conventions SIG group before merging the RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitryax if you think contacting the semantic convention SIG is a requirement for this RFC to be merged, please request changes on the PR so that we follow the process

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with changing the attribute names but delaying this functionality further for an entirely separate naming process seems unnecessary. Can we really not choose our own attributes names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence from the doc linked by @dmitryax makes me hesitant about keeping the current approach:

Any additions to the otel.* namespace MUST be approved as part of OpenTelemetry specification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence from the doc linked by @dmitryax makes me hesitant about keeping the current approach:

Any additions to the otel.* namespace MUST be approved as part of OpenTelemetry specification.

I agree we can't just use the otel namespace. Can't we choose another on our own though? I like the otelcol suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelcol works for me, it's also the one we use for metrics so it makes sense to be consistent here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mx-psi asked me to chime in to help interpret the spec.

You are reading it correctly. If you want to use otel. prefix it needs to go through semconv group.

What can you use instead? If you read the preceding paragraph the closest applicable rules seems to be this:

The name is specific to your company and may be possibly used outside the company as well. To avoid clashes with names introduced by other companies (in a distributed system that uses applications from multiple vendors) it is recommended to prefix the new name by your company's reverse domain name, e.g. com.acme.shopname.

If we consider "OpenTelemetry" the "company" then we should use "io.opentelemetry." as the org-specific prefix, e.g. "io.opentelemetry.collector." for the Collector.

However, I am not sure I like such a long prefix. "otelcol" or "otel.collector" looks much nicer to me.

I cannot find spec recommendations about when a new top-level namespace (like "otelcol") is allowed to be used. I would argue the spec should have a rule like this and the rule should be that it needs to go through semconv, just like for anything under "otel." namespace.

My recommendation is this:

  • Decide which prefix you want to use: "otelcol." or "otel.collector." appear to be your top choices.
  • Submit a request to get this added to semconv. We should be able to convince semconv group to accept it and I do not anticipate difficulties here.

You don't need to be blocked or wait until acceptance by semconv group, there is a ton of work that needs to happen anyway and replacing metric names should be trivial anytime before you declare them stable.

One additional thing that I think is worth thinking about: is this Collector-only telemetry or it can be applicable to SDK processors/exporters? Collector's telemetry is going to be much richer and needs to be more expressive, so I am not sure if it is beneficial to model Collector and SDK the same way. However, if we end up deciding that indeed we want to model them the same way, with SDK's use cases being virtually a subset of Collector's use cases, then "otel." prefix seems to be the most natural choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tigrannajaryan. I'll open an issue to add otelcol as a top level namespace in the semantic conventions.

One additional thing that I think is worth thinking about: is this Collector-only telemetry or it can be applicable to SDK processors/exporters? Collector's telemetry is going to be much richer and needs to be more expressive, so I am not sure if it is beneficial to model Collector and SDK the same way. However, if we end up deciding that indeed we want to model them the same way, with SDK's use cases being virtually a subset of Collector's use cases, then "otel." prefix seems to be the most natural choice.

This is intended to be collector-only telemetry. It is attributed to the collector's instances of components, which are governed by a model that is not intended to be aligned with other parts of the project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened open-telemetry/semantic-conventions#1555 to propose otelcol as a new top-level namespace.

- `otel.component.id`: The component ID
- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ALL`**

#### Processors

- `otel.component.kind`: `processor`
- `otel.component.id`: The component ID
- `otel.pipeline.id`: The pipeline ID, **OR `ALL`**
- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ALL`**

#### Exporters

- `otel.component.kind`: `exporter`
- `otel.component.id`: The component ID
- `otel.signal`: `logs`, `metrics` `traces`, **OR `ALL`**

#### Connectors

- `otel.component.kind`: `connector`
- `otel.component.id`: The component ID
- `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ALL`**
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

Notes: The use of `ALL` is based on the assumption that components are instanced either in the default way, or, as a single instance per configuration (e.g. otlp receiver).
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

### Metrics

There are two straightforward measurements that can be made on any pdata:
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default.
2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11). These are high cost to compute, so by default they should be disabled (and not calculated).
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the component to which the telemetry is ascribed.

1. Incoming measurements are attributed to the component which is _consuming_ the data.
2. Outgoing measurements are attributed to the component which is _producing_ the data.
djaglowski marked this conversation as resolved.
Show resolved Hide resolved
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the function call returned an error. Outgoing measurements will be recorded with `outcome` as `failure` when the next consumer returns an error, and `success` otherwise. Likewise, incoming measurements will be recorded with `outcome` as `failure` when the component itself returns an error, and `success` otherwise.
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

```yaml
otelcol_component_incoming_items:
djaglowski marked this conversation as resolved.
Show resolved Hide resolved
djaglowski marked this conversation as resolved.
Show resolved Hide resolved
enabled: true
description: Number of items passed to the component.
unit: "{items}"
sum:
value_type: int
monotonic: true
otelcol_component_outgoing_items:
enabled: true
description: Number of items emitted from the component.
unit: "{items}"
sum:
value_type: int
monotonic: true

otelcol_component_incoming_size:
enabled: false
description: Size of items passed to the component.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol_component_outgoing_size:
enabled: false
description: Size of items emitted from the component.
unit: "By"
sum:
value_type: int
monotonic: true
djaglowski marked this conversation as resolved.
Show resolved Hide resolved
```

### Spans

A span should be recorded for each execution of a processor or connector. The instrumentation layers adjacent to these components can start and end the span as appropriate.
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

### Logs
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

Metrics and spans provide most of the observability we need but there are some gaps which logs can fill. For example, we can record spans for processors and connectors but logs are useful for capturing precise timing as it relates to data produced by receivers and consumed by exporters. Additionally, although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed.

For security and performance reasons, it would not be appropriate to log the contents of telemetry.

It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to many users if they are not handled automatically.

With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for detailed troubleshooting but does not impact users otherwise.

In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example of future possibilities.

### Additional context

This proposal pulls from a number of issues and PRs:

- [Demonstrate graph-based metrics](https://github.com/open-telemetry/opentelemetry-collector/pull/11311)
- [Attributes for component instancing](https://github.com/open-telemetry/opentelemetry-collector/issues/11179)
- [Simple processor metrics](https://github.com/open-telemetry/opentelemetry-collector/issues/10708)
- [Component instancing is complicated](https://github.com/open-telemetry/opentelemetry-collector/issues/10534)
Loading