RFC - Pipeline Component Telemetry #11406

djaglowski · 2024-10-09T19:57:58Z

This PR adds a RFC for normalized telemetry across all pipeline components. See #11343

edit by @mx-psi:

Announced on #otel-collector-dev on 2024-10-23: https://cloud-native.slack.com/archives/C07CCCMRXBK/p1729705290741179
Announced on the Collector SIG meeting from 2024-10-30

codecov · 2024-10-09T20:05:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.53%. Comparing base (68f0264) to head (3211be7).
Report is 112 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11406      +/-   ##
==========================================
- Coverage   92.15%   91.53%   -0.63%     
==========================================
  Files         432      441       +9     
  Lines       20291    23922    +3631     
==========================================
+ Hits        18700    21896    +3196     
- Misses       1228     1650     +422     
- Partials      363      376      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codeboten

Thanks for opening this as a RFC @djaglowski!

docs/rfcs/component-universal-telemetry.md

Co-authored-by: Alex Boten <[email protected]>

docs/rfcs/component-universal-telemetry.md

djaglowski · 2024-10-16T19:27:57Z

Based on some offline feedback, I've broadened the scope of the RFC, while simultaneously clarifying that it is intended to evolve as we identify additional standards.

jaronoff97

a few questions, I really like this proposal overall :)

docs/rfcs/component-universal-telemetry.md

Co-authored-by: Damien Mathieu <[email protected]>

docs/rfcs/component-universal-telemetry.md

Co-authored-by: William Dumont <[email protected]>

docs/rfcs/component-universal-telemetry.md

jpkrohling · 2024-11-04T14:10:10Z

docs/rfcs/component-universal-telemetry.md

+- `otel.output.signal`: `logs`, `metrics` `traces`, `profiles`
+
+Note: The `otel.signal`, `otel.output.signal`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances
+are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal.


I think this needs normative language. Setting the otel.signal for the OTLP receiver for its bootstrap operations are certainly misleading to people operating the collector who are unaware of the inner working of this specific component (ie, that it's a singleton).

So:

attributes MUST be omitted if the corresponding component instances

This is a tricky topic and I'm not sure we can be so strict. It certainly makes sense to me that e.g. logs generated while initializing the singleton should not be attributed to one signal or pipeline. However, we can still attribute metrics to a particular signal (e.g. if otlp receiver emits 10 logs and 20 metrics, do you want a count of "30 items" or "10 logs" and "20 metrics". Maybe this is a good argument for splitting the proposed metrics by signal type, e.g. produced_metrics, produced_logs, etc. This would allow those metrics to share the same set of attributes with other signals produced by the instance.

jpkrohling · 2024-11-04T14:11:43Z

docs/rfcs/component-universal-telemetry.md

+
+All signals should use the following attributes:
+
+### Receivers


I know this states "pipeline component telemetry", and that extensions aren't technically part of a pipeline, but it feels wrong to leave them out: otel.component.kind and .id could definitely apply to them as well.

The scope of this effort has been increased a lot already. Can we leave extensions for another proposal? Personally I don't feel I have enough expertise with extensions to author such details.

jpkrohling · 2024-11-04T14:19:13Z

docs/rfcs/component-universal-telemetry.md

+
+1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default.
+2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11).
+  These are high cost to compute, so by default they should be disabled (and not calculated).


How costly? I remember talking to someone about this in the past, and they mentioned that it's not that expensive, given that it just delegates to what already exists in protobuf:

opentelemetry-collector/pdata/ptrace/pb.go

Line 22 in 9907ba5

return pb.Size()

It would be nice to have benchmarks to have data backing this (or the other) claim. I would definitely see as very useful to have a histogram of item/batch sizes and having it as optional means that people might only find out about it when they'd benefit from having historical data in the first place.

Suggested change

These are high cost to compute, so by default they should be disabled (and not calculated).

These may be high cost to compute, so by default they should be disabled (and not calculated). This default setting may change in the future if it is demonstrated that the cost is generally acceptable.

How's this wording?

jpkrohling · 2024-11-04T15:39:46Z

docs/rfcs/component-universal-telemetry.md

+
+Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall
+item counts, it is helpful in some cases to record more granular events. e.g. If a produced batch of 10,000 spans results in an error, but
+100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric


This is clearly a tracing case for me :-) The rule of thumb to me is: is the information related to a particular transaction? Then it should go into a span.

Makes sense if we agree that we should capture a span for the consume function.

jpkrohling · 2024-11-04T15:44:18Z

docs/rfcs/component-universal-telemetry.md

+
+### Auto-Instrumented Spans
+
+It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both


Context is passed down, isn't it? We can definitely instrument the ingress part of the component, and ask components to add span links if they are messing with the context. This way, the trace for a pipeline with a batch processor would end at the batching processor, but a new trace with a span link would be created pointing to the originating batch request.

I see, so instead of closing the span when data comes out, we close it when the consume func returns?

I think the duration of the span would be meaningful only for some synchronous processors, and could be meaningful for syncronous connectors (e.g. if they create and link spans to represent the work associated with the incoming data). But what about asynchronous components? Do we accept that the span is just measuring a quick handoff to the internal state of the component? Is this going to be misleading to users?

jpkrohling · 2024-11-04T15:45:21Z

Some of my comments might have been discussed before, in which case, feel free to ignore me and just mark the items as resolved.

Co-authored-by: Evan Bradley <[email protected]>

dmitryax · 2024-11-06T03:53:26Z

docs/rfcs/component-universal-telemetry.md

+
+### Receivers
+
+- `otel.component.kind`: `receiver`


OTel SemConv guidelines require metric attributes to be namespaces so they can be placed in a global registry along with resource attributes. I don't think we can use otel for collector. It probably should be otelcol. Or otel.collector instead. However, otel. seems like kind of a restricted namespace. Not sure if we can use it or not. I think we need to run this by the otel semantic conventions SIG group before merging the RFC.

@dmitryax if you think contacting the semantic convention SIG is a requirement for this RFC to be merged, please request changes on the PR so that we follow the process

I'm fine with changing the attribute names but delaying this functionality further for an entirely separate naming process seems unnecessary. Can we really not choose our own attributes names?

This sentence from the doc linked by @dmitryax makes me hesitant about keeping the current approach:

Any additions to the otel.* namespace MUST be approved as part of OpenTelemetry specification.

This sentence from the doc linked by @dmitryax makes me hesitant about keeping the current approach:

Any additions to the otel.* namespace MUST be approved as part of OpenTelemetry specification.

I agree we can't just use the otel namespace. Can't we choose another on our own though? I like the otelcol suggestion.

otelcol works for me, it's also the one we use for metrics so it makes sense to be consistent here

@mx-psi asked me to chime in to help interpret the spec.

You are reading it correctly. If you want to use otel. prefix it needs to go through semconv group.

What can you use instead? If you read the preceding paragraph the closest applicable rules seems to be this:

The name is specific to your company and may be possibly used outside the company as well. To avoid clashes with names introduced by other companies (in a distributed system that uses applications from multiple vendors) it is recommended to prefix the new name by your company's reverse domain name, e.g. com.acme.shopname.

If we consider "OpenTelemetry" the "company" then we should use "io.opentelemetry." as the org-specific prefix, e.g. "io.opentelemetry.collector." for the Collector.

However, I am not sure I like such a long prefix. "otelcol" or "otel.collector" looks much nicer to me.

I cannot find spec recommendations about when a new top-level namespace (like "otelcol") is allowed to be used. I would argue the spec should have a rule like this and the rule should be that it needs to go through semconv, just like for anything under "otel." namespace.

My recommendation is this:

Decide which prefix you want to use: "otelcol." or "otel.collector." appear to be your top choices.

Submit a request to get this added to semconv. We should be able to convince semconv group to accept it and I do not anticipate difficulties here.

You don't need to be blocked or wait until acceptance by semconv group, there is a ton of work that needs to happen anyway and replacing metric names should be trivial anytime before you declare them stable.

One additional thing that I think is worth thinking about: is this Collector-only telemetry or it can be applicable to SDK processors/exporters? Collector's telemetry is going to be much richer and needs to be more expressive, so I am not sure if it is beneficial to model Collector and SDK the same way. However, if we end up deciding that indeed we want to model them the same way, with SDK's use cases being virtually a subset of Collector's use cases, then "otel." prefix seems to be the most natural choice.

Thanks @tigrannajaryan. I'll open an issue to add otelcol as a top level namespace in the semantic conventions.

One additional thing that I think is worth thinking about: is this Collector-only telemetry or it can be applicable to SDK processors/exporters? Collector's telemetry is going to be much richer and needs to be more expressive, so I am not sure if it is beneficial to model Collector and SDK the same way. However, if we end up deciding that indeed we want to model them the same way, with SDK's use cases being virtually a subset of Collector's use cases, then "otel." prefix seems to be the most natural choice.

This is intended to be collector-only telemetry. It is attributed to the collector's instances of components, which are governed by a model that is not intended to be aligned with other parts of the project.

I've opened open-telemetry/semantic-conventions#1555 to propose otelcol as a new top-level namespace.

djaglowski mentioned this pull request Oct 9, 2024

Auto-instrumented pipeline components #11343

Closed

RFC - Auto-instrumentation of pipeline components

5df52e1

djaglowski force-pushed the component-telemetry-rfc branch from 99e3086 to 5df52e1 Compare October 10, 2024 13:05

Merge branch 'main' into component-telemetry-rfc

4232821

djaglowski marked this pull request as ready for review October 10, 2024 13:36

djaglowski requested a review from a team as a code owner October 10, 2024 13:36

djaglowski requested a review from songy23 October 10, 2024 13:36

djaglowski added Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests labels Oct 10, 2024

codeboten reviewed Oct 10, 2024

View reviewed changes

djaglowski and others added 2 commits October 10, 2024 11:45

Update docs/rfcs/component-universal-telemetry.md

d0f1637

Co-authored-by: Alex Boten <[email protected]>

Feedback

bea0e2f

jmacd mentioned this pull request Oct 10, 2024

WIP: pipeline monitoring otep open-telemetry/oteps#259

Closed

bogdandrutu reviewed Oct 10, 2024

View reviewed changes

Feedback

35c82a4

bogdandrutu reviewed Oct 16, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

bogdandrutu reviewed Oct 16, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

Broaden scope and convert to evolving consensus

2600836

djaglowski changed the title ~~RFC - Auto-instrumentation of pipeline components~~ RFC - Pipeline Component Telemetry Oct 16, 2024

Update names to consumed and produced

b1fd90c

jaronoff97 reviewed Oct 23, 2024

View reviewed changes

djaglowski added 2 commits October 23, 2024 14:08

Change proposed metric names to use '.' instead of '_'

90d19ab

Separate metrics by component kind

497587c

dmathieu reviewed Oct 24, 2024

View reviewed changes

Add profiles as attribute value

6463759

Co-authored-by: Damien Mathieu <[email protected]>

jaronoff97 reviewed Oct 24, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

mx-psi approved these changes Oct 29, 2024

View reviewed changes

wildum reviewed Oct 30, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

wildum reviewed Oct 30, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved

wildum approved these changes Oct 30, 2024

View reviewed changes

Update docs/rfcs/component-universal-telemetry.md

1b26ae2

Co-authored-by: William Dumont <[email protected]>

jaronoff97 approved these changes Oct 30, 2024

View reviewed changes

evan-bradley approved these changes Nov 4, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

jpkrohling reviewed Nov 4, 2024

View reviewed changes

djaglowski and others added 2 commits November 4, 2024 11:15

Update docs/rfcs/component-universal-telemetry.md

bba26c9

Co-authored-by: Evan Bradley <[email protected]>

Change 'otel.output.signal' to 'otel.signal.output'

7925012

dmitryax reviewed Nov 6, 2024

View reviewed changes

Change 'otel.*' to 'otelcol.*'

3211be7

djaglowski mentioned this pull request Nov 7, 2024

New top-level "otelcol" namespace open-telemetry/semantic-conventions#1555

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC - Pipeline Component Telemetry #11406

RFC - Pipeline Component Telemetry #11406

djaglowski commented Oct 9, 2024 •

edited by mx-psi

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

codeboten left a comment

djaglowski commented Oct 16, 2024

jaronoff97 left a comment

jpkrohling Nov 4, 2024

djaglowski Nov 4, 2024

jpkrohling Nov 4, 2024

djaglowski Nov 4, 2024

jpkrohling Nov 4, 2024

djaglowski Nov 4, 2024

jpkrohling Nov 4, 2024

djaglowski Nov 4, 2024

jpkrohling Nov 4, 2024

djaglowski Nov 4, 2024

jpkrohling commented Nov 4, 2024

dmitryax Nov 6, 2024 •

edited

Loading

mx-psi Nov 6, 2024

djaglowski Nov 6, 2024

mx-psi Nov 7, 2024

djaglowski Nov 7, 2024

mx-psi Nov 7, 2024

tigrannajaryan Nov 7, 2024

djaglowski Nov 7, 2024

djaglowski Nov 7, 2024


		All signals should use the following attributes:

		### Receivers

	These are high cost to compute, so by default they should be disabled (and not calculated).
	These may be high cost to compute, so by default they should be disabled (and not calculated). This default setting may change in the future if it is demonstrated that the cost is generally acceptable.


		### Auto-Instrumented Spans

		It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both

RFC - Pipeline Component Telemetry #11406

Are you sure you want to change the base?

RFC - Pipeline Component Telemetry #11406

Conversation

djaglowski commented Oct 9, 2024 • edited by mx-psi Loading

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

codeboten left a comment

Choose a reason for hiding this comment

djaglowski commented Oct 16, 2024

jaronoff97 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpkrohling commented Nov 4, 2024

dmitryax Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djaglowski commented Oct 9, 2024 •

edited by mx-psi

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

dmitryax Nov 6, 2024 •

edited

Loading