Add http request duration to SDK metrics #2007

lzchen · 2025-03-18T19:12:20Z

Related to #1631 and partially addresses #1906

Changes

Adds a metric to track the duration of http requests made within an exporter.

Following the discussion in this issue, I wasn't exactly sure how we want to provide more detail regarding partial success. The error.type can indicate that it was a partial success but we can possibly also include information about which items succeeded/failed somewhere (perhaps in another attribute).

I also made the distinction of making this metric specific to http but am open to defining a generic one regardless of transport protocol (we can probably have protocol + response code as additional attributes if so).

Looking forward to hearing your thoughts!

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
schema-next.yaml updated with changes to existing conventions.

model/otel/metrics.yaml

docs/otel/sdk-metrics.md

jsuereth · 2025-03-19T13:46:09Z

docs/otel/sdk-metrics.md

+
+| Name     | Instrument Type | Unit (UCUM) | Description    | Stability |
+| -------- | --------------- | ----------- | -------------- | --------- |
+| `otel.sdk.exporter.http.request.duration` | Histogram | `s` | The duration of any http request(s) made from the exporter during export. [1] | ![Development](https://img.shields.io/badge/-development-blue) |


At a high level, just want to understand why this wouldn't use the general HTTP semantic convention duration with a required label denoting it's the SDK exporter.

separate monitoring pipeline metrics could make it easier for ingestion services to handle them differently (e.g. maybe not consider them "user data", surface pipeline issues, etc)

I hope we can have a export call duration metrics that covers all tries and has a final export status - the logical rather than physical. So I'd also prefer to remove http and would suggest to call it otel.sdk.export.duration (otel.sdk.exporter.operation.duration)

@jsuereth

Due to the change after this comment, the metric is now being used for all network based exporters instead of just http.

@lmolkova

I have removed the http.

covers all tries and has a final export status

This metric is meant to capture requests sent as a batch. I believe covering all retries constitutes knowing how each individual item in the batch has behaved until final export status. I don't know how that would translate to duration and also I don't think that's within the scope of a batch level metric. Perhaps that would be better suited for a separate metric, as our use case requires knowing the duration of requests on a batch level.

I hope we can have a export call duration metrics that covers all tries and has a final export status - the logical rather than physical

IMO we should stay with otel.sdk.exporter.request.duration and keep this about the physical request.
The problem with tracking the total duration including retries is that I don't think most implementations would treat retried batches and new batches separetly and might combine them onto a single, physical request. This makes it impossible to track the duration correctly with partial successes. I've given a detailed explanation and example in this comment.

Another benefit of having a duration per physical request that you also get the error.type which caused the retry: e.g. was it a network error or did the receiver return 503 Temporarily Unavailable?

I would suggest to add a separate metric if we want to track the "logical" request (= a batch of data to be exported). Something like otel.sdk.exporter.batch.duration. If an exporter does not have partial success semantics OR retries batches in isolation, it can produce this metric. Otherwise we will still have otel.sdk.exporter.request.duration, which also remains useful in combination with otel.sdk.exporter.batch.duration due to giving insights on the physical requests.

patrial success is a terminal status code, it won't be retried - the exporter would need to create a new batch if they want to re-send it

The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.

from https://opentelemetry.io/docs/specs/otlp/#partial-success

We do have protocol-level metrics for HTTP and gRPC, we just need to let them do the work (not suppress metrics from exporters), the logical duration is a separate metrics, it can't be replaced with a physical ones.

otel.sdk.exporter.batch.duration

👍

or otel.sdk.export.duration

patrial success is a terminal status code, it won't be retried - the exporter would need to create a new batch if they want to re-send it

The problem I'm thinking of is not partial-successes being retried. It is about requests with partial success containing a mix of retried and new data:

A batch of spans is submitted to the exporter

The request fails for a network reason

In the meantime, a second batch of spans is submitted for export

The second request is issued, containing the spans from both batches

The request finishes with partial success, returning that 5 spans were rejected

In this example, it is not possible to identify if the rejected spans belong to the first or second batch, therefore it is not possible to correctly populate the error.type for the batches.

I don't know if any exporter does this or whether this is forbidden, I just wanted to highlight that it might not be possible to correctly collect this metric in certain scenarios.

We do have protocol-level metrics for HTTP and gRPC, we just need to let them do the work (not suppress metrics from exporters), the logical duration is a separate metrics, it can't be replaced with a physical ones.

For that I'm with @trask :

separate monitoring pipeline metrics could make it easier for ingestion services to handle them differently (e.g. maybe not consider them "user data", surface pipeline issues, etc)

I think that telemetry about opentelemetry itself should be clearly separated from "user data" too. Some examples where not doing this would lead to problems:

A user has a HTTP-statistics dashboard for a mission-critical service. He now needs to somehow filter out the Opentelemetry-Exporter requests to not distort the data he is actually interested in. To do so, he needs deep deployment knowledge about the deployed OTLP-endpoints to filter them out correctly. Even worse, he might have to change his dashboard if the monitoring setup changes and e.g. collectors are deployed under different URLs.

It makes sense for mission critical environments to have a separate "user-data" telemetry backend and a health-telemetry backend. This way if your "user-data" backend is overloaded, you'll still have your health-telemetry to investigate the problem. If you have both in a single monitoring backend, you won't be able to access your health telemetry in case the backend is overloaded, which ironically is when you need those metrics most. Having the health metrics in well-defined, separated namespaces (e.g. otel.sdk.*) makes it easy to set this up: Just setup two metric exporters in your SDKs, one exporting just otel.sdk.* to your health-telemetry backend and one exporting everything except otel.sdk.*. If you have to seaprate metrics based on attribute values things get a lot more complicated.

I'm not asking to necessarily add physical request metrics in this PR, I just would like to reserve the otel.sdk.exporter.request.* for those and use otel.sdk.exporter.batch.duration or otel.sdk.export.duration for the logical duration instead.

@JonasKunz

the export duration would measure this call:

https://github.com/open-telemetry/opentelemetry-go/blob/796596abbad67e221a89e225174715a74888101c/exporters/otlp/otlptrace/exporter.go#L36

err := e.client.UploadTraces(ctx, protoSpans)

there is no logic that re-batches things on retry - the retry happens within the scope of e.client.UploadTraces and attempts to re-send the same batch.

Do I miss something? Is there an SDK exporter that has some dynamic batching mechanism?

I wasn't talking about a concrete implementation and also not only thinking about OTLP. But it is possible to provide an implementation where my example scenario can occur.

Anyway, this shouldn't be problematic as long as the metric is not a MUST. So I'd propose to just move ahead with the metric for the logic duration, but reserve otel.sdk.exporter.request.* for physical request metrics?

…ventions into requests-duration

lzchen added 2 commits March 18, 2025 11:54

http

606aab9

rename

acadb93

lzchen requested review from a team as code owners March 18, 2025 19:12

github-actions bot added the enhancement New feature or request label Mar 18, 2025

lzchen added 2 commits March 18, 2025 12:21

lint

2f33a66

fix

08a55d2

JonasKunz reviewed Mar 19, 2025

View reviewed changes

model/otel/metrics.yaml Outdated Show resolved Hide resolved

docs/otel/sdk-metrics.md Outdated Show resolved Hide resolved

jsuereth reviewed Mar 19, 2025

View reviewed changes

lzchen added 2 commits March 20, 2025 10:32

remove http

684fd82

Merge branch 'main' of https://github.com/open-telemetry/semantic-con…

8ed5b70

…ventions into requests-duration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add http request duration to SDK metrics #2007

Add http request duration to SDK metrics #2007

lzchen commented Mar 18, 2025 •

edited

Loading

jsuereth Mar 19, 2025

trask Mar 19, 2025

lmolkova Mar 20, 2025

lzchen Mar 20, 2025

JonasKunz Mar 21, 2025 •

edited

Loading

lmolkova Mar 25, 2025 •

edited

Loading

JonasKunz Mar 25, 2025

lmolkova Mar 25, 2025 •

edited

Loading

JonasKunz Mar 26, 2025

Add http request duration to SDK metrics #2007

Are you sure you want to change the base?

Add http request duration to SDK metrics #2007

Conversation

lzchen commented Mar 18, 2025 • edited Loading

Changes

Merge requirement checklist

jsuereth Mar 19, 2025

Choose a reason for hiding this comment

trask Mar 19, 2025

Choose a reason for hiding this comment

lmolkova Mar 20, 2025

Choose a reason for hiding this comment

lzchen Mar 20, 2025

Choose a reason for hiding this comment

JonasKunz Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

lmolkova Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

JonasKunz Mar 25, 2025

Choose a reason for hiding this comment

lmolkova Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

JonasKunz Mar 26, 2025

Choose a reason for hiding this comment

lzchen commented Mar 18, 2025 •

edited

Loading

JonasKunz Mar 21, 2025 •

edited

Loading

lmolkova Mar 25, 2025 •

edited

Loading

lmolkova Mar 25, 2025 •

edited

Loading