Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add http request duration to SDK metrics #2007

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lzchen
Copy link

@lzchen lzchen commented Mar 18, 2025

Related to #1631 and partially addresses #1906

Changes

Adds a metric to track the duration of http requests made within an exporter.

Following the discussion in this issue, I wasn't exactly sure how we want to provide more detail regarding partial success. The error.type can indicate that it was a partial success but we can possibly also include information about which items succeeded/failed somewhere (perhaps in another attribute).

I also made the distinction of making this metric specific to http but am open to defining a generic one regardless of transport protocol (we can probably have protocol + response code as additional attributes if so).

Looking forward to hearing your thoughts!

Merge requirement checklist

@lzchen lzchen requested review from a team as code owners March 18, 2025 19:12
@github-actions github-actions bot added the enhancement New feature or request label Mar 18, 2025

| Name | Instrument Type | Unit (UCUM) | Description | Stability |
| -------- | --------------- | ----------- | -------------- | --------- |
| `otel.sdk.exporter.http.request.duration` | Histogram | `s` | The duration of any http request(s) made from the exporter during export. [1] | ![Development](https://img.shields.io/badge/-development-blue) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, just want to understand why this wouldn't use the general HTTP semantic convention duration with a required label denoting it's the SDK exporter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate monitoring pipeline metrics could make it easier for ingestion services to handle them differently (e.g. maybe not consider them "user data", surface pipeline issues, etc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can have a export call duration metrics that covers all tries and has a final export status - the logical rather than physical. So I'd also prefer to remove http and would suggest to call it otel.sdk.export.duration (otel.sdk.exporter.operation.duration)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsuereth

Due to the change after this comment, the metric is now being used for all network based exporters instead of just http.

@lmolkova

I have removed the http.

covers all tries and has a final export status

This metric is meant to capture requests sent as a batch. I believe covering all retries constitutes knowing how each individual item in the batch has behaved until final export status. I don't know how that would translate to duration and also I don't think that's within the scope of a batch level metric. Perhaps that would be better suited for a separate metric, as our use case requires knowing the duration of requests on a batch level.

Copy link
Contributor

@JonasKunz JonasKunz Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can have a export call duration metrics that covers all tries and has a final export status - the logical rather than physical

IMO we should stay with otel.sdk.exporter.request.duration and keep this about the physical request.
The problem with tracking the total duration including retries is that I don't think most implementations would treat retried batches and new batches separetly and might combine them onto a single, physical request. This makes it impossible to track the duration correctly with partial successes. I've given a detailed explanation and example in this comment.

Another benefit of having a duration per physical request that you also get the error.type which caused the retry: e.g. was it a network error or did the receiver return 503 Temporarily Unavailable?

I would suggest to add a separate metric if we want to track the "logical" request (= a batch of data to be exported). Something like otel.sdk.exporter.batch.duration. If an exporter does not have partial success semantics OR retries batches in isolation, it can produce this metric. Otherwise we will still have otel.sdk.exporter.request.duration, which also remains useful in combination with otel.sdk.exporter.batch.duration due to giving insights on the physical requests.

Copy link
Contributor

@lmolkova lmolkova Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patrial success is a terminal status code, it won't be retried - the exporter would need to create a new batch if they want to re-send it

The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.

from https://opentelemetry.io/docs/specs/otlp/#partial-success

We do have protocol-level metrics for HTTP and gRPC, we just need to let them do the work (not suppress metrics from exporters), the logical duration is a separate metrics, it can't be replaced with a physical ones.

otel.sdk.exporter.batch.duration

👍

or otel.sdk.export.duration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patrial success is a terminal status code, it won't be retried - the exporter would need to create a new batch if they want to re-send it

The problem I'm thinking of is not partial-successes being retried. It is about requests with partial success containing a mix of retried and new data:

  1. A batch of spans is submitted to the exporter
  2. The request fails for a network reason
  3. In the meantime, a second batch of spans is submitted for export
  4. The second request is issued, containing the spans from both batches
  5. The request finishes with partial success, returning that 5 spans were rejected

In this example, it is not possible to identify if the rejected spans belong to the first or second batch, therefore it is not possible to correctly populate the error.type for the batches.

I don't know if any exporter does this or whether this is forbidden, I just wanted to highlight that it might not be possible to correctly collect this metric in certain scenarios.

We do have protocol-level metrics for HTTP and gRPC, we just need to let them do the work (not suppress metrics from exporters), the logical duration is a separate metrics, it can't be replaced with a physical ones.

For that I'm with @trask :

separate monitoring pipeline metrics could make it easier for ingestion services to handle them differently (e.g. maybe not consider them "user data", surface pipeline issues, etc)

I think that telemetry about opentelemetry itself should be clearly separated from "user data" too. Some examples where not doing this would lead to problems:

  • A user has a HTTP-statistics dashboard for a mission-critical service. He now needs to somehow filter out the Opentelemetry-Exporter requests to not distort the data he is actually interested in. To do so, he needs deep deployment knowledge about the deployed OTLP-endpoints to filter them out correctly. Even worse, he might have to change his dashboard if the monitoring setup changes and e.g. collectors are deployed under different URLs.

  • It makes sense for mission critical environments to have a separate "user-data" telemetry backend and a health-telemetry backend. This way if your "user-data" backend is overloaded, you'll still have your health-telemetry to investigate the problem. If you have both in a single monitoring backend, you won't be able to access your health telemetry in case the backend is overloaded, which ironically is when you need those metrics most. Having the health metrics in well-defined, separated namespaces (e.g. otel.sdk.*) makes it easy to set this up: Just setup two metric exporters in your SDKs, one exporting just otel.sdk.* to your health-telemetry backend and one exporting everything except otel.sdk.*. If you have to seaprate metrics based on attribute values things get a lot more complicated.

I'm not asking to necessarily add physical request metrics in this PR, I just would like to reserve the otel.sdk.exporter.request.* for those and use otel.sdk.exporter.batch.duration or otel.sdk.export.duration for the logical duration instead.

Copy link
Contributor

@lmolkova lmolkova Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JonasKunz

the export duration would measure this call:

https://github.com/open-telemetry/opentelemetry-go/blob/796596abbad67e221a89e225174715a74888101c/exporters/otlp/otlptrace/exporter.go#L36

err := e.client.UploadTraces(ctx, protoSpans)

there is no logic that re-batches things on retry - the retry happens within the scope of e.client.UploadTraces and attempts to re-send the same batch.

Do I miss something? Is there an SDK exporter that has some dynamic batching mechanism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't talking about a concrete implementation and also not only thinking about OTLP. But it is possible to provide an implementation where my example scenario can occur.

Anyway, this shouldn't be problematic as long as the metric is not a MUST. So I'd propose to just move ahead with the metric for the logic duration, but reserve otel.sdk.exporter.request.* for physical request metrics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

5 participants