-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add http request duration to SDK metrics #2007
base: main
Are you sure you want to change the base?
Conversation
docs/otel/sdk-metrics.md
Outdated
|
||
| Name | Instrument Type | Unit (UCUM) | Description | Stability | | ||
| -------- | --------------- | ----------- | -------------- | --------- | | ||
| `otel.sdk.exporter.http.request.duration` | Histogram | `s` | The duration of any http request(s) made from the exporter during export. [1] |  | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level, just want to understand why this wouldn't use the general HTTP semantic convention duration with a required label denoting it's the SDK exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate monitoring pipeline metrics could make it easier for ingestion services to handle them differently (e.g. maybe not consider them "user data", surface pipeline issues, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we can have a export call duration metrics that covers all tries and has a final export status - the logical rather than physical. So I'd also prefer to remove http
and would suggest to call it otel.sdk.export.duration
(otel.sdk.exporter.operation.duration
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the change after this comment, the metric is now being used for all network based exporters instead of just http.
I have removed the http
.
covers all tries and has a final export status
This metric is meant to capture requests sent as a batch. I believe covering all retries constitutes knowing how each individual item in the batch has behaved until final export status. I don't know how that would translate to duration and also I don't think that's within the scope of a batch level metric. Perhaps that would be better suited for a separate metric, as our use case requires knowing the duration of requests on a batch level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we can have a export call duration metrics that covers all tries and has a final export status - the logical rather than physical
IMO we should stay with otel.sdk.exporter.request.duration
and keep this about the physical request.
The problem with tracking the total duration including retries is that I don't think most implementations would treat retried batches and new batches separetly and might combine them onto a single, physical request. This makes it impossible to track the duration correctly with partial successes. I've given a detailed explanation and example in this comment.
Another benefit of having a duration per physical request that you also get the error.type
which caused the retry: e.g. was it a network error or did the receiver return 503 Temporarily Unavailable
?
I would suggest to add a separate metric if we want to track the "logical" request (= a batch of data to be exported). Something like otel.sdk.exporter.batch.duration
. If an exporter does not have partial success semantics OR retries batches in isolation, it can produce this metric. Otherwise we will still have otel.sdk.exporter.request.duration
, which also remains useful in combination with otel.sdk.exporter.batch.duration
due to giving insights on the physical requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
patrial success is a terminal status code, it won't be retried - the exporter would need to create a new batch if they want to re-send it
The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.
from https://opentelemetry.io/docs/specs/otlp/#partial-success
We do have protocol-level metrics for HTTP and gRPC, we just need to let them do the work (not suppress metrics from exporters), the logical duration is a separate metrics, it can't be replaced with a physical ones.
otel.sdk.exporter.batch.duration
👍
or otel.sdk.export.duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
patrial success is a terminal status code, it won't be retried - the exporter would need to create a new batch if they want to re-send it
The problem I'm thinking of is not partial-successes being retried. It is about requests with partial success containing a mix of retried and new data:
- A batch of spans is submitted to the exporter
- The request fails for a network reason
- In the meantime, a second batch of spans is submitted for export
- The second request is issued, containing the spans from both batches
- The request finishes with partial success, returning that 5 spans were rejected
In this example, it is not possible to identify if the rejected spans belong to the first or second batch, therefore it is not possible to correctly populate the error.type
for the batches.
I don't know if any exporter does this or whether this is forbidden, I just wanted to highlight that it might not be possible to correctly collect this metric in certain scenarios.
We do have protocol-level metrics for HTTP and gRPC, we just need to let them do the work (not suppress metrics from exporters), the logical duration is a separate metrics, it can't be replaced with a physical ones.
For that I'm with @trask :
separate monitoring pipeline metrics could make it easier for ingestion services to handle them differently (e.g. maybe not consider them "user data", surface pipeline issues, etc)
I think that telemetry about opentelemetry itself should be clearly separated from "user data" too. Some examples where not doing this would lead to problems:
-
A user has a HTTP-statistics dashboard for a mission-critical service. He now needs to somehow filter out the Opentelemetry-Exporter requests to not distort the data he is actually interested in. To do so, he needs deep deployment knowledge about the deployed OTLP-endpoints to filter them out correctly. Even worse, he might have to change his dashboard if the monitoring setup changes and e.g. collectors are deployed under different URLs.
-
It makes sense for mission critical environments to have a separate "user-data" telemetry backend and a health-telemetry backend. This way if your "user-data" backend is overloaded, you'll still have your health-telemetry to investigate the problem. If you have both in a single monitoring backend, you won't be able to access your health telemetry in case the backend is overloaded, which ironically is when you need those metrics most. Having the health metrics in well-defined, separated namespaces (e.g.
otel.sdk.*
) makes it easy to set this up: Just setup two metric exporters in your SDKs, one exporting justotel.sdk.*
to your health-telemetry backend and one exporting everything exceptotel.sdk.*
. If you have to seaprate metrics based on attribute values things get a lot more complicated.
I'm not asking to necessarily add physical request metrics in this PR, I just would like to reserve the otel.sdk.exporter.request.*
for those and use otel.sdk.exporter.batch.duration
or otel.sdk.export.duration
for the logical duration instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the export duration would measure this call:
err := e.client.UploadTraces(ctx, protoSpans)
there is no logic that re-batches things on retry - the retry happens within the scope of e.client.UploadTraces
and attempts to re-send the same batch.
Do I miss something? Is there an SDK exporter that has some dynamic batching mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't talking about a concrete implementation and also not only thinking about OTLP. But it is possible to provide an implementation where my example scenario can occur.
Anyway, this shouldn't be problematic as long as the metric is not a MUST. So I'd propose to just move ahead with the metric for the logic duration, but reserve otel.sdk.exporter.request.*
for physical request metrics?
…ventions into requests-duration
Related to #1631 and partially addresses #1906
Changes
Adds a metric to track the duration of http requests made within an exporter.
Following the discussion in this issue, I wasn't exactly sure how we want to provide more detail regarding partial success. The
error.type
can indicate that it was a partial success but we can possibly also include information about which items succeeded/failed somewhere (perhaps in another attribute).I also made the distinction of making this metric specific to
http
but am open to defining a generic one regardless of transport protocol (we can probably have protocol + response code as additional attributes if so).Looking forward to hearing your thoughts!
Merge requirement checklist
[chore]