[BUG] Opensearch Dashboards. Otel + Jaeger vs DataPrepper. No errors statistics from DataPrepper perspective for review. #2320

berezinsn · 2025-01-13T10:03:12Z

Setup:
Otel Agents -> Otel collector -> Jaeger / DataPrepper -> Opensearch -> OpensearchDashboards

Versions:
Opensearch Helm Chart version: 2.27.1, appVersion: 2.18.0
Opensearch-Dashboards Helm Chart version: 2.25.0, appVersion: 2.18.0
Jaeger Helm Chart version: 3.3.3, appVersion: 1.53.0
DataPrepper Helm Chart version: 0.1.0, appVersion: 2.8.0

Describe the issue:
I have a setup with instrumented applications using OpenTelemetry (Otel) agents, which push traces to an Otel collector. The Otel collector sends data to both Jaeger and DataPrepper. However, I am noticing a difference in the behavior of the same traces when viewed in OpenSearch Dashboards depending on the data source selected (Jaeger vs. DataPrepper).

Specifically, when I select DataPrepper as the data source, I do not see the entire trace being marked as a trace with errors, and the errors are not displayed on the dashboard. In contrast, when using Jaeger as the data source, the errors are correctly visualized, and the entire trace is marked as an "error trace" if any span within the trace contains an error.

Configuration:
Jaeger:

jaeger:
  agent:
    enabled: false
  provisionDataStore:
    cassandra: false
    elasticsearch: false
  collector:
    enabled: true
    annotations: {}
    image:
      registry: ""
      repository: jaegertracing/jaeger-collector
      tag: ""
      digest: ""
    envFrom: []
    cmdlineParams: {}
    basePath: /
    replicaCount: 1
    service:
      otlp:
        grpc:
          name: "otlp-grpc"
          port: 4317
        http:
          name: "otlp-http"
          port: 4318
    serviceAccount:
      create: true
  storage:
    type: elasticsearch
    elasticsearch:
      scheme: http
      host: opensearch-cluster-master.opensearch-otel.svc.cluster.local
      port: 9200
      anonymous: true
      usePassword: false
        - name: SPAN_STORAGE_TYPE
          value: "opensearch"
        - name: ES_TAGS_AS_FIELDS_ALL
          value: "true"
      tls:
        enabled: false

DataPrepper:

    config:
      otel-trace-pipeline:
        delay: "1000"
        source:
          otel_trace_source:
            ssl: false
        buffer:
          bounded_blocking:
            buffer_size: 10240
            batch_size: 160
        sink:
          - pipeline:
              name: "raw-traces-pipeline"
          - pipeline:
              name: "otel-service-map-pipeline"
      raw-traces-pipeline:
        source:
          pipeline:
            name: "otel-trace-pipeline"
        buffer:
          bounded_blocking:
            buffer_size: 10240
            batch_size: 160
        processor:
          - otel_trace_raw:
          - otel_trace_group:
              hosts: [ "http://opensearch-cluster-master:9200" ]
              insecure: true
        sink:
          - opensearch:
              hosts: [ "http://opensearch-cluster-master:9200" ]
              insecure: true
              index_type: trace-analytics-raw
      otel-service-map-pipeline:
        delay: "1000"
        source:
          pipeline:
            name: "otel-trace-pipeline"
        buffer:
          bounded_blocking:
            buffer_size: 10240
            batch_size: 160
        processor:
          - service_map_stateful:
              window_duration: 300
        sink:
          - opensearch:
              hosts: [ "http://opensearch-cluster-master:9200" ]
              insecure: true
              index_type: trace-analytics-service-map
              index: otel-v1-apm-span-%{yyyy.MM.dd}
              #max_retries: 20
              bulk_size: 4

Relevant Logs or Screenshots:
DataPrepper source. Error in span, but not all trace marked with Error, and no statistics observed

Here is Jaeger source. Error is observed in span and the whole trace marked with error (in the right top corner, next capture)

Please share your suggestions on how to fix it. TraceID is the same for both cases.
Thanks

The text was updated successfully, but these errors were encountered:

ps48 · 2025-01-28T19:59:24Z

@berezinsn Thanks for reporting this issue. I'll take a stab this week to replicate the issue and look into the root-cause.

ps48 · 2025-01-28T22:00:04Z

Was able to replicate the issue from our sanity test data:

ps48 · 2025-01-28T22:17:46Z

@berezinsn I was able to look into this deeper. The discrepancy is coming from the product definition of the traces table in both these experiences.

The Jaeger UI experience defines it as if any of the spans in the trace have an error then the trace is marked to have error.
The Data prepper UI experience defines it as if any of the spans in the trace have an error and the error bubbles up to the parent trace then mark the trace to have an error.

Sample span for reference:

  {
    "_index": "otel-v1-apm-span-000001",
    "_id": "01c87a25f18fb004",
    "_score": 2.8322158,
    "_source": {
      "traceId": "4fa04f117be100f476b175e41096e736",
      "spanId": "01c87a25f18fb004",
      "traceState": "",
      "parentSpanId": "a178d4084436e2ba",
      "name": "update_inventory",
      "kind": "SPAN_KIND_SERVER",
      "startTime": "2021-03-25T17:23:30.095432704Z",
      "endTime": "2021-03-25T17:23:30.125712640Z",
      "durationInNanos": 30279936,
      "serviceName": "inventory",
      "events": [],
      "links": [],
      "droppedAttributesCount": 0,
      "droppedEventsCount": 0,
      "droppedLinksCount": 0,
      "traceGroup": "client_checkout",
      "traceGroupFields.endTime": "2021-03-25T17:23:30.481628416Z",
      "traceGroupFields.statusCode": 0,
      "traceGroupFields.durationInNanos": 393149952,
      "span.attributes.net@peer@ip": "127.0.0.1",
      "instrumentationLibrary.version": "0.14b0",
      "resource.attributes.telemetry@sdk@language": "python",
      "span.attributes.host@port": 8082,
      "span.attributes.http@status_text": "SERVICE UNAVAILABLE",
      "resource.attributes.telemetry@sdk@version": "0.14b0",
      "resource.attributes.service@instance@id": "140016202633168",
      "resource.attributes.service@name": "inventory",
      "span.attributes.component": "http",
      "status.code": 2,
      "instrumentationLibrary.name": "opentelemetry.instrumentation.flask",
      "span.attributes.http@method": "POST",
      "span.attributes.http@user_agent": "python-requests/2.25.1",
      "span.attributes.net@peer@port": 45720,
      "resource.attributes.telemetry@sdk@name": "opentelemetry",
      "span.attributes.http@server_name": "0.0.0.0",
      "span.attributes.http@route": "/update_inventory",
      "span.attributes.http@host": "localhost:8082",
      "span.attributes.http@target": "/update_inventory",
      "span.attributes.http@scheme": "http",
      "resource.attributes.host@hostname": "ip-172-31-10-8.us-west-2.compute.internal",
      "span.attributes.http@flavor": "1.1",
      "span.attributes.http@status_code": 503
    }
  },

In the above example Data prepper experience uses traceGroupFields.statusCode as the error field to define a trace to have an error. whereas the Jaeger experience would use status.code field. More details on the fields here: https://github.com/opensearch-project/data-prepper/blob/main/docs/schemas/trace-analytics/otel-v1-apm-span-index-template.md

berezinsn · 2025-01-29T09:05:38Z

Thank you for your response.

Just to clarify. Would it be possible to mark the entire trace as erroneous if any of its child traces contain an error, as shown in the example? Or is this the intended behavior by design?

This impacts error statistics, as I notice numerous traces with errors in child spans, yet the statistics still appear empty

ps48 · 2025-02-03T22:14:08Z

This is by-design for now. But as you mentioned there is a contradictory experience between what you see in Jaeger vs Data prepper. We'll work with the team and see how we can make these experiences consistent or at-least make the product choices clearer.

berezinsn added bug Something isn't working untriaged labels Jan 13, 2025

ps48 transferred this issue from opensearch-project/observability Jan 28, 2025

ps48 added traces traces telemetry related features and removed untriaged labels Jan 28, 2025

ps48 self-assigned this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Opensearch Dashboards. Otel + Jaeger vs DataPrepper. No errors statistics from DataPrepper perspective for review. #2320

[BUG] Opensearch Dashboards. Otel + Jaeger vs DataPrepper. No errors statistics from DataPrepper perspective for review. #2320

berezinsn commented Jan 13, 2025

ps48 commented Jan 28, 2025

ps48 commented Jan 28, 2025

ps48 commented Jan 28, 2025

berezinsn commented Jan 29, 2025

ps48 commented Feb 3, 2025

[BUG] Opensearch Dashboards. Otel + Jaeger vs DataPrepper. No errors statistics from DataPrepper perspective for review. #2320

[BUG] Opensearch Dashboards. Otel + Jaeger vs DataPrepper. No errors statistics from DataPrepper perspective for review. #2320

Comments

berezinsn commented Jan 13, 2025

ps48 commented Jan 28, 2025

ps48 commented Jan 28, 2025

ps48 commented Jan 28, 2025

berezinsn commented Jan 29, 2025

ps48 commented Feb 3, 2025