Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Opensearch Dashboards. Otel + Jaeger vs DataPrepper. No errors statistics from DataPrepper perspective for review. #2320

Open
berezinsn opened this issue Jan 13, 2025 · 5 comments
Assignees
Labels
bug Something isn't working traces traces telemetry related features

Comments

@berezinsn
Copy link

Setup:
Otel Agents -> Otel collector -> Jaeger / DataPrepper -> Opensearch -> OpensearchDashboards

Versions:
Opensearch Helm Chart version: 2.27.1, appVersion: 2.18.0
Opensearch-Dashboards Helm Chart version: 2.25.0, appVersion: 2.18.0
Jaeger Helm Chart version: 3.3.3, appVersion: 1.53.0
DataPrepper Helm Chart version: 0.1.0, appVersion: 2.8.0

Describe the issue:
I have a setup with instrumented applications using OpenTelemetry (Otel) agents, which push traces to an Otel collector. The Otel collector sends data to both Jaeger and DataPrepper. However, I am noticing a difference in the behavior of the same traces when viewed in OpenSearch Dashboards depending on the data source selected (Jaeger vs. DataPrepper).

Specifically, when I select DataPrepper as the data source, I do not see the entire trace being marked as a trace with errors, and the errors are not displayed on the dashboard. In contrast, when using Jaeger as the data source, the errors are correctly visualized, and the entire trace is marked as an "error trace" if any span within the trace contains an error.

Configuration:
Jaeger:

jaeger:
  agent:
    enabled: false
  provisionDataStore:
    cassandra: false
    elasticsearch: false
  collector:
    enabled: true
    annotations: {}
    image:
      registry: ""
      repository: jaegertracing/jaeger-collector
      tag: ""
      digest: ""
    envFrom: []
    cmdlineParams: {}
    basePath: /
    replicaCount: 1
    service:
      otlp:
        grpc:
          name: "otlp-grpc"
          port: 4317
        http:
          name: "otlp-http"
          port: 4318
    serviceAccount:
      create: true
  storage:
    type: elasticsearch
    elasticsearch:
      scheme: http
      host: opensearch-cluster-master.opensearch-otel.svc.cluster.local
      port: 9200
      anonymous: true
      usePassword: false
        - name: SPAN_STORAGE_TYPE
          value: "opensearch"
        - name: ES_TAGS_AS_FIELDS_ALL
          value: "true"
      tls:
        enabled: false

DataPrepper:

    config:
      otel-trace-pipeline:
        delay: "1000"
        source:
          otel_trace_source:
            ssl: false
        buffer:
          bounded_blocking:
            buffer_size: 10240
            batch_size: 160
        sink:
          - pipeline:
              name: "raw-traces-pipeline"
          - pipeline:
              name: "otel-service-map-pipeline"
      raw-traces-pipeline:
        source:
          pipeline:
            name: "otel-trace-pipeline"
        buffer:
          bounded_blocking:
            buffer_size: 10240
            batch_size: 160
        processor:
          - otel_trace_raw:
          - otel_trace_group:
              hosts: [ "http://opensearch-cluster-master:9200" ]
              insecure: true
        sink:
          - opensearch:
              hosts: [ "http://opensearch-cluster-master:9200" ]
              insecure: true
              index_type: trace-analytics-raw
      otel-service-map-pipeline:
        delay: "1000"
        source:
          pipeline:
            name: "otel-trace-pipeline"
        buffer:
          bounded_blocking:
            buffer_size: 10240
            batch_size: 160
        processor:
          - service_map_stateful:
              window_duration: 300
        sink:
          - opensearch:
              hosts: [ "http://opensearch-cluster-master:9200" ]
              insecure: true
              index_type: trace-analytics-service-map
              index: otel-v1-apm-span-%{yyyy.MM.dd}
              #max_retries: 20
              bulk_size: 4

Relevant Logs or Screenshots:
DataPrepper source. Error in span, but not all trace marked with Error, and no statistics observed
Screenshot 2024-12-24 at 16 31 59
Screenshot 2024-12-24 at 16 30 49

Here is Jaeger source. Error is observed in span and the whole trace marked with error (in the right top corner, next capture)
Screenshot 2024-12-24 at 16 31 43
Screenshot 2024-12-24 at 16 31 07

Please share your suggestions on how to fix it. TraceID is the same for both cases.
Thanks

@berezinsn berezinsn added bug Something isn't working untriaged labels Jan 13, 2025
@ps48 ps48 transferred this issue from opensearch-project/observability Jan 28, 2025
@ps48 ps48 added traces traces telemetry related features and removed untriaged labels Jan 28, 2025
@ps48
Copy link
Member

ps48 commented Jan 28, 2025

@berezinsn Thanks for reporting this issue. I'll take a stab this week to replicate the issue and look into the root-cause.

@ps48
Copy link
Member

ps48 commented Jan 28, 2025

Was able to replicate the issue from our sanity test data:

Image

@ps48
Copy link
Member

ps48 commented Jan 28, 2025

@berezinsn I was able to look into this deeper. The discrepancy is coming from the product definition of the traces table in both these experiences.

  • The Jaeger UI experience defines it as if any of the spans in the trace have an error then the trace is marked to have error.
  • The Data prepper UI experience defines it as if any of the spans in the trace have an error and the error bubbles up to the parent trace then mark the trace to have an error.

Sample span for reference:

  {
    "_index": "otel-v1-apm-span-000001",
    "_id": "01c87a25f18fb004",
    "_score": 2.8322158,
    "_source": {
      "traceId": "4fa04f117be100f476b175e41096e736",
      "spanId": "01c87a25f18fb004",
      "traceState": "",
      "parentSpanId": "a178d4084436e2ba",
      "name": "update_inventory",
      "kind": "SPAN_KIND_SERVER",
      "startTime": "2021-03-25T17:23:30.095432704Z",
      "endTime": "2021-03-25T17:23:30.125712640Z",
      "durationInNanos": 30279936,
      "serviceName": "inventory",
      "events": [],
      "links": [],
      "droppedAttributesCount": 0,
      "droppedEventsCount": 0,
      "droppedLinksCount": 0,
      "traceGroup": "client_checkout",
      "traceGroupFields.endTime": "2021-03-25T17:23:30.481628416Z",
      "traceGroupFields.statusCode": 0,
      "traceGroupFields.durationInNanos": 393149952,
      "span.attributes.net@peer@ip": "127.0.0.1",
      "instrumentationLibrary.version": "0.14b0",
      "resource.attributes.telemetry@sdk@language": "python",
      "span.attributes.host@port": 8082,
      "span.attributes.http@status_text": "SERVICE UNAVAILABLE",
      "resource.attributes.telemetry@sdk@version": "0.14b0",
      "resource.attributes.service@instance@id": "140016202633168",
      "resource.attributes.service@name": "inventory",
      "span.attributes.component": "http",
      "status.code": 2,
      "instrumentationLibrary.name": "opentelemetry.instrumentation.flask",
      "span.attributes.http@method": "POST",
      "span.attributes.http@user_agent": "python-requests/2.25.1",
      "span.attributes.net@peer@port": 45720,
      "resource.attributes.telemetry@sdk@name": "opentelemetry",
      "span.attributes.http@server_name": "0.0.0.0",
      "span.attributes.http@route": "/update_inventory",
      "span.attributes.http@host": "localhost:8082",
      "span.attributes.http@target": "/update_inventory",
      "span.attributes.http@scheme": "http",
      "resource.attributes.host@hostname": "ip-172-31-10-8.us-west-2.compute.internal",
      "span.attributes.http@flavor": "1.1",
      "span.attributes.http@status_code": 503
    }
  },

In the above example Data prepper experience uses traceGroupFields.statusCode as the error field to define a trace to have an error. whereas the Jaeger experience would use status.code field. More details on the fields here: https://github.com/opensearch-project/data-prepper/blob/main/docs/schemas/trace-analytics/otel-v1-apm-span-index-template.md

@ps48 ps48 self-assigned this Jan 28, 2025
@berezinsn
Copy link
Author

Thank you for your response.

Just to clarify. Would it be possible to mark the entire trace as erroneous if any of its child traces contain an error, as shown in the example? Or is this the intended behavior by design?

This impacts error statistics, as I notice numerous traces with errors in child spans, yet the statistics still appear empty

@ps48
Copy link
Member

ps48 commented Feb 3, 2025

This is by-design for now. But as you mentioned there is a contradictory experience between what you see in Jaeger vs Data prepper. We'll work with the team and see how we can make these experiences consistent or at-least make the product choices clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working traces traces telemetry related features
Projects
None yet
Development

No branches or pull requests

2 participants