Skip to content

feat: add design doc for otel tracing#913

Open
tanishagoyal2 wants to merge 2 commits intoNVIDIA:mainfrom
tanishagoyal2:otel-traces-design-doc
Open

feat: add design doc for otel tracing#913
tanishagoyal2 wants to merge 2 commits intoNVIDIA:mainfrom
tanishagoyal2:otel-traces-design-doc

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Feb 26, 2026

Summary

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Documentation
    • Added comprehensive OpenTelemetry tracing design: end-to-end cross-service trace continuity, span naming and attributes per module, guidance for propagating trace context, exporter/gateway configuration examples (including auth), and instructions for monitoring and viewing traces in dashboards.

Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

Adds a new OpenTelemetry tracing design document for NVSentinel that defines trace/span concepts, root trace creation and cross-module trace_id propagation (stored in MongoDB), per-module span schemas, OTLP export via an Alloy gateway to Panoptes, and configuration and implementation guidance for tracing and visualization.

Changes

Cohort / File(s) Summary
Design doc
docs/designs/028-OTEL-traces.md
New design document (~423 lines) describing OpenTelemetry tracing for NVSentinel: trace/span concepts, root-trace creation at Platform-Connector, trace_id stored as MongoDB top-level field, per-module & per-operation span naming, span creation/propagation guidelines, example Go struct additions (HealthEventWithStatus.TraceID), OTLP exporter/batch/retry path via Alloy → Panoptes (OAuth2 config examples), and Grafana Tempo visualization notes.

Sequence Diagram(s)

sequenceDiagram
    participant Module as NVSentinel Module
    participant Alloy as Alloy Gateway
    participant Panoptes as Panoptes Collector
    participant Tempo as Grafana Tempo

    Module->>Module: create root trace / spans (attach trace_id)
    Module->>Alloy: export OTLP (batch/retry, OAuth2) with trace data
    Alloy->>Panoptes: forward OTLP
    Panoptes->>Tempo: store/ingest traces
    Tempo->>User: visualize trace
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐰 I nibble on traces, hop and write,

Root to span in silver light.
Trace_id tucked in Mongo's nest,
Across the modules, stitched and dressed.
Hop along — the warren's traced just right. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'feat: add design doc for otel tracing' clearly and concisely summarizes the main change—adding a new design document for OpenTelemetry tracing—which matches the changeset that adds docs/designs/028-OTEL-traces.md.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/designs/028-OTEL-traces.md`:
- Around line 209-257: The section title and language imply multiple propagation
options but the doc only defines a single approach; rename the heading "Trace
Context Propagation Options" to "Trace Context Propagation Approach" (or similar
singular form) and update any phrases that suggest choice like "depends on which
trace context propagation option you choose" to singular wording such as
"depends on this approach" or remove the conditional wording; ensure references
to the implemented design (HealthEventWithStatus.TraceID / `trace_id` top-level
field and the platform_connector.receive_event flow) remain accurate and
consistent with the singular framing.
- Around line 280-293: The flow diagram in the "Platform Connector" section
duplicates the trace creation: remove the redundant Step 5 ("Start new trace
(trace_id: abc123)") or replace it with a clarified different action if intended
(e.g., "Continue existing trace" or "Start root span"), so that only one
explicit trace-creation step remains—keep Step 2 ("Create NEW trace (trace_id:
abc123)") as the single trace creation point and ensure the root span step
("Create root span: 'platform_connector.receive_event'") follows it.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26a91df and 190d1cf.

📒 Files selected for processing (1)
  • docs/designs/028-OTEL-traces.md

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/designs/028-OTEL-traces.md`:
- Line 209: Rename the duplicate H2 heading "Trace Context Propagation" (the
second occurrence that contains implementation details) to a distinct label such
as "Trace Context Propagation — Implementation Details" or "Trace Context
Propagation (Implementation)", by updating the Markdown heading text so readers
can distinguish the conceptual section from the implementation/mongo field/code
changes section; search for the heading string "Trace Context Propagation" and
update the second occurrence accordingly.
- Line 24: Fix the typo in the "Circuit breaker activity" line: replace "triped"
with "triggered" (or "tripped" if preferred) in the documentation sentence
"Circuit breaker activity: Monitor when circuit breaker is triped" so it reads
"Circuit breaker activity: Monitor when circuit breaker is triggered" to correct
the spelling and intent.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 190d1cf and 458b785.

📒 Files selected for processing (1)
  • docs/designs/028-OTEL-traces.md

Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
@tanishagoyal2 tanishagoyal2 changed the title chore: add design doc for otel tracing feat: add design doc for otel tracing Feb 26, 2026

**Trace flow summary:**

1. **Ingestion**: Each NVSentinel module exports spans via OTLP over gRPC to `dgxc-alloy-gateway.dgxc-alloy.svc.cluster.local:4317`. No authentication is required from NVSentinel to the gateway (in-cluster, `OTEL_EXPORTER_OTLP_INSECURE=true`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is alloy deployed as a daemonset or a deployment in the cluster? Will each node export traces to a single alloy deployment pod in the cluster or it's local daemonset pod?

@lalitadithya
Copy link
Collaborator

I'm not quite sure why we want to couple this to alloy, not every customer will be using alloy. IIRC there is open standard for tracing that we should be using so that any consumer can consume traces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants