feat: add design doc for otel tracing#913
Conversation
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
📝 WalkthroughWalkthroughAdds a new OpenTelemetry tracing design document for NVSentinel that defines trace/span concepts, root trace creation and cross-module trace_id propagation (stored in MongoDB), per-module span schemas, OTLP export via an Alloy gateway to Panoptes, and configuration and implementation guidance for tracing and visualization. Changes
Sequence Diagram(s)sequenceDiagram
participant Module as NVSentinel Module
participant Alloy as Alloy Gateway
participant Panoptes as Panoptes Collector
participant Tempo as Grafana Tempo
Module->>Module: create root trace / spans (attach trace_id)
Module->>Alloy: export OTLP (batch/retry, OAuth2) with trace data
Alloy->>Panoptes: forward OTLP
Panoptes->>Tempo: store/ingest traces
Tempo->>User: visualize trace
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/designs/028-OTEL-traces.md`:
- Around line 209-257: The section title and language imply multiple propagation
options but the doc only defines a single approach; rename the heading "Trace
Context Propagation Options" to "Trace Context Propagation Approach" (or similar
singular form) and update any phrases that suggest choice like "depends on which
trace context propagation option you choose" to singular wording such as
"depends on this approach" or remove the conditional wording; ensure references
to the implemented design (HealthEventWithStatus.TraceID / `trace_id` top-level
field and the platform_connector.receive_event flow) remain accurate and
consistent with the singular framing.
- Around line 280-293: The flow diagram in the "Platform Connector" section
duplicates the trace creation: remove the redundant Step 5 ("Start new trace
(trace_id: abc123)") or replace it with a clarified different action if intended
(e.g., "Continue existing trace" or "Start root span"), so that only one
explicit trace-creation step remains—keep Step 2 ("Create NEW trace (trace_id:
abc123)") as the single trace creation point and ensure the root span step
("Create root span: 'platform_connector.receive_event'") follows it.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/designs/028-OTEL-traces.md`:
- Line 209: Rename the duplicate H2 heading "Trace Context Propagation" (the
second occurrence that contains implementation details) to a distinct label such
as "Trace Context Propagation — Implementation Details" or "Trace Context
Propagation (Implementation)", by updating the Markdown heading text so readers
can distinguish the conceptual section from the implementation/mongo field/code
changes section; search for the heading string "Trace Context Propagation" and
update the second occurrence accordingly.
- Line 24: Fix the typo in the "Circuit breaker activity" line: replace "triped"
with "triggered" (or "tripped" if preferred) in the documentation sentence
"Circuit breaker activity: Monitor when circuit breaker is triped" so it reads
"Circuit breaker activity: Monitor when circuit breaker is triggered" to correct
the spelling and intent.
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
458b785 to
27b9d95
Compare
|
|
||
| **Trace flow summary:** | ||
|
|
||
| 1. **Ingestion**: Each NVSentinel module exports spans via OTLP over gRPC to `dgxc-alloy-gateway.dgxc-alloy.svc.cluster.local:4317`. No authentication is required from NVSentinel to the gateway (in-cluster, `OTEL_EXPORTER_OTLP_INSECURE=true`). |
There was a problem hiding this comment.
Is alloy deployed as a daemonset or a deployment in the cluster? Will each node export traces to a single alloy deployment pod in the cluster or it's local daemonset pod?
|
I'm not quite sure why we want to couple this to alloy, not every customer will be using alloy. IIRC there is open standard for tracing that we should be using so that any consumer can consume traces |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit