Skip to content

feat: add spec.traces and spec.logs for OTel forwarding#15

Merged
sallyom merged 30 commits into
mainfrom
feat/diagnostics-otel
Jun 26, 2026
Merged

feat: add spec.traces and spec.logs for OTel forwarding#15
sallyom merged 30 commits into
mainfrom
feat/diagnostics-otel

Conversation

@cooktheryan

@cooktheryan cooktheryan commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds spec.traces and spec.logs as top-level CRD fields alongside the existing spec.metrics, enabling OpenTelemetry trace and log forwarding through the OTel Collector sidecar to external OTLP endpoints
  • Refactors the static OTel Collector config into a dynamic builder that generates pipelines per signal (metrics/traces/logs), supporting split endpoints for traces vs logs
  • Auto-installs @openclaw/diagnostics-otel and @openclaw/diagnostics-prometheus plugins when the corresponding signals are enabled
  • Injects OTel resource attributes (OTEL_SERVICE_NAME, service.instance.id, k8s.namespace.name) via Downward API for multi-instance discrimination
  • Configures parentbased_traceidratio sampler with user-configurable sampling ratio

Example CR

spec:
  metrics:
    enabled: true
  traces:
    enabled: true
    endpoint: "http://otel-collector.observability.svc:4318"
    samplingRatio: "0.1"
  logs:
    enabled: true
    # endpoint defaults to traces endpoint when omitted

Test plan

  • make generate — deepcopy generated for new TracesSpec and LogsSpec types
  • make manifests — CRD schema includes traces and logs fields
  • make lint — 0 issues (extracted injectObservabilityResources to stay under cyclomatic complexity limit)
  • make test — all tests pass including new diagnostics tests
  • Deploy to cluster with traces endpoint pointed at an OTel Collector, verify spans arrive
  • Deploy with both metrics and traces, verify sidecar serves Prometheus on :9464 and forwards OTLP

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added CRD support for configuring traces and logs (with logs defaulting to the traces OTLP endpoint).
    • Added sampling ratio configuration for traces.
    • When signals are enabled, the Deployment now injects the required OpenTelemetry env vars and configures telemetry egress access.
  • Bug Fixes
    • Removed the previous metrics sidecar approach and standardized on a shared OTLP endpoint for traces/logs/metrics.
    • Preserves any user-provided OpenTelemetry endpoint settings.
  • Documentation
    • Expanded the observability runbook with end-to-end setup and verification, plus optional persistent log storage.
  • Tests
    • Updated and expanded coverage for CRD behavior, OTEL injection, network policy egress, and plugin selection.

Add traces and logs as top-level spec fields alongside metrics,
enabling OpenTelemetry trace and log forwarding through the existing
OTel Collector sidecar to external OTLP endpoints. Auto-installs
diagnostics plugins and injects OTel resource attributes for
multi-instance discrimination.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Claw now models traces and logs as observability inputs, updates the CRD and deepcopy generation, and rewires reconciliation to inject diagnostics config, OTEL env vars, plugin requirements, and egress rules. Metrics logic no longer configures a collector sidecar, and the observability guide and tests were updated.

Changes

Observability signals

Layer / File(s) Summary
Schema and deepcopy
api/v1alpha1/claw_types.go, api/v1alpha1/zz_generated.deepcopy.go, config/crd/bases/claw.sandbox.redhat.com_claws.yaml
ClawSpec gains traces and logs, deepcopy handling is added, and the CRD documents shared OTLP endpoint and sampling validation.
Diagnostics and egress injection
internal/controller/claw_metrics.go, internal/controller/claw_diagnostics.go
Metrics config stops injecting collector defaults, and new helpers resolve endpoints, mutate diagnostics config, inject OTEL env vars, and add telemetry egress rules.
Reconciler and plugin wiring
internal/controller/claw_resource_controller.go, internal/controller/claw_plugins.go, cmd/main.go
The reconciler switches to diagnostics injection paths, deployment OTEL env injection runs for traces/logs/metrics, diagnostics plugins are added, and OTelCollectorImage wiring is removed.
Controller tests
internal/controller/claw_diagnostics_test.go, internal/controller/claw_metrics_test.go, internal/controller/claw_plugins_test.go
Tests cover traces/logs helpers, diagnostics config mutation, OTEL env vars, telemetry egress rules, metrics injection behavior, and plugin selection.
Observability guide
docs/observability.md
The guide adds setup, configuration, validation, UI access, and troubleshooting content for traces, logs, and metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • IsaiahStapleton
  • sallyom
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change: adding top-level spec.traces and spec.logs for OTLP forwarding.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/diagnostics-otel

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@cooktheryan

Copy link
Copy Markdown
Collaborator Author

/hold until i can run through code rabbit

@cooktheryan

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Rename the deprecated `otlphttp` exporter alias to `otlp_http` in the
OTel Collector config to suppress warnings on collector 0.152.1+. Add a
dedicated GitHub Actions workflow that validates the full OTel
diagnostics stack (traces, logs, metrics) on a Kind cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread .github/workflows/test-diagnostics.yml Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/test-diagnostics.yml:
- Around line 31-36: The Kind installation step downloads from the `/dl/latest/`
endpoint in the curl command, which is not reproducible as the latest version
can change over time. Replace the `/dl/latest/` path with a specific pinned
version number (e.g., `/dl/v0.20.0/`) in the curl command that downloads Kind
from kind.sigs.k8s.io to ensure consistent and reproducible test runs.
- Around line 11-22: Add a top-level permissions block to the workflow that
explicitly defines the minimum required GITHUB_TOKEN scopes. Currently, the
workflow lacks a permissions declaration which results in using default
permissions that are broader than necessary, violating the principle of least
privilege. Place the permissions block at the same level as the jobs key to
restrict token access to only the specific permissions needed for this OTel
Diagnostics workflow (typically read or none for workflows that only run tests
without deploying or modifying repository content).
- Line 24: The test-diagnostics.yml workflow uses semantic version tags (like
`@v4`) for GitHub Actions references instead of commit hashes, which is
inconsistent with the pinning approach used in publish-images.yml. Update all
uses statements in the test-diagnostics.yml workflow to pin actions to their
full commit hashes instead of version tags, following the same pattern already
established in publish-images.yml for supply-chain hardening and consistency
across the repository.

In `@api/v1alpha1/claw_types.go`:
- Around line 699-707: The TracesSpec and LogsSpec fields in the claw_types.go
file need CRD-level validation to prevent silent misconfiguration. Add
kubebuilder validation markers (such as +kubebuilder:validation:XValidationRule)
to enforce that when Traces or Logs specs have enabled set to true, the
corresponding endpoint fields must be explicitly configured. This ensures the
validation happens at admission time rather than silently failing during
collector configuration, preventing users from creating invalid resource
combinations where endpoints would not be wired for export.

In `@internal/controller/claw_diagnostics.go`:
- Around line 276-282: The isExternalEndpoint function suppresses errors from
classifyServiceURL by returning false on failure, which hides configuration
issues and silently skips egress rule injection. Modify isExternalEndpoint to
return both a boolean and an error instead of only a boolean, then propagate any
error returned by classifyServiceURL up through the call chain via
collectTelemetryEndpoints and finally through injectTelemetryEgressRules so that
reconciliation and status updates reflect misconfiguration issues immediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2ab0c029-6158-4696-ad50-d544ea85e162

📥 Commits

Reviewing files that changed from the base of the PR and between f8c18ad and ad5b275.

📒 Files selected for processing (9)
  • .github/workflows/test-diagnostics.yml
  • api/v1alpha1/claw_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • config/crd/bases/claw.sandbox.redhat.com_claws.yaml
  • internal/controller/claw_diagnostics.go
  • internal/controller/claw_diagnostics_test.go
  • internal/controller/claw_metrics.go
  • internal/controller/claw_plugins.go
  • internal/controller/claw_resource_controller.go

Comment thread .github/workflows/test-kind.yml Outdated
Comment thread .github/workflows/test-kind.yml Outdated
Comment thread .github/workflows/test-kind.yml Outdated
Comment thread api/v1alpha1/claw_types.go Outdated
Comment thread internal/controller/claw_diagnostics.go Outdated
cooktheryan and others added 2 commits June 17, 2026 10:35
- Fix injectTelemetryEgressRules to handle in-cluster endpoints using
  classifyServiceURL/buildInClusterEgressRule (was silently blocking)
- Rename workflow to "kind test"
- Use quay.io/redhat-et/ image names for Kind compatibility
- Add IMAGE_PULL_POLICY=IfNotPresent for Kind deploys
- Enhance debug logging: dump container logs, configmap, kubectl get all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread .github/workflows/test-kind.yml Fixed
cooktheryan and others added 7 commits June 17, 2026 11:03
…ntain permissions'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
- Add permissions: contents: read to limit GITHUB_TOKEN scope
- Add persist-credentials: false on checkout
- Pin Kind to v0.32.0 for reproducible builds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Send test OTLP data to the sidecar and verify it flows through:
- Traces: sidecar receives span, backend logs it via debug exporter
- Logs: sidecar receives log record, backend logs it
- Metrics: sidecar receives OTLP metric, Prometheus endpoint serves it

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These tested the OTel Collector generically, not OpenClaw's actual
telemetry output. Keep the operator artifact validations (sidecar,
config, env vars, NP rules) which are what this test is for.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge publish-images.yml into test-kind.yml so images are only pushed
to quay.io after successful Kind testing and only on merge to main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align test job action references with publish job for supply-chain
hardening consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@frzifus

frzifus commented Jun 23, 2026

Copy link
Copy Markdown

Openclaw in that case is just one pod?

I wonder if we deploy a collector anyway, we may want to use the instrumentaiton CR to configure them.

In that case the workload itself would just need an annoation like:

instrumentation.opentelemetry.io/inject-sdk: "true"

Or if we want to already provide a default, the operator could place one on its namespace. When all pods come with:

instrumentation.opentelemetry.io/inject-sdk: "openclaw/default"

cooktheryan and others added 8 commits June 24, 2026 15:50
…ollector

Replace the bundled OTel Collector sidecar with direct OTLP export:
- Remove configureMetricsSidecar, injectOTelCollectorConfig, otelSidecarNeeded
- Remove OTelCollectorImage reconciler field and OTEL_COLLECTOR_IMAGE env var
- Set OTEL_EXPORTER_OTLP_ENDPOINT to spec.traces.endpoint in gateway env vars
- injectDiagnosticsConfig now sets diagnostics.otel.endpoint to spec.traces.endpoint
  (not localhost:4318); injectMetricsConfig only sets diagnostics.otel.metrics=true
- Update tests to reflect no sidecar, no otel-collector.yaml in ConfigMap
- ServiceMonitor, metrics Service port, and NP ingress rule are preserved

The OTel Collector is expected to be deployed externally (e.g., via the Red Hat
build of OpenTelemetry Operator) in the same namespace. See the TempoMonolithic
+ OpenTelemetryCollector CRs deployed in the rcook namespace for an example.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Covers operator prerequisites (cert-manager, OTel Operator, Tempo Operator),
per-namespace stack deployment (TempoMonolithic + OpenTelemetryCollector),
Claw CR configuration for traces/logs/metrics, and a step-by-step
verification procedure including a synthetic trace injection test.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
feat: remove OTel sidecar, send OTLP directly to external collector
- Bubble classifyTelemetryEndpoints errors instead of silently
  continuing; invalid telemetry endpoint URLs now surface through the
  reconcile error path rather than silently skipping egress rules
- Add CEL XValidation rules to ClawSpec rejecting traces.enabled=true
  without traces.endpoint, and logs.enabled=true without a resolvable
  endpoint (logs.endpoint or traces.endpoint)
- Regenerate CRD YAML with new validation rules

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
controller-gen marshals kubebuilder marker strings into YAML where
double-quoted empty string literals ('""') break the YAML parser.
Use single-quoted CEL string literals ('') instead — CEL accepts both.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Replace empty-string comparisons in XValidation rules with size()
  checks to avoid YAML double-quote escaping issues with controller-gen
- Remove makeTestDeploymentForMetrics and makeTestConfigMapForMetrics
  which became unused after the sidecar removal
- Regenerate CRD YAML

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CEL XValidation markers on ClawSpec generated multi-line YAML
double-quoted scalars that Kubernetes envtest (and older API servers)
reject with a parse error. Replace the admission-time guards with
equivalent runtime checks in injectObservabilityResources:
- traces.enabled without traces.endpoint returns an error immediately
- logs.enabled without a resolvable endpoint returns an error immediately

The reconciler error surfaces through the standard status condition path
and is equally visible to operators.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cooktheryan cooktheryan changed the title feat: add spec.traces and spec.logs for OTel forwarding HOLD: add spec.traces and spec.logs for OTel forwarding Jun 24, 2026
- Delete buildCollectorConfig and its tests — the function was only
  used by injectOTelCollectorConfig which was removed with the sidecar
- Remove unused strings import from claw_diagnostics.go
- Update MetricsSpec, TracesSpec, LogsSpec, and ClawSpec field comments
  to reflect direct OTLP export to an external collector (no sidecar)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
cooktheryan and others added 2 commits June 24, 2026 17:18
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The Jaeger UI bundled in TempoMonolithic is deprecated and will be
removed in a future Tempo Operator release. Replace with the OpenShift-
recommended approach:

- Remove jaegerui: enabled: true from the TempoMonolithic spec
- Replace "Accessing the Tempo UI" section with COO installation and
  UIPlugin deployment for the native Observe → Traces console view
- Document TraceQL query examples for filtering openclaw spans

The distributed tracing UI plugin (observability.openshift.io/v1alpha1
UIPlugin) surfaces the Tempo-backed trace view natively in the
OpenShift Console without exposing a deprecated Jaeger Route.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/observability.md`:
- Line 5: The telemetry overview in the observability guide describes a single
OTLP-to-collector path for all signals, but the metrics flow is documented
elsewhere as Prometheus/ServiceMonitor-based. Update the opening description in
the observability guide to split the path by signal, keeping the OTel SDK/OTLP
collector wording for logs and traces while describing metrics as scraped via
Prometheus/ServiceMonitor. Use the existing observability section text as the
target for this clarification so readers are not led to expect one collector
path for every signal.

In `@internal/controller/claw_diagnostics.go`:
- Around line 113-116: The runtime config injection for OTEL endpoints only uses
tracesEndpoint(instance) when setting otel["endpoint"] and
OTEL_EXPORTER_OTLP_ENDPOINT, so spec.logs.endpoint never reaches the
environment. Update the endpoint resolution in the claw diagnostics config path
to consider logsEndpoint(instance) as a fallback or primary source alongside
tracesEndpoint, using the existing otel map/env injection logic so logs-only or
distinct log endpoint setups are propagated correctly.

In `@internal/controller/claw_resource_controller.go`:
- Around line 668-669: The enrichConfigAndNetworkPolicy validation path is
returning early without updating status, which can leave Ready=True stale when
observability endpoint validation fails. Update the reconcile flow around
enrichConfigAndNetworkPolicy and injectObservabilityResources to set Ready=False
before returning on user-fixable validation errors, matching the nearby
validation handling. If possible, use a typed validation error so internal
manifest failures are not reported as user-facing observability validation
issues.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 5dd6f656-1453-46cd-aa1e-636226fb6fb1

📥 Commits

Reviewing files that changed from the base of the PR and between f2e8a72 and 6d2a3e0.

📒 Files selected for processing (12)
  • api/v1alpha1/claw_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • cmd/main.go
  • config/crd/bases/claw.sandbox.redhat.com_claws.yaml
  • docs/observability.md
  • internal/controller/claw_diagnostics.go
  • internal/controller/claw_diagnostics_test.go
  • internal/controller/claw_metrics.go
  • internal/controller/claw_metrics_test.go
  • internal/controller/claw_plugins.go
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_resource_controller.go
✅ Files skipped from review due to trivial changes (1)
  • api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/controller/claw_plugins.go
  • api/v1alpha1/claw_types.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/observability.md`:
- Line 5: The telemetry overview in the observability guide describes a single
OTLP-to-collector path for all signals, but the metrics flow is documented
elsewhere as Prometheus/ServiceMonitor-based. Update the opening description in
the observability guide to split the path by signal, keeping the OTel SDK/OTLP
collector wording for logs and traces while describing metrics as scraped via
Prometheus/ServiceMonitor. Use the existing observability section text as the
target for this clarification so readers are not led to expect one collector
path for every signal.

In `@internal/controller/claw_diagnostics.go`:
- Around line 113-116: The runtime config injection for OTEL endpoints only uses
tracesEndpoint(instance) when setting otel["endpoint"] and
OTEL_EXPORTER_OTLP_ENDPOINT, so spec.logs.endpoint never reaches the
environment. Update the endpoint resolution in the claw diagnostics config path
to consider logsEndpoint(instance) as a fallback or primary source alongside
tracesEndpoint, using the existing otel map/env injection logic so logs-only or
distinct log endpoint setups are propagated correctly.

In `@internal/controller/claw_resource_controller.go`:
- Around line 668-669: The enrichConfigAndNetworkPolicy validation path is
returning early without updating status, which can leave Ready=True stale when
observability endpoint validation fails. Update the reconcile flow around
enrichConfigAndNetworkPolicy and injectObservabilityResources to set Ready=False
before returning on user-fixable validation errors, matching the nearby
validation handling. If possible, use a typed validation error so internal
manifest failures are not reported as user-facing observability validation
issues.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 5dd6f656-1453-46cd-aa1e-636226fb6fb1

📥 Commits

Reviewing files that changed from the base of the PR and between f2e8a72 and 6d2a3e0.

📒 Files selected for processing (12)
  • api/v1alpha1/claw_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • cmd/main.go
  • config/crd/bases/claw.sandbox.redhat.com_claws.yaml
  • docs/observability.md
  • internal/controller/claw_diagnostics.go
  • internal/controller/claw_diagnostics_test.go
  • internal/controller/claw_metrics.go
  • internal/controller/claw_metrics_test.go
  • internal/controller/claw_plugins.go
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_resource_controller.go
✅ Files skipped from review due to trivial changes (1)
  • api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/controller/claw_plugins.go
  • api/v1alpha1/claw_types.go
🛑 Comments failed to post (3)
docs/observability.md (1)

5-5: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Split the telemetry path by signal.

This sentence is inaccurate for metrics: the rest of the guide documents metrics as Prometheus/ServiceMonitor-based, not OTLP sent to the collector. As written, it will mislead operators into expecting one collector path for every signal.

✏️ Suggested wording
- Telemetry is sent directly from the gateway to an externally-deployed OTel Collector, which routes signals to your chosen backends.
+ Traces and logs are sent directly from the gateway to an externally-deployed OTel Collector, while metrics are scraped separately via Prometheus/ServiceMonitor.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Traces and logs are sent directly from the gateway to an externally-deployed OTel Collector, while metrics are scraped separately via Prometheus/ServiceMonitor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability.md` at line 5, The telemetry overview in the observability
guide describes a single OTLP-to-collector path for all signals, but the metrics
flow is documented elsewhere as Prometheus/ServiceMonitor-based. Update the
opening description in the observability guide to split the path by signal,
keeping the OTel SDK/OTLP collector wording for logs and traces while describing
metrics as scraped via Prometheus/ServiceMonitor. Use the existing observability
section text as the target for this clarification so readers are not led to
expect one collector path for every signal.
internal/controller/claw_diagnostics.go (1)

113-116: 🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether spec.logs.endpoint is propagated beyond validation/egress.
# Expected: logs endpoint should appear in the runtime config/env path, not only in validation or NetworkPolicy code.
rg -n 'logsEndpoint\(|OTEL_EXPORTER_OTLP|otel\["endpoint"\]|logs[_A-Za-z]*endpoint' --iglob '*.go' .

Repository: redhat-et/claw-operator

Length of output: 1930


Propagate the logs endpoint into runtime config/env.

The code at lines 113–116 and 163 exclusively resolves the otel["endpoint"] map key and OTEL_EXPORTER_OTLP_ENDPOINT environment variable using tracesEndpoint(instance), ignoring the configured logsEndpoint(instance). While logsEndpoint is utilized for egress validation (line 260), the runtime configuration fails to propagate the logs endpoint. This causes logs-only deployments or distinct endpoint configurations to either miss the target URL entirely or incorrectly route logs to the traces endpoint.

Update the injection logic to include logsEndpoint(instance) as a fallback or primary resolver when appropriate, ensuring spec.logs.endpoint reaches the environment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/claw_diagnostics.go` around lines 113 - 116, The runtime
config injection for OTEL endpoints only uses tracesEndpoint(instance) when
setting otel["endpoint"] and OTEL_EXPORTER_OTLP_ENDPOINT, so spec.logs.endpoint
never reaches the environment. Update the endpoint resolution in the claw
diagnostics config path to consider logsEndpoint(instance) as a fallback or
primary source alongside tracesEndpoint, using the existing otel map/env
injection logic so logs-only or distinct log endpoint setups are propagated
correctly.
internal/controller/claw_resource_controller.go (1)

668-669: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Set Ready=False for observability validation failures.

injectObservabilityResources now returns user-fixable endpoint validation errors, but this call returns before updating status. If a previously Ready Claw loses a required telemetry endpoint, Ready=True can remain stale while reconcile fails. Mirror the nearby validation paths, ideally with a typed validation error so internal manifest errors are not mislabeled.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/claw_resource_controller.go` around lines 668 - 669, The
enrichConfigAndNetworkPolicy validation path is returning early without updating
status, which can leave Ready=True stale when observability endpoint validation
fails. Update the reconcile flow around enrichConfigAndNetworkPolicy and
injectObservabilityResources to set Ready=False before returning on user-fixable
validation errors, matching the nearby validation handling. If possible, use a
typed validation error so internal manifest failures are not reported as
user-facing observability validation issues.

cooktheryan and others added 4 commits June 25, 2026 11:44
The OpenShift Console distributed tracing UI plugin requires multi-tenancy.
Update the full observability setup:

TempoMonolithic:
- Enable multitenancy with mode: openshift and a named tenant
- Gateway sidecar is deployed automatically alongside Tempo

RBAC:
- Add ClusterRole granting create on tempo.grafana.com/<tenantName>
  resource (the SAR pattern the Tempo gateway's OPA policy checks)
- Both k8sattributes and tenant-write bindings documented

OTel Collector:
- Use bearertokenauth extension with service account token
- Export to tempo-tempo-gateway.<ns>.svc:4317 (not the direct Tempo svc)
- Mount openshift-service-ca.crt (service-serving-signer CA) for TLS
- Add X-Scope-OrgID: <tenant> header

Verification:
- Query traces via tempo-tempo-gateway on port 8080 with HTTPS + bearer auth
- Direct tempo-tempo service no longer exists in multi-tenant mode

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The OTel Collector logs pipeline works out of the box with only the
debug exporter — gateway logs are received and confirmed via collector
pod logs without any additional storage backend.

LokiStack (for Observe → Logs in the OpenShift Console) is now an
explicit optional section that:
- States the capacity requirement upfront (extra-small needs ~4-5 vCPU
  and ~8 GiB beyond base workload)
- Lists OpenShift Logging 6.5 as the prerequisite (OTLP ingestion GA)
- Provides the full LokiStack + ClusterLogForwarder + UIPlugin setup
- Shows the exact collector config change needed (otlphttp/loki exporter)

Update the troubleshooting note to reference the optional section rather
than implying Loki is required.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
1. Logs-only endpoint gap: add otlpEndpoint() helper that prefers
   spec.traces.endpoint but falls back to spec.logs.endpoint when traces
   is not configured. Both injectDiagnosticsConfig and injectOTelEnvVars
   now use otlpEndpoint() so a logs-only deployment correctly receives
   OTEL_EXPORTER_OTLP_ENDPOINT and diagnostics.otel.endpoint.

2. Metrics-only no endpoint: add explicit validation in
   injectObservabilityResources that rejects metrics.enabled=true when
   neither traces nor logs provides an OTLP endpoint. Previously this
   produced a silent misconfiguration where OpenClaw had metrics enabled
   but no OTLP target to export to.

3. Missing status condition: injectObservabilityResources errors now call
   setCondition(ConditionTypeReady, False, ConfigFailed) before returning,
   matching the established pattern for every other validation path in the
   reconciler. Previously the error was only visible in operator logs, not
   in kubectl describe claw.

Consolidate: injectDiagnosticsConfig now owns all three signal flags
(traces, logs, metrics) so injectMetricsConfig call is dropped — both
functions ran sequentially and set the same otel["metrics"]=true, which
was redundant after the consolidation.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
1. Add openshift-operators-redhat OperatorGroup to the optional LokiStack
   prerequisites — without it subscriptions are created but CSVs never
   install (silently broken).

2. Add storageClassName as a required field in the LokiStack spec — the
   CRD rejects the resource without it (validation error hit during deploy).

3. Add infra node tolerations to the LokiStack template spec — on OpenShift
   clusters with dedicated infra nodes (NoSchedule taint), all LokiStack
   components must tolerate the taint or they stay Pending indefinitely.

4. Replace "Point the collector at LokiStack" with a ClusterLogForwarder
   approach. The OTel Collector OTLP→Loki path had a wrong URL (/otlp
   returns 404; the real path is /api/logs/v1/application/otlp/v1/logs).
   The CLF (vector DaemonSet) is the correct tool for log aggregation.

5. Add tls.ca field to the CLF output — the LokiStack gateway uses a cert
   signed by the OpenShift service-serving CA, not the Kubernetes API CA.
   Without openshift-service-ca.crt, vector gets SSL errors on every push.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/observability.md (1)

317-358: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Document the logs-endpoint fallback in this section.

Line 355 says OTEL_EXPORTER_OTLP_ENDPOINT is set from spec.traces.endpoint, but the controller logic and diagnostics tests fall back to spec.logs.endpoint when traces are disabled. As written, this section makes traces.endpoint look mandatory for logs-only setups when it is not.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability.md` around lines 317 - 358, The observability docs
currently imply OTEL_EXPORTER_OTLP_ENDPOINT always comes from
spec.traces.endpoint, which is incorrect for logs-only setups. Update the
section around the Claw spec and “What the operator injects” to explicitly
describe the fallback in the controller logic: when traces are disabled and logs
are enabled, the endpoint should come from spec.logs.endpoint, otherwise it uses
spec.traces.endpoint. Make sure the explanation references the Claw spec fields
traces.endpoint and logs.endpoint, and keep the operator.json diagnostics note
aligned with that fallback behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/observability.md`:
- Around line 317-358: The observability docs currently imply
OTEL_EXPORTER_OTLP_ENDPOINT always comes from spec.traces.endpoint, which is
incorrect for logs-only setups. Update the section around the Claw spec and
“What the operator injects” to explicitly describe the fallback in the
controller logic: when traces are disabled and logs are enabled, the endpoint
should come from spec.logs.endpoint, otherwise it uses spec.traces.endpoint.
Make sure the explanation references the Claw spec fields traces.endpoint and
logs.endpoint, and keep the operator.json diagnostics note aligned with that
fallback behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: ffed5987-bc3a-4487-b790-bdb8a48e7488

📥 Commits

Reviewing files that changed from the base of the PR and between 6d2a3e0 and 3af762e.

📒 Files selected for processing (6)
  • docs/observability.md
  • internal/controller/claw_diagnostics.go
  • internal/controller/claw_diagnostics_test.go
  • internal/controller/claw_metrics_test.go
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_resource_controller.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_diagnostics_test.go
  • internal/controller/claw_diagnostics.go
  • internal/controller/claw_metrics_test.go

cooktheryan and others added 4 commits June 25, 2026 15:47
When diagnostics signals are enabled, the required plugin IDs
(@openclaw/diagnostics-otel, @openclaw/diagnostics-prometheus) are now
added to plugins.allow in operator.json. Without this, OpenClaw treats
non-bundled plugins as auto-discovered and runs them in limited trust
mode, producing sparse or suppressed instrumentation data.

Also ensures idempotent merge: existing user-configured allow entries
are preserved and plugin IDs are not duplicated.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The gateway runs behind a claw-proxy MITM sidecar (HTTP_PROXY set) that
intercepts all outbound HTTP traffic for credential injection. OTLP traffic
to an external collector goes through this proxy, which returns 403 because
no credential rule exists for the collector host.

Fix: parse the OTLP endpoint URL in injectOTelEnvVars and append its
hostname to NO_PROXY and no_proxy, causing the Node.js OTel SDK
(which respects NODE_USE_ENV_PROXY) to connect directly.

The appendToNoProxy helper is idempotent — it checks for existing entries
before appending and preserves the proxy-injected values for all other hosts.

Also document this in the Troubleshooting section of docs/observability.md
so operators know to check NO_PROXY when seeing 403 from their collector.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Setting plugins.allow explicitly prevents OpenClaw from loading bundled
plugins (codex agent runtime, browser tools, etc.) that are not in the
list, reducing the gateway from 11 plugins to 3. The diagnostics plugins
auto-load correctly without this — the NO_PROXY fix is the real fix for
trace delivery. The plugins.allow warning is informational only.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The diagnostics-otel plugin loads silently without this key but does not
initialize any exporters. Adding enabled=true causes the plugin to start
the OTLP/Protobuf exporter on startup (confirmed by log message:
'diagnostics-otel: logs exporter enabled (OTLP/Protobuf)').

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cooktheryan cooktheryan changed the title HOLD: add spec.traces and spec.logs for OTel forwarding feat: add spec.traces and spec.logs for OTel forwarding Jun 26, 2026

@sallyom sallyom left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sallyom sallyom merged commit 9129eee into main Jun 26, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants