feat: add spec.traces and spec.logs for OTel forwarding by cooktheryan · Pull Request #15 · redhat-et/claw-operator

cooktheryan · 2026-06-16T19:17:21Z

Summary

Adds spec.traces and spec.logs as top-level CRD fields alongside the existing spec.metrics, enabling OpenTelemetry trace and log forwarding through the OTel Collector sidecar to external OTLP endpoints
Refactors the static OTel Collector config into a dynamic builder that generates pipelines per signal (metrics/traces/logs), supporting split endpoints for traces vs logs
Auto-installs @openclaw/diagnostics-otel and @openclaw/diagnostics-prometheus plugins when the corresponding signals are enabled
Injects OTel resource attributes (OTEL_SERVICE_NAME, service.instance.id, k8s.namespace.name) via Downward API for multi-instance discrimination
Configures parentbased_traceidratio sampler with user-configurable sampling ratio

Example CR

spec:
  metrics:
    enabled: true
  traces:
    enabled: true
    endpoint: "http://otel-collector.observability.svc:4318"
    samplingRatio: "0.1"
  logs:
    enabled: true
    # endpoint defaults to traces endpoint when omitted

Test plan

make generate — deepcopy generated for new TracesSpec and LogsSpec types
make manifests — CRD schema includes traces and logs fields
make lint — 0 issues (extracted injectObservabilityResources to stay under cyclomatic complexity limit)
make test — all tests pass including new diagnostics tests
Deploy to cluster with traces endpoint pointed at an OTel Collector, verify spans arrive
Deploy with both metrics and traces, verify sidecar serves Prometheus on :9464 and forwards OTLP

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added CRD support for configuring traces and logs (with logs defaulting to the traces OTLP endpoint).
- Added sampling ratio configuration for traces.
- When signals are enabled, the Deployment now injects the required OpenTelemetry env vars and configures telemetry egress access.
Bug Fixes
- Removed the previous metrics sidecar approach and standardized on a shared OTLP endpoint for traces/logs/metrics.
- Preserves any user-provided OpenTelemetry endpoint settings.
Documentation
- Expanded the observability runbook with end-to-end setup and verification, plus optional persistent log storage.
Tests
- Updated and expanded coverage for CRD behavior, OTEL injection, network policy egress, and plugin selection.

Add traces and logs as top-level spec fields alongside metrics, enabling OpenTelemetry trace and log forwarding through the existing OTel Collector sidecar to external OTLP endpoints. Auto-installs diagnostics plugins and injects OTel resource attributes for multi-instance discrimination. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-16T19:17:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Claw now models traces and logs as observability inputs, updates the CRD and deepcopy generation, and rewires reconciliation to inject diagnostics config, OTEL env vars, plugin requirements, and egress rules. Metrics logic no longer configures a collector sidecar, and the observability guide and tests were updated.

Changes

Observability signals

Layer / File(s)	Summary
Schema and deepcopy `api/v1alpha1/claw_types.go`, `api/v1alpha1/zz_generated.deepcopy.go`, `config/crd/bases/claw.sandbox.redhat.com_claws.yaml`	`ClawSpec` gains `traces` and `logs`, deepcopy handling is added, and the CRD documents shared OTLP endpoint and sampling validation.
Diagnostics and egress injection `internal/controller/claw_metrics.go`, `internal/controller/claw_diagnostics.go`	Metrics config stops injecting collector defaults, and new helpers resolve endpoints, mutate diagnostics config, inject OTEL env vars, and add telemetry egress rules.
Reconciler and plugin wiring `internal/controller/claw_resource_controller.go`, `internal/controller/claw_plugins.go`, `cmd/main.go`	The reconciler switches to diagnostics injection paths, deployment OTEL env injection runs for traces/logs/metrics, diagnostics plugins are added, and `OTelCollectorImage` wiring is removed.
Controller tests `internal/controller/claw_diagnostics_test.go`, `internal/controller/claw_metrics_test.go`, `internal/controller/claw_plugins_test.go`	Tests cover traces/logs helpers, diagnostics config mutation, OTEL env vars, telemetry egress rules, metrics injection behavior, and plugin selection.
Observability guide `docs/observability.md`	The guide adds setup, configuration, validation, UI access, and troubleshooting content for traces, logs, and metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

IsaiahStapleton
sallyom

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly summarizes the main change: adding top-level spec.traces and spec.logs for OTLP forwarding.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/diagnostics-otel

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

cooktheryan · 2026-06-16T19:26:09Z

/hold until i can run through code rabbit

cooktheryan · 2026-06-17T12:53:43Z

@coderabbitai review

coderabbitai · 2026-06-17T12:53:50Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Rename the deprecated `otlphttp` exporter alias to `otlp_http` in the OTel Collector config to suppress warnings on collector 0.152.1+. Add a dedicated GitHub Actions workflow that validates the full OTel diagnostics stack (traces, logs, metrics) on a Kind cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/test-diagnostics.yml:
- Around line 31-36: The Kind installation step downloads from the `/dl/latest/`
endpoint in the curl command, which is not reproducible as the latest version
can change over time. Replace the `/dl/latest/` path with a specific pinned
version number (e.g., `/dl/v0.20.0/`) in the curl command that downloads Kind
from kind.sigs.k8s.io to ensure consistent and reproducible test runs.
- Around line 11-22: Add a top-level permissions block to the workflow that
explicitly defines the minimum required GITHUB_TOKEN scopes. Currently, the
workflow lacks a permissions declaration which results in using default
permissions that are broader than necessary, violating the principle of least
privilege. Place the permissions block at the same level as the jobs key to
restrict token access to only the specific permissions needed for this OTel
Diagnostics workflow (typically read or none for workflows that only run tests
without deploying or modifying repository content).
- Line 24: The test-diagnostics.yml workflow uses semantic version tags (like
`@v4`) for GitHub Actions references instead of commit hashes, which is
inconsistent with the pinning approach used in publish-images.yml. Update all
uses statements in the test-diagnostics.yml workflow to pin actions to their
full commit hashes instead of version tags, following the same pattern already
established in publish-images.yml for supply-chain hardening and consistency
across the repository.

In `@api/v1alpha1/claw_types.go`:
- Around line 699-707: The TracesSpec and LogsSpec fields in the claw_types.go
file need CRD-level validation to prevent silent misconfiguration. Add
kubebuilder validation markers (such as +kubebuilder:validation:XValidationRule)
to enforce that when Traces or Logs specs have enabled set to true, the
corresponding endpoint fields must be explicitly configured. This ensures the
validation happens at admission time rather than silently failing during
collector configuration, preventing users from creating invalid resource
combinations where endpoints would not be wired for export.

In `@internal/controller/claw_diagnostics.go`:
- Around line 276-282: The isExternalEndpoint function suppresses errors from
classifyServiceURL by returning false on failure, which hides configuration
issues and silently skips egress rule injection. Modify isExternalEndpoint to
return both a boolean and an error instead of only a boolean, then propagate any
error returned by classifyServiceURL up through the call chain via
collectTelemetryEndpoints and finally through injectTelemetryEgressRules so that
reconciliation and status updates reflect misconfiguration issues immediately.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2ab0c029-6158-4696-ad50-d544ea85e162

📥 Commits

Reviewing files that changed from the base of the PR and between f8c18ad and ad5b275.

📒 Files selected for processing (9)

.github/workflows/test-diagnostics.yml
api/v1alpha1/claw_types.go
api/v1alpha1/zz_generated.deepcopy.go
config/crd/bases/claw.sandbox.redhat.com_claws.yaml
internal/controller/claw_diagnostics.go
internal/controller/claw_diagnostics_test.go
internal/controller/claw_metrics.go
internal/controller/claw_plugins.go
internal/controller/claw_resource_controller.go

- Fix injectTelemetryEgressRules to handle in-cluster endpoints using classifyServiceURL/buildInClusterEgressRule (was silently blocking) - Rename workflow to "kind test" - Use quay.io/redhat-et/ image names for Kind compatibility - Add IMAGE_PULL_POLICY=IfNotPresent for Kind deploys - Enhance debug logging: dump container logs, configmap, kubectl get all Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntain permissions' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

- Add permissions: contents: read to limit GITHUB_TOKEN scope - Add persist-credentials: false on checkout - Pin Kind to v0.32.0 for reproducible builds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Send test OTLP data to the sidecar and verify it flows through: - Traces: sidecar receives span, backend logs it via debug exporter - Logs: sidecar receives log record, backend logs it - Metrics: sidecar receives OTLP metric, Prometheus endpoint serves it Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These tested the OTel Collector generically, not OpenClaw's actual telemetry output. Keep the operator artifact validations (sidecar, config, env vars, NP rules) which are what this test is for. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge publish-images.yml into test-kind.yml so images are only pushed to quay.io after successful Kind testing and only on merge to main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Align test job action references with publish job for supply-chain hardening consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

frzifus · 2026-06-23T20:54:45Z

Openclaw in that case is just one pod?

I wonder if we deploy a collector anyway, we may want to use the instrumentaiton CR to configure them.

In that case the workload itself would just need an annoation like:

instrumentation.opentelemetry.io/inject-sdk: "true"

Or if we want to already provide a default, the operator could place one on its namespace. When all pods come with:

instrumentation.opentelemetry.io/inject-sdk: "openclaw/default"

…ollector Replace the bundled OTel Collector sidecar with direct OTLP export: - Remove configureMetricsSidecar, injectOTelCollectorConfig, otelSidecarNeeded - Remove OTelCollectorImage reconciler field and OTEL_COLLECTOR_IMAGE env var - Set OTEL_EXPORTER_OTLP_ENDPOINT to spec.traces.endpoint in gateway env vars - injectDiagnosticsConfig now sets diagnostics.otel.endpoint to spec.traces.endpoint (not localhost:4318); injectMetricsConfig only sets diagnostics.otel.metrics=true - Update tests to reflect no sidecar, no otel-collector.yaml in ConfigMap - ServiceMonitor, metrics Service port, and NP ingress rule are preserved The OTel Collector is expected to be deployed externally (e.g., via the Red Hat build of OpenTelemetry Operator) in the same namespace. See the TempoMonolithic + OpenTelemetryCollector CRs deployed in the rcook namespace for an example. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Covers operator prerequisites (cert-manager, OTel Operator, Tempo Operator), per-namespace stack deployment (TempoMonolithic + OpenTelemetryCollector), Claw CR configuration for traces/logs/metrics, and a step-by-step verification procedure including a synthetic trace injection test. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

feat: remove OTel sidecar, send OTLP directly to external collector

- Bubble classifyTelemetryEndpoints errors instead of silently continuing; invalid telemetry endpoint URLs now surface through the reconcile error path rather than silently skipping egress rules - Add CEL XValidation rules to ClawSpec rejecting traces.enabled=true without traces.endpoint, and logs.enabled=true without a resolvable endpoint (logs.endpoint or traces.endpoint) - Regenerate CRD YAML with new validation rules Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

controller-gen marshals kubebuilder marker strings into YAML where double-quoted empty string literals ('""') break the YAML parser. Use single-quoted CEL string literals ('') instead — CEL accepts both. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Replace empty-string comparisons in XValidation rules with size() checks to avoid YAML double-quote escaping issues with controller-gen - Remove makeTestDeploymentForMetrics and makeTestConfigMapForMetrics which became unused after the sidecar removal - Regenerate CRD YAML Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

CEL XValidation markers on ClawSpec generated multi-line YAML double-quoted scalars that Kubernetes envtest (and older API servers) reject with a parse error. Replace the admission-time guards with equivalent runtime checks in injectObservabilityResources: - traces.enabled without traces.endpoint returns an error immediately - logs.enabled without a resolvable endpoint returns an error immediately The reconciler error surfaces through the standard status condition path and is equally visible to operators. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Delete buildCollectorConfig and its tests — the function was only used by injectOTelCollectorConfig which was removed with the sidecar - Remove unused strings import from claw_diagnostics.go - Update MetricsSpec, TracesSpec, LogsSpec, and ClawSpec field comments to reflect direct OTLP export to an external collector (no sidecar) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The Jaeger UI bundled in TempoMonolithic is deprecated and will be removed in a future Tempo Operator release. Replace with the OpenShift- recommended approach: - Remove jaegerui: enabled: true from the TempoMonolithic spec - Replace "Accessing the Tempo UI" section with COO installation and UIPlugin deployment for the native Observe → Traces console view - Document TraceQL query examples for filtering openclaw spans The distributed tracing UI plugin (observability.openshift.io/v1alpha1 UIPlugin) surfaces the Tempo-backed trace view natively in the OpenShift Console without exposing a deprecated Jaeger Route. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/observability.md`:
- Line 5: The telemetry overview in the observability guide describes a single
OTLP-to-collector path for all signals, but the metrics flow is documented
elsewhere as Prometheus/ServiceMonitor-based. Update the opening description in
the observability guide to split the path by signal, keeping the OTel SDK/OTLP
collector wording for logs and traces while describing metrics as scraped via
Prometheus/ServiceMonitor. Use the existing observability section text as the
target for this clarification so readers are not led to expect one collector
path for every signal.

In `@internal/controller/claw_diagnostics.go`:
- Around line 113-116: The runtime config injection for OTEL endpoints only uses
tracesEndpoint(instance) when setting otel["endpoint"] and
OTEL_EXPORTER_OTLP_ENDPOINT, so spec.logs.endpoint never reaches the
environment. Update the endpoint resolution in the claw diagnostics config path
to consider logsEndpoint(instance) as a fallback or primary source alongside
tracesEndpoint, using the existing otel map/env injection logic so logs-only or
distinct log endpoint setups are propagated correctly.

In `@internal/controller/claw_resource_controller.go`:
- Around line 668-669: The enrichConfigAndNetworkPolicy validation path is
returning early without updating status, which can leave Ready=True stale when
observability endpoint validation fails. Update the reconcile flow around
enrichConfigAndNetworkPolicy and injectObservabilityResources to set Ready=False
before returning on user-fixable validation errors, matching the nearby
validation handling. If possible, use a typed validation error so internal
manifest failures are not reported as user-facing observability validation
issues.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 5dd6f656-1453-46cd-aa1e-636226fb6fb1

📥 Commits

Reviewing files that changed from the base of the PR and between f2e8a72 and 6d2a3e0.

📒 Files selected for processing (12)

api/v1alpha1/claw_types.go
api/v1alpha1/zz_generated.deepcopy.go
cmd/main.go
config/crd/bases/claw.sandbox.redhat.com_claws.yaml
docs/observability.md
internal/controller/claw_diagnostics.go
internal/controller/claw_diagnostics_test.go
internal/controller/claw_metrics.go
internal/controller/claw_metrics_test.go
internal/controller/claw_plugins.go
internal/controller/claw_plugins_test.go
internal/controller/claw_resource_controller.go

✅ Files skipped from review due to trivial changes (1)

api/v1alpha1/zz_generated.deepcopy.go

🚧 Files skipped from review as they are similar to previous changes (2)

internal/controller/claw_plugins.go
api/v1alpha1/claw_types.go

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/observability.md`:
- Line 5: The telemetry overview in the observability guide describes a single
OTLP-to-collector path for all signals, but the metrics flow is documented
elsewhere as Prometheus/ServiceMonitor-based. Update the opening description in
the observability guide to split the path by signal, keeping the OTel SDK/OTLP
collector wording for logs and traces while describing metrics as scraped via
Prometheus/ServiceMonitor. Use the existing observability section text as the
target for this clarification so readers are not led to expect one collector
path for every signal.

In `@internal/controller/claw_diagnostics.go`:
- Around line 113-116: The runtime config injection for OTEL endpoints only uses
tracesEndpoint(instance) when setting otel["endpoint"] and
OTEL_EXPORTER_OTLP_ENDPOINT, so spec.logs.endpoint never reaches the
environment. Update the endpoint resolution in the claw diagnostics config path
to consider logsEndpoint(instance) as a fallback or primary source alongside
tracesEndpoint, using the existing otel map/env injection logic so logs-only or
distinct log endpoint setups are propagated correctly.

In `@internal/controller/claw_resource_controller.go`:
- Around line 668-669: The enrichConfigAndNetworkPolicy validation path is
returning early without updating status, which can leave Ready=True stale when
observability endpoint validation fails. Update the reconcile flow around
enrichConfigAndNetworkPolicy and injectObservabilityResources to set Ready=False
before returning on user-fixable validation errors, matching the nearby
validation handling. If possible, use a typed validation error so internal
manifest failures are not reported as user-facing observability validation
issues.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 5dd6f656-1453-46cd-aa1e-636226fb6fb1

📥 Commits

Reviewing files that changed from the base of the PR and between f2e8a72 and 6d2a3e0.

📒 Files selected for processing (12)

api/v1alpha1/claw_types.go
api/v1alpha1/zz_generated.deepcopy.go
cmd/main.go
config/crd/bases/claw.sandbox.redhat.com_claws.yaml
docs/observability.md
internal/controller/claw_diagnostics.go
internal/controller/claw_diagnostics_test.go
internal/controller/claw_metrics.go
internal/controller/claw_metrics_test.go
internal/controller/claw_plugins.go
internal/controller/claw_plugins_test.go
internal/controller/claw_resource_controller.go

✅ Files skipped from review due to trivial changes (1)

api/v1alpha1/zz_generated.deepcopy.go

🚧 Files skipped from review as they are similar to previous changes (2)

internal/controller/claw_plugins.go
api/v1alpha1/claw_types.go

🛑 Comments failed to post (3)

docs/observability.md (1)
5-5: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Split the telemetry path by signal.

This sentence is inaccurate for metrics: the rest of the guide documents metrics as Prometheus/ServiceMonitor-based, not OTLP sent to the collector. As written, it will mislead operators into expecting one collector path for every signal.
✏️ Suggested wording
- Telemetry is sent directly from the gateway to an externally-deployed OTel Collector, which routes signals to your chosen backends.
+ Traces and logs are sent directly from the gateway to an externally-deployed OTel Collector, while metrics are scraped separately via Prometheus/ServiceMonitor.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Traces and logs are sent directly from the gateway to an externally-deployed OTel Collector, while metrics are scraped separately via Prometheus/ServiceMonitor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability.md` at line 5, The telemetry overview in the observability
guide describes a single OTLP-to-collector path for all signals, but the metrics
flow is documented elsewhere as Prometheus/ServiceMonitor-based. Update the
opening description in the observability guide to split the path by signal,
keeping the OTel SDK/OTLP collector wording for logs and traces while describing
metrics as scraped via Prometheus/ServiceMonitor. Use the existing observability
section text as the target for this clarification so readers are not led to
expect one collector path for every signal.
internal/controller/claw_diagnostics.go (1)
113-116: 🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain

🏁 Script executed:
#!/bin/bash
# Verify whether spec.logs.endpoint is propagated beyond validation/egress.
# Expected: logs endpoint should appear in the runtime config/env path, not only in validation or NetworkPolicy code.
rg -n 'logsEndpoint\(|OTEL_EXPORTER_OTLP|otel\["endpoint"\]|logs[_A-Za-z]*endpoint' --iglob '*.go' .
Repository: redhat-et/claw-operator

Length of output: 1930
Propagate the logs endpoint into runtime config/env.

The code at lines 113–116 and 163 exclusively resolves the otel["endpoint"] map key and OTEL_EXPORTER_OTLP_ENDPOINT environment variable using tracesEndpoint(instance), ignoring the configured logsEndpoint(instance). While logsEndpoint is utilized for egress validation (line 260), the runtime configuration fails to propagate the logs endpoint. This causes logs-only deployments or distinct endpoint configurations to either miss the target URL entirely or incorrectly route logs to the traces endpoint.

Update the injection logic to include logsEndpoint(instance) as a fallback or primary resolver when appropriate, ensuring spec.logs.endpoint reaches the environment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/claw_diagnostics.go` around lines 113 - 116, The runtime
config injection for OTEL endpoints only uses tracesEndpoint(instance) when
setting otel["endpoint"] and OTEL_EXPORTER_OTLP_ENDPOINT, so spec.logs.endpoint
never reaches the environment. Update the endpoint resolution in the claw
diagnostics config path to consider logsEndpoint(instance) as a fallback or
primary source alongside tracesEndpoint, using the existing otel map/env
injection logic so logs-only or distinct log endpoint setups are propagated
correctly.
internal/controller/claw_resource_controller.go (1)
668-669: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Set Ready=False for observability validation failures.

injectObservabilityResources now returns user-fixable endpoint validation errors, but this call returns before updating status. If a previously Ready Claw loses a required telemetry endpoint, Ready=True can remain stale while reconcile fails. Mirror the nearby validation paths, ideally with a typed validation error so internal manifest errors are not mislabeled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/claw_resource_controller.go` around lines 668 - 669, The
enrichConfigAndNetworkPolicy validation path is returning early without updating
status, which can leave Ready=True stale when observability endpoint validation
fails. Update the reconcile flow around enrichConfigAndNetworkPolicy and
injectObservabilityResources to set Ready=False before returning on user-fixable
validation errors, matching the nearby validation handling. If possible, use a
typed validation error so internal manifest failures are not reported as
user-facing observability validation issues.

The OpenShift Console distributed tracing UI plugin requires multi-tenancy. Update the full observability setup: TempoMonolithic: - Enable multitenancy with mode: openshift and a named tenant - Gateway sidecar is deployed automatically alongside Tempo RBAC: - Add ClusterRole granting create on tempo.grafana.com/<tenantName> resource (the SAR pattern the Tempo gateway's OPA policy checks) - Both k8sattributes and tenant-write bindings documented OTel Collector: - Use bearertokenauth extension with service account token - Export to tempo-tempo-gateway.<ns>.svc:4317 (not the direct Tempo svc) - Mount openshift-service-ca.crt (service-serving-signer CA) for TLS - Add X-Scope-OrgID: <tenant> header Verification: - Query traces via tempo-tempo-gateway on port 8080 with HTTPS + bearer auth - Direct tempo-tempo service no longer exists in multi-tenant mode Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The OTel Collector logs pipeline works out of the box with only the debug exporter — gateway logs are received and confirmed via collector pod logs without any additional storage backend. LokiStack (for Observe → Logs in the OpenShift Console) is now an explicit optional section that: - States the capacity requirement upfront (extra-small needs ~4-5 vCPU and ~8 GiB beyond base workload) - Lists OpenShift Logging 6.5 as the prerequisite (OTLP ingestion GA) - Provides the full LokiStack + ClusterLogForwarder + UIPlugin setup - Shows the exact collector config change needed (otlphttp/loki exporter) Update the troubleshooting note to reference the optional section rather than implying Loki is required. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

1. Logs-only endpoint gap: add otlpEndpoint() helper that prefers spec.traces.endpoint but falls back to spec.logs.endpoint when traces is not configured. Both injectDiagnosticsConfig and injectOTelEnvVars now use otlpEndpoint() so a logs-only deployment correctly receives OTEL_EXPORTER_OTLP_ENDPOINT and diagnostics.otel.endpoint. 2. Metrics-only no endpoint: add explicit validation in injectObservabilityResources that rejects metrics.enabled=true when neither traces nor logs provides an OTLP endpoint. Previously this produced a silent misconfiguration where OpenClaw had metrics enabled but no OTLP target to export to. 3. Missing status condition: injectObservabilityResources errors now call setCondition(ConditionTypeReady, False, ConfigFailed) before returning, matching the established pattern for every other validation path in the reconciler. Previously the error was only visible in operator logs, not in kubectl describe claw. Consolidate: injectDiagnosticsConfig now owns all three signal flags (traces, logs, metrics) so injectMetricsConfig call is dropped — both functions ran sequentially and set the same otel["metrics"]=true, which was redundant after the consolidation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

1. Add openshift-operators-redhat OperatorGroup to the optional LokiStack prerequisites — without it subscriptions are created but CSVs never install (silently broken). 2. Add storageClassName as a required field in the LokiStack spec — the CRD rejects the resource without it (validation error hit during deploy). 3. Add infra node tolerations to the LokiStack template spec — on OpenShift clusters with dedicated infra nodes (NoSchedule taint), all LokiStack components must tolerate the taint or they stay Pending indefinitely. 4. Replace "Point the collector at LokiStack" with a ClusterLogForwarder approach. The OTel Collector OTLP→Loki path had a wrong URL (/otlp returns 404; the real path is /api/logs/v1/application/otlp/v1/logs). The CLF (vector DaemonSet) is the correct tool for log aggregation. 5. Add tls.ca field to the CLF output — the LokiStack gateway uses a cert signed by the OpenShift service-serving CA, not the Kubernetes API CA. Without openshift-service-ca.crt, vector gets SSL errors on every push. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/observability.md (1)
317-358: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Document the logs-endpoint fallback in this section.

Line 355 says OTEL_EXPORTER_OTLP_ENDPOINT is set from spec.traces.endpoint, but the controller logic and diagnostics tests fall back to spec.logs.endpoint when traces are disabled. As written, this section makes traces.endpoint look mandatory for logs-only setups when it is not.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability.md` around lines 317 - 358, The observability docs
currently imply OTEL_EXPORTER_OTLP_ENDPOINT always comes from
spec.traces.endpoint, which is incorrect for logs-only setups. Update the
section around the Claw spec and “What the operator injects” to explicitly
describe the fallback in the controller logic: when traces are disabled and logs
are enabled, the endpoint should come from spec.logs.endpoint, otherwise it uses
spec.traces.endpoint. Make sure the explanation references the Claw spec fields
traces.endpoint and logs.endpoint, and keep the operator.json diagnostics note
aligned with that fallback behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/observability.md`:
- Around line 317-358: The observability docs currently imply
OTEL_EXPORTER_OTLP_ENDPOINT always comes from spec.traces.endpoint, which is
incorrect for logs-only setups. Update the section around the Claw spec and
“What the operator injects” to explicitly describe the fallback in the
controller logic: when traces are disabled and logs are enabled, the endpoint
should come from spec.logs.endpoint, otherwise it uses spec.traces.endpoint.
Make sure the explanation references the Claw spec fields traces.endpoint and
logs.endpoint, and keep the operator.json diagnostics note aligned with that
fallback behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: ffed5987-bc3a-4487-b790-bdb8a48e7488

📥 Commits

Reviewing files that changed from the base of the PR and between 6d2a3e0 and 3af762e.

📒 Files selected for processing (6)

docs/observability.md
internal/controller/claw_diagnostics.go
internal/controller/claw_diagnostics_test.go
internal/controller/claw_metrics_test.go
internal/controller/claw_plugins_test.go
internal/controller/claw_resource_controller.go

🚧 Files skipped from review as they are similar to previous changes (4)

internal/controller/claw_plugins_test.go
internal/controller/claw_diagnostics_test.go
internal/controller/claw_diagnostics.go
internal/controller/claw_metrics_test.go

When diagnostics signals are enabled, the required plugin IDs (@openclaw/diagnostics-otel, @openclaw/diagnostics-prometheus) are now added to plugins.allow in operator.json. Without this, OpenClaw treats non-bundled plugins as auto-discovered and runs them in limited trust mode, producing sparse or suppressed instrumentation data. Also ensures idempotent merge: existing user-configured allow entries are preserved and plugin IDs are not duplicated. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The gateway runs behind a claw-proxy MITM sidecar (HTTP_PROXY set) that intercepts all outbound HTTP traffic for credential injection. OTLP traffic to an external collector goes through this proxy, which returns 403 because no credential rule exists for the collector host. Fix: parse the OTLP endpoint URL in injectOTelEnvVars and append its hostname to NO_PROXY and no_proxy, causing the Node.js OTel SDK (which respects NODE_USE_ENV_PROXY) to connect directly. The appendToNoProxy helper is idempotent — it checks for existing entries before appending and preserves the proxy-injected values for all other hosts. Also document this in the Troubleshooting section of docs/observability.md so operators know to check NO_PROXY when seeing 403 from their collector. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Setting plugins.allow explicitly prevents OpenClaw from loading bundled plugins (codex agent runtime, browser tools, etc.) that are not in the list, reducing the gateway from 11 plugins to 3. The diagnostics plugins auto-load correctly without this — the NO_PROXY fix is the real fix for trace delivery. The plugins.allow warning is informational only. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The diagnostics-otel plugin loads silently without this key but does not initialize any exporters. Adding enabled=true causes the plugin to start the OTLP/Protobuf exporter on startup (confirmed by log message: 'diagnostics-otel: logs exporter enabled (OTLP/Protobuf)'). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

sallyom

LGTM!

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread .github/workflows/test-diagnostics.yml Fixed

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread .github/workflows/test-kind.yml Outdated

Comment thread .github/workflows/test-kind.yml Outdated

Comment thread .github/workflows/test-kind.yml Outdated

Comment thread api/v1alpha1/claw_types.go Outdated

Comment thread internal/controller/claw_diagnostics.go Outdated

cooktheryan and others added 2 commits June 17, 2026 10:35

Merge remote-tracking branch 'origin/main' into feat/diagnostics-otel

295f7b2

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread .github/workflows/test-kind.yml Fixed

cooktheryan and others added 7 commits June 17, 2026 11:03

Potential fix for pull request finding 'CodeQL / Workflow does not co…

a82fac2

…ntain permissions' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

chore: address PR feedback on workflow security

d3e6ac6

- Add permissions: contents: read to limit GITHUB_TOKEN scope - Add persist-credentials: false on checkout - Pin Kind to v0.32.0 for reproducible builds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: combine kind test and image publish into single CI workflow

d79b35a

Merge publish-images.yml into test-kind.yml so images are only pushed to quay.io after successful Kind testing and only on merge to main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: pin actions to commit hashes in test job

c634562

Align test job action references with publish job for supply-chain hardening consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove workflow changes (moved to separate PR)

bdfe03f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cooktheryan mentioned this pull request Jun 17, 2026

ci: combine kind test and image publish into single workflow #18

Merged

3 tasks

cooktheryan and others added 8 commits June 24, 2026 15:50

Merge origin/feat/diagnostics-otel into worktree-otel-simplify

f10e555

Merge pull request #31 from redhat-et/worktree-otel-simplify

2bd4786

feat: remove OTel sidecar, send OTLP directly to external collector

cooktheryan changed the title ~~feat: add spec.traces and spec.logs for OTel forwarding~~ HOLD: add spec.traces and spec.logs for OTel forwarding Jun 24, 2026

cooktheryan and others added 2 commits June 24, 2026 17:18

chore: regenerate CRD after doc comment updates

56a7c28

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

cooktheryan and others added 4 commits June 25, 2026 11:44

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

cooktheryan and others added 4 commits June 25, 2026 15:47

cooktheryan changed the title ~~HOLD: add spec.traces and spec.logs for OTel forwarding~~ feat: add spec.traces and spec.logs for OTel forwarding Jun 26, 2026

sallyom approved these changes Jun 26, 2026

View reviewed changes

sallyom merged commit 9129eee into main Jun 26, 2026
10 checks passed

Uh oh!

Conversation

cooktheryan commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example CR

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

cooktheryan commented Jun 16, 2026

Uh oh!

cooktheryan commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

frzifus commented Jun 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sallyom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cooktheryan commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading