Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Agentic Operator

Kubernetes operator watching Custom Resources and managing AgenticSession Job lifecycle.

Features

  • Controller-runtime based - Uses work queues with rate limiting for scalable processing
  • Concurrent reconciliation - Processes multiple sessions in parallel (configurable)
  • Event deduplication - Multiple rapid events are coalesced into single reconciles
  • Automatic retries - Failed reconciles are requeued with exponential backoff
  • Watches AgenticSession CRs and spawns Jobs with runner pods
  • Updates CR status based on Job completion
  • Handles timeout and cleanup
  • Idempotent reconciliation

Configuration

Command Line Flags

Flag Default Description
--max-concurrent-reconciles 10 Maximum parallel session reconciliations
--metrics-bind-address :8080 Prometheus metrics endpoint
--health-probe-bind-address :8081 Health/readiness probe endpoint
--leader-elect false Enable leader election for HA
--legacy-watch false Use old watch-based implementation (debugging)

Environment Variables

Variable Default Description
MAX_CONCURRENT_RECONCILES 10 Override max concurrent reconciles
DEV_MODE false Enable development logging
NAMESPACE default Operator namespace
BACKEND_NAMESPACE (same as NAMESPACE) Backend API namespace
AMBIENT_CODE_RUNNER_IMAGE quay.io/ambient_code/vteam_claude_runner:latest Runner image

Performance Tuning

For high-throughput environments:

args:
  - --max-concurrent-reconciles=20  # Increase parallelism
  - --leader-elect=false

For HA deployments:

spec:
  replicas: 2
  template:
    spec:
      containers:
      - args:
        - --leader-elect=true  # Only one active controller

Development

Prerequisites

  • Go 1.24+
  • kubectl
  • Kubernetes cluster access
  • CRDs installed in cluster

Quick Start

cd components/operator

# Build
go build -o operator .

# Run locally (requires k8s access and CRDs installed)
go run .

# Run with legacy watch mode (for debugging)
go run . --legacy-watch

Build

# Build binary
go build -o operator .

# Build container image
docker build -t operator .
# or
podman build -t operator .

Testing

# Run tests
go test ./... -v

# Run tests with coverage
go test ./... -v -cover

Linting

# Format code
gofmt -l .

# Run go vet
go vet ./...

# Run golangci-lint
golangci-lint run

Pre-commit checklist:

# Run all linting checks
gofmt -l .             # Should output nothing
go vet ./...
golangci-lint run

# Auto-format code
gofmt -w .

Architecture

Package Structure

operator/
├── internal/
│   ├── config/        # K8s client init, config loading
│   ├── controller/    # Controller-runtime reconcilers (NEW)
│   │   ├── agenticsession_controller.go  # Main reconciler with work queue
│   │   └── reconcile_phases.go           # Phase-specific reconciliation logic
│   ├── types/         # GVR definitions, resource helpers
│   ├── handlers/      # Handler logic called from controllers
│   │   ├── sessions.go      # Session management logic
│   │   ├── reconciler.go    # Exported functions for controller
│   │   ├── namespaces.go    # Namespace watcher
│   │   └── projectsettings.go  # ProjectSettings watcher
│   └── services/      # Reusable services (PVC provisioning, etc.)
└── main.go            # Manager setup and controller registration

Controller-Runtime Benefits

The operator uses controller-runtime which provides:

  1. Work Queue - Events are added to a queue and processed asynchronously
  2. Rate Limiting - Exponential backoff prevents API server overload
  3. Deduplication - Multiple rapid events = single reconcile
  4. Concurrency - Multiple reconcilers process sessions in parallel
  5. Predicates - Filter events to reduce unnecessary reconciles

Key Patterns

See CLAUDE.md in project root for:

  • Reconciliation pattern with Result and error handling
  • Status updates (UpdateStatus subresource)
  • Error handling and requeuing
  • Phase-based state machine

Prometheus Metrics

The operator exposes metrics at :8080/metrics for monitoring session lifecycle and performance.

Available Metrics

Metric Type Labels Description
ambient_session_startup_duration_seconds Histogram namespace Time from Pending to Running
ambient_sessions_total Counter namespace Total sessions created
ambient_sessions_completed_total Counter namespace, final_phase Sessions reaching terminal states (Stopped, Failed, Completed)
ambient_session_phase_transitions_total Counter namespace, from_phase, to_phase Phase transition counts
ambient_reconcile_duration_seconds Histogram phase, success Reconcile loop timing
ambient_pod_creation_duration_seconds Histogram namespace Pod creation timing
ambient_token_provision_duration_seconds Histogram namespace Runner token provisioning time
ambient_session_errors_total Counter namespace, phase, error_type Error tracking

Example PromQL Queries

95th percentile startup time:

histogram_quantile(0.95, sum(rate(ambient_session_startup_duration_seconds_bucket[5m])) by (le, namespace))

Average startup time:

sum(rate(ambient_session_startup_duration_seconds_sum[5m])) / sum(rate(ambient_session_startup_duration_seconds_count[5m]))

Sessions started per hour:

sum(increase(ambient_sessions_total[1h])) by (namespace)

Phase transitions per minute:

sum(rate(ambient_session_phase_transitions_total[1m])) by (from_phase, to_phase)

Error rate:

sum(rate(ambient_session_errors_total[5m])) by (phase, error_type)

Reconcile success rate:

sum(rate(ambient_reconcile_duration_seconds_count{success="true"}[5m])) / sum(rate(ambient_reconcile_duration_seconds_count[5m]))

OpenShift User Workload Monitoring

To enable metrics scraping in OpenShift:

  1. Enable user workload monitoring (done once per cluster):
oc -n openshift-monitoring edit configmap cluster-monitoring-config
# Add: enableUserWorkload: true
  1. Apply the ServiceMonitor (included in manifests):
oc apply -f components/manifests/base/operator-metrics-service.yaml
  1. Access metrics in OpenShift Console → Observe → Metrics

Reference Files

  • internal/controller/agenticsession_controller.go - Main reconciler
  • internal/controller/reconcile_phases.go - Phase handlers
  • internal/controller/metrics.go - Prometheus metric definitions
  • internal/handlers/reconciler.go - Exported handler functions
  • internal/handlers/sessions.go - Core session management logic
  • internal/config/config.go - K8s client initialization
  • internal/types/resources.go - GVR definitions