telemetryflow
diff --git a/‎CHANGELOG.md‎
Lines changed: 31 additions & 3 deletions b/‎CHANGELOG.md‎
Lines changed: 31 additions & 3 deletions
diff --git a/‎configs/tfo-agent.default.yaml‎
Lines changed: 5 additions & 0 deletions b/‎configs/tfo-agent.default.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎configs/tfo-agent.yaml‎
Lines changed: 5 additions & 0 deletions b/‎configs/tfo-agent.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎deploy/helm/tfo-agent/values.yaml‎
Lines changed: 5 additions & 0 deletions b/‎deploy/helm/tfo-agent/values.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/collectors/CADVISOR.md‎
Lines changed: 101 additions & 0 deletions b/‎docs/collectors/CADVISOR.md‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎docs/collectors/DOCKER.md‎
Lines changed: 119 additions & 0 deletions b/‎docs/collectors/DOCKER.md‎
Lines changed: 119 additions & 0 deletions
@@ -24,7 +24,7 @@ All notable changes to TelemetryFlow Agent will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.1/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [1.1.8] - 2026-03-08
+## [1.1.8] - 2026-03-09
 
 ### Added
 
@@ -43,11 +43,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - Host filesystem paths checked both directly and under `TELEMETRYFLOW_HOST_ROOT` prefix — detection works correctly inside DaemonSet containers
   - Returns `(false, "")` when not running in a Kubernetes environment at all
   - New `IsKubernetes bool` and `K8sProvider string` fields added to `collector.SystemInfo` struct
+- **HPA Sub-collector (`collectors.kubernetes.hpa: true`)**: HorizontalPodAutoscaler monitoring
+  - 5 new metrics: `k8s.hpa.min_replicas`, `k8s.hpa.max_replicas`, `k8s.hpa.current_replicas`, `k8s.hpa.desired_replicas`, `k8s.hpa.condition`
+  - Condition types: `AbleToScale`, `ScalingActive`, `ScalingLimited` — emitted as `1` (True) / `0` (False/Unknown)
+  - Labels: `namespace`, `hpa`, `target_kind`, `target_name`
+- **PDB Sub-collector (`collectors.kubernetes.pdb: true`)**: PodDisruptionBudget health monitoring
+  - 4 new metrics: `k8s.pdb.pods_healthy`, `k8s.pdb.pods_desired`, `k8s.pdb.disruptions_allowed`, `k8s.pdb.expected_pods`
+  - Labels: `namespace`, `pdb`
+  - RBAC: `policy` apiGroup with `poddisruptionbudgets` resource added to ClusterRole
+- **Pod Log Collection (`collectors.kubernetes.pod_logs: true`)**: Tail-based container log collection from the Kubernetes API
+  - Collects last N lines (`pod_logs_tail_lines`, default 100) per running container per cycle
+  - Optional namespace allowlist via `pod_logs_namespaces` (empty = same as `namespace_filter`)
+  - Emits `PodLogEntry` records: `timestamp`, `namespace`, `pod`, `container`, `log_line`
+  - Respects the existing `namespace_filter` / `exclude_namespaces` config
+- **Kubelet `/stats/summary` Expansion**: Extended Kubelet summary scraping with container-level and node-level data
+  - **Container ephemeral storage**: `k8s.pod.container.ephemeral_storage.usage` and `k8s.pod.container.ephemeral_storage.limit` (bytes)
+  - **Container memory working set**: `k8s.pod.container.memory.working_set` (bytes, matches `kubectl top`)
+  - **Node-level network I/O** via Kubelet summary: namespace-level `k8s.network.rx_bytes` / `k8s.network.tx_bytes` (existing `network: true` flag)
+- **Collector Documentation (`docs/collectors/`)**: 17 new reference documents covering every collector and sub-collector
+  - Kubernetes: NODES, PODS, DEPLOYMENTS, WORKLOADS, STORAGE, NETWORK, HPA, PDB, EVENTS, RESOURCE-COUNTS, POD-LOGS
+  - Host: NODE-EXPORTER (50+ metrics), SYSTEM (14 metrics + SystemInfo heartbeat fields)
+  - Container: DOCKER (24 metrics), CADVISOR (Prometheus scraper)
+  - Kernel: EBPF (20 metrics, 7 sub-collectors)
+  - `README.md` index with data source table and metric naming conventions
 
 ### Changed
 
 - **`internal/agent/agent.go`**: Replaced inline `uuid.New()` call with `ResolveAgentID(cfg.Agent.ID, cfg.Agent.Hostname, logger)` from the new identity module
 - **`deploy/kubernetes/daemonset.yaml`**: Added `HOST_PROC`, `HOST_ETC`, `HOST_SYS`, `HOST_VAR`, `HOST_RUN` environment variables so that `gopsutil` reads `/etc/machine-id` and other identity files from the **host node** rather than the container image — required for a stable `HostID` in the fingerprint
+- **Config files updated** — all YAML configs now include `hpa`, `pdb`, `pod_logs`, `pod_logs_tail_lines`, `pod_logs_namespaces` under `collectors.kubernetes`:
+  - `configs/tfo-agent.yaml`
+  - `configs/tfo-agent.default.yaml`
+  - `deploy/helm/tfo-agent/values.yaml` (both `config` and `kubernetes.config` sections)
+- **RBAC ClusterRole** (`deploy/helm/tfo-agent/templates/clusterrole.yaml`): Added `policy` apiGroup rule for `poddisruptionbudgets` (required by PDB sub-collector)
 
 ### Fixed
 
@@ -411,8 +439,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 | Version | Date       | OTEL SDK | Description                                                                                                       |
 | ------- | ---------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
-| 1.1.8   | 2026-03-08 | v1.40.0  | Go 1.26.0 upgrade, OTEL SDK v1.40.0, Helm deployment for Kubernetes Cluster                                       |
-| 1.1.7   | 2026-03-04 | v1.40.0  | Stable agent identity via UUIDv5 host fingerprint; K8s provider detection (15 providers); fix SyncKubernetesState |
+| 1.1.8   | 2026-03-09 | v1.40.0  | HPA/PDB/pod-logs sub-collectors; Kubelet summary ephemeral + working set; Go 1.26 + security fixes; 17 collector docs |
+| 1.1.7   | 2026-03-08 | v1.40.0  | Stable agent identity via UUIDv5 host fingerprint; K8s provider detection (15 providers); fix SyncKubernetesState |
 | 1.1.6   | 2026-02-21 | v1.40.0  | Go 1.25.7, OTEL SDK v1.40.0, build-tag lint fixes, errcheck/staticcheck cleanup                                   |
 | 1.1.5   | 2026-02-19 | v1.39.0  | Docker container collector, cAdvisor scraper, CPU fix macOS, tags/labels propagation                              |
 | 1.1.4   | 2026-02-11 | v1.39.0  | eBPF collector (28 metrics), Cilium Hubble integration, 6 BPF programs, kernel-level observability                |
 
@@ -235,6 +235,11 @@ collectors:
     network: true         # Network bytes per namespace (Kubelet Summary API)
     # Fetch actual CPU/Memory usage from metrics-server
     metrics_api: true
+    hpa: true             # HorizontalPodAutoscaler (current/desired replicas + conditions)
+    pdb: true             # PodDisruptionBudget (healthy/desired/disruptions allowed)
+    pod_logs: true        # Collect recent log lines from each running container
+    pod_logs_tail_lines: 100     # Log lines per container per collection cycle
+    pod_logs_namespaces: []      # Restrict pod log collection to these namespaces (empty = same as namespace filter)
     # Sync resource state to TFO backend (PostgreSQL entities)
     sync_to_backend: true
     sync_interval: 60s
 
@@ -236,6 +236,11 @@ collectors:
     resource_counts: true # Secrets, ConfigMaps, Ingresses count
     network: true         # Network bytes per namespace (Kubelet Summary API)
     metrics_api: true     # Fetch actual CPU/Memory usage from metrics-server
+    hpa: true             # HorizontalPodAutoscaler (current/desired replicas + conditions)
+    pdb: true             # PodDisruptionBudget (healthy/desired/disruptions allowed)
+    pod_logs: true        # Collect recent log lines from each running container
+    pod_logs_tail_lines: 100     # Log lines per container per collection cycle
+    pod_logs_namespaces: []      # Restrict pod log collection to these namespaces (empty = same as namespace filter)
     # Sync resource state to TFO backend (populates PostgreSQL K8s entities)
     sync_to_backend: true
     sync_interval: 60s
 
@@ -263,6 +263,11 @@ kubernetes:
         resource_counts: true
         network: true
         metrics_api: true
+        hpa: true
+        pdb: true
+        pod_logs: true
+        pod_logs_tail_lines: 100
+        pod_logs_namespaces: []
         exclude_namespaces:
           - kube-system
 
 
@@ -0,0 +1,101 @@
+# cAdvisor Collector
+
+Scrapes container and machine metrics from a running cAdvisor instance's Prometheus `/metrics` endpoint.
+
+## Data Source
+
+HTTP GET to cAdvisor Prometheus endpoint (default: `http://localhost:8080/metrics`).
+
+Parser: Prometheus text format (`expfmt.TextParser`).
+
+## Metric Filtering
+
+By default, only metrics with the following prefixes are collected:
+
+| Prefix       | Description           |
+| ------------ | --------------------- |
+| `container_` | Per-container metrics |
+| `machine_`   | Host machine metrics  |
+
+If `metric_names` is configured, only the exact metric names listed are collected.
+
+## Key cAdvisor Metrics Collected (defaults)
+
+### Container CPU
+
+| Metric                                      | Type    | Description                  |
+| ------------------------------------------- | ------- | ---------------------------- |
+| `container_cpu_usage_seconds_total`         | Counter | Cumulative CPU time consumed |
+| `container_cpu_system_seconds_total`        | Counter | CPU time in system mode      |
+| `container_cpu_user_seconds_total`          | Counter | CPU time in user mode        |
+| `container_cpu_cfs_throttled_seconds_total` | Counter | CPU throttled time           |
+| `container_cpu_cfs_periods_total`           | Counter | Total CFS periods            |
+| `container_cpu_cfs_throttled_periods_total` | Counter | Throttled CFS periods        |
+
+### Container Memory
+
+| Metric                               | Type    | Description                |
+| ------------------------------------ | ------- | -------------------------- |
+| `container_memory_usage_bytes`       | Gauge   | Current memory usage       |
+| `container_memory_working_set_bytes` | Gauge   | Working set memory         |
+| `container_memory_rss`               | Gauge   | Resident set size          |
+| `container_memory_cache`             | Gauge   | Page cache                 |
+| `container_memory_swap`              | Gauge   | Swap usage                 |
+| `container_memory_failures_total`    | Counter | Memory allocation failures |
+
+### Container Network (per-interface)
+
+| Metric                                             | Type    | Description       |
+| -------------------------------------------------- | ------- | ----------------- |
+| `container_network_receive_bytes_total`            | Counter | Bytes received    |
+| `container_network_transmit_bytes_total`           | Counter | Bytes transmitted |
+| `container_network_receive_errors_total`           | Counter | Receive errors    |
+| `container_network_transmit_errors_total`          | Counter | Transmit errors   |
+| `container_network_receive_packets_dropped_total`  | Counter | Received drops    |
+| `container_network_transmit_packets_dropped_total` | Counter | Transmitted drops |
+
+### Container Filesystem
+
+| Metric                            | Type    | Description                   |
+| --------------------------------- | ------- | ----------------------------- |
+| `container_fs_usage_bytes`        | Gauge   | Bytes used by the container   |
+| `container_fs_limit_bytes`        | Gauge   | Bytes limit for the container |
+| `container_fs_reads_bytes_total`  | Counter | Bytes read from filesystem    |
+| `container_fs_writes_bytes_total` | Counter | Bytes written to filesystem   |
+
+### Machine Metrics
+
+| Metric                                 | Type  | Description           |
+| -------------------------------------- | ----- | --------------------- |
+| `machine_cpu_cores`                    | Gauge | Number of CPU cores   |
+| `machine_memory_bytes`                 | Gauge | Total memory in bytes |
+| `machine_cpu_cache_capacity_kilobytes` | Gauge | CPU cache capacity    |
+
+## Metric Type Handling
+
+| Prometheus type | Converted to                                            |
+| --------------- | ------------------------------------------------------- |
+| GAUGE           | `MetricTypeGauge`                                       |
+| COUNTER         | `MetricTypeCounter`                                     |
+| UNTYPED         | `MetricTypeGauge`                                       |
+| SUMMARY         | `{name}_sum` (Counter) + `{name}_count` (Counter)       |
+| HISTOGRAM       | `{name}_sum` + `{name}_count` + `{name}_bucket{le=...}` |
+
+## Configuration
+
+```yaml
+cadvisor:
+  enabled: true
+  interval: 15s
+  endpoint: "http://localhost:8080"
+  metrics_path: "/metrics"
+  timeout: 10s
+  metric_names: [] # empty = collect all container_* and machine_* metrics
+  labels: {} # extra labels added to all metrics
+```
+
+## Notes
+
+- cAdvisor must be running separately (e.g. as a DaemonSet in Kubernetes, or `docker run google/cadvisor`).
+- All cAdvisor labels from the Prometheus exposition are preserved (container, pod, namespace, image, etc.).
+- For Kubernetes environments, cAdvisor is typically embedded in the Kubelet — accessible at `/api/v1/nodes/{name}/proxy/metrics/cadvisor`. Use the Kubernetes API server proxy URL as the endpoint in that case.
@@ -0,0 +1,119 @@
+# Docker Collector
+
+Collects per-container CPU, memory, network, disk I/O, and PID metrics from the Docker Engine API.
+
+## Data Source
+
+Docker Engine API via Unix socket (`/var/run/docker.sock`). Uses a single-shot `ContainerStats` call per container (no streaming).
+
+## Metrics
+
+### Container State Summary (all containers)
+
+| Metric                       | Type  | Description                                                     |
+| ---------------------------- | ----- | --------------------------------------------------------------- |
+| `container.state.running`    | Gauge | Number of running containers                                    |
+| `container.state.stopped`    | Gauge | Number of stopped/exited/dead containers                        |
+| `container.state.paused`     | Gauge | Number of paused containers                                     |
+| `container.state.restarting` | Gauge | Number of restarting containers                                 |
+| `container.state.total`      | Gauge | Total containers (including stopped if `include_stopped: true`) |
+
+### Per-Container Labels
+
+All per-container metrics carry these labels:
+
+| Label          | Description                           |
+| -------------- | ------------------------------------- |
+| `container_id` | Short container ID (12 characters)    |
+| `name`         | Container name (leading `/` stripped) |
+| `image`        | Container image name                  |
+| `state`        | Container state (`running`)           |
+
+---
+
+### CPU (flag: `collect_cpu: true`)
+
+| Metric                            | Type    | Unit        | Description                                                     |
+| --------------------------------- | ------- | ----------- | --------------------------------------------------------------- |
+| `container.cpu.usage_percent`     | Gauge   | percent     | CPU usage % (delta-based: cpuDelta/systemDelta × numCPUs × 100) |
+| `container.cpu.usage_total`       | Counter | nanoseconds | Total cumulative CPU time                                       |
+| `container.cpu.user`              | Counter | nanoseconds | CPU time in user mode                                           |
+| `container.cpu.kernel`            | Counter | nanoseconds | CPU time in kernel mode                                         |
+| `container.cpu.online_cpus`       | Gauge   | —           | Number of online CPUs                                           |
+| `container.cpu.throttled_periods` | Counter | —           | Number of throttled periods                                     |
+| `container.cpu.throttled_time`    | Counter | nanoseconds | Total throttled CPU time                                        |
+
+---
+
+### Memory (flag: `collect_memory: true`)
+
+| Metric                           | Type  | Unit    | Description                                                |
+| -------------------------------- | ----- | ------- | ---------------------------------------------------------- |
+| `container.memory.usage`         | Gauge | bytes   | Memory usage including cache                               |
+| `container.memory.working_set`   | Gauge | bytes   | Working set (usage − inactive_file, matches `kubectl top`) |
+| `container.memory.limit`         | Gauge | bytes   | Memory limit                                               |
+| `container.memory.max_usage`     | Gauge | bytes   | Peak memory usage recorded                                 |
+| `container.memory.rss`           | Gauge | bytes   | Resident set size (if available)                           |
+| `container.memory.cache`         | Gauge | bytes   | Page cache (if available)                                  |
+| `container.memory.usage_percent` | Gauge | percent | Working set as % of limit                                  |
+
+---
+
+### Network (flag: `collect_network: true`)
+
+Labels: per-container labels + `interface`
+
+| Metric                         | Type    | Unit  | Description                 |
+| ------------------------------ | ------- | ----- | --------------------------- |
+| `container.network.rx_bytes`   | Counter | bytes | Bytes received              |
+| `container.network.tx_bytes`   | Counter | bytes | Bytes transmitted           |
+| `container.network.rx_packets` | Counter | —     | Packets received            |
+| `container.network.tx_packets` | Counter | —     | Packets transmitted         |
+| `container.network.rx_errors`  | Counter | —     | Receive errors              |
+| `container.network.tx_errors`  | Counter | —     | Transmit errors             |
+| `container.network.rx_dropped` | Counter | —     | Received packets dropped    |
+| `container.network.tx_dropped` | Counter | —     | Transmitted packets dropped |
+
+---
+
+### Disk I/O (flag: `collect_disk_io: true`)
+
+| Metric                         | Type    | Unit  | Description                          |
+| ------------------------------ | ------- | ----- | ------------------------------------ |
+| `container.diskio.read_bytes`  | Counter | bytes | Total bytes read from block devices  |
+| `container.diskio.write_bytes` | Counter | bytes | Total bytes written to block devices |
+| `container.diskio.read_ops`    | Counter | —     | Total read operations                |
+| `container.diskio.write_ops`   | Counter | —     | Total write operations               |
+
+---
+
+### PIDs (flag: `collect_pids: true`)
+
+| Metric                   | Type  | Description                             |
+| ------------------------ | ----- | --------------------------------------- |
+| `container.pids.current` | Gauge | Current number of PIDs in the container |
+
+---
+
+## Configuration
+
+```yaml
+docker:
+  enabled: true
+  interval: 15s
+  socket_path: "/var/run/docker.sock"
+  include_stopped: false
+  collect_cpu: true
+  collect_memory: true
+  collect_network: true
+  collect_disk_io: true
+  collect_pids: true
+  include_containers: [] # allowlist by name (empty = all)
+  exclude_containers: [] # denylist by name
+```
+
+## Notes
+
+- The collector pings the Docker daemon on startup. If unreachable, initialization fails.
+- Only containers in `running` state get per-container stats (stopped containers only count toward state summary if `include_stopped: true`).
+- CPU `usage_percent` requires two consecutive stat snapshots (via `PreCPUStats`). Docker returns both in a single stats call.