Skip to content

Commit 586c622

Browse files
committed
feat: HPA/PDB/pod-logs sub-collectors; Kubelet summary ephemeral + working set; Go 1.26 + security fixes; 17 collector docs
1 parent 9feabf4 commit 586c622

34 files changed

Lines changed: 2633 additions & 60 deletions

CHANGELOG.md

Lines changed: 31 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ All notable changes to TelemetryFlow Agent will be documented in this file.
2424
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.1/),
2525
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
2626

27-
## [1.1.8] - 2026-03-08
27+
## [1.1.8] - 2026-03-09
2828

2929
### Added
3030

@@ -43,11 +43,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4343
- Host filesystem paths checked both directly and under `TELEMETRYFLOW_HOST_ROOT` prefix — detection works correctly inside DaemonSet containers
4444
- Returns `(false, "")` when not running in a Kubernetes environment at all
4545
- New `IsKubernetes bool` and `K8sProvider string` fields added to `collector.SystemInfo` struct
46+
- **HPA Sub-collector (`collectors.kubernetes.hpa: true`)**: HorizontalPodAutoscaler monitoring
47+
- 5 new metrics: `k8s.hpa.min_replicas`, `k8s.hpa.max_replicas`, `k8s.hpa.current_replicas`, `k8s.hpa.desired_replicas`, `k8s.hpa.condition`
48+
- Condition types: `AbleToScale`, `ScalingActive`, `ScalingLimited` — emitted as `1` (True) / `0` (False/Unknown)
49+
- Labels: `namespace`, `hpa`, `target_kind`, `target_name`
50+
- **PDB Sub-collector (`collectors.kubernetes.pdb: true`)**: PodDisruptionBudget health monitoring
51+
- 4 new metrics: `k8s.pdb.pods_healthy`, `k8s.pdb.pods_desired`, `k8s.pdb.disruptions_allowed`, `k8s.pdb.expected_pods`
52+
- Labels: `namespace`, `pdb`
53+
- RBAC: `policy` apiGroup with `poddisruptionbudgets` resource added to ClusterRole
54+
- **Pod Log Collection (`collectors.kubernetes.pod_logs: true`)**: Tail-based container log collection from the Kubernetes API
55+
- Collects last N lines (`pod_logs_tail_lines`, default 100) per running container per cycle
56+
- Optional namespace allowlist via `pod_logs_namespaces` (empty = same as `namespace_filter`)
57+
- Emits `PodLogEntry` records: `timestamp`, `namespace`, `pod`, `container`, `log_line`
58+
- Respects the existing `namespace_filter` / `exclude_namespaces` config
59+
- **Kubelet `/stats/summary` Expansion**: Extended Kubelet summary scraping with container-level and node-level data
60+
- **Container ephemeral storage**: `k8s.pod.container.ephemeral_storage.usage` and `k8s.pod.container.ephemeral_storage.limit` (bytes)
61+
- **Container memory working set**: `k8s.pod.container.memory.working_set` (bytes, matches `kubectl top`)
62+
- **Node-level network I/O** via Kubelet summary: namespace-level `k8s.network.rx_bytes` / `k8s.network.tx_bytes` (existing `network: true` flag)
63+
- **Collector Documentation (`docs/collectors/`)**: 17 new reference documents covering every collector and sub-collector
64+
- Kubernetes: NODES, PODS, DEPLOYMENTS, WORKLOADS, STORAGE, NETWORK, HPA, PDB, EVENTS, RESOURCE-COUNTS, POD-LOGS
65+
- Host: NODE-EXPORTER (50+ metrics), SYSTEM (14 metrics + SystemInfo heartbeat fields)
66+
- Container: DOCKER (24 metrics), CADVISOR (Prometheus scraper)
67+
- Kernel: EBPF (20 metrics, 7 sub-collectors)
68+
- `README.md` index with data source table and metric naming conventions
4669

4770
### Changed
4871

4972
- **`internal/agent/agent.go`**: Replaced inline `uuid.New()` call with `ResolveAgentID(cfg.Agent.ID, cfg.Agent.Hostname, logger)` from the new identity module
5073
- **`deploy/kubernetes/daemonset.yaml`**: Added `HOST_PROC`, `HOST_ETC`, `HOST_SYS`, `HOST_VAR`, `HOST_RUN` environment variables so that `gopsutil` reads `/etc/machine-id` and other identity files from the **host node** rather than the container image — required for a stable `HostID` in the fingerprint
74+
- **Config files updated** — all YAML configs now include `hpa`, `pdb`, `pod_logs`, `pod_logs_tail_lines`, `pod_logs_namespaces` under `collectors.kubernetes`:
75+
- `configs/tfo-agent.yaml`
76+
- `configs/tfo-agent.default.yaml`
77+
- `deploy/helm/tfo-agent/values.yaml` (both `config` and `kubernetes.config` sections)
78+
- **RBAC ClusterRole** (`deploy/helm/tfo-agent/templates/clusterrole.yaml`): Added `policy` apiGroup rule for `poddisruptionbudgets` (required by PDB sub-collector)
5179

5280
### Fixed
5381

@@ -411,8 +439,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
411439

412440
| Version | Date | OTEL SDK | Description |
413441
| ------- | ---------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
414-
| 1.1.8 | 2026-03-08 | v1.40.0 | Go 1.26.0 upgrade, OTEL SDK v1.40.0, Helm deployment for Kubernetes Cluster |
415-
| 1.1.7 | 2026-03-04 | v1.40.0 | Stable agent identity via UUIDv5 host fingerprint; K8s provider detection (15 providers); fix SyncKubernetesState |
442+
| 1.1.8 | 2026-03-09 | v1.40.0 | HPA/PDB/pod-logs sub-collectors; Kubelet summary ephemeral + working set; Go 1.26 + security fixes; 17 collector docs |
443+
| 1.1.7 | 2026-03-08 | v1.40.0 | Stable agent identity via UUIDv5 host fingerprint; K8s provider detection (15 providers); fix SyncKubernetesState |
416444
| 1.1.6 | 2026-02-21 | v1.40.0 | Go 1.25.7, OTEL SDK v1.40.0, build-tag lint fixes, errcheck/staticcheck cleanup |
417445
| 1.1.5 | 2026-02-19 | v1.39.0 | Docker container collector, cAdvisor scraper, CPU fix macOS, tags/labels propagation |
418446
| 1.1.4 | 2026-02-11 | v1.39.0 | eBPF collector (28 metrics), Cilium Hubble integration, 6 BPF programs, kernel-level observability |

configs/tfo-agent.default.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,11 @@ collectors:
235235
network: true # Network bytes per namespace (Kubelet Summary API)
236236
# Fetch actual CPU/Memory usage from metrics-server
237237
metrics_api: true
238+
hpa: true # HorizontalPodAutoscaler (current/desired replicas + conditions)
239+
pdb: true # PodDisruptionBudget (healthy/desired/disruptions allowed)
240+
pod_logs: true # Collect recent log lines from each running container
241+
pod_logs_tail_lines: 100 # Log lines per container per collection cycle
242+
pod_logs_namespaces: [] # Restrict pod log collection to these namespaces (empty = same as namespace filter)
238243
# Sync resource state to TFO backend (PostgreSQL entities)
239244
sync_to_backend: true
240245
sync_interval: 60s

configs/tfo-agent.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,11 @@ collectors:
236236
resource_counts: true # Secrets, ConfigMaps, Ingresses count
237237
network: true # Network bytes per namespace (Kubelet Summary API)
238238
metrics_api: true # Fetch actual CPU/Memory usage from metrics-server
239+
hpa: true # HorizontalPodAutoscaler (current/desired replicas + conditions)
240+
pdb: true # PodDisruptionBudget (healthy/desired/disruptions allowed)
241+
pod_logs: true # Collect recent log lines from each running container
242+
pod_logs_tail_lines: 100 # Log lines per container per collection cycle
243+
pod_logs_namespaces: [] # Restrict pod log collection to these namespaces (empty = same as namespace filter)
239244
# Sync resource state to TFO backend (populates PostgreSQL K8s entities)
240245
sync_to_backend: true
241246
sync_interval: 60s

deploy/helm/tfo-agent/values.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,11 @@ kubernetes:
263263
resource_counts: true
264264
network: true
265265
metrics_api: true
266+
hpa: true
267+
pdb: true
268+
pod_logs: true
269+
pod_logs_tail_lines: 100
270+
pod_logs_namespaces: []
266271
exclude_namespaces:
267272
- kube-system
268273

docs/collectors/CADVISOR.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# cAdvisor Collector
2+
3+
Scrapes container and machine metrics from a running cAdvisor instance's Prometheus `/metrics` endpoint.
4+
5+
## Data Source
6+
7+
HTTP GET to cAdvisor Prometheus endpoint (default: `http://localhost:8080/metrics`).
8+
9+
Parser: Prometheus text format (`expfmt.TextParser`).
10+
11+
## Metric Filtering
12+
13+
By default, only metrics with the following prefixes are collected:
14+
15+
| Prefix | Description |
16+
| ------------ | --------------------- |
17+
| `container_` | Per-container metrics |
18+
| `machine_` | Host machine metrics |
19+
20+
If `metric_names` is configured, only the exact metric names listed are collected.
21+
22+
## Key cAdvisor Metrics Collected (defaults)
23+
24+
### Container CPU
25+
26+
| Metric | Type | Description |
27+
| ------------------------------------------- | ------- | ---------------------------- |
28+
| `container_cpu_usage_seconds_total` | Counter | Cumulative CPU time consumed |
29+
| `container_cpu_system_seconds_total` | Counter | CPU time in system mode |
30+
| `container_cpu_user_seconds_total` | Counter | CPU time in user mode |
31+
| `container_cpu_cfs_throttled_seconds_total` | Counter | CPU throttled time |
32+
| `container_cpu_cfs_periods_total` | Counter | Total CFS periods |
33+
| `container_cpu_cfs_throttled_periods_total` | Counter | Throttled CFS periods |
34+
35+
### Container Memory
36+
37+
| Metric | Type | Description |
38+
| ------------------------------------ | ------- | -------------------------- |
39+
| `container_memory_usage_bytes` | Gauge | Current memory usage |
40+
| `container_memory_working_set_bytes` | Gauge | Working set memory |
41+
| `container_memory_rss` | Gauge | Resident set size |
42+
| `container_memory_cache` | Gauge | Page cache |
43+
| `container_memory_swap` | Gauge | Swap usage |
44+
| `container_memory_failures_total` | Counter | Memory allocation failures |
45+
46+
### Container Network (per-interface)
47+
48+
| Metric | Type | Description |
49+
| -------------------------------------------------- | ------- | ----------------- |
50+
| `container_network_receive_bytes_total` | Counter | Bytes received |
51+
| `container_network_transmit_bytes_total` | Counter | Bytes transmitted |
52+
| `container_network_receive_errors_total` | Counter | Receive errors |
53+
| `container_network_transmit_errors_total` | Counter | Transmit errors |
54+
| `container_network_receive_packets_dropped_total` | Counter | Received drops |
55+
| `container_network_transmit_packets_dropped_total` | Counter | Transmitted drops |
56+
57+
### Container Filesystem
58+
59+
| Metric | Type | Description |
60+
| --------------------------------- | ------- | ----------------------------- |
61+
| `container_fs_usage_bytes` | Gauge | Bytes used by the container |
62+
| `container_fs_limit_bytes` | Gauge | Bytes limit for the container |
63+
| `container_fs_reads_bytes_total` | Counter | Bytes read from filesystem |
64+
| `container_fs_writes_bytes_total` | Counter | Bytes written to filesystem |
65+
66+
### Machine Metrics
67+
68+
| Metric | Type | Description |
69+
| -------------------------------------- | ----- | --------------------- |
70+
| `machine_cpu_cores` | Gauge | Number of CPU cores |
71+
| `machine_memory_bytes` | Gauge | Total memory in bytes |
72+
| `machine_cpu_cache_capacity_kilobytes` | Gauge | CPU cache capacity |
73+
74+
## Metric Type Handling
75+
76+
| Prometheus type | Converted to |
77+
| --------------- | ------------------------------------------------------- |
78+
| GAUGE | `MetricTypeGauge` |
79+
| COUNTER | `MetricTypeCounter` |
80+
| UNTYPED | `MetricTypeGauge` |
81+
| SUMMARY | `{name}_sum` (Counter) + `{name}_count` (Counter) |
82+
| HISTOGRAM | `{name}_sum` + `{name}_count` + `{name}_bucket{le=...}` |
83+
84+
## Configuration
85+
86+
```yaml
87+
cadvisor:
88+
enabled: true
89+
interval: 15s
90+
endpoint: "http://localhost:8080"
91+
metrics_path: "/metrics"
92+
timeout: 10s
93+
metric_names: [] # empty = collect all container_* and machine_* metrics
94+
labels: {} # extra labels added to all metrics
95+
```
96+
97+
## Notes
98+
99+
- cAdvisor must be running separately (e.g. as a DaemonSet in Kubernetes, or `docker run google/cadvisor`).
100+
- All cAdvisor labels from the Prometheus exposition are preserved (container, pod, namespace, image, etc.).
101+
- For Kubernetes environments, cAdvisor is typically embedded in the Kubelet — accessible at `/api/v1/nodes/{name}/proxy/metrics/cadvisor`. Use the Kubernetes API server proxy URL as the endpoint in that case.

docs/collectors/DOCKER.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Docker Collector
2+
3+
Collects per-container CPU, memory, network, disk I/O, and PID metrics from the Docker Engine API.
4+
5+
## Data Source
6+
7+
Docker Engine API via Unix socket (`/var/run/docker.sock`). Uses a single-shot `ContainerStats` call per container (no streaming).
8+
9+
## Metrics
10+
11+
### Container State Summary (all containers)
12+
13+
| Metric | Type | Description |
14+
| ---------------------------- | ----- | --------------------------------------------------------------- |
15+
| `container.state.running` | Gauge | Number of running containers |
16+
| `container.state.stopped` | Gauge | Number of stopped/exited/dead containers |
17+
| `container.state.paused` | Gauge | Number of paused containers |
18+
| `container.state.restarting` | Gauge | Number of restarting containers |
19+
| `container.state.total` | Gauge | Total containers (including stopped if `include_stopped: true`) |
20+
21+
### Per-Container Labels
22+
23+
All per-container metrics carry these labels:
24+
25+
| Label | Description |
26+
| -------------- | ------------------------------------- |
27+
| `container_id` | Short container ID (12 characters) |
28+
| `name` | Container name (leading `/` stripped) |
29+
| `image` | Container image name |
30+
| `state` | Container state (`running`) |
31+
32+
---
33+
34+
### CPU (flag: `collect_cpu: true`)
35+
36+
| Metric | Type | Unit | Description |
37+
| --------------------------------- | ------- | ----------- | --------------------------------------------------------------- |
38+
| `container.cpu.usage_percent` | Gauge | percent | CPU usage % (delta-based: cpuDelta/systemDelta × numCPUs × 100) |
39+
| `container.cpu.usage_total` | Counter | nanoseconds | Total cumulative CPU time |
40+
| `container.cpu.user` | Counter | nanoseconds | CPU time in user mode |
41+
| `container.cpu.kernel` | Counter | nanoseconds | CPU time in kernel mode |
42+
| `container.cpu.online_cpus` | Gauge || Number of online CPUs |
43+
| `container.cpu.throttled_periods` | Counter || Number of throttled periods |
44+
| `container.cpu.throttled_time` | Counter | nanoseconds | Total throttled CPU time |
45+
46+
---
47+
48+
### Memory (flag: `collect_memory: true`)
49+
50+
| Metric | Type | Unit | Description |
51+
| -------------------------------- | ----- | ------- | ---------------------------------------------------------- |
52+
| `container.memory.usage` | Gauge | bytes | Memory usage including cache |
53+
| `container.memory.working_set` | Gauge | bytes | Working set (usage − inactive_file, matches `kubectl top`) |
54+
| `container.memory.limit` | Gauge | bytes | Memory limit |
55+
| `container.memory.max_usage` | Gauge | bytes | Peak memory usage recorded |
56+
| `container.memory.rss` | Gauge | bytes | Resident set size (if available) |
57+
| `container.memory.cache` | Gauge | bytes | Page cache (if available) |
58+
| `container.memory.usage_percent` | Gauge | percent | Working set as % of limit |
59+
60+
---
61+
62+
### Network (flag: `collect_network: true`)
63+
64+
Labels: per-container labels + `interface`
65+
66+
| Metric | Type | Unit | Description |
67+
| ------------------------------ | ------- | ----- | --------------------------- |
68+
| `container.network.rx_bytes` | Counter | bytes | Bytes received |
69+
| `container.network.tx_bytes` | Counter | bytes | Bytes transmitted |
70+
| `container.network.rx_packets` | Counter || Packets received |
71+
| `container.network.tx_packets` | Counter || Packets transmitted |
72+
| `container.network.rx_errors` | Counter || Receive errors |
73+
| `container.network.tx_errors` | Counter || Transmit errors |
74+
| `container.network.rx_dropped` | Counter || Received packets dropped |
75+
| `container.network.tx_dropped` | Counter || Transmitted packets dropped |
76+
77+
---
78+
79+
### Disk I/O (flag: `collect_disk_io: true`)
80+
81+
| Metric | Type | Unit | Description |
82+
| ------------------------------ | ------- | ----- | ------------------------------------ |
83+
| `container.diskio.read_bytes` | Counter | bytes | Total bytes read from block devices |
84+
| `container.diskio.write_bytes` | Counter | bytes | Total bytes written to block devices |
85+
| `container.diskio.read_ops` | Counter || Total read operations |
86+
| `container.diskio.write_ops` | Counter || Total write operations |
87+
88+
---
89+
90+
### PIDs (flag: `collect_pids: true`)
91+
92+
| Metric | Type | Description |
93+
| ------------------------ | ----- | --------------------------------------- |
94+
| `container.pids.current` | Gauge | Current number of PIDs in the container |
95+
96+
---
97+
98+
## Configuration
99+
100+
```yaml
101+
docker:
102+
enabled: true
103+
interval: 15s
104+
socket_path: "/var/run/docker.sock"
105+
include_stopped: false
106+
collect_cpu: true
107+
collect_memory: true
108+
collect_network: true
109+
collect_disk_io: true
110+
collect_pids: true
111+
include_containers: [] # allowlist by name (empty = all)
112+
exclude_containers: [] # denylist by name
113+
```
114+
115+
## Notes
116+
117+
- The collector pings the Docker daemon on startup. If unreachable, initialization fails.
118+
- Only containers in `running` state get per-container stats (stopped containers only count toward state summary if `include_stopped: true`).
119+
- CPU `usage_percent` requires two consecutive stat snapshots (via `PreCPUStats`). Docker returns both in a single stats call.

0 commit comments

Comments
 (0)