Skip to content

Commit 3ec4545

Browse files
committed
ADR about metrics-responder
Signed-off-by: manuelbuil <[email protected]>
1 parent badcdaa commit 3ec4545

File tree

1 file changed

+107
-0
lines changed

1 file changed

+107
-0
lines changed

docs/adrs/010-metrics-responder.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# metrics-responder client
2+
3+
Date: 2025-10-3
4+
5+
## Status
6+
7+
Proposed
8+
9+
## Context
10+
11+
### Background
12+
13+
RKE2 currently lacks a mechanism to voluntarily share version and cluster metadata. This telemetry data would be very valuable for understanding adoption and planning future development priorities.
14+
15+
There are existing CNCF projects have already long adopted (or are in the process thereof) the upgrade-responder pattern (such as Longhorn) (see https://github.com/longhorn/upgrade-responder).
16+
17+
That service provides endpoints that accept version and metadata information, allowing maintainers to understand their user base better while respecting privacy.
18+
19+
The core client side implementation is a straight-forward periodic REST API call.
20+
21+
### Current State
22+
23+
- No telemetry collection exists in rke2
24+
- The team lack insights into deployment patterns, version adoption or selected configurations
25+
26+
### Requirements
27+
28+
- Collect only non-personally identifiable cluster metadata
29+
- Opt-out mechanism with clear documentation
30+
- Minimal resource overhead
31+
- Fails gracefully in disconnected environments
32+
- There is no need for retry mechanisms or a persistent daemon; the data is non-critical and loss of a few data points harmless. Resource savings on the nodes are more important.
33+
- Work well in rke2
34+
35+
## Decision
36+
37+
Implement a `metrics-responder` client at `github.com/rancher/rke2-metrics-responder` (similar to existing components) as a separate, optional component deployed via the rke2 manifest system that is triggered periodically.
38+
39+
### Architecture
40+
41+
- **Deployment Method**: `CronJob` in `kube-system` namespace
42+
- **Location**: `/var/lib/rancher/rke2/server/manifests/upgrade-responder.yaml`
43+
- **Scheduling**: CronJob running thrice daily (`0 */8 * * *`)
44+
- **Configuration**: ConfigMap-based with environment variable override
45+
- **Default State**: Enabled by default (opt-out well documented)
46+
47+
### Data Collection
48+
49+
The collected data will include the following information:
50+
- Kubernetes version
51+
- clusteruuid
52+
- nodeCount
53+
- serverNodeCount
54+
- agentNodeCount
55+
- cni-plugin
56+
- os
57+
58+
Example payload structure:
59+
```json
60+
{
61+
"appVersion": "v1.31.6+k3s1",
62+
"extraTagInfo": {
63+
"kubernetesVersion": "v1.31.6",
64+
"clusteruuid": "53741f60-f208-48fc-ae81-8a969510a598"
65+
},
66+
"extraFieldInfo": {
67+
"nodeCount": 5,
68+
"serverNodeCount": 3,
69+
"agentNodeCount": 2,
70+
"cni-plugin": "calico",
71+
"os": "ubuntu",
72+
}
73+
}
74+
```
75+
76+
The `clusteruuid` is needed to differentiate between different deployments (the UUID of `kube-system`). It is completely random and does not expose privacy considerations.
77+
78+
### Configuration Interface Example
79+
80+
```yaml
81+
# /etc/rancher/k3s/config.yaml
82+
metrics-responder-enabled: true # default
83+
metrics-responder-config:
84+
endpoint: "$URL"
85+
schedule: "0 */8 * * *"
86+
```
87+
88+
(The last two would be defaults if `enabled: true` but not specified.)
89+
90+
## Alternatives Considered
91+
92+
93+
### Agent-based Implementation
94+
95+
Would require agents on all nodes. Periodic CronJob is more efficient for cluster-level metadata collection.
96+
97+
### Instrumenting/leveraging update.rke2.io
98+
99+
No easy access to CDN logs, no insights into deployed versions, not as privacy-preserving.
100+
101+
## Consequences
102+
103+
Basic telemetry coverage and analytics to improve project decisions and project visibility.
104+
105+
## Future options
106+
107+
This can also form the basis for pro-actively informing users about relevant available updates based on their existing deployed version. This is explicitly excluded from this ADR, as it will require additional considerations.

0 commit comments

Comments
 (0)