Skip to content

Commit edaf054

Browse files
committed
rfc: Opinionated OpenTelemetry Operator Sampling CR
Signed-off-by: Benedikt Bongartz <[email protected]>
1 parent cf9f890 commit edaf054

File tree

1 file changed

+134
-0
lines changed

1 file changed

+134
-0
lines changed

docs/rfcs/opinionated_sampling_cr.md

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Opinionated OpenTelemetry Operator Sampling CR
2+
3+
**Status:** [*Draft* | *Accepted*]
4+
5+
**Author:** Benedikt Bongartz, [email protected]
6+
7+
**Date:** 20.9.2024
8+
9+
## Objective
10+
11+
By today there is an ongoing discussion about whether and how the collector CR could be split into several components. This would hopefully not only reduce the complexity of some CRs, but also help to manage access to certain collector parts for certain users. As this turned out to be more complicated than expected, and there are some non-trivial setups, it might make sense to experiment with some opinionated CRs for specific use cases. ([#1477](https://github.com/open-telemetry/opentelemetry-operator/pull/1477), [#1906](https://github.com/open-telemetry/opentelemetry-operator/pull/1906)) Based on feedback from a FOSDEM talk on sampling and Kubecon tutorials, we noticed a request for a sampling CR.
12+
13+
The introduction of this CRD is intended to significantly simplify trace sampling in a kubernetes enviroment.
14+
15+
## Summary
16+
17+
Provide a sampler CR in v1alpha1 that can be used to simplify the sampling of traces in kubernetes.
18+
19+
## Goals and non-goals
20+
21+
**Goals**
22+
- Provide an opinionated CR to simply sampling configuration in distributed environments
23+
- Allow managing access using RBAC to different parts of the collector configuration
24+
- Adapt the collector setup based on sampling strategy
25+
- Secure the communication between collector components by default
26+
27+
**Non-Goals**
28+
- Solving the generic OpenTelemetryCollector CR split
29+
- Allow extra processing steps within the CR
30+
- Auto scaling he setup (might be a future goal)
31+
32+
## Use cases for proposal
33+
34+
### CASE 1
35+
36+
As a cluster administrator I want to reduce the amount of traffic caused by generating telemetry data.
37+
38+
### CASE 2
39+
40+
As a cluster administrator I want to be in control of the collector resources while allowing a user to change sampling policies.
41+
42+
### CASE 3
43+
44+
As a user I want to be able to filter relevant data without much specific open telemetry knowledge.
45+
46+
## Struct Design
47+
48+
This custom resource creates an environment that allows us to apply e.g. tailbased sampling in a distributed environment. The operator takes care of creating an optional otel LB service and sampling instances similar to the figure shown above.
49+
50+
LB instances will be pre-configured to distribute traces based on a given routing key like the traceID to the sampler instances.
51+
52+
```yaml
53+
---
54+
apiVersion: opentelemetry.io/v1alpha1
55+
kind: Sampler
56+
metadata:
57+
name: example-sampler
58+
spec:
59+
# Policies taken into account when making a sampling decision.
60+
policies:
61+
- name: "retain-error-policy"
62+
type: status_code,
63+
status_codes: [ERROR, UNSET]
64+
# RoutingKey describes how traffic to be sampled is distributed. It can be 'traceid' or 'service'.
65+
# Default is 'traceid'.
66+
routingKey: traceid
67+
# DecisionWait defines the time to wait before making a sampling decision.
68+
# Default is 30s, specified in nanoseconds (30s = 30000000000ns).
69+
decision_wait: 30000000000
70+
# NumTraces defines the number of traces kept in memory.
71+
# Default is 5000.
72+
num_traces: 5000
73+
# ExpectedNewTracesPerSec defines the expected number of new traces per second.
74+
# Helps allocate memory structures. Default is 5000.
75+
expected_new_traces_per_sec: 5000
76+
# DecisionCache defines the settings for the decision cache.
77+
# This allows sampling decisions to be cached to avoid re-processing traces.
78+
decision_cache:
79+
# SampledCacheSize configures the amount of trace IDs to be kept in an LRU
80+
# cache, persisting the "keep" decisions for traces that may have already
81+
# been released from memory.
82+
# By default, the size is 0 and the cache is inactive.
83+
# If using, configure this as much higher than num_traces so decisions for
84+
# trace IDs are kept longer than the span data for the trace.
85+
sampled_cache_size: 10000
86+
# Components defines the template of all requirements to configure scheduling
87+
# of all components to be deployed.
88+
components:
89+
loadbalancer:
90+
# Defines if the component is managed by the operator
91+
managementState: managed
92+
resources: # Resource requests and limits
93+
limits:
94+
cpu: "500m"
95+
memory: "512Mi"
96+
requests:
97+
cpu: "200m"
98+
memory: "256Mi"
99+
# Node selection for component placement
100+
nodeSelector:
101+
environment: "production"
102+
# Number of load balancer replicas (optional)
103+
replicas: 2
104+
105+
sampler:
106+
# Component managed by the operator
107+
managementState: managed
108+
# Number of sampler replicas (optional)
109+
replicas: 3
110+
111+
# Telemetry settings for the sampler system (currently empty).
112+
telemetry:
113+
# telemetry configuration goes here (e.g., serviceMonitor or spanMetrics)
114+
...
115+
116+
# Exporter configuration. Only OTLP exporter supported.
117+
exporter:
118+
# OTLP exporter endpoint
119+
endpoint: "jaeger:4317"
120+
```
121+
122+
## Rollout Plan
123+
124+
1. Introduction of the CRD in v1alpha1.
125+
2. First controller implementation.
126+
3. Implementation of e2e tests.
127+
4. CRD becomes part of the operator bundle.
128+
129+
## Limitations
130+
131+
1. Initially, there is no TLS support for incoming, internal and outgoing connections.
132+
2. The input and output format is exclusively OTLP.
133+
3. Policies are initially part of the sampling CR and cannot be configured independently of the sampler setup.
134+

0 commit comments

Comments
 (0)