Skip to content

Commit eb5c3ab

Browse files
committed
Provide full example collector configuration of using the load-balancing exporter, tail-sampling processor, and span metrics connector together when scaled to multiple collector instances.
Also removes language and configuration suggesting that load-balancing should be a different deployment of collectors than the collectors doing the tail sampling and span metric generation. It's easier to maintain a single deployment responsible for both load balancing and processing of the load balanced data but the pattern for doing this may not be obvious at first.
1 parent 99f0ae5 commit eb5c3ab

File tree

1 file changed

+101
-29
lines changed

1 file changed

+101
-29
lines changed

content/en/docs/collector/scaling.md

+101-29
Original file line numberDiff line numberDiff line change
@@ -325,60 +325,132 @@ providing a custom `http_sd_config` per collector instance (pod).
325325
### Scaling Stateful Collectors
326326

327327
Certain components might hold data in memory, yielding different results when
328-
scaled up. It is the case for the tail-sampling processor, which holds spans in
329-
memory for a given period, evaluating the sampling decision only when the trace
330-
is considered complete. Scaling a Collector cluster by adding more replicas
331-
means that different collectors will receive spans for a given trace, causing
332-
each collector to evaluate whether that trace should be sampled, potentially
333-
coming to different answers. This behavior results in traces missing spans,
334-
misrepresenting what happened in that transaction.
335-
336-
A similar situation happens when using the span-to-metrics processor to generate
337-
service metrics. When different collectors receive data related to the same
338-
service, aggregations based on the service name will be inaccurate.
339-
340-
To overcome this, you can deploy a layer of Collectors containing the
341-
load-balancing exporter in front of your Collectors doing the tail-sampling or
342-
the span-to-metrics processing. The load-balancing exporter will hash the trace
343-
ID or the service name consistently and determine which collector backend should
344-
receive spans for that trace. You can configure the load-balancing exporter to
345-
use the list of hosts behind a given DNS A entry, such as a Kubernetes headless
346-
service. When the deployment backing that service is scaled up or down, the
347-
load-balancing exporter will eventually see the updated list of hosts.
348-
Alternatively, you can specify a list of static hosts to be used by the
349-
load-balancing exporter. You can scale up the layer of Collectors configured
350-
with the load-balancing exporter by increasing the number of replicas. Note that
351-
each Collector will potentially run the DNS query at different times, causing a
328+
scaled up. It is the case for the
329+
[tail-sampling](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md)
330+
processor, which holds spans in memory for a given period, evaluating the
331+
sampling decision only when the trace is considered complete. Scaling a
332+
Collector cluster by adding more replicas means that different collectors will
333+
receive spans for a given trace, causing each collector to evaluate whether that
334+
trace should be sampled, potentially coming to different answers. This behavior
335+
results in traces missing spans, misrepresenting what happened in that
336+
transaction.
337+
338+
A similar situation happens when using the
339+
[span-to-metrics connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/spanmetricsconnector/README.md)
340+
to generate service metrics. When different collectors receive data related to
341+
the same service, aggregations based on the service name will be inaccurate due
342+
to violating the
343+
[single-writer assumption](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer).
344+
345+
To overcome this, the
346+
[load-balancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/README.md)
347+
can divide load among your collectors while ensuring the telemetry needed for
348+
accurate tail-sampling and span-metrics are processed by a single collector. The
349+
load-balancing exporter will hash the trace ID or the service name consistently
350+
and determine which collector backend should receive spans for that trace. You
351+
can configure the load-balancing exporter to use the list of hosts behind a
352+
given DNS A entry, such as a Kubernetes headless service. When the deployment
353+
backing that service is scaled up or down, the load-balancing exporter will
354+
eventually see the updated list of hosts. Alternatively, you can specify a list
355+
of static hosts to be used by the load-balancing exporter. Note that each
356+
Collector will potentially run the DNS query at different times, causing a
352357
difference in the cluster view for a few moments. We recommend lowering the
353358
interval value so that the cluster view is different only for a short period in
354359
highly-elastic environments.
355360

356361
Here’s an example configuration using a DNS A record (Kubernetes service otelcol
357-
on the observability namespace) as the input for the backend information:
362+
on the observability namespace) as the input for load balancing:
358363

359364
```yaml
360365
receivers:
361-
otlp:
366+
otlp/before_load_balancing:
362367
protocols:
363368
grpc:
364369
endpoint: 0.0.0.0:4317
365370
371+
otlp/for_tail_sampling:
372+
protocols:
373+
grpc:
374+
endpoint: 0.0.0.0:4417
375+
376+
otlp/for_span_metrics:
377+
protocols:
378+
grpc:
379+
endpoint: 0.0.0.0:4517
380+
366381
processors:
382+
tail_sampling:
383+
decision_wait: 10s
384+
policies:
385+
[
386+
{
387+
name: keep-all-traces-with-errors,
388+
type: status_code,
389+
status_code: { status_codes: [ERROR] },
390+
},
391+
{
392+
name: keep-10-percent-of-traces,
393+
type: probabilistic,
394+
probabilistic: { sampling_percentage: 10 },
395+
},
396+
]
397+
398+
connectors:
399+
spanmetrics:
400+
aggregation_temporality: 'AGGREGATION_TEMPORALITY_CUMULATIVE'
401+
resource_metrics_key_attributes:
402+
- service.name
367403
368404
exporters:
369-
loadbalancing:
405+
loadbalancing/tail_sampling:
406+
routing_key: "traceID"
370407
protocol:
371408
otlp:
372409
resolver:
373410
dns:
374411
hostname: otelcol.observability.svc.cluster.local
412+
port: 4417
413+
loadbalancing/span_metrics:
414+
routing_key: "service"
415+
protocol:
416+
otlp:
417+
resolver:
418+
dns:
419+
hostname: otelcol.observability.svc.cluster.local
420+
port: 4517
421+
422+
otlp/vendor:
423+
endpoint: https://some-vendor.com:4317
375424
376425
service:
377426
pipelines:
378427
traces:
379428
receivers:
380-
- otlp
429+
- otlp/before_load_balancing
430+
processors: []
431+
exporters:
432+
- loadbalancing/tail_sampling
433+
- loadbalancing/span_metrics
434+
435+
traces/tail_sampling:
436+
receivers:
437+
- otlp/for_tail_sampling
438+
processors:
439+
- tail_sampling
440+
exporters:
441+
- otlp/vendor
442+
443+
traces/span_metrics:
444+
receivers:
445+
- otlp/for_tail_sampling
446+
processors: []
447+
exporters:
448+
- spanmetrics
449+
450+
metrics/spanmetrics:
451+
receivers:
452+
- spanmetrics
381453
processors: []
382454
exporters:
383-
- loadbalancing
455+
- otlp/vendor
384456
```

0 commit comments

Comments
 (0)