@@ -325,60 +325,132 @@ providing a custom `http_sd_config` per collector instance (pod).
325
325
# ## Scaling Stateful Collectors
326
326
327
327
Certain components might hold data in memory, yielding different results when
328
- scaled up. It is the case for the tail-sampling processor, which holds spans in
329
- memory for a given period, evaluating the sampling decision only when the trace
330
- is considered complete. Scaling a Collector cluster by adding more replicas
331
- means that different collectors will receive spans for a given trace, causing
332
- each collector to evaluate whether that trace should be sampled, potentially
333
- coming to different answers. This behavior results in traces missing spans,
334
- misrepresenting what happened in that transaction.
335
-
336
- A similar situation happens when using the span-to-metrics processor to generate
337
- service metrics. When different collectors receive data related to the same
338
- service, aggregations based on the service name will be inaccurate.
339
-
340
- To overcome this, you can deploy a layer of Collectors containing the
341
- load-balancing exporter in front of your Collectors doing the tail-sampling or
342
- the span-to-metrics processing. The load-balancing exporter will hash the trace
343
- ID or the service name consistently and determine which collector backend should
344
- receive spans for that trace. You can configure the load-balancing exporter to
345
- use the list of hosts behind a given DNS A entry, such as a Kubernetes headless
346
- service. When the deployment backing that service is scaled up or down, the
347
- load-balancing exporter will eventually see the updated list of hosts.
348
- Alternatively, you can specify a list of static hosts to be used by the
349
- load-balancing exporter. You can scale up the layer of Collectors configured
350
- with the load-balancing exporter by increasing the number of replicas. Note that
351
- each Collector will potentially run the DNS query at different times, causing a
328
+ scaled up. It is the case for the
329
+ [tail-sampling](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md)
330
+ processor, which holds spans in memory for a given period, evaluating the
331
+ sampling decision only when the trace is considered complete. Scaling a
332
+ Collector cluster by adding more replicas means that different collectors will
333
+ receive spans for a given trace, causing each collector to evaluate whether that
334
+ trace should be sampled, potentially coming to different answers. This behavior
335
+ results in traces missing spans, misrepresenting what happened in that
336
+ transaction.
337
+
338
+ A similar situation happens when using the
339
+ [span-to-metrics connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/spanmetricsconnector/README.md)
340
+ to generate service metrics. When different collectors receive data related to
341
+ the same service, aggregations based on the service name will be inaccurate due
342
+ to violating the
343
+ [single-writer assumption](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer).
344
+
345
+ To overcome this, the
346
+ [load-balancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/README.md)
347
+ can divide load among your collectors while ensuring the telemetry needed for
348
+ accurate tail-sampling and span-metrics are processed by a single collector. The
349
+ load-balancing exporter will hash the trace ID or the service name consistently
350
+ and determine which collector backend should receive spans for that trace. You
351
+ can configure the load-balancing exporter to use the list of hosts behind a
352
+ given DNS A entry, such as a Kubernetes headless service. When the deployment
353
+ backing that service is scaled up or down, the load-balancing exporter will
354
+ eventually see the updated list of hosts. Alternatively, you can specify a list
355
+ of static hosts to be used by the load-balancing exporter. Note that each
356
+ Collector will potentially run the DNS query at different times, causing a
352
357
difference in the cluster view for a few moments. We recommend lowering the
353
358
interval value so that the cluster view is different only for a short period in
354
359
highly-elastic environments.
355
360
356
361
Here’s an example configuration using a DNS A record (Kubernetes service otelcol
357
- on the observability namespace) as the input for the backend information :
362
+ on the observability namespace) as the input for load balancing :
358
363
359
364
` ` ` yaml
360
365
receivers:
361
- otlp:
366
+ otlp/before_load_balancing :
362
367
protocols:
363
368
grpc:
364
369
endpoint: 0.0.0.0:4317
365
370
371
+ otlp/for_tail_sampling:
372
+ protocols:
373
+ grpc:
374
+ endpoint: 0.0.0.0:4417
375
+
376
+ otlp/for_span_metrics:
377
+ protocols:
378
+ grpc:
379
+ endpoint: 0.0.0.0:4517
380
+
366
381
processors:
382
+ tail_sampling:
383
+ decision_wait: 10s
384
+ policies:
385
+ [
386
+ {
387
+ name: keep-all-traces-with-errors,
388
+ type: status_code,
389
+ status_code: { status_codes: [ERROR] },
390
+ },
391
+ {
392
+ name: keep-10-percent-of-traces,
393
+ type: probabilistic,
394
+ probabilistic: { sampling_percentage: 10 },
395
+ },
396
+ ]
397
+
398
+ connectors:
399
+ spanmetrics:
400
+ aggregation_temporality: 'AGGREGATION_TEMPORALITY_CUMULATIVE'
401
+ resource_metrics_key_attributes:
402
+ - service.name
367
403
368
404
exporters:
369
- loadbalancing:
405
+ loadbalancing/tail_sampling:
406
+ routing_key: "traceID"
370
407
protocol:
371
408
otlp:
372
409
resolver:
373
410
dns:
374
411
hostname: otelcol.observability.svc.cluster.local
412
+ port: 4417
413
+ loadbalancing/span_metrics:
414
+ routing_key: "service"
415
+ protocol:
416
+ otlp:
417
+ resolver:
418
+ dns:
419
+ hostname: otelcol.observability.svc.cluster.local
420
+ port: 4517
421
+
422
+ otlp/vendor:
423
+ endpoint: https://some-vendor.com:4317
375
424
376
425
service:
377
426
pipelines:
378
427
traces:
379
428
receivers:
380
- - otlp
429
+ - otlp/before_load_balancing
430
+ processors: []
431
+ exporters:
432
+ - loadbalancing/tail_sampling
433
+ - loadbalancing/span_metrics
434
+
435
+ traces/tail_sampling:
436
+ receivers:
437
+ - otlp/for_tail_sampling
438
+ processors:
439
+ - tail_sampling
440
+ exporters:
441
+ - otlp/vendor
442
+
443
+ traces/span_metrics:
444
+ receivers:
445
+ - otlp/for_tail_sampling
446
+ processors: []
447
+ exporters:
448
+ - spanmetrics
449
+
450
+ metrics/spanmetrics:
451
+ receivers:
452
+ - spanmetrics
381
453
processors: []
382
454
exporters:
383
- - loadbalancing
455
+ - otlp/vendor
384
456
` ` `
0 commit comments