Skip to content

Improving the efficiency of percentile aggregation #19622

@jainankitk

Description

@jainankitk

Is your feature request related to a problem? Please describe

While working on recent customer investigation, I noticed they were executing following aggregation pattern - top level date histogram with 1s buckets (I know its a lot!) having 100 percentile as sub-aggregation:

"aggregations":{"2":{"date_histogram":{"field":"log_time","format":"epoch_millis","interval":"1s","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0,"extended_bounds":{"min":1747560702304,"max":1750152702304}},"aggregations":{"1":{"percentiles":{"field":"time_taken_ms","percents":[100.0],"keyed":true,"tdigest":{"compression":100.0}}}}}}}

This query execution was pretty slow and many times result in node going down. The captured histogram showed lot of objects related to aggregation dominating the heap.

  17:      10475440      335214080  com.tdunning.math.stats.IntAVLTree$NodeAllocator
  16:      10475440      335214080  com.tdunning.math.stats.IntAVLTree$IntStack
   8:      10475440      754231680  com.tdunning.math.stats.AVLGroupTree$1
   6:      10475440      838035200  com.tdunning.math.stats.AVLGroupTree
   9:      10475410      754229520  org.opensearch.search.aggregations.metrics.InternalTDigestPercentiles
   7:      10475410      838032800  org.opensearch.search.aggregations.metrics.TDigestState
  10:      10475255      586614280  org.opensearch.search.aggregations.bucket.histogram.InternalDateHistogram$Bucket

While we have been able to address the resiliency issue mostly by hardening the cancellation framework for aggregation, thanks to @kaushalmahi12 (#18426, #19066, #19231, #18673), I am still concerned about the performance for this aggregation execution. Since we are looking at 100th percentile only, isn't that equivalent of max aggregation?

Describe the solution you'd like

  1. We should execute the percentile aggregation as min/max aggregation when we are interested in just 0th or 100th percentile. That combined with the recent skiplist based improvements for date histogram at top level - [META] Skip List Based Optimization for Aggregations in OpenSearch #19384 , should help improve the performance significantly and also reduce the memory required for aggregation execution.
  2. For other types of percentile aggregation, we should evaluate MergingDigest instead of AVLTreeDigest. Based of a quick search on google, the primary advantage of AVLTreeDigest is More consistent insertion times. But that should not be concern the aggregation scenario, as we only care about the amortized cost. Rest everything should be advantage in favor of MergingDigest:
Feature AVLTreeDigest MergingDigest
Performance More consistent insertion times. Amortized speed is faster, but has occasional slower compactions.
Memory Requires more memory due to dynamic allocation. Uses less memory, requiring no dynamic allocation after construction.
Accuracy Nearly as good as MergingDigest. Generally more accurate, especially after recent improvements to interpolation and two-level merging.
Bugs Had issues with duplicate values in older versions that were fixed in later releases. Fewer known bugs reported.
Recommendation A good alternative if consistent insertion times are critical, but MergingDigest is generally preferred. The recommended implementation for most use cases due to superior accuracy and memory efficiency.

Related component

Search:Aggregations

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions