[Discussion] [Star Tree] Cardinality Aggregation Using HyperLogLog Algorithm.

### Is your feature request related to a problem? Please describe

We need to support Cardinality Aggregation in Star Tree and hence I was exploring the HyperLogLog Algorithm for it.



### Describe the solution you'd like

HyperLogLog is a probabilistic algorithm used for estimating the cardinality (the number of distinct elements) in a large dataset. It’s often used in cases where dealing with very large datasets makes exact counting infeasible due to memory constraints.
The algorithm works by dividing the input data stream into multiple registers (or buckets). Each register tracks the maximum number of leading zeros observed for hashes of input elements. The basic idea is to use the maximum number of leading zeros across all registers to estimate the cardinality.

The number of registers is determined by log2m, which is the logarithm of the number of registers (buckets). The value of m (the number of registers) is 2^log2m. For example:

     If log2m = 14, then m = 2^14 = 16384 registers.

Every document currently stores a HyperLogLog object to estimate cardinality, with a default `log2m` value of 14. **Each HyperLogLog object uses about `11 KB` of space.** **This quickly adds up — for 1 million documents, this alone would require around 11 GB [11 KB * 1M] of storage, which is very large.** 
The log2m value controls how many registers the HyperLogLog has, and increasing it improves accuracy but also doubles the storage for every 1 point increase. Lowering it saves space but reduces accuracy.

One important point is that the size of the HyperLogLog object does not depend on how many values are added to it. Whether you insert 1 value or 1,000 values, the size remains the same — it’s only the number of reduced Star Tree documents that matters for total storage. This means the storage concern is driven by how many reduced star-tree documents exist.


### Related component

Search:Aggregations

### Describe alternatives you've considered

_No response_

### Additional context

Due to the high storage cost associated with maintaining HyperLogLog objects in Star Tree, we have decided not to proceed with supporting Cardinality Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to cardinality aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.

Note: Percentile Aggregation support in Star Tree using TDigest Algorithm - [Issue](https://github.com/opensearch-project/OpenSearch/issues/17504)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] [Star Tree] Cardinality Aggregation Using HyperLogLog Algorithm. #17630

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion] [Star Tree] Cardinality Aggregation Using HyperLogLog Algorithm. #17630

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions