Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] [Star Tree] Cardinality Aggregation Using HyperLogLog Algorithm. #17630

Open
Shailesh-Kumar-Singh opened this issue Mar 19, 2025 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations untriaged

Comments

@Shailesh-Kumar-Singh
Copy link
Contributor

Is your feature request related to a problem? Please describe

We need to support Cardinality Aggregation in Star Tree and hence I was exploring the HyperLogLog Algorithm for it.

Describe the solution you'd like

HyperLogLog is a probabilistic algorithm used for estimating the cardinality (the number of distinct elements) in a large dataset. It’s often used in cases where dealing with very large datasets makes exact counting infeasible due to memory constraints.
The algorithm works by dividing the input data stream into multiple registers (or buckets). Each register tracks the maximum number of leading zeros observed for hashes of input elements. The basic idea is to use the maximum number of leading zeros across all registers to estimate the cardinality.

The number of registers is determined by log2m, which is the logarithm of the number of registers (buckets). The value of m (the number of registers) is 2^log2m. For example:

 If log2m = 14, then m = 2^14 = 16384 registers.

Every document currently stores a HyperLogLog object to estimate cardinality, with a default log2m value of 14. Each HyperLogLog object uses about 11 KB of space. This quickly adds up — for 1 million documents, this alone would require around 11 GB [11 KB * 1M] of storage, which is very large.
The log2m value controls how many registers the HyperLogLog has, and increasing it improves accuracy but also doubles the storage for every 1 point increase. Lowering it saves space but reduces accuracy.

One important point is that the size of the HyperLogLog object does not depend on how many values are added to it. Whether you insert 1 value or 1,000 values, the size remains the same — it’s only the number of reduced Star Tree documents that matters for total storage. This means the storage concern is driven by how many reduced star-tree documents exist.

Related component

Search:Aggregations

Describe alternatives you've considered

No response

Additional context

Due to the high storage cost associated with maintaining HyperLogLog objects in Star Tree, we have decided not to proceed with supporting Cardinality Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to cardinality aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.

Note: Percentile Aggregation support in Star Tree using TDigest Algorithm - Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations untriaged
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants