You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
We need to support Cardinality Aggregation in Star Tree and hence I was exploring the HyperLogLog Algorithm for it.
Describe the solution you'd like
HyperLogLog is a probabilistic algorithm used for estimating the cardinality (the number of distinct elements) in a large dataset. It’s often used in cases where dealing with very large datasets makes exact counting infeasible due to memory constraints.
The algorithm works by dividing the input data stream into multiple registers (or buckets). Each register tracks the maximum number of leading zeros observed for hashes of input elements. The basic idea is to use the maximum number of leading zeros across all registers to estimate the cardinality.
The number of registers is determined by log2m, which is the logarithm of the number of registers (buckets). The value of m (the number of registers) is 2^log2m. For example:
If log2m = 14, then m = 2^14 = 16384 registers.
Every document currently stores a HyperLogLog object to estimate cardinality, with a default log2m value of 14. Each HyperLogLog object uses about 11 KB of space.This quickly adds up — for 1 million documents, this alone would require around 11 GB [11 KB * 1M] of storage, which is very large.
The log2m value controls how many registers the HyperLogLog has, and increasing it improves accuracy but also doubles the storage for every 1 point increase. Lowering it saves space but reduces accuracy.
One important point is that the size of the HyperLogLog object does not depend on how many values are added to it. Whether you insert 1 value or 1,000 values, the size remains the same — it’s only the number of reduced Star Tree documents that matters for total storage. This means the storage concern is driven by how many reduced star-tree documents exist.
Related component
Search:Aggregations
Describe alternatives you've considered
No response
Additional context
Due to the high storage cost associated with maintaining HyperLogLog objects in Star Tree, we have decided not to proceed with supporting Cardinality Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to cardinality aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.
Note: Percentile Aggregation support in Star Tree using TDigest Algorithm - Issue
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe
We need to support Cardinality Aggregation in Star Tree and hence I was exploring the HyperLogLog Algorithm for it.
Describe the solution you'd like
HyperLogLog is a probabilistic algorithm used for estimating the cardinality (the number of distinct elements) in a large dataset. It’s often used in cases where dealing with very large datasets makes exact counting infeasible due to memory constraints.
The algorithm works by dividing the input data stream into multiple registers (or buckets). Each register tracks the maximum number of leading zeros observed for hashes of input elements. The basic idea is to use the maximum number of leading zeros across all registers to estimate the cardinality.
The number of registers is determined by log2m, which is the logarithm of the number of registers (buckets). The value of m (the number of registers) is 2^log2m. For example:
Every document currently stores a HyperLogLog object to estimate cardinality, with a default
log2m
value of 14. Each HyperLogLog object uses about11 KB
of space. This quickly adds up — for 1 million documents, this alone would require around 11 GB [11 KB * 1M] of storage, which is very large.The log2m value controls how many registers the HyperLogLog has, and increasing it improves accuracy but also doubles the storage for every 1 point increase. Lowering it saves space but reduces accuracy.
One important point is that the size of the HyperLogLog object does not depend on how many values are added to it. Whether you insert 1 value or 1,000 values, the size remains the same — it’s only the number of reduced Star Tree documents that matters for total storage. This means the storage concern is driven by how many reduced star-tree documents exist.
Related component
Search:Aggregations
Describe alternatives you've considered
No response
Additional context
Due to the high storage cost associated with maintaining HyperLogLog objects in Star Tree, we have decided not to proceed with supporting Cardinality Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to cardinality aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.
Note: Percentile Aggregation support in Star Tree using TDigest Algorithm - Issue
The text was updated successfully, but these errors were encountered: