You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries
#1234
In 2.19 and any earlier version, hybrid query is iterating over documents in a sub-optimal way by sequentially moving pointers for all sub-queries to the same doc id. In often case this means:
random access for the iterators that are optimized for stream-oriented access
redundant and expensive operations like sorting sub-query iterators based on highest score for same document. This is not needed as we sort scores as part of hybrid query logic and all sub-query results for same document will be merged
This often means that performance degrade for the broader queries, and we noticed that while running benchmarks after improvements in 2.15 release: https://opensearch.org/blog/performance-improvment-hybrid-query-215/. Based on those results today the latency increases sharply when sub-query has millions of matching documents (~3x for 3 subqueries with 10M matching docs vs 1 subquery with 1K matching docs).
What solution would you like?
New component that orchestrates execution of all sub-queries at the shard level. Each sub-query does exhaustive search and scores are sorted by scores and stored in a single collection. Comparing to existing approach we can skip sorting for doc iterators, and avoid making random access IO operations.
Two big paths are feasible here:
new custom bulk scorer. today hybrid query rely on default implementation in Lucene and that implementation uses general approach that is not optimal for multiple sub-queries scenario.
parallelization of sub-query execution. currently other steps like rewrite or reduce of results are running in parallel, but score collection is sequential. this can be a next step after we change the design into "by sub-query" mode
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
In 2.19 and any earlier version, hybrid query is iterating over documents in a sub-optimal way by sequentially moving pointers for all sub-queries to the same doc id. In often case this means:
This often means that performance degrade for the broader queries, and we noticed that while running benchmarks after improvements in 2.15 release: https://opensearch.org/blog/performance-improvment-hybrid-query-215/. Based on those results today the latency increases sharply when sub-query has millions of matching documents (~3x for 3 subqueries with 10M matching docs vs 1 subquery with 1K matching docs).
What solution would you like?
New component that orchestrates execution of all sub-queries at the shard level. Each sub-query does exhaustive search and scores are sorted by scores and stored in a single collection. Comparing to existing approach we can skip sorting for doc iterators, and avoid making random access IO operations.
Two big paths are feasible here:
The text was updated successfully, but these errors were encountered: