[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries #1234

martin-gaievski · 2025-03-19T19:05:55Z

Is your feature request related to a problem?

In 2.19 and any earlier version, hybrid query is iterating over documents in a sub-optimal way by sequentially moving pointers for all sub-queries to the same doc id. In often case this means:

random access for the iterators that are optimized for stream-oriented access
redundant and expensive operations like sorting sub-query iterators based on highest score for same document. This is not needed as we sort scores as part of hybrid query logic and all sub-query results for same document will be merged

This often means that performance degrade for the broader queries, and we noticed that while running benchmarks after improvements in 2.15 release: https://opensearch.org/blog/performance-improvment-hybrid-query-215/. Based on those results today the latency increases sharply when sub-query has millions of matching documents (~3x for 3 subqueries with 10M matching docs vs 1 subquery with 1K matching docs).

What solution would you like?

New component that orchestrates execution of all sub-queries at the shard level. Each sub-query does exhaustive search and scores are sorted by scores and stored in a single collection. Comparing to existing approach we can skip sorting for doc iterators, and avoid making random access IO operations.

Two big paths are feasible here:

new custom bulk scorer. today hybrid query rely on default implementation in Lucene and that implementation uses general approach that is not optimal for multiple sub-queries scenario.
parallelization of sub-query execution. currently other steps like rewrite or reduce of results are running in parallel, but score collection is sequential. this can be a next step after we change the design into "by sub-query" mode

martin-gaievski added enhancement untriaged hybrid search hybrid query performance optimization and removed untriaged enhancement labels Mar 19, 2025

martin-gaievski mentioned this issue Mar 19, 2025

[META] Advanced Optimization Techniques for Hybrid query #783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries #1234

[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries #1234

martin-gaievski commented Mar 19, 2025

[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries #1234

[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries #1234

Comments

martin-gaievski commented Mar 19, 2025

Is your feature request related to a problem?

What solution would you like?