Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance Optimization] Shard level search - change score collection approach from random access by same doc ID to collecting scores for all doc IDs in sub-queries #1234

Open
martin-gaievski opened this issue Mar 19, 2025 · 0 comments

Comments

@martin-gaievski
Copy link
Member

Is your feature request related to a problem?

In 2.19 and any earlier version, hybrid query is iterating over documents in a sub-optimal way by sequentially moving pointers for all sub-queries to the same doc id. In often case this means:

  • random access for the iterators that are optimized for stream-oriented access
  • redundant and expensive operations like sorting sub-query iterators based on highest score for same document. This is not needed as we sort scores as part of hybrid query logic and all sub-query results for same document will be merged

This often means that performance degrade for the broader queries, and we noticed that while running benchmarks after improvements in 2.15 release: https://opensearch.org/blog/performance-improvment-hybrid-query-215/. Based on those results today the latency increases sharply when sub-query has millions of matching documents (~3x for 3 subqueries with 10M matching docs vs 1 subquery with 1K matching docs).

What solution would you like?

New component that orchestrates execution of all sub-queries at the shard level. Each sub-query does exhaustive search and scores are sorted by scores and stored in a single collection. Comparing to existing approach we can skip sorting for doc iterators, and avoid making random access IO operations.

Two big paths are feasible here:

  • new custom bulk scorer. today hybrid query rely on default implementation in Lucene and that implementation uses general approach that is not optimal for multiple sub-queries scenario.
  • parallelization of sub-query execution. currently other steps like rewrite or reduce of results are running in parallel, but score collection is sequential. this can be a next step after we change the design into "by sub-query" mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant