-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Improve Hybrid Search for multiple Indices based searches #1216
Comments
|
Yes, indeed: When using Reciprocal Rank Fusion, given the following pipeline: PUT /_search/pipeline/rrf_pipeline
{
"description": "Post processor for hybrid RRF search",
"phase_results_processors": [
{
"score-ranker-processor": {
"combination": {
"technique": "rrf"
}
}
}
]
} And the following query: GET movie_plots,movie_vectors/_search?search_pipeline=rrf_pipeline
{
"query": {
"hybrid": {
"queries": [
{
"match": {
"plot": {
"query": "Criminal dream"
}
}
},
{
"knn": {
"vector": {
"vector": [
0.21,
0.45,
-0.78
],
"k": 1
}
}
}
]
}
}
} Score are correctly calculated, but not "merged": [
{
"_index": "movie_plots",
"_id": "Inception",
"_score": 0.016393442,
"_source": {
"id": "Inception",
"plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
}
},
{
"_index": "movie_vectors",
"_id": "Inception",
"_score": 0.016393442,
"_source": {
"id": "Inception",
"vector": [
0.21,
0.45,
-0.78
]
}
}
] according to the formula:
Since the documents are on different indices. |
When using Score normalization ( with l2 or with min-max) and combination: PUT /_search/pipeline/min_max_pipeline
{
"description": "Post processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.3,
0.7
]
}
}
}
}
]
} Given the following query: GET movie_plots,movie_vectors/_search?search_pipeline=min_max_pipeline
{
"query": {
"hybrid": {
"queries": [
{
"match": {
"plot": {
"query": "Criminal dream"
}
}
},
{
"knn": {
"vector": {
"vector": [
0.21,
0.45,
-0.78
],
"k": 1
}
}
}
]
}
}
} Results are well normalized and weights from combination step are applied: [
{
"_index": "movie_vectors",
"_id": "Inception",
"_score": 0.7,
"_source": {
"id": "Inception",
"vector": [
0.21,
0.45,
-0.78
]
}
},
{
"_index": "movie_plots",
"_id": "Inception",
"_score": 0.3,
"_source": {
"id": "Inception",
"plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
}
}
] But documents, despite the same _id, are still considered different documents since they belong to different indices. |
Possibly, another solution would be to use term aggregations to combine the scores as follows: GET movie_plots,movie_vectors/_search?search_pipeline=rrf_pipeline
{
"query": {
"hybrid": {
"queries": [
{
"match": {
"plot": {
"query": "Criminal dream"
}
}
},
{
"knn": {
"vector": {
"vector": [
0.21,
0.45,
-0.78
],
"k": 1
}
}
}
]
}
},
"aggs": {
"aggregate_score": {
"terms": {
"field": "id"
},
"aggs": {
"score": {
"sum": {
"script": "_score"
}
}
}
}
}
} But the aggregation is not retrieving the score computed by the search pipeline: "hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.016393442,
"hits": [
{
"_index": "movie_plots",
"_id": "Inception",
"_score": 0.016393442,
"_source": {
"id": "Inception",
"plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
}
},
{
"_index": "movie_vectors",
"_id": "Inception",
"_score": 0.016393442,
"_source": {
"id": "Inception",
"vector": [
0.21,
0.45,
-0.78
]
}
}
]
},
"aggregations": {
"aggregate_score": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Inception",
"doc_count": 2,
"score": {
"value": 3.403820753097534
}
}
]
}
} @navneet1v Any suggestions on how to make this method work? 🤔 |
@PieroM97 aggregation is not going to work here, as aggregation works in QueryPhase and normalization works between query and fetch phase. The way to solve the problem is we use |
I think Approach 2 with adding this functionality to existing processor makes more sense as comparing to new response processor it uses all existing pipeline elements, and we currently don't have a way of adding that response processor dynamically. We can add a setting to normalization processor that enables this global/domain level normalization. Index information is already available to the SearchPhaseResult processor a part of the QuerySearchResult object, as that class extends SearchPhaseResult class where search shard target is present. One thing that I haven't seen stated anywhere: in order for hybrid query be executed for all the indexes, fields that are part of the query should be present in all of the indexes. Otherwise that sub-query will fail for index where the field is missing. In setup example here |
This is true. All indices should have mappings/schema same. |
Is your feature request related to a problem?
Currently hybrid search works very well when the query is happening 1 index and both text and vector data are part of the same index. But sometime a user may have data split into multiple indices, say text data in 1 index(named text_index) but vector data for different embedding models is present in different indices(say vec_index_1, vec_index_2). The reason for splitting the vector data into different indices is
What solution would you like?
Currently how normalization works is it uses a SearchPhaseResultsProcessors which is good, but in case of multiple indices it will not work perfectly because SearchPhaseResultsProcessors work on
shard_id and lucene docId
combination rather than the global_id
. So in case the search URL has multiple indices this combinationshard_id and lucene docId
will be different and same document scores will not be merged.Approach 1: To solve this problem we can think of building a response processor that can do the normalization and combination at the end
Approach 2: Another approach can be to enhance the SearchPhaseResultsProcessors for normalization to get the
_id
for all results so that_id
can be used to combine the documents.There can be other solutions too. But ideally what I am suggesting is we should have a way to combine the docs based on
_id
rather thanshard_id and lucene docId
when multiple indices are getting queried with hybrid search.What alternatives have you considered?
Only alternative is to put all the data in 1 index and do the query.
Do you have any additional context?
Slack thread: https://opensearch.slack.com/archives/C05RCMNQY8N/p1741088631648179
The text was updated successfully, but these errors were encountered: