[FEATURE] Improve Hybrid Search for multiple Indices based searches #1216

navneet1v · 2025-03-07T00:14:58Z

Is your feature request related to a problem?

Currently hybrid search works very well when the query is happening 1 index and both text and vector data are part of the same index. But sometime a user may have data split into multiple indices, say text data in 1 index(named text_index) but vector data for different embedding models is present in different indices(say vec_index_1, vec_index_2). The reason for splitting the vector data into different indices is

You can update documents without generating whole document.
Since OpenSearch don't delete the docs, so an update in 1 embedding will lead to duplication of data in new document. This will put more pressure on the cluster resources.
Anytime a user has to change the embedding model or want to add a new model they can just drop the whole index or create a new one rather than just re-indexing the data which is expensive.
Scaling requirements of different indices can be different if dimensions are different

What solution would you like?

Currently how normalization works is it uses a SearchPhaseResultsProcessors which is good, but in case of multiple indices it will not work perfectly because SearchPhaseResultsProcessors work on shard_id and lucene docId combination rather than the global _id. So in case the search URL has multiple indices this combination shard_id and lucene docId will be different and same document scores will not be merged.

Approach 1: To solve this problem we can think of building a response processor that can do the normalization and combination at the end

Approach 2: Another approach can be to enhance the SearchPhaseResultsProcessors for normalization to get the _id for all results so that _id can be used to combine the documents.

There can be other solutions too. But ideally what I am suggesting is we should have a way to combine the docs based on _id rather than shard_id and lucene docId when multiple indices are getting queried with hybrid search.

What alternatives have you considered?

Only alternative is to put all the data in 1 index and do the query.

Do you have any additional context?

Slack thread: https://opensearch.slack.com/archives/C05RCMNQY8N/p1741088631648179

The text was updated successfully, but these errors were encountered:

vibrantvarun · 2025-03-07T00:56:42Z

PUT /movie_plots
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "plot": {
        "type": "text"
      }
    }
  }
}


PUT /movie_vectors
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "vector": {
        "type": "knn_vector",
        "dimension": 3  
      }
    }
  }
}

POST /_bulk
{ "index": { "_index": "movie_plots", "_id": "The Godfather" } }
{ "id": "The Godfather", "plot": "The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son." }
{ "index": { "_index": "movie_plots", "_id": "Inception" } }
{ "id": "Inception", "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious." }
{ "index": { "_index": "movie_plots", "_id": "The Matrix" } }
{ "id": "The Matrix", "plot": "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers." }
{ "index": { "_index": "movie_plots", "_id": "The Dark Knight" } }
{ "id": "The Dark Knight", "plot": "When the menace known as the Joker emerges from his mysterious past, he wreaks havoc and chaos on the people of Gotham, forcing Batman to come out of retirement." }
{ "index": { "_index": "movie_plots", "_id": "Interstellar" } }
{ "id": "Interstellar", "plot": "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." }


POST /_bulk
{ "index": { "_index": "movie_vectors", "_id": "The Godfather" } }
{ "id": "The Godfather", "vector": [0.34, -0.56, 0.12] }
{ "index": { "_index": "movie_vectors", "_id": "Inception" } }
{ "id": "Inception", "vector": [0.21, 0.45, -0.78] }
{ "index": { "_index": "movie_vectors", "_id": "The Matrix" } }
{ "id": "The Matrix", "vector": [0.99, -0.23, 0.65] }
{ "index": { "_index": "movie_vectors", "_id": "The Dark Knight" } }
{ "id": "The Dark Knight", "vector": [-0.34, 0.78, 0.56] }
{ "index": { "_index": "movie_vectors", "_id": "Interstellar" } }
{ "id": "Interstellar", "vector": [0.12, -0.67, 0.89] }



PUT /_search/pipeline/rrf_pipeline
{
  "description": "Post processor for hybrid RRF search",
  "phase_results_processors": [
    {
      "score-ranker-processor": {
        "combination": {
          "technique": "rrf"
        }
      }
    }
  ]
}


GET /movie_plots,movie_vectors/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.2,
                0.43,
                -0.68
              ],
              "k": 5
            }
          }
        }
      ]
    }
  }
}

PieroM97 · 2025-03-07T15:12:18Z

Currently how normalization works is it uses a SearchPhaseResultsProcessors which is good, but in case of multiple indices it will not work perfectly because SearchPhaseResultsProcessors work on shard_id and lucene docId combination rather than the global _id. So in case the search URL has multiple indices this combination shard_id and lucene docId will be different and same document scores will not be merged

Yes, indeed:

When using Reciprocal Rank Fusion, given the following pipeline:

PUT /_search/pipeline/rrf_pipeline
{
  "description": "Post processor for hybrid RRF search",
  "phase_results_processors": [
    {
      "score-ranker-processor": {
        "combination": {
          "technique": "rrf"
        }
      }
    }
  ]
}

And the following query:

GET movie_plots,movie_vectors/_search?search_pipeline=rrf_pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.21,
                0.45,
                -0.78
              ],
              "k": 1
            }
          }
        }
      ]
    }
  }
}

Score are correctly calculated, but not "merged":

[
      {
        "_index": "movie_plots",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
        }
      },
      {
        "_index": "movie_vectors",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "vector": [
            0.21,
            0.45,
            -0.78
          ]
        }
      }
    ]

according to the formula:

RRF(d) = Σ(r ∈ R) 1 / (k + r(d))

Since the documents are on different indices.

PieroM97 · 2025-03-07T15:27:14Z

When using Score normalization ( with l2 or with min-max) and combination:

PUT /_search/pipeline/min_max_pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.3,
              0.7
            ]
          }
        }
      }
    }
  ]
}

Given the following query:

GET movie_plots,movie_vectors/_search?search_pipeline=min_max_pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.21,
                0.45,
                -0.78
              ],
              "k": 1
            }
          }
        }
      ]
    }
  }
}

Results are well normalized and weights from combination step are applied:

[
      {
        "_index": "movie_vectors",
        "_id": "Inception",
        "_score": 0.7,
        "_source": {
          "id": "Inception",
          "vector": [
            0.21,
            0.45,
            -0.78
          ]
        }
      },
      {
        "_index": "movie_plots",
        "_id": "Inception",
        "_score": 0.3,
        "_source": {
          "id": "Inception",
          "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
        }
      }
    ]

But documents, despite the same _id, are still considered different documents since they belong to different indices.

PieroM97 · 2025-03-07T15:34:49Z

Possibly, another solution would be to use term aggregations to combine the scores as follows:

GET movie_plots,movie_vectors/_search?search_pipeline=rrf_pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.21,
                0.45,
                -0.78
              ],
              "k": 1
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "aggregate_score": {
      "terms": {
        "field": "id"
      },
      "aggs": {
        "score": {
          "sum": {
            "script": "_score"
          }
        }
      }
    }
  }
}

But the aggregation is not retrieving the score computed by the search pipeline:

  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.016393442,
    "hits": [
      {
        "_index": "movie_plots",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
        }
      },
      {
        "_index": "movie_vectors",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "vector": [
            0.21,
            0.45,
            -0.78
          ]
        }
      }
    ]
  },
  "aggregations": {
    "aggregate_score": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Inception",
          "doc_count": 2,
          "score": {
            "value": 3.403820753097534
          }
        }
      ]
    }
  }

@navneet1v Any suggestions on how to make this method work? 🤔

navneet1v · 2025-03-07T16:31:58Z

@PieroM97 aggregation is not going to work here, as aggregation works in QueryPhase and normalization works between query and fetch phase. The way to solve the problem is we use _id as the key to combine the results rather than lucene docId + shard_id combination. @vibrantvarun even RRF will not going work. The basic problem is the key lucene docId + shard_id which normalization processor is using to say 2 documents are same and combine their scores.

martin-gaievski · 2025-03-13T23:07:49Z

I think Approach 2 with adding this functionality to existing processor makes more sense as comparing to new response processor it uses all existing pipeline elements, and we currently don't have a way of adding that response processor dynamically. We can add a setting to normalization processor that enables this global/domain level normalization. Index information is already available to the SearchPhaseResult processor a part of the QuerySearchResult object, as that class extends SearchPhaseResult class where search shard target is present.

One thing that I haven't seen stated anywhere: in order for hybrid query be executed for all the indexes, fields that are part of the query should be present in all of the indexes. Otherwise that sub-query will fail for index where the field is missing. In setup example here knn query will fail for movie_plots because that index is not even a knn enabled, and the vector field is missing.

navneet1v · 2025-03-14T00:46:50Z

One thing that I haven't seen stated anywhere: in order for hybrid query be executed for all the indexes, fields that are part of the query should be present in all of the indexes. Otherwise that sub-query will fail for index where the field is missing. In setup example here knn query will fail for movie_plots because that index is not even a knn enabled, and the vector field is missing.

This is true. All indices should have mappings/schema same.

navneet1v added enhancement untriaged labels Mar 7, 2025

owaiskazi19 added the hybrid search label Mar 7, 2025

navneet1v changed the title ~~[FEATURE] Improve For Hybrid Search for multiple Indices~~ [FEATURE] Improve Hybrid Search for multiple Indices Mar 7, 2025

navneet1v changed the title ~~[FEATURE] Improve Hybrid Search for multiple Indices~~ [FEATURE] Improve Hybrid Search for multiple Indices based searches Mar 7, 2025

ryanbogan assigned minalsha Mar 19, 2025

ryanbogan removed the untriaged label Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Improve Hybrid Search for multiple Indices based searches #1216

[FEATURE] Improve Hybrid Search for multiple Indices based searches #1216

navneet1v commented Mar 7, 2025

vibrantvarun commented Mar 7, 2025

PieroM97 commented Mar 7, 2025

PieroM97 commented Mar 7, 2025

PieroM97 commented Mar 7, 2025

navneet1v commented Mar 7, 2025

martin-gaievski commented Mar 13, 2025

navneet1v commented Mar 14, 2025

[FEATURE] Improve Hybrid Search for multiple Indices based searches #1216

[FEATURE] Improve Hybrid Search for multiple Indices based searches #1216

Comments

navneet1v commented Mar 7, 2025

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

vibrantvarun commented Mar 7, 2025

PieroM97 commented Mar 7, 2025

PieroM97 commented Mar 7, 2025

PieroM97 commented Mar 7, 2025

navneet1v commented Mar 7, 2025

martin-gaievski commented Mar 13, 2025

navneet1v commented Mar 14, 2025