Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Improve Hybrid Search for multiple Indices based searches #1216

Open
navneet1v opened this issue Mar 7, 2025 · 7 comments
Open

Comments

@navneet1v
Copy link
Collaborator

Is your feature request related to a problem?

Currently hybrid search works very well when the query is happening 1 index and both text and vector data are part of the same index. But sometime a user may have data split into multiple indices, say text data in 1 index(named text_index) but vector data for different embedding models is present in different indices(say vec_index_1, vec_index_2). The reason for splitting the vector data into different indices is

  1. You can update documents without generating whole document.
  2. Since OpenSearch don't delete the docs, so an update in 1 embedding will lead to duplication of data in new document. This will put more pressure on the cluster resources.
  3. Anytime a user has to change the embedding model or want to add a new model they can just drop the whole index or create a new one rather than just re-indexing the data which is expensive.
  4. Scaling requirements of different indices can be different if dimensions are different

What solution would you like?

Currently how normalization works is it uses a SearchPhaseResultsProcessors which is good, but in case of multiple indices it will not work perfectly because SearchPhaseResultsProcessors work on shard_id and lucene docId combination rather than the global _id. So in case the search URL has multiple indices this combination shard_id and lucene docId will be different and same document scores will not be merged.

Approach 1: To solve this problem we can think of building a response processor that can do the normalization and combination at the end

Approach 2: Another approach can be to enhance the SearchPhaseResultsProcessors for normalization to get the _id for all results so that _id can be used to combine the documents.

There can be other solutions too. But ideally what I am suggesting is we should have a way to combine the docs based on _id rather than shard_id and lucene docId when multiple indices are getting queried with hybrid search.

What alternatives have you considered?

Only alternative is to put all the data in 1 index and do the query.

Do you have any additional context?

Slack thread: https://opensearch.slack.com/archives/C05RCMNQY8N/p1741088631648179

@vibrantvarun
Copy link
Member

PUT /movie_plots
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "plot": {
        "type": "text"
      }
    }
  }
}


PUT /movie_vectors
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "vector": {
        "type": "knn_vector",
        "dimension": 3  
      }
    }
  }
}

POST /_bulk
{ "index": { "_index": "movie_plots", "_id": "The Godfather" } }
{ "id": "The Godfather", "plot": "The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son." }
{ "index": { "_index": "movie_plots", "_id": "Inception" } }
{ "id": "Inception", "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious." }
{ "index": { "_index": "movie_plots", "_id": "The Matrix" } }
{ "id": "The Matrix", "plot": "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers." }
{ "index": { "_index": "movie_plots", "_id": "The Dark Knight" } }
{ "id": "The Dark Knight", "plot": "When the menace known as the Joker emerges from his mysterious past, he wreaks havoc and chaos on the people of Gotham, forcing Batman to come out of retirement." }
{ "index": { "_index": "movie_plots", "_id": "Interstellar" } }
{ "id": "Interstellar", "plot": "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." }


POST /_bulk
{ "index": { "_index": "movie_vectors", "_id": "The Godfather" } }
{ "id": "The Godfather", "vector": [0.34, -0.56, 0.12] }
{ "index": { "_index": "movie_vectors", "_id": "Inception" } }
{ "id": "Inception", "vector": [0.21, 0.45, -0.78] }
{ "index": { "_index": "movie_vectors", "_id": "The Matrix" } }
{ "id": "The Matrix", "vector": [0.99, -0.23, 0.65] }
{ "index": { "_index": "movie_vectors", "_id": "The Dark Knight" } }
{ "id": "The Dark Knight", "vector": [-0.34, 0.78, 0.56] }
{ "index": { "_index": "movie_vectors", "_id": "Interstellar" } }
{ "id": "Interstellar", "vector": [0.12, -0.67, 0.89] }



PUT /_search/pipeline/rrf_pipeline
{
  "description": "Post processor for hybrid RRF search",
  "phase_results_processors": [
    {
      "score-ranker-processor": {
        "combination": {
          "technique": "rrf"
        }
      }
    }
  ]
}


GET /movie_plots,movie_vectors/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.2,
                0.43,
                -0.68
              ],
              "k": 5
            }
          }
        }
      ]
    }
  }
}

@navneet1v navneet1v changed the title [FEATURE] Improve For Hybrid Search for multiple Indices [FEATURE] Improve Hybrid Search for multiple Indices Mar 7, 2025
@navneet1v navneet1v changed the title [FEATURE] Improve Hybrid Search for multiple Indices [FEATURE] Improve Hybrid Search for multiple Indices based searches Mar 7, 2025
@PieroM97
Copy link

PieroM97 commented Mar 7, 2025

Currently how normalization works is it uses a SearchPhaseResultsProcessors which is good, but in case of multiple indices it will not work perfectly because SearchPhaseResultsProcessors work on shard_id and lucene docId combination rather than the global _id. So in case the search URL has multiple indices this combination shard_id and lucene docId will be different and same document scores will not be merged

Yes, indeed:

When using Reciprocal Rank Fusion, given the following pipeline:

PUT /_search/pipeline/rrf_pipeline
{
  "description": "Post processor for hybrid RRF search",
  "phase_results_processors": [
    {
      "score-ranker-processor": {
        "combination": {
          "technique": "rrf"
        }
      }
    }
  ]
}

And the following query:

GET movie_plots,movie_vectors/_search?search_pipeline=rrf_pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.21,
                0.45,
                -0.78
              ],
              "k": 1
            }
          }
        }
      ]
    }
  }
}

Score are correctly calculated, but not "merged":

[
      {
        "_index": "movie_plots",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
        }
      },
      {
        "_index": "movie_vectors",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "vector": [
            0.21,
            0.45,
            -0.78
          ]
        }
      }
    ]

according to the formula:

RRF(d) = Σ(r ∈ R) 1 / (k + r(d))

Since the documents are on different indices.

@PieroM97
Copy link

PieroM97 commented Mar 7, 2025

When using Score normalization ( with l2 or with min-max) and combination:

PUT /_search/pipeline/min_max_pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.3,
              0.7
            ]
          }
        }
      }
    }
  ]
}

Given the following query:

GET movie_plots,movie_vectors/_search?search_pipeline=min_max_pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.21,
                0.45,
                -0.78
              ],
              "k": 1
            }
          }
        }
      ]
    }
  }
}

Results are well normalized and weights from combination step are applied:

[
      {
        "_index": "movie_vectors",
        "_id": "Inception",
        "_score": 0.7,
        "_source": {
          "id": "Inception",
          "vector": [
            0.21,
            0.45,
            -0.78
          ]
        }
      },
      {
        "_index": "movie_plots",
        "_id": "Inception",
        "_score": 0.3,
        "_source": {
          "id": "Inception",
          "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
        }
      }
    ]

But documents, despite the same _id, are still considered different documents since they belong to different indices.

@PieroM97
Copy link

PieroM97 commented Mar 7, 2025

Possibly, another solution would be to use term aggregations to combine the scores as follows:

GET movie_plots,movie_vectors/_search?search_pipeline=rrf_pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "plot": {
              "query": "Criminal dream"
            }
          }
        },
        {
          "knn": {
            "vector": {
              "vector": [
                0.21,
                0.45,
                -0.78
              ],
              "k": 1
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "aggregate_score": {
      "terms": {
        "field": "id"
      },
      "aggs": {
        "score": {
          "sum": {
            "script": "_score"
          }
        }
      }
    }
  }
}

But the aggregation is not retrieving the score computed by the search pipeline:

  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.016393442,
    "hits": [
      {
        "_index": "movie_plots",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "plot": "A skilled thief, who steals secrets through the use of dream-sharing technology, is given a chance to have his criminal record erased in exchange for implanting an idea into a target's subconscious."
        }
      },
      {
        "_index": "movie_vectors",
        "_id": "Inception",
        "_score": 0.016393442,
        "_source": {
          "id": "Inception",
          "vector": [
            0.21,
            0.45,
            -0.78
          ]
        }
      }
    ]
  },
  "aggregations": {
    "aggregate_score": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Inception",
          "doc_count": 2,
          "score": {
            "value": 3.403820753097534
          }
        }
      ]
    }
  }

@navneet1v Any suggestions on how to make this method work? 🤔

@navneet1v
Copy link
Collaborator Author

@PieroM97 aggregation is not going to work here, as aggregation works in QueryPhase and normalization works between query and fetch phase. The way to solve the problem is we use _id as the key to combine the results rather than lucene docId + shard_id combination. @vibrantvarun even RRF will not going work. The basic problem is the key lucene docId + shard_id which normalization processor is using to say 2 documents are same and combine their scores.

@martin-gaievski
Copy link
Member

I think Approach 2 with adding this functionality to existing processor makes more sense as comparing to new response processor it uses all existing pipeline elements, and we currently don't have a way of adding that response processor dynamically. We can add a setting to normalization processor that enables this global/domain level normalization. Index information is already available to the SearchPhaseResult processor a part of the QuerySearchResult object, as that class extends SearchPhaseResult class where search shard target is present.

One thing that I haven't seen stated anywhere: in order for hybrid query be executed for all the indexes, fields that are part of the query should be present in all of the indexes. Otherwise that sub-query will fail for index where the field is missing. In setup example here knn query will fail for movie_plots because that index is not even a knn enabled, and the vector field is missing.

@navneet1v
Copy link
Collaborator Author

One thing that I haven't seen stated anywhere: in order for hybrid query be executed for all the indexes, fields that are part of the query should be present in all of the indexes. Otherwise that sub-query will fail for index where the field is missing. In setup example here knn query will fail for movie_plots because that index is not even a knn enabled, and the vector field is missing.

This is true. All indices should have mappings/schema same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants