Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _search-plugins/search-pipelines/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ The following is a list of search pipeline terminology:
* [_Search response processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/search-processors#search-response-processors): A component that intercepts a search response and search request (the query, results, and metadata passed in the request), performs an operation with or on the search response, and returns the search response.
* [_Search phase results processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/search-processors#search-phase-results-processors): A component that runs between search phases at the coordinating node level. A search phase results processor intercepts the results retrieved from one search phase and transforms them before passing them to the next search phase.
* [_Processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/search-processors/): Either a search request processor or a search response processor.
* [_System Generated Processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/system-generated-search-processors/): System generated search processors.
* _Search pipeline_: An ordered list of processors that is integrated into OpenSearch. The pipeline intercepts a query, performs processing on the query, sends it to OpenSearch, intercepts the results, performs processing on the results, and returns them to the calling application, as shown in the following diagram.

![Search processor diagram]({{site.url}}{{site.baseurl}}/images/search-pipelines.png)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Search pipeline metrics
nav_order: 50
nav_order: 60
has_children: false
parent: Search pipelines
---
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
layout: default
title: System generated search processors
nav_order: 50
has_children: false
parent: Search pipelines
---

# System generated search processors

System generated search processors are search processors that can be systematically generated based on the search request.

To enable the processors, you must set the `cluster.search.enabled_system_generated_factories` setting to either `*` or explicitly include the required factories.

Example:
```json
PUT _cluster/settings
{
"persistent": {
"cluster.search.enabled_system_generated_factories": [
"mmr_over_sample_factory",
"mmr_rerank_factory"
]
}
}
```
{% include copy-curl.html %}



Search processors can be of the following types:

- [System generated search request processors](#system-generated-search-request-processors)
- [System generated search response processors](#system-generated-search-response-processors)
- [System generated search phase results processors](#system-generated-search-phase-results-processors)

# System generated search request processors

| Processor name | Processor factory name | Execution stage | Trigger condition | Description |
|-------------------|---------------------------| ------------------- |-------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `mmr_over_sample` | `mmr_over_sample_factory` | `POST_USER_DEFINED` | Triggered when a search request includes the mmr extension | Modifies the query size and `k` value of the k-NN query to oversample candidates for MMR re-ranking. This processor runs after any user-defined search request processors. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be knn and neural queries because this processor supports both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is correct. Thanks for catching it.


The execution stage determines whether a system-generated processor runs before or after user-defined processors of the same type.

# System generated search response processors

| Processor name | Processor factory name | Execution stage | Trigger condition | Description |
|----------------|------------------------|---------------------|------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `mmr_rerank` | `mmr_rerank_factory` | `PRE_USER_DEFINED` | Triggered when a search request includes the mmr extension | Re-ranks the oversampled results using MMR and reduces them to the original query size. This processor runs before any user-defined search response processors. |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add explanation for execution stage here as well?

The execution stage determines whether a system-generated processor runs before or after user-defined processors of the same type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzhangam @heemin32 So a given system-generated processor can run either before or after user-defined processors? Or does the system-generated processor type determine when the processor runs? So, for example, does mmr_rerank always run before the user-defined processors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already fixed by implementation. So, for mmr_rerank, it always run before user-defined processor.

The execution stage determines whether a system-generated processor runs before or after user-defined processors of the same type.

# System generated search phase results processors

We don't have any for now.

# Limitation

## One processor per type and execution stage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzhangam @heemin32 Given that there are 2 processor types (request and response) and 2 execution stages (PRE_USER_DEFINED and POST_USER_DEFINED), does this mean that a single search request can have up to 4 system-generated processors (request processor at PRE_USER_DEFINED stage, request processor at POST_USER_DEFINED stage, response processor at PRE_USER_DEFINED stage, and response processor at POST_USER_DEFINED stage)? Or is the limit actually 2 total processors (one request + one response) regardless of execution stage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total system generated processors which can be run is 6.
Three type: SearchRequestProcessor, SearchPhaseResultProcessor, SearchResponseProcessor
Two stage: PRE_USER_DEFIEND, POST_USER_DEFINED

opensearch-project/OpenSearch#19062 (comment)

For each processor type and execution stage, OpenSearch currently supports only one system-generated processor for a search request. For example, only one search request processor can run at the `POST_USER_DEFINED` stage, and only one search response processor can run at the `PRE_USER_DEFINED` stage.
3 changes: 3 additions & 0 deletions _vector-search/specialized-operations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ cards:
- heading: "Radial search"
description: "Search all points in a vector space that reside within a specified maximum distance or minimum score threshold from a query point"
link: "/vector-search/specialized-operations/radial-search-knn/"
- heading: "Vector search with MMR"
description: "Use vector search with maximal marginal relevance(mmr) re-rank."
link: "/vector-search/specialized-operations/vector-search-mmr/"
---

# Specialized vector search
Expand Down
126 changes: 126 additions & 0 deletions _vector-search/specialized-operations/vector-search-mmr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
layout: default
title: Vector search with MMR
nav_order: 60
parent: Specialized vector search
has_children: false
has_math: true
---

# Vector search with MMR

The maximal marginal relevance (MMR) search helps balance relevance and diversity in search results. Instead of returning only the most similar documents, MMR selects results that are both relevant to the query and different from each other. This improves the coverage of the result set and reduces redundancy, which is especially useful in vector search scenarios.

MMR re-ranking balances two competing objectives:

- Relevance: How well a document matches the query.

- Diversity: How different a document is from the documents already selected.

The algorithm computes a score for each candidate document using the following principle:

```json
MMR = (1 − λ) * relevance_score − λ * max(similarity_with_selected_docs)
```

Where:

- λ is the diversity parameter (closer to 1 means higher diversity).

- relevance_score measures similarity between the query vector and the candidate document vector.

- similarity_with_selected_docs measures similarity between the candidate and already selected documents.

By adjusting the diversity parameter, you can control the tradeoff between highly relevant results and more diverse coverage in the result set.

Check failure on line 34 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'trade-off' instead of 'tradeoff'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'trade-off' instead of 'tradeoff'.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 34, "column": 59}}}, "severity": "ERROR"}

# Prerequisites

To use MMR, you must enable [system-generated search processor factories]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/system-generated-search-processors/). Set the `cluster.search.enabled_system_generated_factories` setting (by default it is an empty list) to either `*` or explicitly include the required factories:

```json
PUT _cluster/settings
{
"persistent": {
"cluster.search.enabled_system_generated_factories": [
"mmr_over_sample_factory",
"mmr_rerank_factory"
]
}
}
```
{% include copy-curl.html %}

# Parameters

The mmr extension in the search API supports the following parameters:

Check failure on line 55 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: mmr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: mmr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 55, "column": 5}}}, "severity": "ERROR"}

| Parameter | Data type | Required | Description |
| ------------------------- | --------- | ----------------------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `diversity` | float | No | Controls the weight of diversity in the re-ranking process. Valid values range from `0` to `1`. A value of `1` prioritizes maximum diversity, and `0` disables diversity. Default is `0.5`. |
| `candidates` | integer | No | Specifies how many candidate documents to oversample before re-ranking. Default is `3 * query size`. |
| `vector_field_path` | string | Optional, but required for remote indices | Path to the vector field used for MMR re-ranking. If not provided, OpenSearch resolves it automatically from the search request. |

Check failure on line 61 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 61, "column": 77}}}, "severity": "ERROR"}
| `vector_field_data_type` | string | Optional, but required for remote indices | Data type of the vector field. Used to parse the field and calculate similarity. If not provided, OpenSearch resolves it from the index mapping. |

Check failure on line 62 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 62, "column": 77}}}, "severity": "ERROR"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For data type and space type should we add a link to the https://docs.opensearch.org/latest/field-types/supported-field-types/knn-vector/#parameters to show what are the valid values? It should be valid values of the data type and space type of the knn vector.

| `vector_field_space_type` | string | Optional, but required for remote indices | Used to decide the similarity function for the vector field, such as cosine similarity or Euclidean distance. If not provided, OpenSearch resolves it from the index mapping. |

Check failure on line 63 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 63, "column": 77}}}, "severity": "ERROR"}


# Example request

The following example shows how to use the mmr extension with a k-NN query:

Check failure on line 68 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: mmr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: mmr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 68, "column": 44}}}, "severity": "ERROR"}

```json
POST /my-index/_search
{
"query": {
"knn": {
"my_vector_field": {
"vector": [0.12, 0.54, 0.91],
"k": 10
}
}
},
"ext": {
"mmr": {
"diversity": 0.7
}
}
}

```
{% include copy-curl.html %}

The following example shows how to use the mmr extension with a neural query:

Check failure on line 91 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: mmr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: mmr. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 91, "column": 44}}}, "severity": "ERROR"}
```json
POST /my-index/_search
{
"query": {
"neural": {
"my_vector_field": {
"query_text": "query text",
"model_id": "<your model id>"
}
}
},
"ext": {
"mmr": {
"diversity": 0.6,
"candidates": 50,
"vector_field_path": "my_vector_field",
"vector_field_data_type": "float",
"vector_field_space_type": "l2"
}
}
}
```

When querying across multiple indices, ensure that the data type, and space type are aligned. Since that info decides the similarity function we use to calculate the similarity between docs.

Check failure on line 115 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 115, "column": 31}}}, "severity": "ERROR"}
{: .note}

# Limitations

## MMR Query Type Restriction:

Check failure on line 120 in _vector-search/specialized-operations/vector-search-mmr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.StackedHeadings] Do not stack headings. Insert an introductory sentence between headings. Raw Output: {"message": "[OpenSearch.StackedHeadings] Do not stack headings. Insert an introductory sentence between headings.", "location": {"path": "_vector-search/specialized-operations/vector-search-mmr.md", "range": {"start": {"line": 120, "column": 1}}}, "severity": "ERROR"}
MMR currently only supports knn or neural queries as the top-level query in a search request. If knn or neural is nested inside another query type (such as a bool query or hybrid query), MMR is not supported.

## Required Explicit Vector Field Details
You must explicitly provide the vector field details—`vector_field_path, vector_field_data_type, and vector_field_space_type`—when querying remote indices.

Reason: Unlike a local index where OpenSearch can automatically resolve this metadata from the index mapping, the system cannot reliably fetch this information from the remote cluster. Providing these details ensures correct parsing of the vector data and accurate similarity calculations.
Loading