Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support Semantic Field Type to Simplify Neural Search Set Up HLD #1211

Open
bzhangam opened this issue Mar 5, 2025 · 0 comments
Open
Assignees

Comments

@bzhangam
Copy link
Contributor

bzhangam commented Mar 5, 2025

Background

Neural search transforms text into vectors and facilitates vector search both at ingestion time and at search time. During ingestion, neural search transforms document text into vector embeddings and indexes both the text and its vector embeddings in a vector index. When you use a neural query during search, neural search converts the query text into vector embeddings, uses vector search to compare the query and document embeddings, and returns the closest results.

Before you ingest documents into an index, documents are passed through a machine learning (ML) model, which generates vector embeddings for the document fields. When you send a search request, the query text or image is also passed through the ML model, which generates the corresponding vector embeddings. Then neural search performs a vector search on the embeddings and returns matching documents.

To set up the neural search we need to(doc):

  • Register and deploy the ML model.
  • Set up the ingest pipeline with the ML model.
  • Set up the index with the ingest pipeline.
  • Add data to the index.
  • Use neural query to query data with the ML model.

In this process we need to configure the ingest pipeline, the index and the neural query based on the ML model which can be a complicated and annoying thing especially for people who are not very familiar with the ML model but want to give neural search a try. So we have a proposal to build a feature to simplify the process to set up the neural search. In this doc we will clarify the proposal and discuss the solutions.

Existing System

Please refer to this doc to see the detail what we need to do to set up a neural search now.

Future State

In the OpenSearch we should support a new field type to simplify the process to set up the neural search. Below are the changes we need for each step to support this new feature. We propose to name the new field type as semantic.

Compared to the existing process the future UX will be:

Image

Register and deploy the ML model. (no change)

Set up the ingest pipeline with the ML model. (not needed anymore)

The ingest pipeline can be skipped. We will know how to ingest the data based on the ML model defined in the index.

Index Creation

Index set up can be simplified. We just need to define a semantic field(A new field type we plan to support.) with the id of the ML model we want to use for inference. We also can define the raw_field_type to control how we should handle the raw data(the original value). We can handle it like a string(text, keyword, match_only_text, token_count, wildcard) or binary field. By default we will handle it like a text field. Then Opensearch will automatically generate the semantic info fields to the index mapping.

Check this example for more details.

Add doc to the index.

No need to set up an ingest pipeline for text chunking and embedding generation. OpenSearch will automatically:

  1. Generate embedding properly according to the ML model defined in the semantic field.
  2. Chunk the text before converting it to embeddings if it’s too long.

Use neural query to query data with the ML model.

Neural query can be simplified. We can simply query against the neural field rather than specifying the embedding field name and the model id. Opensearch will automatically:

  1. Apply the proper query(dense vs sparse) against the embeddings based on the ML model set up in the index.
// future neural query example
GET /my-nlp-index/_search
{
 "_source": {
    "excludes": [
      "text_semantic_info"
    ]
  },
  "query": {
    "neural": {
      // No need to specify the path to embedding field
      // No need to specify the search model id
      "text": {
        "query_text": "wild west"
      }
    }
  }
}

Requirements

Note: All the default configuration mentioned in the below section will be proposed and reviewed in a later design phase.

Phase 0(P0)

In Phase 0, the goal is to implement the most critical requirements necessary to enable the feature in its simplest, functional form, providing a basic yet usable foundation for future enhancements.

R1 Support New Field type

A new field type semantic should be supported which can be used to define the field in an index. This field should support below parameters.

name type description updatable defaultValue required Note
model_id string The id of ML model which will be used to generate the embedding for the raw data. TRUE N/A TRUE  
search_model_id string The id of ML model which will be used to generate the embedding for query text during the search. If it's not defined we will simply use model_id. TRUE N/A FALSE It's a common use case for sparse embedding doc-only mode we will use different models for indexing and search.
raw_field_type string The type of raw data which will be used to control how we store and query the raw data. It should be a string(text, keyword, match_only_text, token_count, wildcard) or binary field. FALSE text FALSE  
semantic_info_field_name string Custom root field name for the semantic info fields we will auto generate. This can help us avoid naming conflicts between the auto generated name and user defined names. FALSE {semantic_field_name}_semantic_info FALSE  
R1.1 Embeddings auto-generated from ML model on doc addition/update

When a doc with semantic fields is added to the index or is updated then we will automatically calculate the embedding for the semantic field. The configuration of how to calculate the embedding will be automatically derived from the ML model based on the model id defined in the neural field. And if the original data is too long we should chunk it before we calculate the embedding.

R1.2 Embeddings should be viewable

When we get a doc with neural field we should be able to view the auto-generated embedding fields in case users are interested in it.

R1.3 Query semantic field using auto-derived embeddings

We should be able to query against the semantic field and then OpenSearch should automatically generate the embedding for the query text based on the ML model defined in the semantic field and query against the embedding.

R1.4 Configurability when query neural field

Currently, for neural and neural_sparse queries, customers can configure query parameters to customize their search behavior. For semantic fields, we should continue to support it.

R1.5 Allow indexing raw embeddings

It’s possible that we want to directly provide the embedding for the neural field for ad hoc analysis, unit testing, custom/unsupported models. We should directly store the raw embedding in the index rather than make an inference call to use the ML model to generate the embedding. Currently both KNN index and Neural Sparse index can support that feature.

R2 Machine Learning Model Support

Since we plan to mainly rely on the ML model to decide how to ingest and query the neural field we should ensure we can support this properly.

  • Should be able to decide the embedding type(dense vs sparse) based on the model metadata.
  • Should be able to decide default configuration to ingest and query the neural field.
R2.1 Support OpenSearch supported pre-trained ML models

We should ensure we can handle the OpenSearch supported pre-trained ML models properly.

R2.2 Support Remote Model

We should ensure we can handle the remote models properly.

R2.3 Allow Simple Model ID Update

To improve performance, customers may need to update the ML model used by the semantic field. Since registering and deploying a new ML model in OpenSearch generates a new model ID, we must allow customers to update the model ID for the semantic field without requiring re-indexing of the data.

To ensure compatibility and avoid indexing issues, the following validations will be enforced when updating the model ID:

  • Should not update to an invalid model.
  • Should not change the model type(e.g from a dense to a sparse).
  • Should not update a dense model with different dimension.

Updating the model ID will not automatically regenerate the embeddings for existing documents. This means embeddings generated with an older model may still exist in the index, potentially leading to inconsistent results when queried with the updated model. Addressing this limitation (e.g., by allowing regeneration of embeddings) can be considered for future phases.

R3 Backward Compatible

Using the new semantic field type should not impact the existing OpenSearch functions.

R3.1 Ensure compatibility with existing embedding configuration process

The existing process for defining new embedding fields in the index, including those with semantic fields, and using an ingest pipeline to configure embedding calculations should remain functional. In P0, we may not offer advanced configurability. Therefore, we should allow experienced customers to continue using the existing solution rather than limiting their options or creating a dead end.

Note: P1 and P2 requirements can be found in the Appendix.

Limitation

  • semantic_text fields can’t be defined as multi-fields of another field.

Out of the Scope

A single neural field is limited to using one ML model

While supporting multiple ML models for the same neural field might seem beneficial, it would introduce significant complexity to the implementation. Additionally, the current OpenSearch framework does not support this capability, and there has been no customer demand for it. As a workaround, users can define multiple neural fields, each configured with a different ML model. However, this approach requires duplicating the original data across those fields. If a need arises in the future to support multiple ML models for a single neural field, we can revisit the design and implement a solution to address this requirement.

High Level Design

The high level design is to support P0 features and should be extendible to support P1 and P2 features.

Overall Solution Compare

Option Pros Cons  
Option 1: Auto-Adding Embedding Fields to Index Mapping (Recommended) 1. Best leverage the existing codes and aligns with exisitng framework.    
Option 2: Not Auto-Adding Embedding Fields to Index Mapping 1. Cleaner Index Mapping. 1. Too much complexity and error prone.  
Option 3: Enhance the Existing Field Types 1. Best leverage the existing codes and aligns with exisitng framework. 1. Tight coupling with core.  

Option 1 Auto-Adding Embedding Fields to Index Mapping (Recommended)

In this approach, when an index is created with a semantic field, the system will automatically add corresponding embedding fields to the index mapping based on the ML model configuration.

During document ingestion, if chunking is needed, the semantic field will be processed accordingly, and embeddings will be generated using the specified ML model.

At query time, a neural query targeting the semantic field will be rewritten to search against the associated embedding fields, ensuring seamless integration with neural search capabilities.

Below we will discuss how should we handle the index creation, indexing doc and query.

Index Creation

For index creation, we will transform the index mapping to create semantic info fields and add them to the index mapping. The transform should happen in the actions that can modify the index mapping.

Additionally, we will create a new SemanticFieldMapper within the Neural Search plugin to handle parsing and processing of the semantic field. Beyond these changes, the rest of the index creation workflow will remain unchanged.

e.g. Below diagram shows how will we handle CreateIndexAction:

Image

Discuss Points

Why we want to auto create the embedding fields to the index mapping?

The primary reason for automatically adding embedding fields to the index mapping is to align with OpenSearch's existing mapper service, which handles field configurations. In OpenSearch, the mapper service creates a mapping between {field full name → MappedFieldType}, where each type holds the field’s configuration in the index. And when we convert the query to lucene query we will try to pull the configuration from the mapper service. (e.g. KNNQueryBuilder, NestedQueryBuilder)

If the embedding fields are not included in the mapping:

  • Field Configuration Won’t Be Recognized. The mapper service will recognize the semantic field but not the generated embedding fields. This means we won’t be able to properly execute queries against the embeddings by default.
  • Manual Query Conversion Is Needed. Without embedding fields in the mapping, we would need to manually write logic to transform queries into Lucene queries for the embedding fields.
  • Increased Complexity and Redundancy. The neural search plugin already uses KNN queries for query transformation, and introducing custom logic would duplicate this process, leading to inefficiency and increased maintenance overhead. And if we want to leverage other existing query type(e.g. nested query to handle the chunks) we also need to build the custom logic in the neural search plugin.

By automatically adding embedding fields to the index mapping:

  • We ensure compatibility with OpenSearch’s mapper service, which allows the system to recognize and process the fields for queries.
  • We avoid manual query transformation logic and maintain alignment with OpenSearch's built-in query capabilities.
  • We simplify the integration of nested queries for text chunking and embedding searches, ensuring the neural search plugin works as expected.

Indexing Doc

Today when we add docs to an index, there will be an ingest process that executes ingest pipelines to transform the document before indexing. Following this, the field mappers process the document's fields, mapping them into indexable fields for storage.

We will store the original text and its embeddings in chunks. And we will also store the model info so that we know which model was used to generate the inference for the original text. By default the model info will simply be stored without being indexed since it’s not the common data users want to search.

Here we propose to make the text chunking as the default behavior to simplify the set up. But if the text chunking is not actually needed then there can be some performance issue. Here is an example how currently we support text chunking with neural search. When we store the chunks we will store them as child doc to ensure the data of each chunk is stored together to ensure the search quality. And we need to use a nested query to wrap the neural query to query it. With our proposal we can simplify the config to use the text chunking and the query but internally we should implement it in a similar way to store and query the data. It means we will always store the chunks as child doc even sometimes it’s not needed when we handle short text. We can address this issue by providing an immutable parameter to allow customers to control if the text chunking should be enabled or not in future.

When we add a doc with semantic fields we just need the original text/binary and the embedding and the model info will be added automatically. And below are couple of the options to do that:

Inject A Processor to the Existing Ingest Process

To support the semantic field, we need to refine the ingest process as follows:

  1. Enhanced Ingest Process: Introduce a mechanism to handle semantic fields during ingestion without requiring manual pipeline configuration. This includes programmatically generating embeddings using the ML model associated with the semantic field. And for large text fields, incorporate automatic text chunking into the process to ensure proper embedding generation.
  2. Field Mapping and Indexing: Extend the field mapping step with a SemanticFieldMapper to parse the semantic field. Since other fields are defined in the index mapping and they are existing types so the existing field mapper will process them properly.

Components need to to created and modified are highlighted in green.

Image

Enhanced Ingest Process:

We propose to add a new interface for the IngestPlugin in OpenSearch core to get processors based on the index configuration and then inject it to the ingest process. This processor will handle text chunking and embedding creation seamlessly, based on the ML model associated with the semantic field, eliminating the need for manual pipeline configuration.

Key Changes:

  • Automatic In-Memory Pipeline Creation: If no final ingest pipeline is specified, OpenSearch will dynamically create a temporary in-memory pipeline that includes the SemanticTextFieldProcessor to handle embedding generation.
  • Integration with Existing Pipelines: If an final ingest pipeline is already defined, the SemanticTextFieldProcessor will be appended to it, ensuring embeddings are generated without disrupting the existing pipeline configuration.

Here we plan to leverage the final pipeline because it’s possible during the ingest process the index can be changed based on this code. And we don’t allow the index change for the final pipeline. Since we rely on the index to decide how should we process the semantic_text fields we should only do it once the index is finalized. And we want to append it to the end of the final pipeline because it’s not a common case that we want to modify the auto generated semantic info fields.

Note: Since the ingest process is orchestrated by the OpenSearch core so the key changes will happen in it. Currently we only pull the ingest processors from the ingest pipeline. Basically we are proposing a new way to configure the processors through the index configuration. And this pattern can potentially be extended to other fields and other phases(e.g. search pipeline).

SemanticFieldProcessor Responsibilities:

  • Embedding Generation: Automatically generate embeddings based on the ML model (model_id) defined in the index mapping.
  • Text Chunking: Handle long text by chunking it and generating embeddings for each chunk. For P0 we will use default config which will be clarified in the LLD. In future we can consider to make it configurable.
  • Add Model Info: ML model info will be added to the doc.

New Field Mapper

After the ingest process, the document will include the generated embedding and model information. For the semantic info since they are defined as existing field types in the index mapping so they will be handled by the existing field mappers. But for the original value we need a new field mapper SemanticFieldMapper to handle it. And we will delegate the work to the existing field mapper based on the raw_field_type.

Query Semantic Field

To support querying the semantic field, we propose extending the existing neural query type to support it. Currently, the neural query retrieves the model ID either from the search query itself or through a neural-query-enricher. We now introduce a third approach: retrieving the model ID from the index mapping.

Check Index Mapping: When querying a field with neural query, we first check the index mapping. If it's a semantic field, the model ID is sourced from the mapping, triggering the new query rewrite logic. Otherwise, the existing neural query behavior applies.

In the new logic we will pull the model id from the index mapping and also decide underlying we should use a KNN query to handle dense model or neural sparse query to handle the sparse model.

Components need to to created an modified are highlighted in green.

Image

Note: Currently we don’t have the index related info in the NeuralQueryBuilder and we need to modify the OpenSearch core to pass this info to the NeuralQueryBuilder when we do rewrite and fetch operation.

Discuss Points

Why not check index when we parse the query in fromXContent function?

The fromXContent function is responsible for parsing a query from its JSON representation, and its existing interface only accepts a single parameter: XContentParser. This design does not provide an easy way to pass index-related information to the function.

Modifying the fromXContent interface to include index information would require changes to all existing query builders, which would be a significant and intrusive update to the OpenSearch codebase. Instead, the index information is primarily needed during query rewriting, where we can validate the query and determine how to transform it.

The doRewrite function already accepts a QueryRewriteContext, which provides a natural place to include index-related details. Since QueryRewriteContext can be extended to hold index information, we recommend handling index-based validation and transformation within the doRewrite function rather than modifying fromXContent.

This approach keeps query parsing lightweight, avoids unnecessary modifications to existing query builders, and ensures that index-related decisions are made at the appropriate stage in the query processing workflow.

Option 2: Not Auto-Adding Embedding Fields to Index Mapping

In this approach, we propose not automatically adding embedding fields to the index mapping. This avoids potential confusion for customers who may expect to see only the fields they explicitly define in the index mapping.

By not auto-adding embedding fields, we eliminate the need for special handling when creating, updating, or retrieving index mappings. However, this decision introduces new challenges during indexing and querying:

  1. Indexing Phase: Since embedding fields are not explicitly defined in the index mapping, there will be no existing field mappers to process them. To handle this, we need to implement custom logic within SemanticFieldMapper to dynamically map the embeddings to indexable fields at runtime.
  2. Query Phase: Without embedding fields in the index mapping, we cannot rely on existing query types to search against them. The system determines query behavior based on field configurations stored in the index mapping. Since embedding fields will be missing from the mapping, the system will treat them as unmapped fields, resulting in empty query results by default.

To make this approach viable, we need a way to bypass the standard field lookup mechanism and ensure embedding fields are properly interpreted during query execution. This might require custom query transformation logic to handle embedding fields explicitly without relying on the standard field resolution mechanism in OpenSearch.

Index Creation

In this approach we not only need to parse the semantic_text but also need to maintain the configuration for the embedding fields. So that we later during the query phase we can access them through the SemanticFieldMapper.

Components need to to created and modified are highlighted in green/red. Red one is the different part compared to the first solution.

Image

Pros:

Simplicity for Users. By defining just the semantic field in the index mapping, users avoid dealing with additional fields like embeddings or chunks. This reduces the cognitive load and makes the feature more approachable, especially for users unfamiliar with the underlying details of neural search.

Reduced Risk of Naming Conflicts and Minimized Mapping Bloat. Automatically adding fields for embeddings and chunks could result in naming conflicts, especially in scenarios with multiple neural fields. Besides it also can lead to a cluttered and verbose index mapping, especially when multiple neural fields are defined. This approach keeps the mapping clean and manageable.

Cons:

More Complicated SemanticFieldMapper. Since we can not rely on the index mapping to handle semantic info fields additional work is needed to manage them in the SemanticFieldMapper.

Document Indexing Process

In this option we will ingest the semantic field like below:

{
    "id":"string",
    "production_description":{
        "text":"string", // original text
        "chunks":[
            {
                "text":"string", // Chunked text
                // If it's a dense model we map it to knn_vector.
                // If it's a sparse model we can map it to a rank_features field
                "embedding":"knn_vector"/"rank_features", 
            },
            ...
        ],
        "model":{ // Info of the ML model we use to generate the embedding
            "id":"string",
            "type":"string",
            "name":"string"
        }
    }
}

To achieve that we have below options

Option 1: Inject A Processor to the Existing Ingest Process

Similar to the first solution we also can transform the doc during ingest phase. But we need to add additional custom logic to the SemanticTextFieldMapper to not only process the original value but also handle other auto generated fields. We need this because those fields are not defined in the index mapping so not able to map them to existing field mappers to process them.

Components need to to created an modified are highlighted in green/red. Red one is the different part compared to the first solution.

Image

Option 2: Process Neural Field in the NeuralFieldMapper

In this option we propose to generate the semantic info fields in the field mapper parseCreateField function. And we will need to build an EmbeddingGenerationTaskManager to manage the embedding generation work. We need this because field mapper each time only can process one field and we need to invoke ml-common predict API to generate embedding. When we are using a remote ML model it’s expensive to do that. So we should batch the work and process each doc in parallel to improve the performance.

Components need to to created an modified are highlighted in green/red. Red one is the different part compared to the first solution.

Image

Pros:

  1. No core change. We only need to make the change in the field mapper in neural plugin. No need to modify the OpenSearch core.
  2. No need to locate the semantic field. We don’t need to go through the doc to locate the semantic field. The SemanticFieldMapper will only be invoked to handle semantic fields.

Cons:

  1. Too Complicated. Need to modify core to allow async tasks in parseCreateField function and a dedicated service to manage the embedding generation work especially when we try to bulk index docs. It’s too much effort.

Since the option 2 is too complicated we would recommend to do option 1 and the solution for the query phase will be designed based on that.

Query Semantic Text Field

We need to address the issue that the system cannot find the embedding fields from the index mapping.

Option 1. Wrap an ObjectFieldMapper

Since after the ingestion the semantic field becomes an object we can delegate the parse work to an object mapper. During index creation we will create the object mapping with the configuration of the semantic info fields in the type parser. In this way we can make it work like those fields are defined in the index mapping. Then later we can simply delegate the indexing and query work to the existing field mappers.

But in this way we kind define the semantic field as a ParameterizedFieldMapper and an ObjectMapper at the same time which is not compatible with existing system.

Option 2. Introduce New doToQuery Interface

An alternative approach is to introduce a new function in QueryBuilders that allows invoking doToQuery with a MappedFieldType. This would enable us to determine the MappedFieldType for embedding fields based on SemanticTextFieldMapper, which is explicitly defined in the index mapping. Since SemanticTextFieldMapper exists in the index mapping, we can reliably retrieve its configuration.

With this approach, the doToQuery function in NeuralQueryBuilder can simply delegate the query conversion work to other query types (e.g., KNN query) using the correct MappedFieldType for the embedding field.
However, this solution comes with some considerations:

  1. Query Type Compatibility: Any query type that needs to delegate query conversion must support this new function. This means modifications may be required across multiple query implementations to accommodate this behavior.
  2. Decoupling from Index Mapping: We must ensure that query conversion logic relies solely on the provided MappedFieldType rather than directly referencing the index mapping to retrieve field definitions. This guarantees that queries can be properly executed even when embedding fields are not explicitly defined in the index mapping.
// Existing function
protected Query doToQuery(QueryShardContext context) {
        MappedFieldType mappedFieldType = context.fieldMapper(this.fieldName);
        // do query conversion
}

// New function to add
publis Query doToQuery(QueryShardContext context, MappedFieldType mappedFieldType) {
        MappedFieldType mappedFieldType = mappedFieldType;
        // do query conversion
}
Option 3. Fully Custom Query Conversion Logic

In this approach, we handle the entire query conversion process within the Neural Query Plugin. By doing so, we only need to identify the semantic_text field, while the rest of the conversion logic is managed internally by the plugin.

Pros

  • Simplified Field Handling: We no longer need to fetch the MappedFieldType for embedding fields from the mapper service. Instead, we can directly access their configuration from SemanticTextFieldMapper.
  • Greater Control: Customizing the query logic gives us fine-grained control over how neural queries are processed, allowing for tailored optimizations.

Cons

  • Code Duplication: This approach requires duplicating existing query conversion logic within the Neural Search Plugin.
  • Too Much Effort: We need to replicate logic for nested, KNN queries and any other query types we want to support.
  • Maintenance Overhead: Any updates or changes to query types in OpenSearch (e.g., nested or KNN queries) would require corresponding changes in the Neural Query Plugin, increasing maintenance effort and the risk of inconsistencies.

Option 3: Enhance the Existing Field Types

There is some concern that introducing a new field type called semantic may introduce unnecessary learning effort for users compared to enhancing the existing field. e.g. text field.

Instead of a new semantic field type, we can enhance existing field types—such as text—by introducing an inference_config parameter. This configuration will indicate that embeddings should be generated for the field during indexing.

To achieve this, we will modify the corresponding field mapper to support the new parameter. Additionally, we can introduce an abstract field mapper that handles inference-related configurations, allowing us to extend this functionality to multiple field types in the future.

{
    "mappings": {
            "product_description": {
                "type": "text",
                "inference_config": {
                       "model_id": "abc",
                       "other_config": ...
                }
            }
        }
    }
}

And the remaining part is very similar to the option 1. Basically we just change the logic to check if the field is a semantic field to check if the field is a text field with inference_config.

Pros:

  1. More Intuitive for Users – Avoids introducing a new field type, reducing complexity and making it easier for users to adopt.
  2. Leverages Existing Field Types – Aligns with OpenSearch’s existing architecture, ensuring better compatibility and reducing the need for extensive changes.

Cons:

  1. Tight Coupling with Core – Since TextFieldMapper is part of OpenSearch core, modifying it to support a plugin feature creates strong dependencies, making it harder to maintain and evolve independently. Ideally we should implement the logic in plugin and inject it to the TextFieldMapper in core. But with our current architecture there is no easy way to do that.
  2. Increased Complexity of TextFieldMapper – The text field already supports numerous functions, and adding more capabilities risks overloading it, making it harder to manage and maintain. We should avoid bloating it with additional responsibilities.

Overall Solution Compare

Option Pros Cons  
Option 1: Auto-Adding Embedding Fields to Index Mapping (Recommended) 1. Best leverage the existing codes and aligns with exisitng framework.    
Option 2: Not Auto-Adding Embedding Fields to Index Mapping 1. Cleaner Index Mapping. 1. Too much complexity and error prone.  
Option 3: Enhance the Existing Field Types 1. Best leverage the existing codes and aligns with exisitng framework. 1. Tight coupling with core.  

Low Level Design

The low level design is just for the recommended high level option 1

Since the HLD is already very long we create another RFC for the LLD.

Appendix

Index Creation Future State Example

Set up example with a single neural field:

PUT /my-nlp-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "text": {
        "type": "semantic",
        // This can be omitted since by default the raw field type will be text
        "raw_field_type": "text",
        "model_id": "aVeif4oB5Vm0Tdw8zYO2"
      }
    }
  }
}

Then the index will be created with mapping like:

"mappings": {
            "properties": {
                "id": {
                    "type": "text"
                },
                "text": {
                    "type": "semantic",
                    "raw_field_type": "text",
                    "model_id": "aVeif4oB5Vm0Tdw8zYO2"
                },
                // Auto add semantic_info fields
                "text_semantic_info": {
                    "properties": {
                        // Use nested field to handle text chunking
                        "chunks": {
                            "type": "nested",
                            "properties": {
                                // Use knn_vector for TEXT_EMBEDDING model
                                "embedding": {
                                    "type": "knn_vector",
                                    "dimension": 768,
                                    "method": {
                                        "engine": "faiss",
                                        "space_type": "l2",
                                        "name": "hnsw",
                                        "parameters": {}
                                    }
                                },
                                "text": {
                                    "type": "text"
                                }
                            }
                        },
                        // metadata of the model we use to generate the embedding
                        "model": {
                            "properties": {
                                "id": {
                                    "type": "text",
                                    "index": false
                                },
                                "name": {
                                    "type": "text",
                                    "index": false
                                },
                                "type": {
                                    "type": "text",
                                    "index": false
                                }
                            }
                        }
                    }
                }
            }
        }

P1 Requirements

Features in P1 are the ones we think it’s valuable and should be a fast follow up after the P0.

R1 ML Model Support Enhancement

R1.1 Restrict Model Removal

Models associated with a neural field should not be removable. Removing such models would result in failures during indexing and querying operations for the neural field, as the model is essential for embedding generation and query processing. To ensure the stability of the system, we will enforce a validation check to prevent the deletion of any model currently in use by a neural field.

R2 Query Enhancement

R2.1 Enhance Neural Sparse Query

In the existing function we can define a search pipeline with a neural_sparse_two_phase_processor or a neural_query_enricher processor to enhance the neural sparse query. If the ML model of the neural text field is a sparse model we should also support the query enhancement. We can either support it by configuring it in the neural field query itself or through a search pipeline.

R3 Configurability for Neural Field Inference

In P0, inference for the neural field is based on default configurations. Moving forward, we aim to provide customers with the ability to configure how embeddings are generated for the neural field during inference. Currently, OpenSearch allows customers to configure the knn_vector field type at index creation. A similar approach can be adopted for the neural field, enabling users to specify configurations such as chunking thresholds, or other model-specific parameters directly in the index mapping. This added flexibility will empower customers to tailor the inference process to their specific needs while maintaining compatibility with default configurations for ease of use.

P2 Requirements

Features in the P2 are the features that we are not sure if they are really needed. We may drop them if we don’t see a real need.

R1 Advanced model id update

When we update the model the best practice should update the existing embedding using the latest model. So this feature should support the auto embedding update when we update the model of the neural_ text field. If the new model is not compatible with the old one we should either throw the error or automatically update the index to make it work. Kind like auto re-index based on the new model.

R2 Default model id support

If a model ID is not explicitly defined for a neural field, a default model will be used to generate embeddings and handle neural queries. This default model will be an OpenSearch-supported, pre-trained dense model. By providing a default model, we simplify the setup process, making it easier for new users to explore and leverage the neural search feature without needing to configure or register a custom model. This approach ensures a user-friendly experience while offering a quick starting point for neural search.

Actions Can Modify Index Mapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants