Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introduce Mapping Transformer #17500

Open
bzhangam opened this issue Mar 3, 2025 · 4 comments · May be fixed by #17635
Open

[RFC] Introduce Mapping Transformer #17500

bzhangam opened this issue Mar 3, 2025 · 4 comments · May be fixed by #17635
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Plugins untriaged

Comments

@bzhangam
Copy link

bzhangam commented Mar 3, 2025

Is your feature request related to a problem? Please describe

I'm working on a propose in neural plugin to simply the neural search set up. We propose a solution to introduce a new field type and user can define the model id in it. Then we will automatically generate neural search related fields like knn_vector field to the index mapping based on the model id defined in the index mapping. Now we need to figure out how to auto generate the fields and add them to the index mapping.

Describe the solution you'd like

There are couple of APIs that can modify the index mapping or index template

When we invoke those APIs we want to inject some logic transform the mapping before we store it.

There are two possible solutions to do that:
Option 1:
We leverage the ActionFilter to modify the request which can create/modify the index mapping/index template. But ActionFilter is kind over-empowered that it can be used to modify any action and we think it's not a good idea to keep using it in a wrong way.

Pros: No need to modify core.
Cons: It's not clear that we rely on ActionFilter to mange the mapping transform.

Option 2:
We introduce a new interface in the MapperPlugin to allow plugin to implement MappingTransformer. And then in the actions modifying the index mapping we invoke the MappingTransformer implemented by plugins to transform the mapping before we store it.

Pros: Clear responsibility.
Cons: Need to modify core to introduce this new interface.

We would like to do the option 2 since it's more clear. But also want to get some feedback from the community.

Thanks.

Related component

Plugins

Describe alternatives you've considered

No response

Additional context

[RFC] Support Semantic Field Type to Simplify Neural Search Set Up HLD
[RFC] Support Semantic Field Type to Simplify Neural Search Set Up LLD

@bzhangam bzhangam added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 3, 2025
@bzhangam bzhangam changed the title [Feature Request] Introduce Mapping Transformer [RFC] Introduce Mapping Transformer Mar 3, 2025
@bugmakerrrrrr
Copy link
Collaborator

Hi @bzhangam , could you please provide an example demonstrating how the index mapping will be transformed? Is it possible to update the mapping while ingesting data, similar to what we did in the bulk action?

@bzhangam
Copy link
Author

bzhangam commented Mar 4, 2025

Hi @bzhangam , could you please provide an example demonstrating how the index mapping will be transformed? Is it possible to update the mapping while ingesting data, similar to what we did in the bulk action?

Sure. Basically we are proposing when user create an index like

{
   "settings":{
      "index.knn":true
   },
   "mappings":{
      "properties":{
         "id":{
            "type":"text"
         },
         "products":{
            "type":"nested",
            "properties":{
               "product_description":{
                  "type":"semantic",
                  "model_id":"oC31TZUBuSxkFaMuZlMo"
               }
            }
         }
      }
   }
}

We will automatically transform the mapping to:

"mappings":{
         "properties":{
            "id":{
               "type":"text"
            },
            "products":{
               "type":"nested",
               "properties":{
                  "product_description":{
                     "type":"semantic",
                     "model_id":"oC31TZUBuSxkFaMuZlMo",
                     "raw_field_type":"text"
                  },
         // This is the default name. But we also allow use to define custom name to 
         // avoid the naming conflict.
                  "product_description_semantic_info":{
                     "properties":{
                        "chunks":{
                           "type":"nested",
                           "properties":{
                              "embedding":{
                                 "type":"knn_vector", // come from ML model config
                                 "dimension":768, // come from ML model config
                                 "method":{
                                    "engine":"faiss", // Default config
                                    "space_type":"l2", // come from ML model config
                                    "name":"hnsw", // Default config
                                    "parameters":{
                                       
                                    }
                                 }
                              },
                              "text":{
                                 "type":"text"
                              }
                           }
                        },
                        // Model metadata. Only store it but not index it.
                        "model":{
                           "properties":{
                              "id":{
                                 "type":"text",
                                 "index":false
                              },
                              "name":{
                                 "type":"text",
                                 "index":false
                              },
                              "type":{
                                 "type":"text",
                                 "index":false
                              }
                           }
                        }
                     }
                  }
               }
            }
         }
      }

We cannot do this while we ingest data because we expect those fields to be added to index mapping before we parse the index mapping to create the mapper service. We need to have the corresponding field mappers created so that we can handle the indexing and query properly.

@bugmakerrrrrr
Copy link
Collaborator

@bzhangam I think it is possible to update the mapping while ingesting data. If we find that the fields derived from the semantic field are not included in the index mapping during ingestion, we can submit a mapping update request and continue processing the data after the request is completed. Please correct me if I misunderstood.

@bzhangam
Copy link
Author

bzhangam commented Mar 6, 2025

@bzhangam I think it is possible to update the mapping while ingesting data. If we find that the fields derived from the semantic field are not included in the index mapping during ingestion, we can submit a mapping update request and continue processing the data after the request is completed. Please correct me if I misunderstood.

@bugmakerrrrrr I think you are right. That is doable. But I think it probably better to create those field while we add the semantic field to the index/index template. The reason is that:

  1. It's cleaner that we add semantic info fields when we add the semantic field. Since at that time we already know we should create them.
  2. We can fail fast when we cannot generate semantic info fields in case like the ml model is not valid.

I think the reason we want to update the mapping while ingesting data like the bulk action is mainly because we want to support dynamic mapping. We don't know what fields we should add to the mapping until we ingest the doc and identify the unmapped fields which is different from our use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request Plugins untriaged
Projects
Status: 🆕 New
3 participants