Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Optimized embedding generation in text and image embedding processor #1249

Conversation

will-hwang
Copy link
Contributor

@will-hwang will-hwang commented Mar 28, 2025

Description

This PR implements embedding optimization for text/image embedding processor.

Notable differences from previous optimization in text embedding/sparse encoding:

  1. In order for embeddings to be copied, both Text and Image fields need to be compared. If either one is different between new and existing documents, embedding cannot be copied over.
  2. Field Map in text/image embedding processor does not support nested structure
  3. text/image embedding processor does not support batch_size option for _bulk update, and will process document one by one for batch operation

Related PRs:

Text Embedding Processor Optimization PR: link
Sparse Encoding Processor Optimization PR: link
Related RFC: #1138

Benchmark Results

Inference Model: amazon.titan-embed-image-v1
Dataset: flickr image dataset (https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset?resource=download)

Pipeline:

{
  "description": "A text/image embedding pipeline",
  "processors": [
    {
      "text_image_embedding": {
        "model_id": "w7dR6pUBlqd61Tw827EG",
        "embedding": "vector_embedding",
        "field_map": {
          "text": "image_description",
          "image": "image_binary"
        },
        "skip_existing": false
      }
    }
  ]
}

Index:

{
  "settings": {
    "index.knn": true,
    "default_pipeline": "nlp-ingest-pipeline",
    "number_of_shards": 3
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "engine": "lucene",
          "parameters": {}
        }
      },
      "image_description": {
        "type": "text"
      },
      "image_binary": {
        "type": "binary"
      }
    }
  }
}

Ingest Latency

The following table presents latency measurements of initial ingest operation (in milliseconds) with skip_existing feature enabled and disabled. The Percent difference columns show the relative performance impact between the two.

Operation Doc Size Batch Size skip_existing_off latency (ms) skip_existing_on latency (ms) Difference (ms) Percent Difference
Single Ingest 1000 1 350052.94 375069.05 25016.11 7.15%
Single Ingest 2000 1 674941.13 620254.93 -54686.2 -8.10%
Single Ingest 3000 1 1060338.51 1060785 446.49 0.04%
Batch Ingest 31783 200 1809298.92 1662389.55 -146909.37 -8.12%

Update Latency

The following table presents the latency measurements of update operation after identical ingest operation (in milliseconds) with skip_existing feature enabled and disabled. The Percent difference columns show the relative performance impact between the two.

Operation Doc Size Batch Size skip_existing_off latency (ms) skip_existing_on latency (ms) Difference (ms) Percent Difference
Single Update 1000 1 350052.94 180572.66 -169480.28 -48.42%
Single Update 2000 1 674941.13 296953.31 -377987.82 -56.00%
Single Update 3000 1 1060338.51 465770.91 -594567.6 -56.07%
Batch Update 31783 200 1809298.92 1571011.76 -238287.16 -13.17%

Related Issues

Resolves #1138

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

for (Map.Entry<String, String> entry : knnMap.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
if (existingDocument.containsKey(key) == false || existingDocument.get(key).equals(value) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to compare both the text and image here and this can be done by just checking one key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knnMap contains two keys: one for text and one for value. For each entry, it will be compared with text and image values of the existing document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline that this code will work because currently we only allow user to define one image and one text field. So the knnMap only contains the text and image fields and both of them should be the same to reuse the existing embedding. We should add a comment to call it out.

Besides also thinking we may want to allow users to define multiple text and image fields in the processor. Probably we can create a RFC to see if there is a user need.

Copy link
Collaborator

@heemin32 heemin32 Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides also thinking we may want to allow users to define multiple text and image fields in the processor. Probably we can create a RFC to see if there is a user need.

#476

@will-hwang will-hwang force-pushed the text_image_embedding_processor_optimization branch from b9b21d0 to 109954a Compare March 31, 2025 22:58
@heemin32 heemin32 merged commit 1b47f0e into opensearch-project:main Apr 1, 2025
48 of 50 checks passed
Copy link

codecov bot commented Apr 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (8deb63c) to head (109954a).
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #1249       +/-   ##
============================================
- Coverage     82.23%       0   -82.24%     
============================================
  Files           106       0      -106     
  Lines          5078       0     -5078     
  Branches        864       0      -864     
============================================
- Hits           4176       0     -4176     
+ Misses          569       0      -569     
+ Partials        333       0      -333     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v3.0.0 v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Optimizing Text Embedding Processor
5 participants