Implement Optimized embedding generation in text and image embedding processor #1249

will-hwang · 2025-03-28T18:46:38Z

Description

This PR implements embedding optimization for text/image embedding processor.

Notable differences from previous optimization in text embedding/sparse encoding:

In order for embeddings to be copied, both Text and Image fields need to be compared. If either one is different between new and existing documents, embedding cannot be copied over.
Field Map in text/image embedding processor does not support nested structure
text/image embedding processor does not support batch_size option for _bulk update, and will process document one by one for batch operation

Related PRs:

Text Embedding Processor Optimization PR: link
Sparse Encoding Processor Optimization PR: link
Related RFC: #1138

Benchmark Results

Inference Model: amazon.titan-embed-image-v1
Dataset: flickr image dataset (https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset?resource=download)

Pipeline:

{
  "description": "A text/image embedding pipeline",
  "processors": [
    {
      "text_image_embedding": {
        "model_id": "w7dR6pUBlqd61Tw827EG",
        "embedding": "vector_embedding",
        "field_map": {
          "text": "image_description",
          "image": "image_binary"
        },
        "skip_existing": false
      }
    }
  ]
}

Index:

{
  "settings": {
    "index.knn": true,
    "default_pipeline": "nlp-ingest-pipeline",
    "number_of_shards": 3
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "engine": "lucene",
          "parameters": {}
        }
      },
      "image_description": {
        "type": "text"
      },
      "image_binary": {
        "type": "binary"
      }
    }
  }
}

Ingest Latency

The following table presents latency measurements of initial ingest operation (in milliseconds) with skip_existing feature enabled and disabled. The Percent difference columns show the relative performance impact between the two.

Operation	Doc Size	Batch Size	skip_existing_off latency (ms)	skip_existing_on latency (ms)	Difference (ms)	Percent Difference
Single Ingest	1000	1	350052.94	375069.05	25016.11	7.15%
Single Ingest	2000	1	674941.13	620254.93	-54686.2	-8.10%
Single Ingest	3000	1	1060338.51	1060785	446.49	0.04%
Batch Ingest	31783	200	1809298.92	1662389.55	-146909.37	-8.12%

Update Latency

The following table presents the latency measurements of update operation after identical ingest operation (in milliseconds) with skip_existing feature enabled and disabled. The Percent difference columns show the relative performance impact between the two.

Operation	Doc Size	Batch Size	skip_existing_off latency (ms)	skip_existing_on latency (ms)	Difference (ms)	Percent Difference
Single Update	1000	1	350052.94	180572.66	-169480.28	-48.42%
Single Update	2000	1	674941.13	296953.31	-377987.82	-56.00%
Single Update	3000	1	1060338.51	465770.91	-594567.6	-56.07%
Batch Update	31783	200	1809298.92	1571011.76	-238287.16	-13.17%

Related Issues

Resolves #1138

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

src/main/java/org/opensearch/neuralsearch/processor/TextImageEmbeddingProcessor.java

bzhangam · 2025-03-31T21:54:16Z

...va/org/opensearch/neuralsearch/processor/optimization/TextImageEmbeddingInferenceFilter.java

+        for (Map.Entry<String, String> entry : knnMap.entrySet()) {
+            String key = entry.getKey();
+            String value = entry.getValue();
+            if (existingDocument.containsKey(key) == false || existingDocument.get(key).equals(value) == false) {


I think we need to compare both the text and image here and this can be done by just checking one key?

knnMap contains two keys: one for text and one for value. For each entry, it will be compared with text and image values of the existing document

Synced offline that this code will work because currently we only allow user to define one image and one text field. So the knnMap only contains the text and image fields and both of them should be the same to reuse the existing embedding. We should add a comment to call it out.

Besides also thinking we may want to allow users to define multiple text and image fields in the processor. Probably we can create a RFC to see if there is a user need.

Besides also thinking we may want to allow users to define multiple text and image fields in the processor. Probably we can create a RFC to see if there is a user need.

#476

src/main/java/org/opensearch/neuralsearch/processor/TextImageEmbeddingProcessor.java

…processor Signed-off-by: will-hwang <[email protected]>

codecov · 2025-04-01T00:37:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (8deb63c) to head (109954a).
Report is 1 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #1249       +/-   ##
============================================
- Coverage     82.23%       0   -82.24%     
============================================
  Files           106       0      -106     
  Lines          5078       0     -5078     
  Branches        864       0      -864     
============================================
- Hits           4176       0     -4176     
+ Misses          569       0      -569     
+ Partials        333       0      -333

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

will-hwang force-pushed the text_image_embedding_processor_optimization branch from 5e8c376 to b9b21d0 Compare March 28, 2025 18:48

github-actions bot added the v3.0.0 v3.0.0 label Mar 28, 2025

will-hwang marked this pull request as ready for review March 28, 2025 19:03

will-hwang requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, vibrantvarun, zhichao-aws, yuye-aws and minalsha as code owners March 28, 2025 19:03

bzhangam reviewed Mar 31, 2025

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/TextImageEmbeddingProcessor.java Outdated Show resolved Hide resolved

bzhangam reviewed Mar 31, 2025

View reviewed changes

weijia-aws reviewed Mar 31, 2025

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/TextImageEmbeddingProcessor.java Show resolved Hide resolved

weijia-aws reviewed Mar 31, 2025

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/TextImageEmbeddingProcessor.java Show resolved Hide resolved

Implement Optimized embedding generation in text and image embedding …

109954a

…processor Signed-off-by: will-hwang <[email protected]>

will-hwang force-pushed the text_image_embedding_processor_optimization branch from b9b21d0 to 109954a Compare March 31, 2025 22:58

bzhangam approved these changes Mar 31, 2025

View reviewed changes

weijia-aws approved these changes Mar 31, 2025

View reviewed changes

heemin32 approved these changes Mar 31, 2025

View reviewed changes

junqiu-lei approved these changes Mar 31, 2025

View reviewed changes

heemin32 mentioned this pull request Apr 1, 2025

Added release notes for 3.0 beta1 #1252

Open

1 task

heemin32 merged commit 1b47f0e into opensearch-project:main Apr 1, 2025
48 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Optimized embedding generation in text and image embedding processor #1249

Implement Optimized embedding generation in text and image embedding processor #1249

will-hwang commented Mar 28, 2025 •

edited

Loading

bzhangam Mar 31, 2025

will-hwang Mar 31, 2025

bzhangam Mar 31, 2025

heemin32 Mar 31, 2025 •

edited

Loading

codecov bot commented Apr 1, 2025

Implement Optimized embedding generation in text and image embedding processor #1249

Implement Optimized embedding generation in text and image embedding processor #1249

Conversation

will-hwang commented Mar 28, 2025 • edited Loading

Description

Benchmark Results

Related Issues

Check List

bzhangam Mar 31, 2025

Choose a reason for hiding this comment

will-hwang Mar 31, 2025

Choose a reason for hiding this comment

bzhangam Mar 31, 2025

Choose a reason for hiding this comment

heemin32 Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Apr 1, 2025

Codecov Report

will-hwang commented Mar 28, 2025 •

edited

Loading

heemin32 Mar 31, 2025 •

edited

Loading