Remove validation on text and image field for text_image_embedding processor #1230

weijia-aws · 2025-03-13T23:51:37Z

Description

This PR removes the text and image field validation for text_image_processor

Related Issues

Resolves #1221

Code Bug

Request:

POST nlp-index/_doc/1?pipeline=nlp-ingest-pipeline
{
  "text": ""
}

Response:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "map type field [text] has empty string value, cannot process it"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "map type field [text] has empty string value, cannot process it"
  },
  "status": 400
}

Code Bug Deep Dive

When create a text_image_processor, we need to pass a field_map that at least contains text or image field, see doc here. During document ingestion, the processor will read the actual value from input_text_field and input_image_field, NOT from text or image.

However it's possible that some document contain the text or image field, as well as input_text_field and input_image_field. Since the processor reads the actual values from input_text_field and input_image_field and create embeddings against them, these text or image fields should be treated as unmapped fields. We should allow users to ingest whatever value text or image field has.

Currently there's a validation for creating the processor to ensure the field_map is valid.

During ingestion, there's another validation to validate the embedding field values (from input_text_field and input_image_field). However this method is wrong, it checks the field values with text or image field, we should remove this method entirely. As for validating the actual embedding field values, it is already covered here

Check List

[N] New functionality includes testing.
[N/A] New functionality has been documented.
[N] API changes companion pull request created.
[Y] Commits are signed per the DCO using --signoff.
[N/A] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ocessor Signed-off-by: Weijia Zhao <[email protected]>

codecov · 2025-03-14T00:03:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.70%. Comparing base (f5377c0) to head (8042725).
Report is 1 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1230      +/-   ##
============================================
- Coverage     81.85%   81.70%   -0.16%     
+ Complexity     2682     1339    -1343     
============================================
  Files           192       96      -96     
  Lines          9158     4574    -4584     
  Branches       1568      784     -784     
============================================
- Hits           7496     3737    -3759     
+ Misses         1060      538     -522     
+ Partials        602      299     -303

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

will-hwang · 2025-03-14T00:11:53Z

src/main/java/org/opensearch/neuralsearch/processor/TextImageEmbeddingProcessor.java

@@ -173,20 +170,6 @@ Map<String, Object> buildTextEmbeddingResult(final String knnKey, List<Number> m
        return result;
    }

-    private void validateEmbeddingFieldsValue(final IngestDocument ingestDocument) {


we should not remove this method entirely, but pass in true for allowEmpty field here

This method simply validates the values in field_map, however this field_map contains:

{ text: "input_text_field", image: "input_image_field", }

It's correct to validate it during processor creation, but not document ingestion. We can accept whatever value text or image field has in a document, so there's no point to keep the validation method

we have the same validation for text chunking processor here

I would propose to enable allowEmpty to true and avoid removing this method

The text chunking processor is different, in the processor, the field map is:

"field_map": { "<input_field>": "<output_field>" }

where input_field is The name of the field from which to obtain text for generating chunked passages. source, so it's correct to have that validation method (this is also the only one validation for this processor).

While in the text_image_embedding processor, the field map is

{ text: "input_text_field", image: "input_image_field", }

where input_text_field and input_image_field is the field to obtain values for embedding

we should validate the input_text_field and input_image_field instead of text and image (which is what that validation method does). Since we accept all values for text and image field, there's no validations needed.

I think we are using this function wrongly here for the TextImageEmbeddingProcessor. But I think we still want to validate:

The depth of the map should not exceed the limits.

The input_text_field and input_image_field should have string values.

The actual field map we should validate is
{
"input_text_field": "vector_field",
"input_image_field": "vector_field"
}

And we should unflatten it before the validation to handle the "." properly.

My point is to validate if there is a value for the text or image it should be a string value rather than an object or array.

For unmapped field text and image, why do we want to restrict the data type? the value can be anything including string and any other object.

If you are talking about mapped fields input_text_field and input_image_field, we do need to validating the data type. But it's already handled here in buildMapWithKnnKeyAndOriginalValue method

@weijia-aws for 1, you're correct it's already defined to Map<String, String>, we don't need to add any additional validation for this.
for 2, we do allow ingest document with mapped fields with empty strings, in fact the allowEmpty is introduced for TextChunkingProcessor because the chunking field might be empty during ingestion, but this value is defaulted to false in InferenceProcessor which is been extended by TextEmbeddingProcessor, SparseEmbeddingProcessor. The reason why we don't allow the field to be empty is because passing empty string to generate dense/sparse embedding doesn't make any sense.

For now TextImageEmbeddingProcessor's validation is not working because it's validating the reserved keywords instead of actual keys, so I'm open to either merge this PR(as this indeed can fix the bug) or change it to add actual validation.

For now, I will merge as is. We can add the validation in future

@weijia-aws Could you create a github issue for those two cases above?

Create one Github issue for the second case #1239, there's no issue with the first one

heemin32 · 2025-03-14T20:21:43Z

Please update changelog.

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski · 2025-03-18T20:33:28Z

Can you address these two items:

add test that will trigger that failure scenario and incorrectly fail with the code we do have today, without this PR
add example of request response for this scenario to PR description, this will ease the understanding of your intension

weijia-aws · 2025-03-19T22:51:50Z

add test that will trigger that failure scenario and incorrectly fail with the code we do have today, without this PR

I don't quick follow. Why do we want to add the test, we already removed the validation in this PR. If you meant to validate the scenario through test, then it's already done, see the unit test here https://github.com/opensearch-project/neural-search/pull/1230/files#diff-c429eb1a1fb2495d60ee8f799c59893de5e77ae1e9273204113c57ab828f7a1eR188, if I keep the validation and run this test, the test will fail with the exact error message.

add example of request response for this scenario to PR description, this will ease the understanding of your intension

Done

Signed-off-by: Weijia Zhao <[email protected]>

opensearch-trigger-bot · 2025-03-20T18:51:15Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1230-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 8506daa6b62fcc2085f701685779828a22a4bd66
# Push it to GitHub
git push --set-upstream origin backport/backport-1230-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1230-to-2.x.

opensearch-trigger-bot · 2025-03-20T18:51:16Z

The backport to 2.19 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.19 2.19
# Navigate to the new working tree
cd .worktrees/backport-2.19
# Create a new branch
git switch --create backport/backport-1230-to-2.19
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 8506daa6b62fcc2085f701685779828a22a4bd66
# Push it to GitHub
git push --set-upstream origin backport/backport-1230-to-2.19
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.19

Then, create a pull request where the base branch is 2.19 and the compare/head branch is backport/backport-1230-to-2.19.

…ocessor (opensearch-project#1230) * Remove validation on text and image field for text_image_embedding processor Signed-off-by: Weijia Zhao <[email protected]> * Add Changelog Signed-off-by: Weijia Zhao <[email protected]> --------- Signed-off-by: Weijia Zhao <[email protected]> (cherry picked from commit 8506daa)

…ocessor (#1230) (#1240) * Remove validation on text and image field for text_image_embedding processor Signed-off-by: Weijia Zhao <[email protected]> * Add Changelog Signed-off-by: Weijia Zhao <[email protected]> --------- Signed-off-by: Weijia Zhao <[email protected]> (cherry picked from commit 8506daa)

…ocessor (#1230) (#1241) * Remove validation on text and image field for text_image_embedding processor Signed-off-by: Weijia Zhao <[email protected]> * Add Changelog Signed-off-by: Weijia Zhao <[email protected]> --------- Signed-off-by: Weijia Zhao <[email protected]> (cherry picked from commit 8506daa)

Remove validation on text and image field for text_image_embedding pr…

fb7c7c5

…ocessor Signed-off-by: Weijia Zhao <[email protected]>

weijia-aws requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, vibrantvarun, zhichao-aws, yuye-aws and minalsha as code owners March 13, 2025 23:51

github-actions bot added the bug Something isn't working label Mar 13, 2025

will-hwang reviewed Mar 14, 2025

View reviewed changes

Add Changelog

49d74ab

Signed-off-by: Weijia Zhao <[email protected]>

martin-gaievski changed the title ~~Remove validation on text and image field for text_image_embedding pr…~~ Remove validation on text and image field for text_image_embedding processor Mar 18, 2025

Merge branch 'main' into bug-fix

8042725

Signed-off-by: Weijia Zhao <[email protected]>

heemin32 approved these changes Mar 20, 2025

View reviewed changes

heemin32 added backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.19 labels Mar 20, 2025

weijia-aws mentioned this pull request Mar 20, 2025

Add validation for values in mapped fields in TextImageEmbeddingProcessor #1239

Open

junqiu-lei approved these changes Mar 20, 2025

View reviewed changes

junqiu-lei merged commit 8506daa into opensearch-project:main Mar 20, 2025
52 of 56 checks passed

weijia-aws mentioned this pull request Mar 20, 2025

Remove validation on text and image field for text_image_embedding processor #1240

Merged

5 tasks

weijia-aws mentioned this pull request Mar 20, 2025

Remove validation on text and image field for text_image_embedding processor #1241

Merged

5 tasks

weijia-aws deleted the bug-fix branch March 20, 2025 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove validation on text and image field for text_image_embedding processor #1230

Remove validation on text and image field for text_image_embedding processor #1230

weijia-aws commented Mar 13, 2025 •

edited

Loading

codecov bot commented Mar 14, 2025 •

edited

Loading

will-hwang Mar 14, 2025

weijia-aws Mar 14, 2025

will-hwang Mar 14, 2025

weijia-aws Mar 14, 2025

bzhangam Mar 14, 2025

weijia-aws Mar 17, 2025

zane-neo Mar 18, 2025

weijia-aws Mar 19, 2025

heemin32 Mar 20, 2025

weijia-aws Mar 20, 2025

heemin32 commented Mar 14, 2025

martin-gaievski commented Mar 18, 2025

weijia-aws commented Mar 19, 2025

opensearch-trigger-bot bot commented Mar 20, 2025

opensearch-trigger-bot bot commented Mar 20, 2025

Remove validation on text and image field for text_image_embedding processor #1230

Remove validation on text and image field for text_image_embedding processor #1230

Conversation

weijia-aws commented Mar 13, 2025 • edited Loading

Description

Related Issues

Code Bug

Code Bug Deep Dive

Check List

codecov bot commented Mar 14, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heemin32 commented Mar 14, 2025

martin-gaievski commented Mar 18, 2025

weijia-aws commented Mar 19, 2025

opensearch-trigger-bot bot commented Mar 20, 2025

opensearch-trigger-bot bot commented Mar 20, 2025

weijia-aws commented Mar 13, 2025 •

edited

Loading

codecov bot commented Mar 14, 2025 •

edited

Loading