Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove validation on text and image field for text_image_embedding processor #1230

Merged
merged 3 commits into from
Mar 20, 2025

Conversation

weijia-aws
Copy link
Contributor

@weijia-aws weijia-aws commented Mar 13, 2025

Description

This PR removes the text and image field validation for text_image_processor

Related Issues

Resolves #1221

Code Bug

Request:

POST nlp-index/_doc/1?pipeline=nlp-ingest-pipeline
{
  "text": ""
}

Response:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "map type field [text] has empty string value, cannot process it"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "map type field [text] has empty string value, cannot process it"
  },
  "status": 400
}

Code Bug Deep Dive

When create a text_image_processor, we need to pass a field_map that at least contains text or image field, see doc here. During document ingestion, the processor will read the actual value from input_text_field and input_image_field, NOT from text or image.

However it's possible that some document contain the text or image field, as well as input_text_field and input_image_field. Since the processor reads the actual values from input_text_field and input_image_field and create embeddings against them, these text or image fields should be treated as unmapped fields. We should allow users to ingest whatever value text or image field has.

Currently there's a validation for creating the processor to ensure the field_map is valid.

During ingestion, there's another validation to validate the embedding field values (from input_text_field and input_image_field). However this method is wrong, it checks the field values with text or image field, we should remove this method entirely. As for validating the actual embedding field values, it is already covered here

Check List

  • [N] New functionality includes testing.
  • [N/A] New functionality has been documented.
  • [N] API changes companion pull request created.
  • [Y] Commits are signed per the DCO using --signoff.
  • [N/A] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Mar 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.70%. Comparing base (f5377c0) to head (8042725).
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1230      +/-   ##
============================================
- Coverage     81.85%   81.70%   -0.16%     
+ Complexity     2682     1339    -1343     
============================================
  Files           192       96      -96     
  Lines          9158     4574    -4584     
  Branches       1568      784     -784     
============================================
- Hits           7496     3737    -3759     
+ Misses         1060      538     -522     
+ Partials        602      299     -303     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -173,20 +170,6 @@ Map<String, Object> buildTextEmbeddingResult(final String knnKey, List<Number> m
return result;
}

private void validateEmbeddingFieldsValue(final IngestDocument ingestDocument) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not remove this method entirely, but pass in true for allowEmpty field here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method simply validates the values in field_map, however this field_map contains:

{
  text: "input_text_field",
  image: "input_image_field",
}

It's correct to validate it during processor creation, but not document ingestion. We can accept whatever value text or image field has in a document, so there's no point to keep the validation method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the same validation for text chunking processor here

I would propose to enable allowEmpty to true and avoid removing this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text chunking processor is different, in the processor, the field map is:

"field_map": {
      "<input_field>": "<output_field>"
    }

where input_field is The name of the field from which to obtain text for generating chunked passages. source, so it's correct to have that validation method (this is also the only one validation for this processor).

While in the text_image_embedding processor, the field map is

{
  text: "input_text_field",
  image: "input_image_field",
}

where input_text_field and input_image_field is the field to obtain values for embedding

we should validate the input_text_field and input_image_field instead of text and image (which is what that validation method does). Since we accept all values for text and image field, there's no validations needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are using this function wrongly here for the TextImageEmbeddingProcessor. But I think we still want to validate:

  1. The depth of the map should not exceed the limits.
  2. The input_text_field and input_image_field should have string values.

The actual field map we should validate is
{
"input_text_field": "vector_field",
"input_image_field": "vector_field"
}

And we should unflatten it before the validation to handle the "." properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is to validate if there is a value for the text or image it should be a string value rather than an object or array.

For unmapped field text and image, why do we want to restrict the data type? the value can be anything including string and any other object.

If you are talking about mapped fields input_text_field and input_image_field, we do need to validating the data type. But it's already handled here in buildMapWithKnnKeyAndOriginalValue method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weijia-aws for 1, you're correct it's already defined to Map<String, String>, we don't need to add any additional validation for this.
for 2, we do allow ingest document with mapped fields with empty strings, in fact the allowEmpty is introduced for TextChunkingProcessor because the chunking field might be empty during ingestion, but this value is defaulted to false in InferenceProcessor which is been extended by TextEmbeddingProcessor, SparseEmbeddingProcessor. The reason why we don't allow the field to be empty is because passing empty string to generate dense/sparse embedding doesn't make any sense.

For now TextImageEmbeddingProcessor's validation is not working because it's validating the reserved keywords instead of actual keys, so I'm open to either merge this PR(as this indeed can fix the bug) or change it to add actual validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I will merge as is. We can add the validation in future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weijia-aws Could you create a github issue for those two cases above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create one Github issue for the second case #1239, there's no issue with the first one

@heemin32
Copy link
Collaborator

Please update changelog.

Signed-off-by: Weijia Zhao <[email protected]>
@martin-gaievski martin-gaievski changed the title Remove validation on text and image field for text_image_embedding pr… Remove validation on text and image field for text_image_embedding processor Mar 18, 2025
@martin-gaievski
Copy link
Member

Can you address these two items:

  • add test that will trigger that failure scenario and incorrectly fail with the code we do have today, without this PR
  • add example of request response for this scenario to PR description, this will ease the understanding of your intension

@weijia-aws
Copy link
Contributor Author

add test that will trigger that failure scenario and incorrectly fail with the code we do have today, without this PR

I don't quick follow. Why do we want to add the test, we already removed the validation in this PR. If you meant to validate the scenario through test, then it's already done, see the unit test here https://github.com/opensearch-project/neural-search/pull/1230/files#diff-c429eb1a1fb2495d60ee8f799c59893de5e77ae1e9273204113c57ab828f7a1eR188, if I keep the validation and run this test, the test will fail with the exact error message.

add example of request response for this scenario to PR description, this will ease the understanding of your intension

Done

@heemin32 heemin32 added backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.19 labels Mar 20, 2025
@junqiu-lei junqiu-lei merged commit 8506daa into opensearch-project:main Mar 20, 2025
52 of 56 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1230-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 8506daa6b62fcc2085f701685779828a22a4bd66
# Push it to GitHub
git push --set-upstream origin backport/backport-1230-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1230-to-2.x.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.19 2.19
# Navigate to the new working tree
cd .worktrees/backport-2.19
# Create a new branch
git switch --create backport/backport-1230-to-2.19
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 8506daa6b62fcc2085f701685779828a22a4bd66
# Push it to GitHub
git push --set-upstream origin backport/backport-1230-to-2.19
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.19

Then, create a pull request where the base branch is 2.19 and the compare/head branch is backport/backport-1230-to-2.19.

weijia-aws added a commit to weijia-aws/neural-search that referenced this pull request Mar 20, 2025
…ocessor (opensearch-project#1230)

* Remove validation on text and image field for text_image_embedding processor

Signed-off-by: Weijia Zhao <[email protected]>

* Add Changelog

Signed-off-by: Weijia Zhao <[email protected]>

---------

Signed-off-by: Weijia Zhao <[email protected]>
(cherry picked from commit 8506daa)
weijia-aws added a commit to weijia-aws/neural-search that referenced this pull request Mar 20, 2025
…ocessor (opensearch-project#1230)

* Remove validation on text and image field for text_image_embedding processor

Signed-off-by: Weijia Zhao <[email protected]>

* Add Changelog

Signed-off-by: Weijia Zhao <[email protected]>

---------

Signed-off-by: Weijia Zhao <[email protected]>
(cherry picked from commit 8506daa)
junqiu-lei pushed a commit that referenced this pull request Mar 20, 2025
…ocessor (#1230) (#1240)

* Remove validation on text and image field for text_image_embedding processor

Signed-off-by: Weijia Zhao <[email protected]>

* Add Changelog

Signed-off-by: Weijia Zhao <[email protected]>

---------

Signed-off-by: Weijia Zhao <[email protected]>
(cherry picked from commit 8506daa)
junqiu-lei pushed a commit that referenced this pull request Mar 20, 2025
…ocessor (#1230) (#1241)

* Remove validation on text and image field for text_image_embedding processor

Signed-off-by: Weijia Zhao <[email protected]>

* Add Changelog

Signed-off-by: Weijia Zhao <[email protected]>

---------

Signed-off-by: Weijia Zhao <[email protected]>
(cherry picked from commit 8506daa)
@weijia-aws weijia-aws deleted the bug-fix branch March 20, 2025 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.19 bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Remove validation on text/image field for text_image_embedding processor
7 participants