Skip to content

Remove validation on text and image field for text_image_embedding processor #1230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Lower bound for min-max normalization technique in hybrid query ([#1195](https://github.com/opensearch-project/neural-search/pull/1195))
### Enhancements
### Bug Fixes
- Remove validations for unmapped fields (text and image) in TextImageEmbeddingProcessor ([#1230](https://github.com/opensearch-project/neural-search/pull/1230))
### Infrastructure
### Documentation
### Maintenance
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,13 @@
import org.opensearch.cluster.service.ClusterService;
import org.opensearch.core.action.ActionListener;
import org.opensearch.env.Environment;
import org.opensearch.index.mapper.IndexFieldMapper;
import org.opensearch.ingest.AbstractProcessor;
import org.opensearch.ingest.IngestDocument;
import org.opensearch.neuralsearch.ml.MLCommonsClientAccessor;

import com.google.common.annotations.VisibleForTesting;

import lombok.extern.log4j.Log4j2;
import org.opensearch.neuralsearch.util.ProcessorDocumentUtils;

/**
* This processor is used for user input data text and image embedding processing, model_id can be used to indicate which model user use,
Expand Down Expand Up @@ -107,7 +105,6 @@ public IngestDocument execute(IngestDocument ingestDocument) {
@Override
public void execute(final IngestDocument ingestDocument, final BiConsumer<IngestDocument, Exception> handler) {
try {
validateEmbeddingFieldsValue(ingestDocument);
Map<String, String> knnMap = buildMapWithKnnKeyAndOriginalValue(ingestDocument);
Map<String, String> inferenceMap = createInferences(knnMap);
if (inferenceMap.isEmpty()) {
Expand Down Expand Up @@ -173,20 +170,6 @@ Map<String, Object> buildTextEmbeddingResult(final String knnKey, List<Number> m
return result;
}

private void validateEmbeddingFieldsValue(final IngestDocument ingestDocument) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not remove this method entirely, but pass in true for allowEmpty field here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method simply validates the values in field_map, however this field_map contains:

{
  text: "input_text_field",
  image: "input_image_field",
}

It's correct to validate it during processor creation, but not document ingestion. We can accept whatever value text or image field has in a document, so there's no point to keep the validation method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the same validation for text chunking processor here

I would propose to enable allowEmpty to true and avoid removing this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text chunking processor is different, in the processor, the field map is:

"field_map": {
      "<input_field>": "<output_field>"
    }

where input_field is The name of the field from which to obtain text for generating chunked passages. source, so it's correct to have that validation method (this is also the only one validation for this processor).

While in the text_image_embedding processor, the field map is

{
  text: "input_text_field",
  image: "input_image_field",
}

where input_text_field and input_image_field is the field to obtain values for embedding

we should validate the input_text_field and input_image_field instead of text and image (which is what that validation method does). Since we accept all values for text and image field, there's no validations needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are using this function wrongly here for the TextImageEmbeddingProcessor. But I think we still want to validate:

  1. The depth of the map should not exceed the limits.
  2. The input_text_field and input_image_field should have string values.

The actual field map we should validate is
{
"input_text_field": "vector_field",
"input_image_field": "vector_field"
}

And we should unflatten it before the validation to handle the "." properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is to validate if there is a value for the text or image it should be a string value rather than an object or array.

For unmapped field text and image, why do we want to restrict the data type? the value can be anything including string and any other object.

If you are talking about mapped fields input_text_field and input_image_field, we do need to validating the data type. But it's already handled here in buildMapWithKnnKeyAndOriginalValue method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weijia-aws for 1, you're correct it's already defined to Map<String, String>, we don't need to add any additional validation for this.
for 2, we do allow ingest document with mapped fields with empty strings, in fact the allowEmpty is introduced for TextChunkingProcessor because the chunking field might be empty during ingestion, but this value is defaulted to false in InferenceProcessor which is been extended by TextEmbeddingProcessor, SparseEmbeddingProcessor. The reason why we don't allow the field to be empty is because passing empty string to generate dense/sparse embedding doesn't make any sense.

For now TextImageEmbeddingProcessor's validation is not working because it's validating the reserved keywords instead of actual keys, so I'm open to either merge this PR(as this indeed can fix the bug) or change it to add actual validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I will merge as is. We can add the validation in future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weijia-aws Could you create a github issue for those two cases above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create one Github issue for the second case #1239, there's no issue with the first one

Map<String, Object> sourceAndMetadataMap = ingestDocument.getSourceAndMetadata();
String indexName = sourceAndMetadataMap.get(IndexFieldMapper.NAME).toString();
ProcessorDocumentUtils.validateMapTypeValue(
FIELD_MAP_FIELD,
sourceAndMetadataMap,
fieldMap,
indexName,
clusterService,
environment,
false
);
}

@Override
public String getType() {
return TYPE;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import java.nio.file.Path;

import org.junit.Before;
import org.opensearch.client.ResponseException;
import org.opensearch.neuralsearch.BaseNeuralSearchIT;

/**
Expand All @@ -21,19 +22,15 @@ public class TextImageEmbeddingProcessorIT extends BaseNeuralSearchIT {
private static final String INGEST_DOCUMENT = "{\n"
+ " \"title\": \"This is a good day\",\n"
+ " \"description\": \"daily logging\",\n"
+ " \"passage_text\": \"A very nice day today\",\n"
+ " \"passage_text\": \"passage_text_value\",\n"
+ " \"text\": \"\",\n"
+ " \"image\": null,\n"
+ " \"favorites\": {\n"
+ " \"game\": \"overwatch\",\n"
+ " \"movie\": null\n"
+ " }\n"
+ "}\n";

private static final String INGEST_DOCUMENT_UNMAPPED_FIELDS = "{\n"
+ " \"title\": \"This is a good day\",\n"
+ " \"description\": \"daily logging\",\n"
+ " \"some_random_field\": \"Today is a sunny weather\"\n"
+ "}\n";

@Before
public void setUp() throws Exception {
super.setUp();
Expand All @@ -49,10 +46,21 @@ public void testEmbeddingProcessor_whenIngestingDocumentWithOrWithoutSourceMatch
ingestDocument(INDEX_NAME, INGEST_DOCUMENT);
assertEquals(1, getDocCount(INDEX_NAME));
// verify doc without mapping
ingestDocument(INDEX_NAME, INGEST_DOCUMENT_UNMAPPED_FIELDS);
String documentWithUnmappedFields;
documentWithUnmappedFields = INGEST_DOCUMENT.replace("passage_text", "random_field_1");
ingestDocument(INDEX_NAME, documentWithUnmappedFields);
assertEquals(2, getDocCount(INDEX_NAME));
}

public void testEmbeddingProcessor_whenIngestingDocumentWithNullMappingValue_thenThrowException() throws Exception {
String modelId = uploadModel();
loadModel(modelId);
createPipelineProcessor(modelId, PIPELINE_NAME, ProcessorType.TEXT_IMAGE_EMBEDDING);
createIndexWithPipeline(INDEX_NAME, "IndexMappings.json", PIPELINE_NAME);

expectThrows(ResponseException.class, () -> ingestDocument(INDEX_NAME, INGEST_DOCUMENT.replace("\"passage_text_value\"", "null")));
}

private String uploadModel() throws Exception {
String requestBody = Files.readString(Path.of(classLoader.getResource("processor/UploadModelRequestBody.json").toURI()));
return registerModelGroupAndUploadModel(requestBody);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,9 @@ public void testExecute_successful() {
sourceAndMetadata.put(IndexFieldMapper.NAME, "my_index");
sourceAndMetadata.put("key1", "value1");
sourceAndMetadata.put("my_text_field", "value2");
sourceAndMetadata.put("key3", "value3");
sourceAndMetadata.put("text", "");
sourceAndMetadata.put("image", null);
sourceAndMetadata.put("key5", Map.of("inner_field", "innerValue1"));
sourceAndMetadata.put("image_field", "base64_of_image_1234567890");
IngestDocument ingestDocument = new IngestDocument(sourceAndMetadata, new HashMap<>());
TextImageEmbeddingProcessor processor = createInstance();
Expand Down
Loading