Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add semantic field mapper. #1225

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## [Unreleased 3.x](https://github.com/opensearch-project/neural-search/compare/main...HEAD)
### Features
- Support semantic field type to simplify neural search set up([#1225](https://github.com/opensearch-project/neural-search/pull/1225)).
- Lower bound for min-max normalization technique in hybrid query ([#1195](https://github.com/opensearch-project/neural-search/pull/1195))
- Support filter function for HybridQueryBuilder and NeuralQueryBuilder ([#1206](https://github.com/opensearch-project/neural-search/pull/1206))
### Enhancements
Expand Down
24 changes: 12 additions & 12 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,9 +351,9 @@ through the same build issue.

### Class and package names

Class names should use `CamelCase`.
Class names should use `CamelCase`.

Try to put new classes into existing packages if package name abstracts the purpose of the class.
Try to put new classes into existing packages if package name abstracts the purpose of the class.

Example of good class file name and package utilization:

Expand All @@ -371,7 +371,7 @@ methods rather than a long single one and does everything.
### Documentation

Document you code. That includes purpose of new classes, every public method and code sections that have critical or non-trivial
logic (check this example https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/query/NeuralQueryBuilder.java#L238).
logic (check this example https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/query/NeuralQueryBuilder.java#L238).

When you submit a feature PR, please submit a new
[documentation issue](https://github.com/opensearch-project/documentation-website/issues/new/choose). This is a path for the documentation to be published as part of https://opensearch.org/docs/latest/ documentation site.
Expand All @@ -384,17 +384,17 @@ For the most part, we're using common conventions for Java projects. Here are a

1. Use descriptive names for classes, methods, fields, and variables.
2. Avoid abbreviations unless they are widely accepted
3. Use `final` on all method arguments unless it's absolutely necessary
3. Use `final` on all method arguments unless it's absolutely necessary
4. Wildcard imports are not allowed.
5. Static imports are preferred over qualified imports when using static methods
6. Prefer creating non-static public methods whenever possible. Avoid static methods in general, as they can often serve as shortcuts.
Static methods are acceptable if they are private and do not access class state.
7. Use functional programming style inside methods unless it's a performance critical section.
7. Use functional programming style inside methods unless it's a performance critical section.
8. For parameters of lambda expression please use meaningful names instead of shorten cryptic ones.
9. Use Optional for return values if the value may not be present. This should be preferred to returning null.
10. Do not create checked exceptions, and do not throw checked exceptions from public methods whenever possible. In general, if you call a method with a checked exception, you should wrap that exception into an unchecked exception.
11. Throwing checked exceptions from private methods is acceptable.
12. Use String.format when a string includes parameters, and prefer this over direct string concatenation. Always specify a Locale with String.format;
12. Use String.format when a string includes parameters, and prefer this over direct string concatenation. Always specify a Locale with String.format;
as a rule of thumb, use Locale.ROOT.
13. Prefer Lombok annotations to the manually written boilerplate code
14. When throwing an exception, avoid including user-provided content in the exception message. For secure coding practices,
Expand Down Expand Up @@ -440,17 +440,17 @@ Fix any new warnings before submitting your PR to ensure proper code documentati

### Tests

Write unit and integration tests for your new functionality.
Write unit and integration tests for your new functionality.

Unit tests are preferred as they are cheap and fast, try to use them to cover all possible
combinations of parameters. Utilize mocks to mimic dependencies.
combinations of parameters. Utilize mocks to mimic dependencies.

Integration tests should be used sparingly, focusing primarily on the main (happy path) scenario or cases where extensive
mocking is impractical. Include one or two unhappy paths to confirm that correct response codes are returned to the user.
Whenever possible, favor scenarios that do not require model deployment. If model deployment is necessary, use an existing
Integration tests should be used sparingly, focusing primarily on the main (happy path) scenario or cases where extensive
mocking is impractical. Include one or two unhappy paths to confirm that correct response codes are returned to the user.
Whenever possible, favor scenarios that do not require model deployment. If model deployment is necessary, use an existing
model, as tests involving new model deployments are the most resource-intensive.

If your changes could affect backward compatibility, please include relevant backward compatibility tests along with your
If your changes could affect backward compatibility, please include relevant backward compatibility tests along with your
PR. For guidance on adding these tests, refer to the [Backwards Compatibility Testing](#backwards-compatibility-testing) section in this guide.

### Outdated or irrelevant code
Expand Down
1 change: 1 addition & 0 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ def knnJarDirectory = "$buildDir/dependencies/opensearch-knn"

dependencies {
api "org.opensearch:opensearch:${opensearch_version}"
implementation group: 'org.opensearch.plugin', name:'mapper-extras-client', version: "${opensearch_version}"
zipArchive group: 'org.opensearch.plugin', name:'opensearch-job-scheduler', version: "${opensearch_build}"
zipArchive group: 'org.opensearch.plugin', name:'opensearch-knn', version: "${opensearch_build}"
zipArchive group: 'org.opensearch.plugin', name:'opensearch-ml-plugin', version: "${opensearch_build}"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/
package org.opensearch.neuralsearch.constants;

/**
* Constants related to the index mapping.
*/
public class MappingConstants {
/**
* Name for the field type. In index mapping we use this key to define the field type.
*/
public static final String TYPE = "type";
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/
package org.opensearch.neuralsearch.constants;

/**
* Constants for semantic field
*/
public class SemanticFieldConstants {
/**
* Name of the model id parameter. We use this key to define the id of the ML model that we will use for the
* semantic field.
*/
public static final String MODEL_ID = "model_id";

/**
* Name of the search model id parameter. We use this key to define the id of the ML model that we will use to
* inference the query text during the search. If this parameter is not defined we will use the model_id instead.
*/
public static final String SEARCH_MODEL_ID = "search_model_id";

/**
* Name of the raw field type parameter. We use this key to define the field type for the raw data. It will control
* how to store and query the raw data.
*/
public static final String RAW_FIELD_TYPE = "raw_field_type";

/**
* Name of the raw field type parameter. We use this key to define a custom field name for the semantic info.
*/
public static final String SEMANTIC_INFO_FIELD_NAME = "semantic_info_field_name";
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/
package org.opensearch.neuralsearch.mapper;

import lombok.Getter;
import lombok.Setter;
import org.opensearch.core.xcontent.XContentBuilder;
import org.opensearch.index.mapper.BinaryFieldMapper;
import org.opensearch.index.mapper.KeywordFieldMapper;
import org.opensearch.index.mapper.MappedFieldType;
import org.opensearch.index.mapper.Mapper;
import org.opensearch.index.mapper.MapperParsingException;
import org.opensearch.index.mapper.MatchOnlyTextFieldMapper;
import org.opensearch.index.mapper.ParametrizedFieldMapper;
import org.opensearch.index.mapper.ParseContext;
import org.opensearch.index.mapper.TextFieldMapper;
import org.opensearch.index.mapper.TokenCountFieldMapper;
import org.opensearch.index.mapper.WildcardFieldMapper;
import org.opensearch.neuralsearch.constants.MappingConstants;
import org.opensearch.neuralsearch.mapper.semanticfieldtypes.SemanticFieldTypeFactory;
import org.opensearch.neuralsearch.mapper.semanticfieldtypes.SemanticParameters;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;

import static org.opensearch.neuralsearch.constants.SemanticFieldConstants.MODEL_ID;
import static org.opensearch.neuralsearch.constants.SemanticFieldConstants.RAW_FIELD_TYPE;
import static org.opensearch.neuralsearch.constants.SemanticFieldConstants.SEARCH_MODEL_ID;
import static org.opensearch.neuralsearch.constants.SemanticFieldConstants.SEMANTIC_INFO_FIELD_NAME;

/**
* FieldMapper for the semantic field. It will hold a delegate field mapper to delegate the data parsing and query work
* based on the raw_field_type.
*/
public class SemanticFieldMapper extends ParametrizedFieldMapper {
public static final String CONTENT_TYPE = "semantic";
private final SemanticParameters semanticParameters;

@Setter
@Getter
private ParametrizedFieldMapper delegateFieldMapper;

protected SemanticFieldMapper(
String simpleName,
MappedFieldType mappedFieldType,
MultiFields multiFields,
CopyTo copyTo,
ParametrizedFieldMapper delegateFieldMapper,
SemanticParameters semanticParameters
) {
super(simpleName, mappedFieldType, multiFields, copyTo);
this.delegateFieldMapper = delegateFieldMapper;
this.semanticParameters = semanticParameters;
}

@Override
public Builder getMergeBuilder() {
Builder semanticFieldMapperBuilder = (Builder) new Builder(simpleName(), SemanticFieldTypeFactory.getInstance()).init(this);
ParametrizedFieldMapper.Builder delegateBuilder = delegateFieldMapper.getMergeBuilder();
semanticFieldMapperBuilder.setDelegateBuilder(delegateBuilder);
return semanticFieldMapperBuilder;
}

@Override
public final ParametrizedFieldMapper merge(Mapper mergeWith) {
if (mergeWith instanceof SemanticFieldMapper) {
try {
delegateFieldMapper = delegateFieldMapper.merge(((SemanticFieldMapper) mergeWith).delegateFieldMapper);
} catch (IllegalArgumentException e) {
String err = "Failed to update the mapper ["
+ this.name()
+ "] because failed to update the delegate "
+ "mapper for the raw_field_type "
+ this.semanticParameters.getRawFieldType()
+ ". "
+ e.getMessage();
throw new IllegalArgumentException(err, e);
}
}
return super.merge(mergeWith);
}

@Override
protected void parseCreateField(ParseContext context) throws IOException {
delegateFieldMapper.parse(context);
}

@Override
protected String contentType() {
return CONTENT_TYPE;
}

public static class Builder extends ParametrizedFieldMapper.Builder {
@Getter
protected final Parameter<String> modelId = Parameter.stringParam(
MODEL_ID,
true,
m -> ((SemanticFieldMapper) m).semanticParameters.getModelId(),
null
);
@Getter
protected final Parameter<String> searchModelId = Parameter.stringParam(
SEARCH_MODEL_ID,
true,
m -> ((SemanticFieldMapper) m).semanticParameters.getSearchModelId(),
null
);
@Getter
protected final Parameter<String> rawFieldType = Parameter.stringParam(
RAW_FIELD_TYPE,
false,
m -> ((SemanticFieldMapper) m).semanticParameters.getRawFieldType(),
TextFieldMapper.CONTENT_TYPE
);
@Getter
protected final Parameter<String> semanticInfoFieldName = Parameter.stringParam(
SEMANTIC_INFO_FIELD_NAME,
false,
m -> ((SemanticFieldMapper) m).semanticParameters.getSemanticInfoFieldName(),
null
);

@Setter
protected ParametrizedFieldMapper.Builder delegateBuilder;
private final SemanticFieldTypeFactory semanticFieldTypeFactory;

protected Builder(String name, SemanticFieldTypeFactory semanticFieldTypeFactory) {
super(name);
this.semanticFieldTypeFactory = semanticFieldTypeFactory;
}

@Override
protected List<Parameter<?>> getParameters() {
return List.of(modelId, searchModelId, rawFieldType, semanticInfoFieldName);
}

@Override
public SemanticFieldMapper build(BuilderContext context) {
final ParametrizedFieldMapper delegateMapper = delegateBuilder.build(context);

final SemanticParameters semanticParameters = this.getSemanticParameters();
final MappedFieldType semanticFieldType = semanticFieldTypeFactory.createSemanticFieldType(
delegateMapper,
rawFieldType.getValue(),
semanticParameters
);

return new SemanticFieldMapper(
name,
semanticFieldType,
multiFieldsBuilder.build(this, context),
copyTo.build(),
delegateMapper,
semanticParameters
);
}

public SemanticParameters getSemanticParameters() {
return new SemanticParameters(
modelId.getValue(),
searchModelId.getValue(),
rawFieldType.getValue(),
semanticInfoFieldName.getValue()
);
}
}

public static class TypeParser implements Mapper.TypeParser {

private final static Set<String> SUPPORTED_RAW_FIELD_TYPE = Set.of(
TextFieldMapper.CONTENT_TYPE,
KeywordFieldMapper.CONTENT_TYPE,
MatchOnlyTextFieldMapper.CONTENT_TYPE,
WildcardFieldMapper.CONTENT_TYPE,
TokenCountFieldMapper.CONTENT_TYPE,
BinaryFieldMapper.CONTENT_TYPE
);

@Override
public Builder parse(String name, Map<String, Object> node, ParserContext parserContext) throws MapperParsingException {
final String rawFieldType = (String) node.getOrDefault(RAW_FIELD_TYPE, TextFieldMapper.CONTENT_TYPE);

validateRawFieldType(rawFieldType);

final ParametrizedFieldMapper.TypeParser typeParser = (ParametrizedFieldMapper.TypeParser) parserContext.typeParser(
rawFieldType
);
final Builder semanticFieldMapperBuilder = new Builder(name, SemanticFieldTypeFactory.getInstance());

// semantic field mapper builder parse semantic fields
Map<String, Object> semanticConfig = extractSemanticConfig(node, semanticFieldMapperBuilder.getParameters(), rawFieldType);
semanticFieldMapperBuilder.parse(name, parserContext, semanticConfig);

// delegate field mapper builder parse remaining fields
ParametrizedFieldMapper.Builder delegateBuilder = typeParser.parse(name, node, parserContext);
semanticFieldMapperBuilder.setDelegateBuilder(delegateBuilder);

return semanticFieldMapperBuilder;
}

private void validateRawFieldType(final String rawFieldType) {
if (rawFieldType == null || !SUPPORTED_RAW_FIELD_TYPE.contains(rawFieldType)) {
throw new IllegalArgumentException(
RAW_FIELD_TYPE
+ ": ["
+ rawFieldType
+ "] is not supported. It "
+ "should be one of ["
+ String.join(", ", SUPPORTED_RAW_FIELD_TYPE)
+ "]"
);
}
}

/**
* In this function we will extract all the parameters defined in the semantic field mapper builder and parse it
* later. The remaining parameters will be processed by the type parser of the raw field type. Here we cannot
* pass the parameters defined by semantic field to the delegate type parser of the raw field type because it
* cannot recognize them.
* @param node field config
* @param parameters parameters for semantic field
* @param rawFieldType field type of the raw data
* @return semantic field config
*/
private Map<String, Object> extractSemanticConfig(Map<String, Object> node, List<Parameter<?>> parameters, String rawFieldType) {
final Map<String, Object> semanticConfig = new HashMap<>();
for (Parameter<?> parameter : parameters) {
Object config = node.get(parameter.name);
if (config != null) {
semanticConfig.put(parameter.name, config);
node.remove(parameter.name);
}
}
semanticConfig.put(MappingConstants.TYPE, SemanticFieldMapper.CONTENT_TYPE);
node.put(MappingConstants.TYPE, rawFieldType);
return semanticConfig;
}
}

@Override
protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, Params params) throws IOException {
builder.field(MappingConstants.TYPE, contentType());

// semantic parameters
final List<Parameter<?>> parameters = getMergeBuilder().getParameters();
for (Parameter<?> parameter : parameters) {
// By default, we will not return the default value. But raw_field_type is useful info to let users know how
// we will handle the raw data. So we explicitly return it even it is using the default value.
if (RAW_FIELD_TYPE.equals(parameter.name)) {
parameter.toXContent(builder, true);
} else {
parameter.toXContent(builder, includeDefaults);
}
}

// non-semantic parameters
// semantic field mapper itself does not handle multi fields or copy to. The delegate field mapper will handle it.
delegateFieldMapper.multiFields().toXContent(builder, params);
delegateFieldMapper.copyTo().toXContent(builder, params);
delegateFieldMapper.getMergeBuilder().toXContent(builder, includeDefaults);
}
}
Loading
Loading