Skip to content

[Pluggable Dataformat] Adding support for FieldTypeCapabilities#21733

Open
darjisagar7 wants to merge 2 commits into
opensearch-project:mainfrom
darjisagar7:FTC
Open

[Pluggable Dataformat] Adding support for FieldTypeCapabilities#21733
darjisagar7 wants to merge 2 commits into
opensearch-project:mainfrom
darjisagar7:FTC

Conversation

@darjisagar7
Copy link
Copy Markdown
Contributor

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sagar Darji <darsaga@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

PR Reviewer Guide 🔍

(Review updated until commit 022ce52)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Missing License Header

The file is missing the Apache 2.0 license header that is present in all other files in this PR. This violates the project's licensing requirements.

package org.opensearch.parquet.fields.core.metadata;
Logic Inversion Bug

The condition checks if type equals NESTED_CONTENT_TYPE AND pluggable data format is enabled, then sets nested = true. However, the original code set nested = true when type equals NESTED_CONTENT_TYPE unconditionally. The new logic inverts this: it only sets nested = true when the feature is enabled, but the subsequent code expects nested to be set whenever the type is NESTED_CONTENT_TYPE. This breaks nested object handling when pluggable data format is disabled.

} else if (type.equals(NESTED_CONTENT_TYPE) && isPluggableDataFormatEnabled(parserContext.getSettings())) {
    nested = true;
Null Dereference Risk

If fieldType.getCapabilityMap() returns null (though the API suggests it should not), the code will throw NullPointerException at line 100 when calling capMap.isEmpty(). While the current implementation may guarantee non-null, defensive coding would check for null before calling methods on capMap.

Map<DataFormat, Set<FieldTypeCapabilities.Capability>> capMap = fieldType.getCapabilityMap();

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

PR Code Suggestions ✨

Latest suggestions up to 022ce52
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Restore nested type parsing compatibility

The nested type parsing is now gated behind isPluggableDataFormatEnabled, but the
original logic allowed nested types unconditionally. This change breaks backward
compatibility for indices created before the pluggable data format feature. Remove
the feature flag check to preserve existing behavior.

server/src/main/java/org/opensearch/index/mapper/ObjectMapper.java [420-422]

 if (type.equals(CONTENT_TYPE)) {
     builder.nested = Nested.NO;
-} else if (type.equals(NESTED_CONTENT_TYPE) && isPluggableDataFormatEnabled(parserContext.getSettings())) {
+} else if (type.equals(NESTED_CONTENT_TYPE)) {
     nested = true;
 }
Suggestion importance[1-10]: 9

__

Why: The change gates NESTED_CONTENT_TYPE parsing behind isPluggableDataFormatEnabled, which breaks backward compatibility for existing indices. This is a critical issue that could prevent nested documents from being parsed correctly when the feature flag is disabled, affecting production workloads.

High
Prevent indexing failure for metadata field

The IndexParquetField throws an exception when addToGroup is called, but this method
is invoked during document indexing. If the _index field is registered and reaches
this code path, indexing will fail. Consider either implementing proper handling or
filtering out this field type earlier in the pipeline to prevent runtime failures.

sandbox/plugins/parquet-data-format/src/main/java/org/opensearch/parquet/fields/core/metadata/IndexParquetField.java [14-17]

 @Override
 protected void addToGroup(MappedFieldType fieldType, ManagedVSR managedVSR, Object parseValue) {
-    throw new IllegalStateException("Index field is not supposed to be added to group");
+    // _index field is metadata-only and should not be stored in Parquet
+    // Silently skip to avoid indexing failures
 }
Suggestion importance[1-10]: 7

__

Why: The addToGroup method throws an IllegalStateException, which could cause indexing failures if the _index field reaches this code path. The suggestion to silently skip is valid, though the PR may intentionally prevent this field from being indexed. The score reflects that this is a legitimate concern about potential runtime failures.

Medium
General
Validate format configuration before assignment

When configuredFormats is empty but the field requests capabilities, the method
silently sets an empty capability map. This causes fields to be skipped during
indexing without clear indication of misconfiguration. Throw an exception or log a
warning when required capabilities cannot be satisfied due to missing format
configuration.

server/src/main/java/org/opensearch/index/engine/dataformat/FieldCapabilityAssigner.java [44-55]

 public void assign(MappedFieldType fieldType) {
     Set<FieldTypeCapabilities.Capability> requested = fieldType.requestedCapabilities();
 
     if (requested.isEmpty()) {
         fieldType.setCapabilityMap(Map.of());
         return;
     }
 
     if (configuredFormats.isEmpty()) {
-        fieldType.setCapabilityMap(Map.of());
-        return;
+        throw new MapperParsingException(
+            "Field [" + fieldType.name() + "] requires capabilities " + requested +
+            " but no data formats are configured for this index"
+        );
     }
Suggestion importance[1-10]: 8

__

Why: When configuredFormats is empty but the field requests capabilities, silently setting an empty map causes fields to be skipped during indexing without clear error messages. Throwing an exception would provide better feedback about misconfiguration. This is a significant improvement for error handling and user experience.

Medium
Add debug logging for skipped fields

When the capability map is empty, the field is silently skipped. However, this could
hide configuration errors where a field type should be indexed but isn't due to
missing capability declarations. Consider logging a warning when skipping fields
with empty capability maps to aid debugging.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/index/LuceneDocumentInput.java [100-104]

 Map<DataFormat, Set<FieldTypeCapabilities.Capability>> capMap = fieldType.getCapabilityMap();
 if (capMap.isEmpty()) {
     // No capability map set — no format declared support for this field type, skip it
+    logger.debug("Skipping field [{}] with type [{}]: no format declared support", fieldType.name(), fieldType.typeName());
     return;
 }
Suggestion importance[1-10]: 5

__

Why: Adding debug logging for skipped fields would improve observability and help with troubleshooting configuration issues. However, this is a minor enhancement that doesn't address a functional bug, and the silent skip behavior may be intentional for the per-format filtering design.

Low

Previous suggestions

Suggestions up to commit 3bde29b
CategorySuggestion                                                                                                                                    Impact
General
Document capability assignment order

The capability assignment for metadata fields happens after the recursive walk of
the root mapper. If metadata mappers contain nested structures or if the order of
assignment matters for validation, this could lead to inconsistent state. Consider
documenting why metadata fields are assigned separately or verify that the order is
intentional.

server/src/main/java/org/opensearch/index/mapper/DocumentMapper.java [126-136]

 if (dataFormatRegistry != null) {
+    // Assign capabilities to all non-metadata fields first
     assignCapabilitiesRecursive(rootObjectMapper, builderContext);
+    // Then assign to metadata fields - order matters for validation consistency
     for (MetadataFieldMapper metadataMapper : metadataMappers.values()) {
         builderContext.assignCapabilities(metadataMapper.fieldType());
     }
 }
Suggestion importance[1-10]: 5

__

Why: Adding a comment to clarify why metadata fields are assigned separately improves code maintainability and helps future developers understand the intentional ordering, though the code itself is functionally correct.

Low
Remove redundant empty set check

The condition ownedCaps.isEmpty() == false is redundant because
capMap.get(owningFormat) returns null when the format doesn't own the field. The
check should verify ownedCaps != null only, as an empty set would indicate the
format owns zero capabilities (which shouldn't occur in practice but would still be
a valid ownership state).

sandbox/plugins/parquet-data-format/src/main/java/org/opensearch/parquet/writer/ParquetDocumentInput.java [56-65]

 Map<DataFormat, Set<FieldTypeCapabilities.Capability>> capMap = fieldType.getCapabilityMap();
 if (capMap.isEmpty()) {
     // No capability map set — no format declared support for this field type, skip it
     return;
 }
 
 Set<FieldTypeCapabilities.Capability> ownedCaps = capMap.get(owningFormat);
-if (ownedCaps != null && ownedCaps.isEmpty() == false) {
+if (ownedCaps != null) {
     collectedFields.add(new FieldValuePair(fieldType, value));
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies that checking ownedCaps.isEmpty() == false is redundant when ownedCaps != null already filters out non-ownership. However, the impact is minimal as the current code is functionally correct, just slightly verbose.

Low
Use immutable list collector

The error message construction uses Collectors.toList() which creates a mutable list
unnecessarily. For error messages, use Collectors.toUnmodifiableList() or collect to
a string directly to avoid potential modification issues and improve clarity.

server/src/main/java/org/opensearch/index/engine/dataformat/FieldCapabilityAssigner.java [72-85]

 if (remaining.isEmpty() == false) {
     throw new MapperParsingException(
         "Field ["
             + fieldType.name()
             + "] of type ["
             + typeName
             + "] requires capabilities "
             + requested
             + " but configured data formats cannot collectively cover them. Unsupported capabilities: "
             + remaining
             + ". Configured formats: "
-            + configuredFormats.stream().map(DataFormat::name).collect(Collectors.toList())
+            + configuredFormats.stream().map(DataFormat::name).collect(Collectors.toUnmodifiableList())
     );
 }
Suggestion importance[1-10]: 3

__

Why: Using Collectors.toUnmodifiableList() in error message construction is a minor improvement for immutability, but the impact is negligible since the list is only used for display and immediately discarded.

Low
Possible issue
Fix empty capability set handling

The logic incorrectly treats an empty capability set as "another format owns this
field". An empty set returned from capMap.get(owningFormat) means this format owns
the field but with zero capabilities, which is semantically different from null
(format doesn't own the field at all). The check should only verify ownedCaps ==
null.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/index/LuceneDocumentInput.java [100-109]

 Map<DataFormat, Set<FieldTypeCapabilities.Capability>> capMap = fieldType.getCapabilityMap();
 if (capMap.isEmpty()) {
     // No capability map set — no format declared support for this field type, skip it
     return;
 }
 Set<FieldTypeCapabilities.Capability> ownedCaps = capMap.get(owningFormat);
-if (ownedCaps == null || ownedCaps.isEmpty()) {
+if (ownedCaps == null) {
     // Another format owns this field — silently skip
     return;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion raises a valid semantic point about distinguishing null (format doesn't own field) from empty set (format owns field with zero capabilities). However, in practice, an empty capability set for an owned field is unlikely and the current defensive check is reasonable.

Low

* @param fieldFactoryRegistry the registry to use for field creation
*/
public LuceneDocumentInput(LuceneFieldFactoryRegistry fieldFactoryRegistry) {
public LuceneDocumentInput(DataFormat owningFormat, LuceneFieldFactoryRegistry fieldFactoryRegistry) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can hardcode to Lucene format here

if (ownedCaps == null || ownedCaps.isEmpty()) {
// Another format owns this field — silently skip
return;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add tests for this

Comment on lines +56 to +77
new FieldTypeCapabilities(IgnoredFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(IdFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(RoutingFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(SourceFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(SeqNoFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.COLUMNAR_STORAGE,
FieldTypeCapabilities.Capability.POINT_RANGE)),
new FieldTypeCapabilities(IndexFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.COLUMNAR_STORAGE,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(NestedPathFieldMapper.NAME, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(VersionFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(DocCountFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities("_feature", Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH)),
new FieldTypeCapabilities(FieldNamesFieldMapper.CONTENT_TYPE, Set.of(FieldTypeCapabilities.Capability.STORED_FIELDS,
FieldTypeCapabilities.Capability.FULL_TEXT_SEARCH))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this determined?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 3bde29b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?


@Override
protected FieldTypeCapabilities.Capability searchCapability() {
return FieldTypeCapabilities.Capability.POINT_RANGE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this even supported?


@Override
protected FieldTypeCapabilities.Capability searchCapability() {
return FieldTypeCapabilities.Capability.POINT_RANGE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add for PrimaryTermFieldType as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same POINT_RANGE right?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 022ce52

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 022ce52: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sagar Darji <darsaga@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 496920e.

PathLineSeverityDescription
server/src/main/java/org/opensearch/index/mapper/ObjectMapper.java421mediumNested type parsing is now gated behind `isPluggableDataFormatEnabled`. Any OpenSearch index that uses nested field mappings but does NOT have the experimental pluggable data format feature enabled will receive a MapperParsingException when creating or updating mappings. Nested types are a core, widely-used feature; silently breaking them for the majority of users unless an opt-in experimental flag is set is a significant and suspicious regression that is unrelated to the stated purpose of the per-format capability routing feature.
server/src/main/java/org/opensearch/index/engine/exec/PrimaryTermFieldType.java41lowThe `PrimaryTermFieldType` constructor changes `hasDocValues` from `false` to `true`. This modifies the behaviour of an internal metadata field in a way that will cause it to request `COLUMNAR_STORAGE` via `requestedCapabilities()`, potentially routing it into the capability-assignment pipeline and altering how it is stored. The change is unexplained in the diff and could cause unintended side effects in capability assignment for this critical internal field.
sandbox/plugins/parquet-data-format/src/main/java/org/opensearch/parquet/fields/core/metadata/IndexParquetField.java16lowNew `IndexParquetField` class throws `IllegalStateException` from `addToGroup`, which is the core ingestion path. While the per-format capability filter should prevent it from being invoked at runtime, the class is registered as the handler for `IndexFieldMapper.CONTENT_TYPE` in `MetadataFieldPlugin`. If capability map filtering is ever bypassed or misconfigured, this would cause an unhandled exception during document indexing. The absence of any graceful handling or override (getArrowType/getFieldType both return null) makes this pattern unusual and worth review.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants