Skip to content

Switch derived source from field attributes to segment attribute #2606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jmazanec15
Copy link
Member

Description

Follow up on #2583, I am pivoting approach for supporting derived source for vector fields from filter approach to field masking approach. See PoC here: https://github.com/jmazanec15/k-NN-1/tree/mask-derived-poc. The main motivation is to simplify the injection logic. As a first step, this PR switches storing derived source information from the field attributes to the segment attributes.

The main motivation for switching is that: (1) it allows us to simplify mapper logic. We no longer need to add another field attribute to an already bloated set of field attributes. We keep adding info into these field attributes and I dont think we really need to. Its making it somewhat hard to navigate mappers. (2) At the time of setting field attributes, we do not have access to info about nesting scope for the field. If we switch to segment info, we can get the nested scope by using the mapper service and avoid having to perform complex lucene low level operations to get this info ourselves.

The majority of the functional change in this PR can be found in KNN10010DerivedSourceStoredFieldsFormat.

Along with the change, I moved certain components to the bwc package to preserve backwards compatibility. Additionally, I added several bwc tests. This will help setup for moving towards masking approach to confirm bwc.

Next Steps

  1. Migration to mask from filter. I did not want to do this as a part of this PR because it will be fairly complex to handle bwc.
  2. Additional optimizations around the lucene utils we have left.
  3. Testing to determine defaults for 3.0.

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added Refactoring Improve the design, structure, and implementation while preserving its functionality skip-changelog labels Mar 14, 2025
Switches derived source from using field attributes to using segment
attributes to track removed vector fields.The main motivation for this
change is that: (1) it allows us to simplify mapper logic. We no longer
need to add another field attribute to an already bloated set of field
attributes. (2) At the time of setting field attributes, we do not have
access to info about nesting scope for the field. If we switch to
segment info, we can get the nested scope and avoid having to perform
complex lucene low level operations to get this info ourselves.

Along with the change, I moved certain components to the bwc package to
preserve backwards compatibility. Additionally, I added several bwc
tests.

Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 force-pushed the derived-source-segment-attribute branch from 194118a to e251f01 Compare March 14, 2025 20:02
*/
public class DerivedSourceSegmentAttributeHelper {

public static final String DERIVED_SOURCE_FIELD = "derived_vector_fields";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make it package private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense - will update.

return Collections.emptyList();
}
String derivedVectorFields = segmentInfo.getAttribute(DERIVED_SOURCE_FIELD);
if (derivedVectorFields == null || derivedVectorFields.isEmpty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use StringUtils for apache commons here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - will update

if (derivedVectorFields == null || derivedVectorFields.isEmpty()) {
return Collections.emptyList();
}
return Arrays.stream(derivedVectorFields.split(",")).collect(Collectors.toList());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle the case of corruption here. So if the string comes out to be ,,,, what will happen? will there be failure of query?

also move , to a constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle the case of corruption here. So if the string comes out to be ,,,, what will happen? will there be failure of query?

Right - this shouldnt happen because the inputs will always be field names. But, I will update split to preserve empty strings just in case.

also move , to a constant.

Sure, will update

Comment on lines 46 to 48
public static void addDerivedVectorFieldsSegmentInfoAttribute(SegmentInfo segmentInfo, List<String> vectorFieldTypes) {
segmentInfo.putAttribute(DERIVED_SOURCE_FIELD, String.join(",", vectorFieldTypes));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given its a public function lets ensure that vectorFieldTypes is not null and empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@jmazanec15 jmazanec15 force-pushed the derived-source-segment-attribute branch 2 times, most recently from 438bcbd to c618dec Compare March 17, 2025 15:22
Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 force-pushed the derived-source-segment-attribute branch from c618dec to 100ff0b Compare March 17, 2025 16:32
@jmazanec15 jmazanec15 requested a review from navneet1v March 17, 2025 17:54
@jmazanec15 jmazanec15 merged commit 762b7cb into opensearch-project:main Mar 17, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactoring Improve the design, structure, and implementation while preserving its functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants