-
Notifications
You must be signed in to change notification settings - Fork 146
Switch derived source from field attributes to segment attribute #2606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch derived source from field attributes to segment attribute #2606
Conversation
Switches derived source from using field attributes to using segment attributes to track removed vector fields.The main motivation for this change is that: (1) it allows us to simplify mapper logic. We no longer need to add another field attribute to an already bloated set of field attributes. (2) At the time of setting field attributes, we do not have access to info about nesting scope for the field. If we switch to segment info, we can get the nested scope and avoid having to perform complex lucene low level operations to get this info ourselves. Along with the change, I moved certain components to the bwc package to preserve backwards compatibility. Additionally, I added several bwc tests. Signed-off-by: John Mazanec <[email protected]>
194118a
to
e251f01
Compare
qa/restart-upgrade/src/test/java/org/opensearch/knn/bwc/DerivedSourceBWCRestartIT.java
Show resolved
Hide resolved
qa/restart-upgrade/src/test/java/org/opensearch/knn/bwc/DerivedSourceBWCRestartIT.java
Show resolved
Hide resolved
...rch/knn/index/codec/backward_codecs/KNN9120Codec/KNN9120DerivedSourceStoredFieldsFormat.java
Show resolved
Hide resolved
*/ | ||
public class DerivedSourceSegmentAttributeHelper { | ||
|
||
public static final String DERIVED_SOURCE_FIELD = "derived_vector_fields"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make it package private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense - will update.
return Collections.emptyList(); | ||
} | ||
String derivedVectorFields = segmentInfo.getAttribute(DERIVED_SOURCE_FIELD); | ||
if (derivedVectorFields == null || derivedVectorFields.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use StringUtils for apache commons here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - will update
if (derivedVectorFields == null || derivedVectorFields.isEmpty()) { | ||
return Collections.emptyList(); | ||
} | ||
return Arrays.stream(derivedVectorFields.split(",")).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we handle the case of corruption here. So if the string comes out to be ,,,,
what will happen? will there be failure of query?
also move ,
to a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we handle the case of corruption here. So if the string comes out to be ,,,, what will happen? will there be failure of query?
Right - this shouldnt happen because the inputs will always be field names. But, I will update split to preserve empty strings just in case.
also move , to a constant.
Sure, will update
public static void addDerivedVectorFieldsSegmentInfoAttribute(SegmentInfo segmentInfo, List<String> vectorFieldTypes) { | ||
segmentInfo.putAttribute(DERIVED_SOURCE_FIELD, String.join(",", vectorFieldTypes)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given its a public function lets ensure that vectorFieldTypes is not null and empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
438bcbd
to
c618dec
Compare
...n/java/org/opensearch/knn/index/codec/derivedsource/DerivedSourceSegmentAttributeParser.java
Outdated
Show resolved
Hide resolved
Signed-off-by: John Mazanec <[email protected]>
c618dec
to
100ff0b
Compare
Description
Follow up on #2583, I am pivoting approach for supporting derived source for vector fields from filter approach to field masking approach. See PoC here: https://github.com/jmazanec15/k-NN-1/tree/mask-derived-poc. The main motivation is to simplify the injection logic. As a first step, this PR switches storing derived source information from the field attributes to the segment attributes.
The main motivation for switching is that: (1) it allows us to simplify mapper logic. We no longer need to add another field attribute to an already bloated set of field attributes. We keep adding info into these field attributes and I dont think we really need to. Its making it somewhat hard to navigate mappers. (2) At the time of setting field attributes, we do not have access to info about nesting scope for the field. If we switch to segment info, we can get the nested scope by using the mapper service and avoid having to perform complex lucene low level operations to get this info ourselves.
The majority of the functional change in this PR can be found in KNN10010DerivedSourceStoredFieldsFormat.
Along with the change, I moved certain components to the bwc package to preserve backwards compatibility. Additionally, I added several bwc tests. This will help setup for moving towards masking approach to confirm bwc.
Next Steps
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.