Migrate derived source from filter to mask #2612

jmazanec15 · 2025-03-18T18:20:15Z

Description

Migrates derived source functionality from filter to mask based approach in a bwc way. The change can be summed up as instead of remove vector fields from source, it replaces them with a smaller representation (mask). When we need to add them back, we transform the source map to replace the masks with the actual vectors. This makes handling nested docs much easier.

For backwards compatibility, this PR moves old read functionality and related classes to backwardscodecs/KNN9120Codec package and removes old write as no longer necessary. On merge, we add custom functionality in the stored fields writer merge logic to fall back to base, non-optimized merge if it detects older readers in the merge state. For this, we will reconstruct the source document with the reader and then apply the mask again on top of it to remove the vectors. This ensures that segments are migrated to the mask approach. To verify this, I added several backwards compatibility tests.

There are 40+ file changes, but most of them are just moving files to backwards codec.

Adding a few more test cases today.

Related Issues

Resolves #2377

Check List

Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

jmazanec15 · 2025-03-18T21:27:54Z

Evidence bwc test works: https://github.com/opensearch-project/k-NN/actions/runs/13933569801/job/38996400312?pr=2612. Will fix now

...org/opensearch/knn/index/codec/backward_codecs/KNN9120Codec/KNN9120DerivedSourceReaders.java

src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedMapHelper.java

src/main/java/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010Codec.java

...va/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010DerivedSourceStoredFieldsReader.java

...va/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010DerivedSourceStoredFieldsWriter.java

...rch/knn/index/codec/backward_codecs/KNN9120Codec/KNN9120DerivedSourceStoredFieldsReader.java

...ensearch/knn/index/codec/backward_codecs/KNN9120Codec/RootPerFieldDerivedVectorInjector.java

...a/org/opensearch/knn/index/codec/derivedsource/AbstractPerFieldDerivedVectorTransformer.java

.../java/org/opensearch/knn/index/codec/derivedsource/RootPerFieldDerivedVectorTransformer.java

0ctopus13prime · 2025-03-22T04:43:11Z

Hi @jmazanec15 , just throwing an idea regarding to get the first child doc id for nested case.
If parent id (e.g. top level doc id) + _source Map are given, then I think we can do a backtrack to get child doc ids.
Just run single DFS will be sufficient. But we still need the mutable map creation step.

Lemma 1. The DFS visit order at the last doc, which is the top level doc having `_source` field, is the total number of Lucene documents populated.

Example

# _source JSON. total 11 documents will be populated

{
  "nested_field": [
    {
      "key": "key7",
      "nested_field": [
        {
          "nested_field": [
            {
              "key": "key1",
              "vector_field": "$MASK"
            },
            {
              "key": "key2",
              "vector_field": "$MASK"
            }
          ],
          "key": "key3"
        },
        {
          "nested_field": [
            {
              "key": "key4",
              "vector_field": "$MASK"
            },
            {
              "key": "key5",
              "vector_field": "$MASK"
            }
          ],
          "key": "key6"
        }
      ]
    },
    {
      "key": "key10",
      "nested_field": [
        {
          "nested_field": [
            {
              "key": "key8",
              "vector_field": "$MASK"
            },
            {
              "key": "key9",
              "vector_field": "$MASK"
            }
          ]
        }
      ]
    }
  ],
  "key": "key11",
  "vector_field": "$MASK"
}

Lemma 2. Child doc's offset away from the top level document id is calculated as `target field's visit order - #total docs populated`. In turn, child doc id can be calculated adding the offset to the top level doc id.

For example, assuming top level doc id was given as 255 when target field is nested_field.nested_field.nested_field.vector_field, then visit orders would be below:

visit order=1, key=key1, offset=1 - 11=-10, child doc = -10 + 255 = 245
visit order=2, key=key2, offset=2 - 11=-9, child doc = -9 + 255 = 246
visit order=4, key=key4, offset=4 - 11=-7, child doc = -7 + 255 = 248
visit order=5, key=key5, offset=5 - 11=-6, child doc = -6 + 255 = 249
visit order=8, key=key8, offset=8 - 11=-3, child doc = -3 + 255 = 252
visit order=9, key=key9, offset=9 - 11=-2, child doc = -2 + 255 = 253

Idea: First Run DFS to collect target Map to update, and visit order for the target field.

During the DFS, we collect map and visit order for the target field -> Get #total Lucene docs, Lemma 1.
After running DFS, we can calculate the total Lucene documents, and with this info, we can get the first child document id. In the above example, it would be 245. -> Lemma 2.
Iterate collected target Maps, advance the iterator to the target child document, inject vector value.

Complexity to transform _source for all fields

O(N^2) where N is the total number of Lucene documents populated.

Pseudocode

void inject(int parentDocId, ...) {
    // Run DFS to collect visit number + target map
    List<Pair> subMaps = new ArrayList<>();
    int numLuceneDocsPopulated = collectVisitNumberAndSubMaps(sourceMap, 1, subMaps) - 1;

    // Do the actual replacement.
    KnnVectorValues values = ...;
    for (visitNo, targetMap : subMaps) {
        int childDoc = parentDocId + visitNo - numLuceneDocsPopulated;
        // Child doc must have been collected in increasing order.
        values.advance(childDoc);
        String vectorString = toString(values.getVector());  // Convert vector to string
        targetMap.put(targetField, vectorString);  // Replace mask to actual vector string representation.
    }
}

int collectVisitNumberAndSubMaps(Map sourceMap, int visitNo, List<Map> maps) {
    // Visit sub-maps first. e.g. post-order DFS
    for (String key : sourceMap.keys()) {
        Object value = sourceMap.get(key);
        if (value instanceof Map) {
            visitNo = collectVisitNumberAndSubMaps((Map) value, visitNo, maps);
        } else if (value instanceof List) {
            for (Object subMap : listValue) {
                visitNo = collectVisitNumberAndSubMaps(subMap, visitNo, maps);
            }
        }
    }

    // Now we visit the current map
    for (String key : sourceMap.keys()) {
        if (key.equals(targetField)) {
            // Found it! collect it.
            maps.add(new Pair(visitNo, sourceMap));
            break;  // As far as I know, one doc cannot have duplicate fields.
        }
    }

    // return next visit no
    return visitNo + 1;
}

Pros / Cons

This solution works for both nested and non-nested case.
For the non-nested case, visit order will be always 1, #total populated Lucene docs is also 1. Hence offset is 0. This will make iterator always advance to the top level document id.

Pros:

As a general solution, we will not need to manage nested and non-nested two cases.
Relatively simple to implement. We don't need to consider edge cases.
For the sparse nested case, where not all parent documents have KNN field, the results of DFS for a parent doc not having KNN field will be empty. Because there's no visit order for the target knn field, will be ignored naturally. Therefore, no need for head ache for such edge cases with suggested solution.

Cons:

Not sure how much it is difficult to add this one to the existing derived source code base, but if it does not need that much efforts, I can't think of any cons of this approach.

Optimization

Reusing iterator:
Now I know the document id will not be given in increasing order, but I think it would be good if we could reuse the previously instantiated iterator rather than creating a new one every time.

Iterator getOrCreate(int firstChildDocId) {
    if (iterator == null || iterator.docId() > firstChildDocId) {
        return iterator = create a new one;
    }

    // We can reuse it!
    iterator.advance(firstChildDocId);
    return iterator;
}

Not using the proposed algorithm for non-nested KNN field.
Even though it is a general solution that works for both cases, but we could skip it and directly jump to the given doc id if the target field is non-nested.

jmazanec15 · 2025-03-24T15:05:29Z

@0ctopus13prime - I think thats a pretty interesting idea. I worry that it might run into issues when there are arrays that are not necessarily related to nested documents. However, I think it makes sense to follow up in #2626

jmazanec15 · 2025-03-24T15:09:49Z

Discovered #2625 - will address this in a follow up PR. @shatejas and @0ctopus13prime could I get a follow up review?

jmazanec15 · 2025-03-24T15:10:23Z

Tests failing because common utils needs to upgrade to beta1 - see opensearch-project/common-utils#808

...va/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010DerivedSourceStoredFieldsReader.java

Migrates derived source functionality from filter to mask based approach. Moves old read functionality and related classes to backwards codecs KNN9120... Removes old write as no longer necessary. In order to support bwc, we add custom functionality in the stored fields writer merge logic to fall back to base, non-optimized merge if it detects older readers in the merge state. This is needed because for these segments, we need to rebuild the source and then apply filter to migrate to new write format. Signed-off-by: John Mazanec <[email protected]>

0ctopus13prime

LGTM! Once it passes CI, will approve it.
Thank you.

shatejas

LGTM

Can approve once CI passes

...va/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010DerivedSourceStoredFieldsFormat.java

jmazanec15 added Refactoring Improve the design, structure, and implementation while preserving its functionality skip-changelog labels Mar 18, 2025

jmazanec15 force-pushed the derived-mask-integ branch 2 times, most recently from 05a55c1 to 0c0536f Compare March 18, 2025 21:11

jmazanec15 force-pushed the derived-mask-integ branch from 6659ee5 to 7493a93 Compare March 19, 2025 15:27

jmazanec15 commented Mar 19, 2025

View reviewed changes

...org/opensearch/knn/index/codec/backward_codecs/KNN9120Codec/KNN9120DerivedSourceReaders.java Show resolved Hide resolved

src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedMapHelper.java Outdated Show resolved Hide resolved

jmazanec15 marked this pull request as ready for review March 19, 2025 15:28

jmazanec15 requested review from heemin32, navneet1v, VijayanB, vamshin, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan, luyuncheng, shatejas, 0ctopus13prime and Vikasht34 as code owners March 19, 2025 15:28

jmazanec15 force-pushed the derived-mask-integ branch from 7493a93 to 293ea53 Compare March 19, 2025 15:35

0ctopus13prime reviewed Mar 20, 2025

View reviewed changes

jmazanec15 mentioned this pull request Mar 24, 2025

[Meta] Track derived source optimizations #2626

Open

3 tasks

jmazanec15 force-pushed the derived-mask-integ branch from 293ea53 to 6b65f36 Compare March 24, 2025 15:08

jmazanec15 added v3.0.0 and removed skip-changelog labels Mar 24, 2025

jmazanec15 requested a review from 0ctopus13prime March 24, 2025 15:10

0ctopus13prime reviewed Mar 24, 2025

View reviewed changes

...va/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010DerivedSourceStoredFieldsReader.java Show resolved Hide resolved

jmazanec15 force-pushed the derived-mask-integ branch from 6b65f36 to 89a725e Compare March 24, 2025 17:12

jmazanec15 force-pushed the derived-mask-integ branch from 89a725e to a117d38 Compare March 24, 2025 17:15

0ctopus13prime reviewed Mar 24, 2025

View reviewed changes

shatejas reviewed Mar 24, 2025

View reviewed changes

...va/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010DerivedSourceStoredFieldsFormat.java Show resolved Hide resolved

jmazanec15 requested review from 0ctopus13prime and shatejas March 24, 2025 18:23

0ctopus13prime approved these changes Mar 24, 2025

View reviewed changes

shatejas approved these changes Mar 24, 2025

View reviewed changes

jmazanec15 merged commit 7f1af5a into opensearch-project:main Mar 24, 2025
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate derived source from filter to mask #2612

Migrate derived source from filter to mask #2612

jmazanec15 commented Mar 18, 2025 •

edited

Loading

jmazanec15 commented Mar 18, 2025

0ctopus13prime commented Mar 22, 2025 •

edited

Loading

jmazanec15 commented Mar 24, 2025

jmazanec15 commented Mar 24, 2025

jmazanec15 commented Mar 24, 2025

0ctopus13prime left a comment

shatejas left a comment

Migrate derived source from filter to mask #2612

Migrate derived source from filter to mask #2612

Conversation

jmazanec15 commented Mar 18, 2025 • edited Loading

Description

Related Issues

Check List

jmazanec15 commented Mar 18, 2025

0ctopus13prime commented Mar 22, 2025 • edited Loading

Lemma 1. The DFS visit order at the last doc, which is the top level doc having _source field, is the total number of Lucene documents populated.

Lemma 2. Child doc's offset away from the top level document id is calculated as target field's visit order - #total docs populated. In turn, child doc id can be calculated adding the offset to the top level doc id.

Idea: First Run DFS to collect target Map to update, and visit order for the target field.

Complexity to transform _source for all fields

Pseudocode

Pros / Cons

Optimization

jmazanec15 commented Mar 24, 2025

jmazanec15 commented Mar 24, 2025

jmazanec15 commented Mar 24, 2025

0ctopus13prime left a comment

Choose a reason for hiding this comment

shatejas left a comment

Choose a reason for hiding this comment

jmazanec15 commented Mar 18, 2025 •

edited

Loading

0ctopus13prime commented Mar 22, 2025 •

edited

Loading

Lemma 1. The DFS visit order at the last doc, which is the top level doc having `_source` field, is the total number of Lucene documents populated.

Lemma 2. Child doc's offset away from the top level document id is calculated as `target field's visit order - #total docs populated`. In turn, child doc id can be calculated adding the offset to the top level doc id.