Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate derived source from filter to mask #2612

Merged
merged 1 commit into from
Mar 24, 2025

Conversation

jmazanec15
Copy link
Member

@jmazanec15 jmazanec15 commented Mar 18, 2025

Description

Migrates derived source functionality from filter to mask based approach in a bwc way. The change can be summed up as instead of remove vector fields from source, it replaces them with a smaller representation (mask). When we need to add them back, we transform the source map to replace the masks with the actual vectors. This makes handling nested docs much easier.

For backwards compatibility, this PR moves old read functionality and related classes to backwardscodecs/KNN9120Codec package and removes old write as no longer necessary. On merge, we add custom functionality in the stored fields writer merge logic to fall back to base, non-optimized merge if it detects older readers in the merge state. For this, we will reconstruct the source document with the reader and then apply the mask again on top of it to remove the vectors. This ensures that segments are migrated to the mask approach. To verify this, I added several backwards compatibility tests.

There are 40+ file changes, but most of them are just moving files to backwards codec.

Adding a few more test cases today.

Related Issues

Resolves #2377

Check List

  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added Refactoring Improve the design, structure, and implementation while preserving its functionality skip-changelog labels Mar 18, 2025
@jmazanec15 jmazanec15 force-pushed the derived-mask-integ branch 2 times, most recently from 05a55c1 to 0c0536f Compare March 18, 2025 21:11
@jmazanec15
Copy link
Member Author

@0ctopus13prime
Copy link
Collaborator

0ctopus13prime commented Mar 22, 2025

Hi @jmazanec15 , just throwing an idea regarding to get the first child doc id for nested case.
If parent id (e.g. top level doc id) + _source Map are given, then I think we can do a backtrack to get child doc ids.
Just run single DFS will be sufficient. But we still need the mutable map creation step.

Lemma 1. The DFS visit order at the last doc, which is the top level doc having _source field, is the total number of Lucene documents populated.

Example

# _source JSON. total 11 documents will be populated

{
  "nested_field": [
    {
      "key": "key7",
      "nested_field": [
        {
          "nested_field": [
            {
              "key": "key1",
              "vector_field": "$MASK"
            },
            {
              "key": "key2",
              "vector_field": "$MASK"
            }
          ],
          "key": "key3"
        },
        {
          "nested_field": [
            {
              "key": "key4",
              "vector_field": "$MASK"
            },
            {
              "key": "key5",
              "vector_field": "$MASK"
            }
          ],
          "key": "key6"
        }
      ]
    },
    {
      "key": "key10",
      "nested_field": [
        {
          "nested_field": [
            {
              "key": "key8",
              "vector_field": "$MASK"
            },
            {
              "key": "key9",
              "vector_field": "$MASK"
            }
          ]
        }
      ]
    }
  ],
  "key": "key11",
  "vector_field": "$MASK"
}

Lemma 2. Child doc's offset away from the top level document id is calculated as target field's visit order - #total docs populated. In turn, child doc id can be calculated adding the offset to the top level doc id.

For example, assuming top level doc id was given as 255 when target field is nested_field.nested_field.nested_field.vector_field, then visit orders would be below:

  1. visit order=1, key=key1, offset=1 - 11=-10, child doc = -10 + 255 = 245
  2. visit order=2, key=key2, offset=2 - 11=-9, child doc = -9 + 255 = 246
  3. visit order=4, key=key4, offset=4 - 11=-7, child doc = -7 + 255 = 248
  4. visit order=5, key=key5, offset=5 - 11=-6, child doc = -6 + 255 = 249
  5. visit order=8, key=key8, offset=8 - 11=-3, child doc = -3 + 255 = 252
  6. visit order=9, key=key9, offset=9 - 11=-2, child doc = -2 + 255 = 253

Idea: First Run DFS to collect target Map to update, and visit order for the target field.

  1. During the DFS, we collect map and visit order for the target field -> Get #total Lucene docs, Lemma 1.
  2. After running DFS, we can calculate the total Lucene documents, and with this info, we can get the first child document id. In the above example, it would be 245. -> Lemma 2.
  3. Iterate collected target Maps, advance the iterator to the target child document, inject vector value.

Complexity to transform _source for all fields

O(N^2) where N is the total number of Lucene documents populated.

Pseudocode

void inject(int parentDocId, ...) {
    // Run DFS to collect visit number + target map
    List<Pair> subMaps = new ArrayList<>();
    int numLuceneDocsPopulated = collectVisitNumberAndSubMaps(sourceMap, 1, subMaps) - 1;

    // Do the actual replacement.
    KnnVectorValues values = ...;
    for (visitNo, targetMap : subMaps) {
        int childDoc = parentDocId + visitNo - numLuceneDocsPopulated;
        // Child doc must have been collected in increasing order.
        values.advance(childDoc);
        String vectorString = toString(values.getVector());  // Convert vector to string
        targetMap.put(targetField, vectorString);  // Replace mask to actual vector string representation.
    }
}

int collectVisitNumberAndSubMaps(Map sourceMap, int visitNo, List<Map> maps) {
    // Visit sub-maps first. e.g. post-order DFS
    for (String key : sourceMap.keys()) {
        Object value = sourceMap.get(key);
        if (value instanceof Map) {
            visitNo = collectVisitNumberAndSubMaps((Map) value, visitNo, maps);
        } else if (value instanceof List) {
            for (Object subMap : listValue) {
                visitNo = collectVisitNumberAndSubMaps(subMap, visitNo, maps);
            }
        }
    }

    // Now we visit the current map
    for (String key : sourceMap.keys()) {
        if (key.equals(targetField)) {
            // Found it! collect it.
            maps.add(new Pair(visitNo, sourceMap));
            break;  // As far as I know, one doc cannot have duplicate fields.
        }
    }

    // return next visit no
    return visitNo + 1;
}

Pros / Cons

This solution works for both nested and non-nested case.
For the non-nested case, visit order will be always 1, #total populated Lucene docs is also 1. Hence offset is 0. This will make iterator always advance to the top level document id.

Pros:

  1. As a general solution, we will not need to manage nested and non-nested two cases.
  2. Relatively simple to implement. We don't need to consider edge cases.
    For the sparse nested case, where not all parent documents have KNN field, the results of DFS for a parent doc not having KNN field will be empty. Because there's no visit order for the target knn field, will be ignored naturally. Therefore, no need for head ache for such edge cases with suggested solution.

Cons:

  1. Not sure how much it is difficult to add this one to the existing derived source code base, but if it does not need that much efforts, I can't think of any cons of this approach.

Optimization

  1. Reusing iterator:
    Now I know the document id will not be given in increasing order, but I think it would be good if we could reuse the previously instantiated iterator rather than creating a new one every time.
Iterator getOrCreate(int firstChildDocId) {
    if (iterator == null || iterator.docId() > firstChildDocId) {
        return iterator = create a new one;
    }

    // We can reuse it!
    iterator.advance(firstChildDocId);
    return iterator;
}

  1. Not using the proposed algorithm for non-nested KNN field.
    Even though it is a general solution that works for both cases, but we could skip it and directly jump to the given doc id if the target field is non-nested.

@jmazanec15
Copy link
Member Author

@0ctopus13prime - I think thats a pretty interesting idea. I worry that it might run into issues when there are arrays that are not necessarily related to nested documents. However, I think it makes sense to follow up in #2626

@jmazanec15
Copy link
Member Author

Discovered #2625 - will address this in a follow up PR. @shatejas and @0ctopus13prime could I get a follow up review?

@jmazanec15
Copy link
Member Author

Tests failing because common utils needs to upgrade to beta1 - see opensearch-project/common-utils#808

Migrates derived source functionality from filter to mask based
approach. Moves old read functionality and related classes to backwards
codecs KNN9120... Removes old write as no longer necessary.

In order to support bwc, we add custom functionality in the stored
fields writer merge logic to fall back to base, non-optimized merge if
it detects older readers in the merge state. This is needed because for
these segments, we need to rebuild the source and then apply filter to
migrate to new write format.

Signed-off-by: John Mazanec <[email protected]>
Copy link
Collaborator

@0ctopus13prime 0ctopus13prime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Once it passes CI, will approve it.
Thank you.

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Can approve once CI passes

@jmazanec15 jmazanec15 merged commit 7f1af5a into opensearch-project:main Mar 24, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactoring Improve the design, structure, and implementation while preserving its functionality v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Derived Source for Vectors
3 participants