Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(EAI-152) check for change to chunkAlgoHash when updating embeddings #580

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

yakubova92
Copy link
Collaborator

@yakubova92 yakubova92 commented Dec 16, 2024

Jira: https://jira.mongodb.org/browse/EAI-152

Changes

  • Previously, updateEmbeddedContent would re-chunk if the page content changed OR chunkAlgoHash changed since the date provided
  • Now, it updates embedded content when the chunk algo has changed, even if there were no other changes to the page, essentially ignoring the since date provided

@yakubova92 yakubova92 changed the title check for change in chunkAlgoHash (EAI-152) check for change to chunkAlgoHash when updating embeddings Dec 16, 2024
@yakubova92 yakubova92 requested a review from mongodben December 16, 2024 16:09
Copy link
Collaborator

@mongodben mongodben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about caching.

and beyond that, sorry if i'm dense, but what changes are you making?

@yakubova92 yakubova92 marked this pull request as ready for review January 10, 2025 21:41
@yakubova92 yakubova92 requested a review from mongodben January 10, 2025 21:41
Copy link
Collaborator

@mongodben mongodben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about the $lookup implementation

@yakubova92 yakubova92 requested a review from mongodben January 15, 2025 22:41
Copy link
Collaborator

@mongodben mongodben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about typing

@yakubova92 yakubova92 requested a review from mongodben January 23, 2025 04:56
@yakubova92 yakubova92 requested a review from nlarew March 11, 2025 20:18
Copy link
Collaborator

@nlarew nlarew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggested tweaks + one question about the caching behavior

@@ -276,3 +295,231 @@ describe("updateEmbeddedContent", () => {
});
});
});

// These tests use "mongodb-memory-server", not mockEmbeddedContentStore
describe("updateEmbeddedContent", () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have two describe blocks with the same title. Can you either combine into a single block or disambiguate them with more descriptive titles?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a more specific describe block for this for now, but this test file should be refactored. These test cases were placed in their own describe block bc the original describe block used mocks for the page and embedding stores and we are moving to using mongodb-memory-server. I created a ticket to capture that work - https://jira.mongodb.org/browse/EAI-935

Comment on lines +70 to +75
const chunkAlgoHashes = new Map<string, string>();
const chunkAlgoHash = getHashForFunc(
chunkAlgoHashes,
chunkPage,
chunkOptions
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the purpose of the chunkAlgoHashes caching here.

  • We define an empty cache and only use it for this one call right after.
  • That call only gets the current chunkPage function (that we import into this file)

It seems like the cache will never actually be used? Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants