Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LuceneOnFaiss] Apply monotonic increasing integer encoding to FAISS HNSW and IdMapIndex. #2609

Open
wants to merge 1 commit into
base: lucene-on-faiss
Choose a base branch
from

Conversation

0ctopus13prime
Copy link
Collaborator

Description

This PR introduces a monotonic integer sequence encoding scheme for doc IDs in IdMapIndex and offsets in FAISS HNSW.
Previously, both were stored as long[], consuming O(N) memory where N is the total number of vectors. For example, with a 10M dataset, they occupied 228MB (76MB + 152MB). Using monotonic encoding, this can be reduced to a few hundred KB. Lucene already uses this encoding scheme Link

The key difference is that Lucene applies encoding during segment creation, whereas we do it at load time. However, encoding is extremely fast, on my local setup it takes about 3.2 seconds for 100M integers.
A warm-up API could further optimize this by preloading data before search, though this hasn't been raised yet, but it is part of future PR.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

RFC : #2401

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@0ctopus13prime
Copy link
Collaborator Author

@shatejas @jmazanec15 Could you approve this PR if it looks good on you? :)
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants