-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add AcceptDocs abstraction for accepted KNN docs #15011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into it, it looks like this refactoring is working well. I left a first round of suggestions.
/** | ||
* Random access to the accepted documents. | ||
* | ||
* @return Bits instance for random access, or null if not available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should apply the usual contract that a null
Bits instance means that all docs are accepted?
* @return Bits instance for random access, or null if not available | |
* @return Bits instance for random access, or null if all documents are accepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this resolved? I do think that "or null if all documents are accepted" is more clear.
fieldEntry.similarityFunction, | ||
getGraphValues(fieldEntry), | ||
getAcceptOrds(acceptDocs, fieldEntry), | ||
getAcceptOrds(acceptDocs != null ? acceptDocs.getBits() : null, fieldEntry), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set the contract of this method so that acceptDocs
is not allowed to be null
? This would help save all these (annoying) null checks, and it wouldn't be a great burden on the caller side?
@shubhamvishu I hope you don't mind, I addressed my feedback directly on your branch. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
We will also need to move the decision between approximate and exact to the codec, but it's probably best done as a follow-up PR? I think that this change is good as-is. |
@jpountz Not at all, thank you so much for taking care of it! Sorry I couldn’t get to it sooner. I really appreciate the assist.
I agree completely , that is unrelated to this specific change and better to keep as separate PR. |
Co-authored-by: Shubham Chaudhary <[email protected]>
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@jpountz @benwtrent The test failures seems is unrelated to this change and a side effect of bulk scoring change. The test fails on main also when HNSW is disabled to force exhaustive search and we score bulk score 64 docs now. I wonder if we still want to respect the |
@shubhamvishu could you open a test failure issue? @ChrisHegarty what do you think? |
@benwtrent I opened an issue #15057 with the details of the test failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally very nice. I left a few small comments.
/** | ||
* Random access to the accepted documents. | ||
* | ||
* @return Bits instance for random access, or null if not available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this resolved? I do think that "or null if all documents are accepted" is more clear.
* @return Bits instance for random access, or null if not available | ||
* @throws IOException if an I/O error occurs | ||
*/ | ||
public abstract Bits bits() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any requirement for either this bits
or iterator
to throw IOException. It can be removed unless there is some other usage not currently in this PR that needs it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was thinking of lazily creating the bitset in this method at first, so it would need the IOException
. But actually we always need the cardinality of the bitset to decide between random access and sequential access now, so maybe we could simplify things and make AcceptDocs
a record around Bits
and a DocIdSetIterator
. So no exception, no lazy computation of cost()
(to you other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I spoke too quickly. A flat scorer should never compute the cost()
or use the Bits
. So we need to both compute the cost and the Bits
lazily.
Co-authored-by: Chris Hegarty <[email protected]>
One of Chris' comment helped me realize that we would still load filter matches into a |
/** | ||
* Random access to the accepted documents. | ||
* | ||
* <p><b>NOTE</b>: This must not be called if the {@link #iterator()} has already been used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the contract if bits()
is called first, and then iterator()
? Is it possible the iterator will have been consumed in the process of creating the Bits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. The original iterator will be consumed to create the bit set and replaced with an iterator over the bit set.
So we should disallow calling bits() or cardinality() after iterator()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we first call the iterator()
then we directly pass the input iterator as-is in DocIdSetIteratorAcceptDocs
unless the bits()
is called which then updates it to a BitSetIterator
. How will we do that if we disallow bits() after iterator()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like forbidding #cost()
or #bits()
to be called after #iterator()
works with current impls, so I pushed this change.
If you call #iterator()
first, you will get the original iterator, if you call it after one of the two other methods, the original iterator will be loaded into a bit set and #iterator()
will return an iterator over the bit set.
@shubhamvishu Sorry, I don't get your point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand from your above main comment that we don't want to load the matches in the bitset incase of flat scorer but it seems to me that allowing would make the api contract simple(but at the cost of doing unneccessary work)....Is that correct? If yes, could we somehow avoid only undesirable loading at the same time keeping the contract simple(with no restrictions)?
Basically I was considering that if user doesn't call bits()
or cost()
before calling iterator()
, we return the same iterator(without loading or creating the BitSet), and the iterator wouldn't filter out the deletes i.e. the iterator wouldn't be a conjunction over live docs, as we previously decided. So was thinking if we could allow the caller to access both bits
and iterator
as needed? maybe its not that simple or the extra cost to pay if not worth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me that allowing would make the api contract simple(but at the cost of doing unneccessary work)....Is that correct?
This is correct! I agree with your sentiment, I got there because I wanted to avoid some performance traps, but it has gone too far, the API contract is too complex. I pushed a change so that it's legal to call the methods in any order, it just happens to be a bit more efficient to call bits() before iterator() than iterator() before bits().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
// Usage of AcceptDocs should be confined to a single thread, so this doesn't need | ||
// synchronization. | ||
if (bitSet == null) { | ||
if (iterator.docID() != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check this even if bitSet is not null, for consistency with the API contract?? As it is we are exposing some internal details -- it's OK to call bits(), then iterator(), consume some of the iterator, and then call bits() again ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not checked all the time.
@jpountz Thanks for the changes. It looks overall good to me. I just made 2 simple refactoring changes on top :
Let me know if looks good or feel free to revert any changes. I think this looks in good shape to be merged soon (if nobody has concerns)?. Also ran the luceneutil benchmarks to confirm there is no regression or any sort. Looks all good. Candidate
Baseline
|
|
||
@Override | ||
public DocIdSetIterator iterator() { | ||
return iterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my most recent commit, this is now supposed to return a new iterator, so we need to move the iterator instantiation from the ctor to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we also need to create a new iterator in the case when it's backed by DocIdSetIterator#all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it as-is as we were not creating in the case of BitsAcceptDocs in the earlier commit. But makes sense to return a new iterator irrespective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I updated it. We always return a fresh iterator now with both APIs..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good! I added some small tests, I'll merge soon.
Co-authored-by: Adrien Grand <[email protected]> Co-authored-by: Chris Hegarty <[email protected]>
@jpountz do you mean moving the If so, I think it could be useful in optimizations like #12820 too, where the main overhead is passing / re-creating information from ordinal-space to docid-space (i.e. the "visited" |
@kaivalnp Yes, we want to move it down to the Codec. I'll soon post a PR for this change. |
FYI, for exact search to be fast, it would also be good to have AcceptDocs be able to provide you a "size estimate" that doesn't require evaluating the entire iterator. Right now I think it does that and pretty much defeats the purpose of lazy evaluation of the iterator. |
Description
Addresses this comment on the PR #14963 to allow both sequential and random access consumption for accepted docs in KNN search.