-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Reciprocal Rank Fusion (RRF) in TopDocs #13470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! This looks like what I'd expect for RRF in Lucene. I left some comments, could you also add some tests?
} | ||
} | ||
|
||
/** Reciprocal Rank Fusion method. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users could use more details in these javadocs, e.g. what are k
and topN
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to check if the k value is greater than or equal to 1 in the code? And maybe mention it in the javadocs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to validating parameters
|
||
/** Reciprocal Rank Fusion method. */ | ||
public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { | ||
Map<Integer, Float> rrfScore = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we should identify documents not only by their doc ID, but by the combination of ScoreDoc.shardIndex
and ScoreDoc.doc
, as there could be multiple documents that have the same doc ID but come from different shards.
Map<Integer, Float> rrfScore = new HashMap<>(); | ||
long minHits = Long.MAX_VALUE; | ||
for (TopDocs topDoc : hits) { | ||
minHits = Math.min(minHits, topDoc.totalHits.value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it should be a max
rather than a min
? Presumably, hits from either top hits are considered as hits globally, and we are just using RRF to boost hits that are ranked high in multiple top hits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The totalHits was a tricky part that we didn't know what value to assign to. IIUC, The totalHits means all the matched Document in a query, and we couldn't really calculate the union of the totalHits for all the TopDocs. So for this min totalHits, I just wanted to assign a min totalHits temporarily, to match the totalHits relation "greater than or equal to". And I want to ask for your opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with using GREATER_THAN_OR_EQUAL_TO all the time, but I would still take the max, because a document is a match of it is a match to either query. For instance, imagine combining top hits from two queries where one query didn't match any docs, the total hit count should be the hit count of the other query, not zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me, thank you for the detailed explanation!!
} | ||
|
||
List<Map.Entry<Integer, Float>> scoreList = new ArrayList<>(scoreMap.entrySet()); | ||
scoreList.sort(Map.Entry.comparingByValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to be using these scoreMap
and scoreList
anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! My bad. I think we got something wrong right here. The for loop traversal for (ScoreDoc scoreDoc : topDoc.scoreDocs)
is wrong, we should actually traverse the sorted results, i.e., scoreList, to add the ranking result to rrfScore
.
int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
rrfScore.put(entry.getKey(),
rrfScore.getOrDefault(entry.getKey(), 0.0f) + 1 / (rank + k));
rank++;
}
P.S. For this part, however, I think we should determine the implementation of combining ScoreDoc.doc
and ScoreDoc.shardIndex
together first.
} | ||
|
||
/** Reciprocal Rank Fusion method. */ | ||
public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: function arguments should use lower camel case
public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { | |
public static TopDocs rrf(int topN, int k, TopDocs[] hits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, my bad.
|
||
int rank = 1; | ||
for (ScoreDoc scoreDoc : topDoc.scoreDocs) { | ||
rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Map#compute
instead of getOrDefault
+ put
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I don't know Map#compute
before tho. It should be like below:
int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
rrfScore.compute(entry.getKey(), (key, value) ->
(value == null ? 0.0f : value) + 1 / (float)(rank + k)
);
rank++;
}
|
||
ScoreDoc[] rrfScoreDocs = new ScoreDoc[Math.min(TopN, rrfScoreRank.size())]; | ||
for (int i = 0; i < rrfScoreDocs.length; i++) { | ||
rrfScoreDocs[i] = new ScoreDoc(rrfScoreRank.get(i).getKey(), rrfScoreRank.get(i).getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we should preserve the original shardIndex
that is configured on the ScoreDoc
object that identifies the shard that it is coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was also a tricky part for us. For my understanding, the RRF would combine search result based on the different ranks of a documents in different results. We supposed to combine the ranks for all individual doucments, but a document come from different shards should be treated as different documents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct. When working with shards, hits should first be combined on a per-query basis using TopDocs#merge, which will set the shardIndex field. And then global top hits can be merged across queries with RRF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do the rrf by setting the unique key as docid and shardIndex, what would be the difference between TopDocs#rrf and TopDocs#merge? I think giving an example could express better. Suppose that we have two Shards, and we want to retrieve top 3 results from each shards and do rrf on top of them. There's three documents A, B and C. In Shard1, the top 3 is A -> B -> C. In Shard2, it's B -> C -> A. The original rrf method would calculate the rank by aggregating the docid, assume the constant k is 1. Top 3 results would be B (1/(k+2) + 1/(k+1))
- A (1/(k+1) + 1/(k+3))
- C (1/(k+3) + 1/(k+2))
.
If we are going to consider the shardIndex as a unique key as well, how should the rrf rank to be presented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When working with shards, shards managed non-overlapping subsets of the data, so you could not have documents A, B and C in both shards.
I'm not sure 'rrf' should be a direct method in topDocs: N.B. I am generally in favour of "You are Not Gonna Need It' approach, but in Lucene's instance we have many contributors and future contributors that may get involved, and doing this abstraction work when and if "a second strategy" gets implemented may not happen |
I'm not worried about this. If we feel like we should expose it differently in the future, we'll do it, deprecate this function, and remove it in Lucene 11. |
- Parameters are now validated. - The shardIndex is now taken into account to identify hits. - The total hit count is computed is the max total hit count. - Unit tests. - Tie break on doc and shardIndex, consistently with TopDocs#merge.
@harenlin I took some freedom to apply my feedback and push it to your branch. Would you like to take a look and check if it makes sense? |
I plan on merging this PR soon if there are no objections. |
This looks good to me. Perhaps we could mark the new static method experimental, especially if we think we are going to want to support more ways of combining topdocs soon enough. I don't have a strong opinion though, it would also be ok to introduce a more flexible way to do rrf while keeping this one around until the next major. |
Thanks for taking a look. I have a bias for the latter, as I was planning on improving the docs of the oal.search package as a follow-up to provide guidance wrt how to do hybrid search by linking to this RRF helper. |
for (TopDocs topDocs : hits) { | ||
for (ScoreDoc scoreDoc : topDocs.scoreDocs) { | ||
shardIndexSet = scoreDoc.shardIndex != -1; | ||
break outer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the purpose here to only check the first scoreDoc of every TopDocs instance provided in the array? Should we try and rewrite this to be more readable and not use goto ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it look better now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thank you!
Co-authored-by: SuperSonicVox <[email protected]> Co-authored-by: Adrien Grand <[email protected]>
I opened #14310. |
Co-authored-by: SuperSonicVox <[email protected]> Co-authored-by: Adrien Grand <[email protected]>
Description
Hello the community,
Hank and I just follow the discussion thread to implement the RRF function that can be used. By the way, we know that the RRF issue is under debate in solr (FYR); however, we think this new feature could still be a good one.