Skip to content

Conversation

harenlin
Copy link
Contributor

@harenlin harenlin commented Jun 7, 2024

Description

Hello the community,

Hank and I just follow the discussion thread to implement the RRF function that can be used. By the way, we know that the RRF issue is under debate in solr (FYR); however, we think this new feature could still be a good one.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! This looks like what I'd expect for RRF in Lucene. I left some comments, could you also add some tests?

}
}

/** Reciprocal Rank Fusion method. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users could use more details in these javadocs, e.g. what are k and topN?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to check if the k value is greater than or equal to 1 in the code? And maybe mention it in the javadocs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to validating parameters


/** Reciprocal Rank Fusion method. */
public static TopDocs rrf(int TopN, int k, TopDocs[] hits) {
Map<Integer, Float> rrfScore = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we should identify documents not only by their doc ID, but by the combination of ScoreDoc.shardIndex and ScoreDoc.doc, as there could be multiple documents that have the same doc ID but come from different shards.

Map<Integer, Float> rrfScore = new HashMap<>();
long minHits = Long.MAX_VALUE;
for (TopDocs topDoc : hits) {
minHits = Math.min(minHits, topDoc.totalHits.value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it should be a max rather than a min? Presumably, hits from either top hits are considered as hits globally, and we are just using RRF to boost hits that are ranked high in multiple top hits?

Copy link
Contributor

@hack4chang hack4chang Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The totalHits was a tricky part that we didn't know what value to assign to. IIUC, The totalHits means all the matched Document in a query, and we couldn't really calculate the union of the totalHits for all the TopDocs. So for this min totalHits, I just wanted to assign a min totalHits temporarily, to match the totalHits relation "greater than or equal to". And I want to ask for your opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with using GREATER_THAN_OR_EQUAL_TO all the time, but I would still take the max, because a document is a match of it is a match to either query. For instance, imagine combining top hits from two queries where one query didn't match any docs, the total hit count should be the hit count of the other query, not zero?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me, thank you for the detailed explanation!!

}

List<Map.Entry<Integer, Float>> scoreList = new ArrayList<>(scoreMap.entrySet());
scoreList.sort(Map.Entry.comparingByValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to be using these scoreMap and scoreList anywhere?

Copy link
Contributor Author

@harenlin harenlin Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! My bad. I think we got something wrong right here. The for loop traversal for (ScoreDoc scoreDoc : topDoc.scoreDocs) is wrong, we should actually traverse the sorted results, i.e., scoreList, to add the ranking result to rrfScore.

int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
    rrfScore.put(entry.getKey(), 
                 rrfScore.getOrDefault(entry.getKey(), 0.0f) + 1 / (rank + k));
    rank++;
}

P.S. For this part, however, I think we should determine the implementation of combining ScoreDoc.doc and ScoreDoc.shardIndex together first.

}

/** Reciprocal Rank Fusion method. */
public static TopDocs rrf(int TopN, int k, TopDocs[] hits) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: function arguments should use lower camel case

Suggested change
public static TopDocs rrf(int TopN, int k, TopDocs[] hits) {
public static TopDocs rrf(int topN, int k, TopDocs[] hits) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad.


int rank = 1;
for (ScoreDoc scoreDoc : topDoc.scoreDocs) {
rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Map#compute instead of getOrDefault + put?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I don't know Map#compute before tho. It should be like below:

int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
    rrfScore.compute(entry.getKey(), (key, value) -> 
        (value == null ? 0.0f : value) + 1 / (float)(rank + k)
    );
    rank++;
}


ScoreDoc[] rrfScoreDocs = new ScoreDoc[Math.min(TopN, rrfScoreRank.size())];
for (int i = 0; i < rrfScoreDocs.length; i++) {
rrfScoreDocs[i] = new ScoreDoc(rrfScoreRank.get(i).getKey(), rrfScoreRank.get(i).getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we should preserve the original shardIndex that is configured on the ScoreDoc object that identifies the shard that it is coming from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was also a tricky part for us. For my understanding, the RRF would combine search result based on the different ranks of a documents in different results. We supposed to combine the ranks for all individual doucments, but a document come from different shards should be treated as different documents?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. When working with shards, hits should first be combined on a per-query basis using TopDocs#merge, which will set the shardIndex field. And then global top hits can be merged across queries with RRF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do the rrf by setting the unique key as docid and shardIndex, what would be the difference between TopDocs#rrf and TopDocs#merge? I think giving an example could express better. Suppose that we have two Shards, and we want to retrieve top 3 results from each shards and do rrf on top of them. There's three documents A, B and C. In Shard1, the top 3 is A -> B -> C. In Shard2, it's B -> C -> A. The original rrf method would calculate the rank by aggregating the docid, assume the constant k is 1. Top 3 results would be B (1/(k+2) + 1/(k+1)) - A (1/(k+1) + 1/(k+3)) - C (1/(k+3) + 1/(k+2)).
If we are going to consider the shardIndex as a unique key as well, how should the rrf rank to be presented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When working with shards, shards managed non-overlapping subsets of the data, so you could not have documents A, B and C in both shards.

@alessandrobenedetti
Copy link
Contributor

I'm not sure 'rrf' should be a direct method in topDocs:
Reciprocal Rank Fusion is just one way of combining result sets, if in the future we want to add other algorithms having 'rrf' there may encourage to just add and add to topDocs.
What about having a "combine" method there, potentially taking in input the combining strategy?
Then abstract the combining strategy as an interface/abstract class and implement Reciprocal Rank Fusion as the first available strategy? That should ease the process of adding more strategies and prevent TopDocs to become too dirty in the future.

N.B. I am generally in favour of "You are Not Gonna Need It' approach, but in Lucene's instance we have many contributors and future contributors that may get involved, and doing this abstraction work when and if "a second strategy" gets implemented may not happen

@jpountz
Copy link
Contributor

jpountz commented Jun 9, 2024

I'm not worried about this. If we feel like we should expose it differently in the future, we'll do it, deprecate this function, and remove it in Lucene 11.

hack4chang and others added 3 commits June 12, 2024 09:40
 - Parameters are now validated.
 - The shardIndex is now taken into account to identify hits.
 - The total hit count is computed is the max total hit count.
 - Unit tests.
 - Tie break on doc and shardIndex, consistently with TopDocs#merge.
@jpountz
Copy link
Contributor

jpountz commented Feb 14, 2025

@harenlin I took some freedom to apply my feedback and push it to your branch. Would you like to take a look and check if it makes sense?

@jpountz jpountz marked this pull request as ready for review February 21, 2025 13:31
@jpountz
Copy link
Contributor

jpountz commented Feb 21, 2025

I plan on merging this PR soon if there are no objections.

@javanna
Copy link
Contributor

javanna commented Feb 24, 2025

This looks good to me. Perhaps we could mark the new static method experimental, especially if we think we are going to want to support more ways of combining topdocs soon enough. I don't have a strong opinion though, it would also be ok to introduce a more flexible way to do rrf while keeping this one around until the next major.

@jpountz
Copy link
Contributor

jpountz commented Feb 24, 2025

Thanks for taking a look. I have a bias for the latter, as I was planning on improving the docs of the oal.search package as a follow-up to provide guidance wrt how to do hybrid search by linking to this RRF helper.

for (TopDocs topDocs : hits) {
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
shardIndexSet = scoreDoc.shardIndex != -1;
break outer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose here to only check the first scoreDoc of every TopDocs instance provided in the array? Should we try and rewrite this to be more readable and not use goto ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it look better now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you!

@jpountz jpountz added this to the 10.2.0 milestone Feb 27, 2025
@jpountz jpountz merged commit 1ae2655 into apache:main Feb 27, 2025
6 checks passed
jpountz added a commit that referenced this pull request Feb 27, 2025
Co-authored-by: SuperSonicVox <[email protected]>
Co-authored-by: Adrien Grand <[email protected]>
@jpountz
Copy link
Contributor

jpountz commented Feb 27, 2025

I have a bias for the latter, as I was planning on improving the docs of the oal.search package as a follow-up to provide guidance wrt how to do hybrid search by linking to this RRF helper.

I opened #14310.

hanbj pushed a commit to hanbj/lucene that referenced this pull request Mar 12, 2025
Co-authored-by: SuperSonicVox <[email protected]>
Co-authored-by: Adrien Grand <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants