BM25 values as additional information to predictabilities #28

mpadge · 2025-01-27T08:06:03Z

Hi @bnicenboim, I'm part of rOpenSci, and have been following the review of your package there. Just wanted to let you know that my pkgmatch package includes efficient C++ routines for calculating BM25 inverse term frequencies. These often provide a useful benchmark for LLM prediction values in the context of a given input corpus. I just wanted to let you know that these routines exist. Happy to discuss further if you think these values might provide a useful enhancement here, otherwise no worries at all if you just want to close this issue. Great work with the package!

The text was updated successfully, but these errors were encountered:

bnicenboim · 2025-01-28T10:30:51Z

Hi!
Thanks for the suggestion, I don't know anything about BM25 inverse term frequencies. I've been going over the links, but something I don't understand is what's their corpus. Is it language based? Or just the "internet"?

mpadge · 2025-01-28T11:40:01Z

Yeah, that the trick there. The corpus has to be pre-defined. How to do this would depend on envisioned applications of your package. If, for example, an application were for a particular academic research field, then a corpus of published papers could be assembled and used. But the need to define and have a prepared corpus may make BM25 out of scope for your package?

It is nevertheless important to note that everything needs a pre-defined corpus, we just happen to have cultivated a culture where that's not openly discussed. BM25 values are very commonly used anyway, and most "hybrid search"-type systems modify search rankings by combining with BM25. Open and general corpuses are often described and linked to in model cards on sites like huggingface.co.

bnicenboim · 2025-01-28T12:51:15Z

The envisioned use of pangoling is to get the predictability of the words of psycholinguistic stimuli. The stimuli usually represent a language (or a written language). LLM such as gpt-* models are very useful because they have been trained on a relatively broad corpus of a specific language. There are hundreds of papers that use gpt* models with reading times, for example:
https://scholar.google.com/scholar?hl=en&as_sdt=7%2C39&q=gpt+%22reading+times%22+psycholinguistics&btnG=

And if I do the same search with BM25, I get 2 papers which are not really about reading:
https://scholar.google.com/scholar?hl=en&as_sdt=7%2C39&q=bm25+%22reading+times%22+psycholinguistics&btnG=

So including BM25 would be out of the scope of the package right now. But in any case, I think this is interesting. Checking the Wikipedia page, it seems that it's more likely to replace word frequency (which my package also doesn't provide yet) than predictability. Are there other important papers about it?
I can just use the function pkgmatch_bm25 to try it out, right?

mpadge · 2025-01-30T12:09:22Z

Interestingly, the thing that convinced me to incorporate BM25 in pkgmatch was not an academic paper at all, but this blog post from anthropic. That was my starting point, Snooping around in codebases (mostly in python) revealed that BM25 values are used a lot, even where that's not prominently described. That also revealed that they are entirely and exclusively coded in everything-at-once style, so you have to calculate weightings over a whole corpus each time you want weighting for any set of inputs. My routines just separate those two steps, to allow efficient pre-calculation of corpus weightings, and very fast calculation for inputs from that point on.

That two-stage process means that you can't just run the one function to get values, creating difficulties like in this issue. Until that issue has been addressed, advice is like written there, to look at the script included with the package and adapt the lines there.

Finally, I entirely understand that this is out of scope, so feel free to close. That said, please feel equally free to return to the idea at any later stage. Thanks for considering!

bnicenboim closed this as completed Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM25 values as additional information to predictabilities #28

BM25 values as additional information to predictabilities #28

mpadge commented Jan 27, 2025

bnicenboim commented Jan 28, 2025

mpadge commented Jan 28, 2025

bnicenboim commented Jan 28, 2025

mpadge commented Jan 30, 2025

BM25 values as additional information to predictabilities #28

BM25 values as additional information to predictabilities #28

Comments

mpadge commented Jan 27, 2025

bnicenboim commented Jan 28, 2025

mpadge commented Jan 28, 2025

bnicenboim commented Jan 28, 2025

mpadge commented Jan 30, 2025