Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25 values as additional information to predictabilities #28

Closed
mpadge opened this issue Jan 27, 2025 · 4 comments
Closed

BM25 values as additional information to predictabilities #28

mpadge opened this issue Jan 27, 2025 · 4 comments

Comments

@mpadge
Copy link

mpadge commented Jan 27, 2025

Hi @bnicenboim, I'm part of rOpenSci, and have been following the review of your package there. Just wanted to let you know that my pkgmatch package includes efficient C++ routines for calculating BM25 inverse term frequencies. These often provide a useful benchmark for LLM prediction values in the context of a given input corpus. I just wanted to let you know that these routines exist. Happy to discuss further if you think these values might provide a useful enhancement here, otherwise no worries at all if you just want to close this issue. Great work with the package!

@bnicenboim
Copy link
Owner

Hi!
Thanks for the suggestion, I don't know anything about BM25 inverse term frequencies. I've been going over the links, but something I don't understand is what's their corpus. Is it language based? Or just the "internet"?

@mpadge
Copy link
Author

mpadge commented Jan 28, 2025

Yeah, that the trick there. The corpus has to be pre-defined. How to do this would depend on envisioned applications of your package. If, for example, an application were for a particular academic research field, then a corpus of published papers could be assembled and used. But the need to define and have a prepared corpus may make BM25 out of scope for your package?

It is nevertheless important to note that everything needs a pre-defined corpus, we just happen to have cultivated a culture where that's not openly discussed. BM25 values are very commonly used anyway, and most "hybrid search"-type systems modify search rankings by combining with BM25. Open and general corpuses are often described and linked to in model cards on sites like huggingface.co.

@bnicenboim
Copy link
Owner

The envisioned use of pangoling is to get the predictability of the words of psycholinguistic stimuli. The stimuli usually represent a language (or a written language). LLM such as gpt-* models are very useful because they have been trained on a relatively broad corpus of a specific language. There are hundreds of papers that use gpt* models with reading times, for example:
https://scholar.google.com/scholar?hl=en&as_sdt=7%2C39&q=gpt+%22reading+times%22+psycholinguistics&btnG=

And if I do the same search with BM25, I get 2 papers which are not really about reading:
https://scholar.google.com/scholar?hl=en&as_sdt=7%2C39&q=bm25+%22reading+times%22+psycholinguistics&btnG=

So including BM25 would be out of the scope of the package right now. But in any case, I think this is interesting. Checking the Wikipedia page, it seems that it's more likely to replace word frequency (which my package also doesn't provide yet) than predictability. Are there other important papers about it?
I can just use the function pkgmatch_bm25 to try it out, right?

@mpadge
Copy link
Author

mpadge commented Jan 30, 2025

Interestingly, the thing that convinced me to incorporate BM25 in pkgmatch was not an academic paper at all, but this blog post from anthropic. That was my starting point, Snooping around in codebases (mostly in python) revealed that BM25 values are used a lot, even where that's not prominently described. That also revealed that they are entirely and exclusively coded in everything-at-once style, so you have to calculate weightings over a whole corpus each time you want weighting for any set of inputs. My routines just separate those two steps, to allow efficient pre-calculation of corpus weightings, and very fast calculation for inputs from that point on.

That two-stage process means that you can't just run the one function to get values, creating difficulties like in this issue. Until that issue has been addressed, advice is like written there, to look at the script included with the package and adapt the lines there.

Finally, I entirely understand that this is out of scope, so feel free to close. That said, please feel equally free to return to the idea at any later stage. Thanks for considering!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants