-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
English stemming problems #13535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ojwb I'm having difficulty replicating this part of the report; I find that the EN-language JS stemmer and Python equivalent both stem the word |
I looked into this, and it was caused by me failing to unescape the jsporter code fully. I had fixed up two long regexp lines which had been wrapped and The outputs still don't match for other cases though. Repeating the test there are 68 differences compared to Snowball's "porter" algorithm which is not nearly as bad, but still not ideal. (Compared to the "english" algorithm, there are actually 2176 differences so jsporter is definitely closer to "porter".) The main differences are (porter vs jsporter):
I'd actually say jsporter considered in by itself does better in all these cases because it conflates e.g. Snowball's "english" stemmer has these improvements too, but also various others, so I'd still suggest standardising on that for both Python and Javascript. It's likely to be easier than adjusting jsporter to exactly match Snowball's "porter" stemmer, while giving the benefit of better English stemming. Here's the patched You can test it on a single word like so:
|
That's OK - thank you for the report, and for checking that! I'm on-board with migrating from the I'll try to get around to starting the update within the next week or so. |
Thanks for letting us know. I've tried to work out why I didn't use English when I changed to use
I can't work out the provenance of this parser, it's rather strange. It appears to date back to the first revision of proto-Sphinx (then called
This makes sense, I'll propose a PR. A Snowball documentation troubles:I tried to find the documentation, but couldn't easily find the link on snowball's PyPI page, I eventually found https://snowballstem.org/algorithms/. On the list of English stemmers, the first three are Porter/Lovins, with the main English stemmer 'buried' in the list. The brackets are unclear, it turns out they mean 'not recommended', but perhaps it might be an improvement to say 'academic interest only: blah', or list English first, or put it in boldface? The page for the english stemmer (https://snowballstem.org/algorithms/english/stemmer.html) also contains no usage exemplars, so somewhat unclear if I should use |
Haha, sorry, I have no idea anymore :) It may even have been @mitsuhiko, he did a lot of the early HTML/JS part. |
Describe the bug
Hi, snowballstem upstream here.
The recent issue with our 3.0.0 release caused me to notice that you're using our "porter" stemmer, which is really still provided only for academic interest. It aims to be a faithful implementation of Martin Porter's English stemmer as described in his 1980 paper, and may be useful to people trying to reproduce past results which used it. This means that the implementation of "porter" is effectively frozen (we'd only fix deviations from the original paper). In the 45 years since the paper numerous shortcomings in the algorithm it describes have come to light, and Martin himself has since devised an improved version of the stemmer, which he nicknamed "porter2". You can find the lastest version of this as our "english" stemmer.
Looking at https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/en.py I see there's also a "JS porter" implementation which looks like a hand-written implementation, and my initial thought that "oh, they're using the older and less good porter stemmer because they need to be compatible with that", but looking at the "JS porter" javascript code, it isn't actually an implementation of Porter's 1980 algorithm. For example, line 48 has
logi: 'log'
which is one of the additional rules added in "porter2". My best guess is it's a hand-written implementation of an early version of "porter2".If I follow how this is being used, you index with the Python "porter" stemmer and search with this Javascript "JS porter" stemmer. If that's correct, searches for some words will fail to match the same word in documentation (e.g. a search for
tautology
won't matchtautology
in the documentation because "porter" will stem it totautologi
while "JS porter" will stem it totautolog
).I extracted this "JS porter" code to actually verify this, and ran it against Snowball's test suite, which reveals more problems. For example, its undoubling rule is buggy and it stems
wrapped
towrapp
, while both "porter" and "english" stem it to "wrap", so it's a buggy implementation compared to either. That means a query forwrapped
will fail to matchwrapped
in a document.In total 1148 words from our English test vocabulary of 42603 words are stemmed differently by "porter" and "JS porter" - that's 2.7% (counting each word equally rather than trying to weight by frequency, but maybe 1 word in 40 will not match as it should).
However, 5324 out of 42621 words are stemmed differently by "english" (from Snowball 3.0.0) and "JS porter" so that's worse (at least if we assume all words are equally important).
(The 42621 vs 42603 word list size difference is just because the word list for "english" has had a few extra words added over that for "porter" to provide better test coverage for some rule changes.)
I'd suggest the best way to resolve this would be to switch from "porter" to "english" (because the latter has improvements from 45 years of experience using the original so is a significantly better stemmer) and replace this "JS porter" implementation with a Javascript version of the same stemmer generated by Snowball. Snowball's upstream testsuite should ensure these produce the same stems (at least if you take them from the same upstream Snowball release, but evolution is slow at this point so even version skew is not going to give you different stems for 2.7% of words).
It looks like you even already have Snowball-generated Javascript versions for many languages, though they're rather out of date (Snowball 2.1.0 was released 2021-01-21):
https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/non-minified-js/danish-stemmer.js
How to Reproduce
Compare stems for e.g.
wrapped
from the Python and Javascript code.Environment Information
Sphinx extensions
Additional context
No response
The text was updated successfully, but these errors were encountered: