Skip to content

English stemming problems #13535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ojwb opened this issue May 8, 2025 · 5 comments · May be fixed by #13561
Open

English stemming problems #13535

ojwb opened this issue May 8, 2025 · 5 comments · May be fixed by #13561

Comments

@ojwb
Copy link

ojwb commented May 8, 2025

Describe the bug

Hi, snowballstem upstream here.

The recent issue with our 3.0.0 release caused me to notice that you're using our "porter" stemmer, which is really still provided only for academic interest. It aims to be a faithful implementation of Martin Porter's English stemmer as described in his 1980 paper, and may be useful to people trying to reproduce past results which used it. This means that the implementation of "porter" is effectively frozen (we'd only fix deviations from the original paper). In the 45 years since the paper numerous shortcomings in the algorithm it describes have come to light, and Martin himself has since devised an improved version of the stemmer, which he nicknamed "porter2". You can find the lastest version of this as our "english" stemmer.

Looking at https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/en.py I see there's also a "JS porter" implementation which looks like a hand-written implementation, and my initial thought that "oh, they're using the older and less good porter stemmer because they need to be compatible with that", but looking at the "JS porter" javascript code, it isn't actually an implementation of Porter's 1980 algorithm. For example, line 48 has logi: 'log' which is one of the additional rules added in "porter2". My best guess is it's a hand-written implementation of an early version of "porter2".

If I follow how this is being used, you index with the Python "porter" stemmer and search with this Javascript "JS porter" stemmer. If that's correct, searches for some words will fail to match the same word in documentation (e.g. a search for tautology won't match tautology in the documentation because "porter" will stem it to tautologi while "JS porter" will stem it to tautolog).

I extracted this "JS porter" code to actually verify this, and ran it against Snowball's test suite, which reveals more problems. For example, its undoubling rule is buggy and it stems wrapped to wrapp, while both "porter" and "english" stem it to "wrap", so it's a buggy implementation compared to either. That means a query for wrapped will fail to match wrapped in a document.

In total 1148 words from our English test vocabulary of 42603 words are stemmed differently by "porter" and "JS porter" - that's 2.7% (counting each word equally rather than trying to weight by frequency, but maybe 1 word in 40 will not match as it should).

However, 5324 out of 42621 words are stemmed differently by "english" (from Snowball 3.0.0) and "JS porter" so that's worse (at least if we assume all words are equally important).

(The 42621 vs 42603 word list size difference is just because the word list for "english" has had a few extra words added over that for "porter" to provide better test coverage for some rule changes.)

I'd suggest the best way to resolve this would be to switch from "porter" to "english" (because the latter has improvements from 45 years of experience using the original so is a significantly better stemmer) and replace this "JS porter" implementation with a Javascript version of the same stemmer generated by Snowball. Snowball's upstream testsuite should ensure these produce the same stems (at least if you take them from the same upstream Snowball release, but evolution is slow at this point so even version skew is not going to give you different stems for 2.7% of words).

It looks like you even already have Snowball-generated Javascript versions for many languages, though they're rather out of date (Snowball 2.1.0 was released 2021-01-21):

https://github.com/sphinx-doc/sphinx/blob/master/sphinx/search/non-minified-js/danish-stemmer.js

How to Reproduce

Compare stems for e.g. wrapped from the Python and Javascript code.

Environment Information

Report based on inspecting code in git.

Sphinx extensions

Additional context

No response

@ojwb ojwb added the type:bug label May 8, 2025
@jayaddison
Copy link
Contributor

I extracted this "JS porter" code to actually verify this, and ran it against Snowball's test suite, which reveals more problems. For example, its undoubling rule is buggy and it stems wrapped to wrapp, while both "porter" and "english" stem it to "wrap", so it's a buggy implementation compared to either. That means a query for wrapped will fail to match wrapped in a document.

@ojwb I'm having difficulty replicating this part of the report; I find that the EN-language JS stemmer and Python equivalent both stem the word wrapped to wrap, meaning that wrapped remains valid as a findable query token.

@ojwb
Copy link
Author

ojwb commented May 9, 2025

I'm having difficulty replicating this part of the report

I looked into this, and it was caused by me failing to unescape the jsporter code fully. I had fixed up two long regexp lines which had been wrapped and \ inserted, but I'd failed to spot where \\ was escaped to \\\\ in another regexp. Fixing that I do indeed get wrap for wrapped - sorry for the misleading information there.

The outputs still don't match for other cases though. Repeating the test there are 68 differences compared to Snowball's "porter" algorithm which is not nearly as bad, but still not ideal. (Compared to the "english" algorithm, there are actually 2176 differences so jsporter is definitely closer to "porter".)

The main differences are (porter vs jsporter):

  • -ibly (e.g. audibly -> audibli vs audibl)
  • -ology (e.g. tautology -> tautologi vs tautolog)
  • -s on very short inputs (e.g. ms -> m vs ms)

I'd actually say jsporter considered in by itself does better in all these cases because it conflates e.g. audibly/audible and tautology/tautological, and doesn't conflate ms/m, but it's problematic to use a different stemmer at index and search time because searches for the affected words won't match themselves in the text that was indexed.

Snowball's "english" stemmer has these improvements too, but also various others, so I'd still suggest standardising on that for both Python and Javascript. It's likely to be easier than adjusting jsporter to exactly match Snowball's "porter" stemmer, while giving the benefit of better English stemming.


Here's the patched stemwords.js with the correctly unescaped jsporter code in case it is useful to you for testing (I had to add .txt to the filename for github to allow me to attach it):

jsporter-stemwords.js.txt

You can test it on a single word like so:

$ echo wrapped|node jsporter-stemwords.js.txt -i /dev/stdin -o /dev/stdout
wrap

@jayaddison
Copy link
Contributor

I looked into this, and it was caused by me failing to unescape the jsporter code fully. I had fixed up two long regexp lines which had been wrapped and \ inserted, but I'd failed to spot where \\ was escaped to \\\\ in another regexp. Fixing that I do indeed get wrap for wrapped - sorry for the misleading information there.

That's OK - thank you for the report, and for checking that! I'm on-board with migrating from the porter to more-developed English stemmer and also updating all of the bundled stemmers (both Python and JS) - especially after reading through some of the recent changelog entries.

I'll try to get around to starting the update within the next week or so.

@AA-Turner
Copy link
Member

The recent issue with our 3.0.0 release caused me to notice that you're using our "porter" stemmer, which is really still provided only for academic interest.

Thanks for letting us know. I've tried to work out why I didn't use English when I changed to use snowballstemmer in 2022, and encountered a few challenges with the snowball documentation, which I've detailed at the end.

Looking at master/sphinx/search/en.py I see there's also a "JS porter" implementation which looks like a hand-written implementation, and my initial thought that "oh, they're using the older and less good porter stemmer because they need to be compatible with that", but looking at the "JS porter" javascript code, it isn't actually an implementation of Porter's 1980 algorithm. For example, line 48 has logi: 'log' which is one of the additional rules added in "porter2". My best guess is it's a hand-written implementation of an early version of "porter2".

I can't work out the provenance of this parser, it's rather strange. It appears to date back to the first revision of proto-Sphinx (then called py-rest-doc). @birkenfeld may know, but it was nearly 20 years ago!

I'd suggest the best way to resolve this would be to switch from "porter" to "english" (because the latter has improvements from 45 years of experience using the original so is a significantly better stemmer) and replace this "JS porter" implementation with a Javascript version of the same stemmer generated by Snowball. Snowball's upstream testsuite should ensure these produce the same stems (at least if you take them from the same upstream Snowball release, but evolution is slow at this point so even version skew is not going to give you different stems for 2.7% of words).

This makes sense, I'll propose a PR.

A


Snowball documentation troubles:

I tried to find the documentation, but couldn't easily find the link on snowball's PyPI page, I eventually found https://snowballstem.org/algorithms/. On the list of English stemmers, the first three are Porter/Lovins, with the main English stemmer 'buried' in the list. The brackets are unclear, it turns out they mean 'not recommended', but perhaps it might be an improvement to say 'academic interest only: blah', or list English first, or put it in boldface?

The page for the english stemmer (https://snowballstem.org/algorithms/english/stemmer.html) also contains no usage exemplars, so somewhat unclear if I should use snowballstemmer.stemmer('porter2') or 'english' etc.

@birkenfeld
Copy link
Member

Haha, sorry, I have no idea anymore :) It may even have been @mitsuhiko, he did a lot of the early HTML/JS part.

@AA-Turner AA-Turner linked a pull request May 16, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants