Skip to content

Update stemming and Snowball #13561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

AA-Turner
Copy link
Member

@AA-Turner AA-Turner commented May 16, 2025

Purpose

Follows on from Olly Betts' #13535.

This PR changes the stemmer for English to use the more modern 'english' instead of 'porter'. We also automate the creation of stopword sets from the data files on the snowball website, and update the JavaScript files to v3.0.1.

Open questions:

  • The files that Snowball generates aren't ready for using on the web, as far as I understand, because they use NodeJS-specific require() functions and module.exports. Is there a way to avoid this? I just ran make dist_libstemmer_js.
  • Is there a better way to get stopwords than by parsing the stop.txt files on the Snowball website?
  • Do we need to keep the JS and Python versions in lockstep, meaning updating "snowballstemmer>=2.2" to "snowballstemmer==3.0.1" in pyproject.toml?

We still use the English stemmer in zh.py, but I think that's as there's no Mandarin/Cantonese stemmer. I won't claim to understand the rationale or background here, though.

A

cc @ojwb

References

fyi/cc @mitya57 as the author of the previous Snowball upgrade.

@ojwb
Copy link

ojwb commented May 16, 2025

  • The files that Snowball generates aren't ready for using on the web, as far as I understand, because they use NodeJS-specific require() functions and module.exports. Is there a way to avoid this? I just ran make dist_libstemmer_js.

There's work in progress to improve this, but it's rather hampered by my lack of familiarity with modern Javascript and by people who are familiar not being very responsive to questions in issues/PRs. (I understand we're all busy, but that is why this isn't resolved yet.)

I think we're probably going to move to producing ES6 Javascript with ESM modules, which apparently should work in all modern web browsers, and also server-side with both node and deno.

The current immediate blocker to progress is how to produce a minimised version for web-use that is a similar size to what we can currently get with closure-compiler, as that's what we use to make the website demo. Unfortunately closure-compiler doesn't like the type annotations in the JS code produced by the open Snowball PR for ES6 generation, and the other options for minifying I've tried don't do nearly as good a job at reducing the code size.

  • Is there a better way to get stopwords than by parsing the stop.txt files on the Snowball website?

Maybe we should offer versions with the comments stripped, but it's pretty trivial to do that yourself, and stopword lists are somewhat domain specific so they're really just meant as starting points - it's expected people will want to consider removing or adding entries, for which the comments are useful. Also if you're using the list to avoid even indexing stopwords you need to be more conservative than if you're applying it by default at search time.

  • Do we need to keep the JS and Python versions in lockstep, meaning updating "snowballstemmer>=2.2" to "snowballstemmer==3.0.1" in pyproject.toml?

Ideally you want to use identical versions since different versions may stem some words differently in some languages, but minor changes are usually not problematic. For example, Greek in 3.0.1 vs 2.2 stems ισα to ισ instead of an empty string, but a search for ισα in 2.2 wouldn't match ισα in a document anyway. Italian in 3.0.1 vs 2.2 only differs by stemming divano to divan instead of div (which unfortunately collides with diva); a search for divano will not match divano in a document indexed using 2.2, but it will now match divan in a document (indexed with either version) which it wouldn't before.

It's pretty common to see something like the situation with Italian here where a stemming change has positive and negative effects for existing indexed data so is close to neutral overall (but is an improvement after reindexing).

For some languages the changes in 3.0.0 were substantial - we've switched Dutch to be a completely different stemming algorithm and many words now have different stems (e.g. the new algorithm doubles vowels rather than undoubling, so maan and manen both stem to maan now instead of man before); German now handles text with ä, ö, ü written as ae, oe, ue which would make mixed version use problematic with text containing these transliterated forms). Mixing versions across sweeping changes like these is likely to be problematic, but they are very rare - 3.0.0 is the first major bump in 5.5 years (and 2.0.0 really just signified the first version released as snowballstem.org after Martin Porter retired from development).

Generally a major version bump signifies you might need to reindex, but that may only be necessary for some languages. E.g. the Danish and Portuguese stemmers haven't changed functionally at all since v2.0.0.

Something else to be aware of is that the Python snowballstemer module will actually use the stemmers from PyStemmer instead if that's installed (PyStemmer is a Python wrapper around the C versions of the snowball stemmers, which is much faster than the pure Python, so this provides an easy way to allow users to accelerate stemming just by installing PyStemmer). However this means snowballstemmer==3.0.1 might not get you the 3.0.1 stemmers if PyStemmer is installed.

I'm not sure there's a great solution to trying to keep different languages in step. Maybe the best way would be pypi package of snowballstem which also provided matching Javascript code?

We still use the English stemmer in zh.py, but I think that's as there's no Mandarin/Cantonese stemmer. I won't claim to understand the rationale or background here, though.

Chinese doesn't have inflected forms, so doesn't need a stemmer. It is usually written without explicit word breaks, so needs word boundary identification (or an alternative approach like indexing n-grams instead of words) but that's a different problem to stemming.

Looks like zh.py is assuming spans of latin alphabet text in amongst Chinese text is English and is useful to stem as such.

@AA-Turner
Copy link
Member Author

The current immediate blocker to progress is how to produce a minimised version for web-use that is a similar size to what we can currently get with closure-compiler, as that's what we use to make the website demo. Unfortunately closure-compiler doesn't like the type annotations in the JS code produced by the open Snowball PR for ES6 generation, and the other options for minifying I've tried don't do nearly as good a job at reducing the code size.

We use uglifyjs, which appears competitive: npx uglifyjs sphinx/search/non-minified-js/*.js --compress --mangle -o tmp.js gives a 281KB file, and https://snowballstem.org/js/stemmers.js is 292KB.

This project appears to be a reasonable benchmark of different minfiers: https://github.com/privatenumber/minification-benchmarks#-results

import requests

SNOWBALL_VERSION = '3.0.1'
SNOWBALL_URL = f'https://github.com/snowballstem/snowball/archive/refs/tags/v{SNOWBALL_VERSION}.tar.gz'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I recognize that this is a utility script and therefore any changes introduced by it are likely to go through code review / etc, I'm generally fairly strongly in favour of pinning checksums along with download of static content.

In other words: because we know that we're downloading v3.0.1 of snowball here, I think we could/should assert that the SHA256sum of the resulting download matches an expected value.

There is a small chance that GitHub GZ compression might change in future, as they have once before -- but such events should be rare, so I don't think it would be worth being clever and trying to checksum the .tar or otherwise determine the inner contents of the archive.

Suggested change
SNOWBALL_URL = f'https://github.com/snowballstem/snowball/archive/refs/tags/v{SNOWBALL_VERSION}.tar.gz'
SNOWBALL_URL = f'https://github.com/snowballstem/snowball/archive/refs/tags/v{SNOWBALL_VERSION}.tar.gz'
SNOWBALL_SHA256 = '80ac10ce40dc4fcfbfed8d085c457b5613da0e86a73611a3d5527d044a142d60'

STOPWORDS_DIR = SEARCH_DIR / '_stopwords'
NON_MINIFIED_JS_DIR = SEARCH_DIR / 'non-minified-js'

STOPWORD_URLS = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ojwb are stopwords for multiple languages available as a combined download, or do we need to collect each file individually (as here)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apologies; I've only just noticed that this somewhat-duplicates previous discussion. even so, it would be convenient..)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems a good idea - I'd just thought about the comment stripping aspect in the other thread, but being able to grab all the lists in one download would be handy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - and FWIW: the specific use-case I have in mind for this is to allow adding checksums for the stopwords files (similar to another comment I left about the snowball tarball download).

Assuming that updates to those files would occur similarly to localization/translation files, e.g. they may occur piecemeal and in somewhat unpredictable order, but can generally be bundled and approved for a release version, then inluding a snapshot of those in a versioned file could be convenient.

@jayaddison

This comment was marked as outdated.

@ojwb

This comment was marked as outdated.

@jayaddison

This comment was marked as resolved.

@jayaddison

This comment was marked as resolved.

@ojwb

This comment was marked as resolved.

@jayaddison

This comment was marked as resolved.

@AA-Turner
Copy link
Member Author

There's work in progress to improve this, but it's rather hampered by my lack of familiarity with modern Javascript and by people who are familiar not being very responsive to questions in issues/PRs. (I understand we're all busy, but that is why this isn't resolved yet.)

I've had a go in snowballstem/snowball#234.

I think we're probably going to move to producing ES6 Javascript with ESM modules, which apparently should work in all modern web browsers, and also server-side with both node and deno.

+1. I think Sphinx would be fine migrating to modules with no delay, as we have always bundled the JavaScript.

Something else to be aware of is that the Python snowballstemer module will actually use the stemmers from PyStemmer instead if that's installed ... However this means snowballstemmer==3.0.1 might not get you the 3.0.1 stemmers if PyStemmer is installed.

We mention PyStemmer in the documentation (and I've added it to the docs.python.org build system), but I hadn't realised that the stemmer data itself would change with the package, thanks for pointing it out.

I'm not sure there's a great solution to trying to keep different languages in step. Maybe the best way would be pypi package of snowballstem which also provided matching Javascript code?

I don't want to ask for you to do more packaging work, I'm aware of the pain involved. I think we should be fine just through this PR. It's probably somewhat rare to use stemmers in two different programming languages.

Your comparison may be of very different things though - at least if you ran it on sphinx git master, that has 15 stemmers + base-stemmer.js whereas the Snowball website has 33 + base-stemmer.js and the JS code for the demo itself. The stemmers do vary wildly in code size (from under 7K for hindi to 128K for serbian, but e.g. serbian and greek are the largest two JS files and not in sphinx git master.

I ran the comparison on this branch, with all stemmers generated.

Unhelpfully the benchmark just seems to have X for what we're currently using for snowballstem.org.

I see, rather unhelpful. They note: "google-closure-compiler: A heavy misstep even at the starting gate, failing on "react" due to a critical configuration issue. It’s still a solid minifier if configured correctly, but good luck setting it up!", but seems a shame they didn't try harder in configuration.

A

AA-Turner added 4 commits May 18, 2025 04:15
# Conflicts:
#	CHANGES.rst
#	sphinx/search/_stopwords/da.py
#	sphinx/search/_stopwords/da.txt
#	sphinx/search/_stopwords/de.py
#	sphinx/search/_stopwords/de.txt
#	sphinx/search/_stopwords/en.py
#	sphinx/search/_stopwords/es.py
#	sphinx/search/_stopwords/es.txt
#	sphinx/search/_stopwords/fi.py
#	sphinx/search/_stopwords/fi.txt
#	sphinx/search/_stopwords/fr.py
#	sphinx/search/_stopwords/fr.txt
#	sphinx/search/_stopwords/hu.py
#	sphinx/search/_stopwords/hu.txt
#	sphinx/search/_stopwords/it.py
#	sphinx/search/_stopwords/it.txt
#	sphinx/search/_stopwords/nl.py
#	sphinx/search/_stopwords/nl.txt
#	sphinx/search/_stopwords/no.py
#	sphinx/search/_stopwords/no.txt
#	sphinx/search/_stopwords/pt.py
#	sphinx/search/_stopwords/pt.txt
#	sphinx/search/_stopwords/ru.py
#	sphinx/search/_stopwords/ru.txt
#	sphinx/search/_stopwords/sv.py
#	sphinx/search/_stopwords/sv.txt
#	sphinx/search/en.py
#	sphinx/search/zh.py
# Conflicts:
#	sphinx/search/minified-js/README.rst
@ojwb
Copy link

ojwb commented May 19, 2025

Something else to be aware of is that the Python snowballstemer module will actually use the stemmers from PyStemmer instead if that's installed ... However this means snowballstemmer==3.0.1 might not get you the 3.0.1 stemmers if PyStemmer is installed.

We mention PyStemmer in the documentation (and I've added it to the docs.python.org build system), but I hadn't realised that the stemmer data itself would change with the package, thanks for pointing it out.

It's really stemmer code rather than stemmer data - snowballstemmer uses pure Python code for the stemmers (generated from the Snowball code) whereas PyStemmer is a Python C extension using C code for the stemmers (also generated from the Snowball code).

Possibly snowballstemmer should only forward to PyStemmer if the major versions are the same or something like that, though that's unhelpful if you're only using a stemmer which hasn't changed between those versions.

Your comparison may be of very different things though - at least if you ran it on sphinx git master, that has 15 stemmers + base-stemmer.js whereas the Snowball website has 33 + base-stemmer.js and the JS code for the demo itself. The stemmers do vary wildly in code size (from under 7K for hindi to 128K for serbian, but e.g. serbian and greek are the largest two JS files and not in sphinx git master.

I ran the comparison on this branch, with all stemmers generated.

OK, so at least they're working on approximately the same inputs (probably just demo.js extra for closure-compiler which is only 6416 bytes).

Looking at the code generated by each, one obvious extra thing uglifyjs does that helps reduce the size is change Unicode escapes in string literals to UTF-8 encoded source code (so e.g. \u640 becomes a two byte UTF-8 sequence saving 3 bytes). If I add --charset UTF-8 for closure-compiler that brings its output down to 263504 bytes (and if UTF-8 encoded Javascript source is OK then the Snowball compiler could easily produce it directly - since v3.0.0 it actually does for target languages which clearly document the default source encoding is UTF-8 or a way to specify that it is, but I failed to find that info for Javascript - e.g. https://tc39.es/ecma262/multipage/ecmascript-language-source-code.html#sec-ecmascript-language-source-code says "The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification").

Comparing on the snowball-website repo (so including demo.js for both) and allowing UTF-8 JS source I get 288406 bytes with uglifyjs vs 263504 with closure-compiler, so uglifyjs output is about 9.5% larger. That seems tolerable and both are smaller than closure-compiler without specifying UTF-8 output, though the number of stemming languages and hence total code size will continue to grow over time so I'd be happier to achieve a more similar size reduction with uglifyjs, or find how to make closure-compiler work with modernised JS output.

Anyway, I realise this is getting increasingly off-topic for sphinx-doc but it seemed worth summarising my findings as you're also compressing the JS code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

English stemming problems
3 participants