Update stemming and Snowball #13561

AA-Turner · 2025-05-16T01:45:28Z

Purpose

Follows on from Olly Betts' #13535.

This PR changes the stemmer for English to use the more modern 'english' instead of 'porter'. We also automate the creation of stopword sets from the data files on the snowball website, and update the JavaScript files to v3.0.1.

Open questions:

The files that Snowball generates aren't ready for using on the web, as far as I understand, because they use NodeJS-specific require() functions and module.exports. Is there a way to avoid this? I just ran make dist_libstemmer_js.
- Edit: It seems that snowballstem/snowball@510d933 is the relevant commit, which seems to be unconditionally enabled.
Is there a better way to get stopwords than by parsing the stop.txt files on the Snowball website?
Do we need to keep the JS and Python versions in lockstep, meaning updating "snowballstemmer>=2.2" to "snowballstemmer==3.0.1" in pyproject.toml?

We still use the English stemmer in zh.py, but I think that's as there's no Mandarin/Cantonese stemmer. I won't claim to understand the rationale or background here, though.

A

cc @ojwb

References

Closes English stemming problems #13535
Update JavaScript stemmer code to the latest version of Snowball (v2.1.0) #8867
059a74c (Fix #1784: Provide non-minified JS code in sphinx/search/*.py)
Please provide source for JS code in sphinx/search/*.py #1784
afd6c0b (Merged in shibu/sphinx/add_stemmer (pull request #214))

fyi/cc @mitya57 as the author of the previous Snowball upgrade.

ojwb · 2025-05-16T05:21:49Z

The files that Snowball generates aren't ready for using on the web, as far as I understand, because they use NodeJS-specific require() functions and module.exports. Is there a way to avoid this? I just ran make dist_libstemmer_js.

Edit: It seems that snowballstem/snowball@510d933 is the relevant commit, which seems to be unconditionally enabled.

There's work in progress to improve this, but it's rather hampered by my lack of familiarity with modern Javascript and by people who are familiar not being very responsive to questions in issues/PRs. (I understand we're all busy, but that is why this isn't resolved yet.)

I think we're probably going to move to producing ES6 Javascript with ESM modules, which apparently should work in all modern web browsers, and also server-side with both node and deno.

The current immediate blocker to progress is how to produce a minimised version for web-use that is a similar size to what we can currently get with closure-compiler, as that's what we use to make the website demo. Unfortunately closure-compiler doesn't like the type annotations in the JS code produced by the open Snowball PR for ES6 generation, and the other options for minifying I've tried don't do nearly as good a job at reducing the code size.

Is there a better way to get stopwords than by parsing the stop.txt files on the Snowball website?

Maybe we should offer versions with the comments stripped, but it's pretty trivial to do that yourself, and stopword lists are somewhat domain specific so they're really just meant as starting points - it's expected people will want to consider removing or adding entries, for which the comments are useful. Also if you're using the list to avoid even indexing stopwords you need to be more conservative than if you're applying it by default at search time.

Do we need to keep the JS and Python versions in lockstep, meaning updating "snowballstemmer>=2.2" to "snowballstemmer==3.0.1" in pyproject.toml?

Ideally you want to use identical versions since different versions may stem some words differently in some languages, but minor changes are usually not problematic. For example, Greek in 3.0.1 vs 2.2 stems ισα to ισ instead of an empty string, but a search for ισα in 2.2 wouldn't match ισα in a document anyway. Italian in 3.0.1 vs 2.2 only differs by stemming divano to divan instead of div (which unfortunately collides with diva); a search for divano will not match divano in a document indexed using 2.2, but it will now match divan in a document (indexed with either version) which it wouldn't before.

It's pretty common to see something like the situation with Italian here where a stemming change has positive and negative effects for existing indexed data so is close to neutral overall (but is an improvement after reindexing).

For some languages the changes in 3.0.0 were substantial - we've switched Dutch to be a completely different stemming algorithm and many words now have different stems (e.g. the new algorithm doubles vowels rather than undoubling, so maan and manen both stem to maan now instead of man before); German now handles text with ä, ö, ü written as ae, oe, ue which would make mixed version use problematic with text containing these transliterated forms). Mixing versions across sweeping changes like these is likely to be problematic, but they are very rare - 3.0.0 is the first major bump in 5.5 years (and 2.0.0 really just signified the first version released as snowballstem.org after Martin Porter retired from development).

Generally a major version bump signifies you might need to reindex, but that may only be necessary for some languages. E.g. the Danish and Portuguese stemmers haven't changed functionally at all since v2.0.0.

Something else to be aware of is that the Python snowballstemer module will actually use the stemmers from PyStemmer instead if that's installed (PyStemmer is a Python wrapper around the C versions of the snowball stemmers, which is much faster than the pure Python, so this provides an easy way to allow users to accelerate stemming just by installing PyStemmer). However this means snowballstemmer==3.0.1 might not get you the 3.0.1 stemmers if PyStemmer is installed.

I'm not sure there's a great solution to trying to keep different languages in step. Maybe the best way would be pypi package of snowballstem which also provided matching Javascript code?

We still use the English stemmer in zh.py, but I think that's as there's no Mandarin/Cantonese stemmer. I won't claim to understand the rationale or background here, though.

Chinese doesn't have inflected forms, so doesn't need a stemmer. It is usually written without explicit word breaks, so needs word boundary identification (or an alternative approach like indexing n-grams instead of words) but that's a different problem to stemming.

Looks like zh.py is assuming spans of latin alphabet text in amongst Chinese text is English and is useful to stem as such.

AA-Turner · 2025-05-16T14:14:13Z

The current immediate blocker to progress is how to produce a minimised version for web-use that is a similar size to what we can currently get with closure-compiler, as that's what we use to make the website demo. Unfortunately closure-compiler doesn't like the type annotations in the JS code produced by the open Snowball PR for ES6 generation, and the other options for minifying I've tried don't do nearly as good a job at reducing the code size.

We use uglifyjs, which appears competitive: npx uglifyjs sphinx/search/non-minified-js/*.js --compress --mangle -o tmp.js gives a 281KB file, and https://snowballstem.org/js/stemmers.js is 292KB.

This project appears to be a reasonable benchmark of different minfiers: https://github.com/privatenumber/minification-benchmarks#-results

utils/generate_snowball.py

CHANGES.rst

sphinx/search/__init__.py

utils/generate_snowball.py

AA-Turner · 2025-05-18T02:40:50Z

There's work in progress to improve this, but it's rather hampered by my lack of familiarity with modern Javascript and by people who are familiar not being very responsive to questions in issues/PRs. (I understand we're all busy, but that is why this isn't resolved yet.)

I've had a go in snowballstem/snowball#234.

I think we're probably going to move to producing ES6 Javascript with ESM modules, which apparently should work in all modern web browsers, and also server-side with both node and deno.

+1. I think Sphinx would be fine migrating to modules with no delay, as we have always bundled the JavaScript.

Something else to be aware of is that the Python snowballstemer module will actually use the stemmers from PyStemmer instead if that's installed ... However this means snowballstemmer==3.0.1 might not get you the 3.0.1 stemmers if PyStemmer is installed.

We mention PyStemmer in the documentation (and I've added it to the docs.python.org build system), but I hadn't realised that the stemmer data itself would change with the package, thanks for pointing it out.

I'm not sure there's a great solution to trying to keep different languages in step. Maybe the best way would be pypi package of snowballstem which also provided matching Javascript code?

I don't want to ask for you to do more packaging work, I'm aware of the pain involved. I think we should be fine just through this PR. It's probably somewhat rare to use stemmers in two different programming languages.

Your comparison may be of very different things though - at least if you ran it on sphinx git master, that has 15 stemmers + base-stemmer.js whereas the Snowball website has 33 + base-stemmer.js and the JS code for the demo itself. The stemmers do vary wildly in code size (from under 7K for hindi to 128K for serbian, but e.g. serbian and greek are the largest two JS files and not in sphinx git master.

I ran the comparison on this branch, with all stemmers generated.

Unhelpfully the benchmark just seems to have X for what we're currently using for snowballstem.org.

I see, rather unhelpful. They note: "google-closure-compiler: A heavy misstep even at the starting gate, failing on "react" due to a critical configuration issue. It’s still a solid minifier if configured correctly, but good luck setting it up!", but seems a shame they didn't try harder in configuration.

A

# Conflicts: # CHANGES.rst # sphinx/search/_stopwords/da.py # sphinx/search/_stopwords/da.txt # sphinx/search/_stopwords/de.py # sphinx/search/_stopwords/de.txt # sphinx/search/_stopwords/en.py # sphinx/search/_stopwords/es.py # sphinx/search/_stopwords/es.txt # sphinx/search/_stopwords/fi.py # sphinx/search/_stopwords/fi.txt # sphinx/search/_stopwords/fr.py # sphinx/search/_stopwords/fr.txt # sphinx/search/_stopwords/hu.py # sphinx/search/_stopwords/hu.txt # sphinx/search/_stopwords/it.py # sphinx/search/_stopwords/it.txt # sphinx/search/_stopwords/nl.py # sphinx/search/_stopwords/nl.txt # sphinx/search/_stopwords/no.py # sphinx/search/_stopwords/no.txt # sphinx/search/_stopwords/pt.py # sphinx/search/_stopwords/pt.txt # sphinx/search/_stopwords/ru.py # sphinx/search/_stopwords/ru.txt # sphinx/search/_stopwords/sv.py # sphinx/search/_stopwords/sv.txt # sphinx/search/en.py # sphinx/search/zh.py

# Conflicts: # sphinx/search/minified-js/README.rst

ojwb · 2025-05-19T02:35:03Z

Something else to be aware of is that the Python snowballstemer module will actually use the stemmers from PyStemmer instead if that's installed ... However this means snowballstemmer==3.0.1 might not get you the 3.0.1 stemmers if PyStemmer is installed.

We mention PyStemmer in the documentation (and I've added it to the docs.python.org build system), but I hadn't realised that the stemmer data itself would change with the package, thanks for pointing it out.

It's really stemmer code rather than stemmer data - snowballstemmer uses pure Python code for the stemmers (generated from the Snowball code) whereas PyStemmer is a Python C extension using C code for the stemmers (also generated from the Snowball code).

Possibly snowballstemmer should only forward to PyStemmer if the major versions are the same or something like that, though that's unhelpful if you're only using a stemmer which hasn't changed between those versions.

Your comparison may be of very different things though - at least if you ran it on sphinx git master, that has 15 stemmers + base-stemmer.js whereas the Snowball website has 33 + base-stemmer.js and the JS code for the demo itself. The stemmers do vary wildly in code size (from under 7K for hindi to 128K for serbian, but e.g. serbian and greek are the largest two JS files and not in sphinx git master.

I ran the comparison on this branch, with all stemmers generated.

OK, so at least they're working on approximately the same inputs (probably just demo.js extra for closure-compiler which is only 6416 bytes).

Looking at the code generated by each, one obvious extra thing uglifyjs does that helps reduce the size is change Unicode escapes in string literals to UTF-8 encoded source code (so e.g. \u640 becomes a two byte UTF-8 sequence saving 3 bytes). If I add --charset UTF-8 for closure-compiler that brings its output down to 263504 bytes (and if UTF-8 encoded Javascript source is OK then the Snowball compiler could easily produce it directly - since v3.0.0 it actually does for target languages which clearly document the default source encoding is UTF-8 or a way to specify that it is, but I failed to find that info for Javascript - e.g. https://tc39.es/ecma262/multipage/ecmascript-language-source-code.html#sec-ecmascript-language-source-code says "The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification").

Comparing on the snowball-website repo (so including demo.js for both) and allowing UTF-8 JS source I get 288406 bytes with uglifyjs vs 263504 with closure-compiler, so uglifyjs output is about 9.5% larger. That seems tolerable and both are smaller than closure-compiler without specifying UTF-8 output, though the number of stemming languages and hence total code size will continue to grow over time so I'd be happier to achieve a more similar size reduction with uglifyjs, or find how to make closure-compiler work with modernised JS output.

Anyway, I realise this is getting increasingly off-topic for sphinx-doc but it seemed worth summarising my findings as you're also compressing the JS code.

sphinx/search/__init__.py

jayaddison · 2025-05-19T18:30:59Z

sphinx/search/__init__.py

-            ((len(word) < 3) and (12353 < ord(word[0]) < 12436))
-            or (ord(word[0]) < 256 and (word in self.stopwords))
-        )
+        return word == '' or not word.isdigit() or word not in self.stopwords


Do we really want word == '' in here? From the docstring, that seems to indicate that empty-strings would be included in the index?

(the test indices don't indicate any actual change in behaviour -- I'm not sure how an empty-string would be provided to the _filter function here..)

I see this has been marked as resolved, but I can answer the last part. It looks like this function gets called with both unstemmed words and their stems, and it's possible for get an empty string output from some stemmers for a non-empty input.

I would tend to argue that a well designed stemming algorithm should only produce an empty output for an empty input. However there have been a few cases where an algorithm can produce an empty stem for a certain non-empty input (or inputs). We've usually fixed these when they've come to light, but the original porter algorithm produces an empty stem for input s and given it's aiming to essentially be a reference implementation we have left that alone.

I'm pretty confident you won't get an empty stem from "english". You may from other algorithms (but please report if you do).

I wonder whether we could/should add an info-level output message when we detect empty-string output from a stemmer, with a message indicating that it may be a Snowball bug.

(a warning message wouldn't be appropriate, because there isn't anything the user can do to fix it (and many Sphinx users run in strict mode, where warnings fail the build))

I'm leaning towards allowing the empty string values to reach the index file itself, because, provided that both Python and JS stemming is consistent, that seems marginally more correct behaviour-wise.

# Conflicts: # tests/js/fixtures/cpp/searchindex.js # tests/js/fixtures/multiterm/searchindex.js # tests/js/fixtures/partial/searchindex.js # tests/js/fixtures/titles/searchindex.js

AA-Turner · 2025-05-19T22:49:52Z

Thanks @ojwb @jayaddison!

A

AA-Turner added the html search label May 16, 2025

AA-Turner added 2 commits May 16, 2025 03:14

Use the more modern English stemmer

e3fd29c

Update URLs for the snowball stemmer project

b6c4578

AA-Turner force-pushed the snowball branch from 8f81df6 to 080d764 Compare May 16, 2025 02:15

AA-Turner added 7 commits May 16, 2025 03:17

Add the generate_snowball script

da935e8

Pre-parse stopwords

b52df1b

Update Javascript to snowball 3.0.1

17b17a6

Remove node-specific bits

b025a3a

Regenerate minified files

aaf9e07

Use Snowball's english-stemmer.js

9530c29

Docs

e0ee7d6

AA-Turner force-pushed the snowball branch from 080d764 to e0ee7d6 Compare May 16, 2025 02:17

jayaddison reviewed May 16, 2025

View reviewed changes

utils/generate_snowball.py Show resolved Hide resolved

jayaddison reviewed May 16, 2025

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

jayaddison reviewed May 16, 2025

View reviewed changes

sphinx/search/__init__.py Outdated Show resolved Hide resolved

jayaddison reviewed May 16, 2025

View reviewed changes

utils/generate_snowball.py Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

ojwb mentioned this pull request May 16, 2025

Provide a download of all the stopword lists snowballstem/snowball-website#37

Open

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

AA-Turner added 4 commits May 18, 2025 04:15

Simplify sphinx.search

db62729

Refactor get_js_stemmer_code

892d28e

Merge branch 'master' into snowball/all

012a347

# Conflicts: # sphinx/search/minified-js/README.rst

CHANGES

2bb8549

ojwb mentioned this pull request May 19, 2025

Javascript minification/compression snowballstem/snowball#236

Open

jayaddison reviewed May 19, 2025

View reviewed changes

sphinx/search/__init__.py Outdated Show resolved Hide resolved

jayaddison reviewed May 19, 2025

View reviewed changes

Merge branch 'master' into snowball/all

0a62a52

This was referenced May 19, 2025

Use the more modern English stemmer #13574

Merged

Update language stemmer JavaScript to Snowball 3.0.1 #13573

Merged

AA-Turner added 4 commits May 19, 2025 23:00

Merge branch 'master' into snowball/all

5db1ca3

# Conflicts: # tests/js/fixtures/cpp/searchindex.js # tests/js/fixtures/multiterm/searchindex.js # tests/js/fixtures/partial/searchindex.js # tests/js/fixtures/titles/searchindex.js

sphinx.search

eb0b90f

globalThis -> window

d0a598a

Verify SHA256 digests

9bccacd

AA-Turner merged commit 75400af into sphinx-doc:master May 19, 2025
24 of 25 checks passed

AA-Turner deleted the snowball branch May 19, 2025 22:49

jayaddison mentioned this pull request May 20, 2025

Nitpick: re-order boolean word-filter conditions #13577

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jun 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update stemming and Snowball #13561

Update stemming and Snowball #13561

AA-Turner commented May 16, 2025 •

edited

Loading

Uh oh!

ojwb commented May 16, 2025 •

edited by AA-Turner

Loading

Uh oh!

AA-Turner commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

AA-Turner commented May 18, 2025

Uh oh!

ojwb commented May 19, 2025

Uh oh!

Uh oh!

jayaddison May 19, 2025

Uh oh!

ojwb May 21, 2025

Uh oh!

jayaddison May 21, 2025

Uh oh!

Uh oh!

AA-Turner commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

Update stemming and Snowball #13561

Update stemming and Snowball #13561

Conversation

AA-Turner commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

References

Uh oh!

ojwb commented May 16, 2025 • edited by AA-Turner Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AA-Turner commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

AA-Turner commented May 18, 2025

Uh oh!

ojwb commented May 19, 2025

Uh oh!

Uh oh!

jayaddison May 19, 2025

Choose a reason for hiding this comment

Uh oh!

ojwb May 21, 2025

Choose a reason for hiding this comment

Uh oh!

jayaddison May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AA-Turner commented May 19, 2025

Uh oh!

Uh oh!

AA-Turner commented May 16, 2025 •

edited

Loading

ojwb commented May 16, 2025 •

edited by AA-Turner

Loading