-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Update stemming and Snowball #13561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Update stemming and Snowball #13561
Conversation
There's work in progress to improve this, but it's rather hampered by my lack of familiarity with modern Javascript and by people who are familiar not being very responsive to questions in issues/PRs. (I understand we're all busy, but that is why this isn't resolved yet.) I think we're probably going to move to producing ES6 Javascript with ESM modules, which apparently should work in all modern web browsers, and also server-side with both node and deno. The current immediate blocker to progress is how to produce a minimised version for web-use that is a similar size to what we can currently get with closure-compiler, as that's what we use to make the website demo. Unfortunately closure-compiler doesn't like the type annotations in the JS code produced by the open Snowball PR for ES6 generation, and the other options for minifying I've tried don't do nearly as good a job at reducing the code size.
Maybe we should offer versions with the comments stripped, but it's pretty trivial to do that yourself, and stopword lists are somewhat domain specific so they're really just meant as starting points - it's expected people will want to consider removing or adding entries, for which the comments are useful. Also if you're using the list to avoid even indexing stopwords you need to be more conservative than if you're applying it by default at search time.
Ideally you want to use identical versions since different versions may stem some words differently in some languages, but minor changes are usually not problematic. For example, Greek in 3.0.1 vs 2.2 stems ισα to ισ instead of an empty string, but a search for ισα in 2.2 wouldn't match ισα in a document anyway. Italian in 3.0.1 vs 2.2 only differs by stemming divano to divan instead of div (which unfortunately collides with diva); a search for divano will not match divano in a document indexed using 2.2, but it will now match divan in a document (indexed with either version) which it wouldn't before. It's pretty common to see something like the situation with Italian here where a stemming change has positive and negative effects for existing indexed data so is close to neutral overall (but is an improvement after reindexing). For some languages the changes in 3.0.0 were substantial - we've switched Dutch to be a completely different stemming algorithm and many words now have different stems (e.g. the new algorithm doubles vowels rather than undoubling, so maan and manen both stem to maan now instead of man before); German now handles text with ä, ö, ü written as ae, oe, ue which would make mixed version use problematic with text containing these transliterated forms). Mixing versions across sweeping changes like these is likely to be problematic, but they are very rare - 3.0.0 is the first major bump in 5.5 years (and 2.0.0 really just signified the first version released as snowballstem.org after Martin Porter retired from development). Generally a major version bump signifies you might need to reindex, but that may only be necessary for some languages. E.g. the Danish and Portuguese stemmers haven't changed functionally at all since v2.0.0. Something else to be aware of is that the Python I'm not sure there's a great solution to trying to keep different languages in step. Maybe the best way would be pypi package of snowballstem which also provided matching Javascript code?
Chinese doesn't have inflected forms, so doesn't need a stemmer. It is usually written without explicit word breaks, so needs word boundary identification (or an alternative approach like indexing n-grams instead of words) but that's a different problem to stemming. Looks like |
We use This project appears to be a reasonable benchmark of different minfiers: https://github.com/privatenumber/minification-benchmarks#-results |
import requests | ||
|
||
SNOWBALL_VERSION = '3.0.1' | ||
SNOWBALL_URL = f'https://github.com/snowballstem/snowball/archive/refs/tags/v{SNOWBALL_VERSION}.tar.gz' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I recognize that this is a utility script and therefore any changes introduced by it are likely to go through code review / etc, I'm generally fairly strongly in favour of pinning checksums along with download of static content.
In other words: because we know that we're downloading v3.0.1 of snowball
here, I think we could/should assert that the SHA256sum of the resulting download matches an expected value.
There is a small chance that GitHub GZ compression might change in future, as they have once before -- but such events should be rare, so I don't think it would be worth being clever and trying to checksum the .tar
or otherwise determine the inner contents of the archive.
SNOWBALL_URL = f'https://github.com/snowballstem/snowball/archive/refs/tags/v{SNOWBALL_VERSION}.tar.gz' | |
SNOWBALL_URL = f'https://github.com/snowballstem/snowball/archive/refs/tags/v{SNOWBALL_VERSION}.tar.gz' | |
SNOWBALL_SHA256 = '80ac10ce40dc4fcfbfed8d085c457b5613da0e86a73611a3d5527d044a142d60' |
STOPWORDS_DIR = SEARCH_DIR / '_stopwords' | ||
NON_MINIFIED_JS_DIR = SEARCH_DIR / 'non-minified-js' | ||
|
||
STOPWORD_URLS = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ojwb are stopwords for multiple languages available as a combined download, or do we need to collect each file individually (as here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(apologies; I've only just noticed that this somewhat-duplicates previous discussion. even so, it would be convenient..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems a good idea - I'd just thought about the comment stripping aspect in the other thread, but being able to grab all the lists in one download would be handy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - and FWIW: the specific use-case I have in mind for this is to allow adding checksums for the stopwords files (similar to another comment I left about the snowball
tarball download).
Assuming that updates to those files would occur similarly to localization/translation files, e.g. they may occur piecemeal and in somewhat unpredictable order, but can generally be bundled and approved for a release version, then inluding a snapshot of those in a versioned file could be convenient.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I've had a go in snowballstem/snowball#234.
+1. I think Sphinx would be fine migrating to modules with no delay, as we have always bundled the JavaScript.
We mention
I don't want to ask for you to do more packaging work, I'm aware of the pain involved. I think we should be fine just through this PR. It's probably somewhat rare to use stemmers in two different programming languages.
I ran the comparison on this branch, with all stemmers generated.
I see, rather unhelpful. They note: "google-closure-compiler: A heavy misstep even at the starting gate, failing on "react" due to a critical configuration issue. It’s still a solid minifier if configured correctly, but good luck setting it up!", but seems a shame they didn't try harder in configuration. A |
# Conflicts: # CHANGES.rst # sphinx/search/_stopwords/da.py # sphinx/search/_stopwords/da.txt # sphinx/search/_stopwords/de.py # sphinx/search/_stopwords/de.txt # sphinx/search/_stopwords/en.py # sphinx/search/_stopwords/es.py # sphinx/search/_stopwords/es.txt # sphinx/search/_stopwords/fi.py # sphinx/search/_stopwords/fi.txt # sphinx/search/_stopwords/fr.py # sphinx/search/_stopwords/fr.txt # sphinx/search/_stopwords/hu.py # sphinx/search/_stopwords/hu.txt # sphinx/search/_stopwords/it.py # sphinx/search/_stopwords/it.txt # sphinx/search/_stopwords/nl.py # sphinx/search/_stopwords/nl.txt # sphinx/search/_stopwords/no.py # sphinx/search/_stopwords/no.txt # sphinx/search/_stopwords/pt.py # sphinx/search/_stopwords/pt.txt # sphinx/search/_stopwords/ru.py # sphinx/search/_stopwords/ru.txt # sphinx/search/_stopwords/sv.py # sphinx/search/_stopwords/sv.txt # sphinx/search/en.py # sphinx/search/zh.py
# Conflicts: # sphinx/search/minified-js/README.rst
It's really stemmer code rather than stemmer data - snowballstemmer uses pure Python code for the stemmers (generated from the Snowball code) whereas PyStemmer is a Python C extension using C code for the stemmers (also generated from the Snowball code). Possibly snowballstemmer should only forward to PyStemmer if the major versions are the same or something like that, though that's unhelpful if you're only using a stemmer which hasn't changed between those versions.
OK, so at least they're working on approximately the same inputs (probably just Looking at the code generated by each, one obvious extra thing uglifyjs does that helps reduce the size is change Unicode escapes in string literals to UTF-8 encoded source code (so e.g. Comparing on the snowball-website repo (so including Anyway, I realise this is getting increasingly off-topic for sphinx-doc but it seemed worth summarising my findings as you're also compressing the JS code. |
Purpose
Follows on from Olly Betts' #13535.
This PR changes the stemmer for English to use the more modern
'english'
instead of'porter'
. We also automate the creation of stopword sets from the data files on the snowball website, and update the JavaScript files to v3.0.1.Open questions:
require()
functions andmodule.exports
. Is there a way to avoid this? I just ranmake dist_libstemmer_js
.stop.txt
files on the Snowball website?"snowballstemmer>=2.2"
to"snowballstemmer==3.0.1"
inpyproject.toml
?We still use the English stemmer in
zh.py
, but I think that's as there's no Mandarin/Cantonese stemmer. I won't claim to understand the rationale or background here, though.A
cc @ojwb
References
Fix #1784: Provide non-minified JS code in sphinx/search/*.py
)Merged in shibu/sphinx/add_stemmer (pull request #214)
)fyi/cc @mitya57 as the author of the previous Snowball upgrade.