stdtext builds a hybrid Danish grammar/style model that blends a probabilistic dependency grammar (PDG) trained with spaCy and a KenLM n-gram language model. The package exposes both a FastAPI service and a small CLI for scoring sentences with grammar- and style-aware metrics.
- PDG model generated from dependency parses of a Danish corpus (
data/lm/lm_corpus.txtby default). - KenLM language model stored as
data/lm/my_corpus.binwith the source ARPA file indata/lm/my_corpus.arpa. - Hybrid scorer that mixes PDG and LM scores (configurable
alphaweight) and powers both the API and CLI helpers.
- Create and activate a virtual environment (Python 3.10+).
- Install dependencies:
pip install -r requirements.txt. - Download the Danish spaCy model:
python -m spacy download da_core_news_sm(or runscripts/setup_env.cmd).
Run the training script to parse the corpus and write PDG statistics:
python src/training/train_pdg.py
# outputs data/pdg/grammar_stats.json (create data/pdg first if it is missing)Update the corpus path inside train_pdg.py if you want to use a different dataset.
Rebuild the language model from the corpus with KenLM CLI tools:
lmplz -o 5 < data/lm/lm_corpus.txt > data/lm/my_corpus.arpa
build_binary data/lm/my_corpus.arpa data/lm/my_corpus.binNote: Build and binarize the KenLM model on a Linux environment. The binary format is platform-specific, so generating it on Linux avoids compatibility issues when the service loads
data/lm/my_corpus.bin.
Start the API (expects data/pdg/grammar_stats.json and data/lm/my_corpus.bin to exist):
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000Score text via HTTP:
curl -X POST "http://localhost:8000/score" \
-H "Content-Type: application/json" \
-d '{"text": "Jeg har en stor hund som elsker at lege.", "autocorrect": true}'The API will attempt to spellcheck the input against the training corpus vocabulary
(data/lm/lm_corpus.txt) before scoring. The response returns the original sentence,
the corrected sentence, and per-token corrections so you can see what changed.
The CLI helper in src/cli/scor_sentence.py demonstrates hybrid scoring for a single sentence. Ensure grammar_stats.json and my_corpus.bin are available in your working directory (or edit the paths), then run:
python -m src.cli.scor_sentence