stdtext — Danish Grammar & Style Modeling System

stdtext builds a hybrid Danish grammar/style model that blends a probabilistic dependency grammar (PDG) trained with spaCy and a KenLM n-gram language model. The package exposes both a FastAPI service and a small CLI for scoring sentences with grammar- and style-aware metrics.

Components

PDG model generated from dependency parses of a Danish corpus (data/lm/lm_corpus.txt by default).
KenLM language model stored as data/lm/my_corpus.bin with the source ARPA file in data/lm/my_corpus.arpa.
Hybrid scorer that mixes PDG and LM scores (configurable alpha weight) and powers both the API and CLI helpers.

Setup

Create and activate a virtual environment (Python 3.10+).
Install dependencies: pip install -r requirements.txt.
Download the Danish spaCy model: python -m spacy download da_core_news_sm (or run scripts/setup_env.cmd).

Training

Probabilistic Dependency Grammar

Run the training script to parse the corpus and write PDG statistics:

python src/training/train_pdg.py
# outputs data/pdg/grammar_stats.json (create data/pdg first if it is missing)

Update the corpus path inside train_pdg.py if you want to use a different dataset.

KenLM n-gram model

Rebuild the language model from the corpus with KenLM CLI tools:

lmplz -o 5 < data/lm/lm_corpus.txt > data/lm/my_corpus.arpa
build_binary data/lm/my_corpus.arpa data/lm/my_corpus.bin

Note: Build and binarize the KenLM model on a Linux environment. The binary format is platform-specific, so generating it on Linux avoids compatibility issues when the service loads data/lm/my_corpus.bin.

Using the models

FastAPI service

Start the API (expects data/pdg/grammar_stats.json and data/lm/my_corpus.bin to exist):

uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000

Score text via HTTP:

curl -X POST "http://localhost:8000/score" \
  -H "Content-Type: application/json" \
  -d '{"text": "Jeg har en stor hund som elsker at lege.", "autocorrect": true}'

The API will attempt to spellcheck the input against the training corpus vocabulary (data/lm/lm_corpus.txt) before scoring. The response returns the original sentence, the corrected sentence, and per-token corrections so you can see what changed.

CLI quick check

The CLI helper in src/cli/scor_sentence.py demonstrates hybrid scoring for a single sentence. Ensure grammar_stats.json and my_corpus.bin are available in your working directory (or edit the paths), then run:

python -m src.cli.scor_sentence

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
src		src
.gitignore		.gitignore
README.md		README.md
dacy_inst.py		dacy_inst.py
install.md		install.md
requirements.txt		requirements.txt
start_api.cmd		start_api.cmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

stdtext — Danish Grammar & Style Modeling System

Components

Setup

Training

Probabilistic Dependency Grammar

KenLM n-gram model

Using the models

FastAPI service

CLI quick check

About

Uh oh!

Releases

Packages

Languages

janm-comcon/dacy-text

Folders and files

Latest commit

History

Repository files navigation

stdtext — Danish Grammar & Style Modeling System

Components

Setup

Training

Probabilistic Dependency Grammar

KenLM n-gram model

Using the models

FastAPI service

CLI quick check

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages