Minimal set of scripts for training language and acoustic models for the speech recognition task. Training pipeline includes the following stages:
-
Character-based RNN language model
-
CNN-RNN acoustic model with CTC loss
-
Character-based RNN language model and CNN-RNN acoustic model with RNN-T loss
-
Fine-tuning with Reinforcement Learning and RNN-T loss
The following table shows the results for Russian Open Speech To Text (STT/ASR) Dataset.
Stage | Model | Loss | Updates | CER | WER |
---|---|---|---|---|---|
1 | LM | CE | 2407000 | ||
2 | AM | CTC | 216850 | 19.9 | 57.0 |
3 | LM+AM | RNN-T | 108425 | 21.7 | 45.6 |
4 | LM+AM | RL | 300 | 19.2 | 43.9 |
- PyTorch >= 1.3 (with bug fix #27460)
- torch-edit-distance
- warp-rnnt
Acoustic models based on the log mel filterbanks with 40 filters of size 25ms, strided by 10ms.
- features.py - extract features of utterances listed in manifest file
Language model is character-based and not case sensitive.
- utterances.py - extract transcriptions of precomputed utterances
Pre-trained models:
There are outdated kernels with small training subsets: