|
| 1 | +# Whisper Speech-to-Text |
| 2 | + |
| 3 | +Whisper STT Service uses [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to perform offline speech-to-text in openHAB. |
| 4 | +It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity detection to isolate single command to transcribe, speeding up the execution. |
| 5 | + |
| 6 | +[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications. |
| 7 | + |
| 8 | +Whisper enables speech recognition for multiple languages and dialects: |
| 9 | + |
| 10 | +english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish, |
| 11 | +italian, indonesian, hindi, finnish, vietnamese, hebrew, ukrainian, greek, malay, czech, romanian, danish, hungarian, tamil, norwegian, |
| 12 | +thai, urdu, croatian, bulgarian, lithuanian, latin, maori, malayalam, welsh, slovak, telugu, persian, latvian, bengali, serbian, azerbaijani, |
| 13 | +slovenian, kannada, estonian, macedonian, breton, basque, icelandic, armenian, nepali, mongolian, bosnian, kazakh, albanian, swahili, galician, |
| 14 | +marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, georgian, belarusian, tajik, sindhi, gujarati, amharic, yiddish, lao, |
| 15 | +uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala, |
| 16 | +hausa, bashkir, javanese and sundanese. |
| 17 | + |
| 18 | +## Supported platforms |
| 19 | + |
| 20 | +This add-on uses some native binaries to work. |
| 21 | +You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni). |
| 22 | + |
| 23 | +The following platforms are supported: |
| 24 | + |
| 25 | +* Windows10 x86_64 |
| 26 | +* Debian GLIBC x86_64/arm64 (min GLIBC version 2.31 / min Debian version Focal) |
| 27 | +* macOS x86_64/arm64 (min version v11.0) |
| 28 | + |
| 29 | +The native binaries for those platforms are included in this add-on provided with the openHAB distribution. |
| 30 | + |
| 31 | +## CPU compatibility |
| 32 | + |
| 33 | +To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU. |
| 34 | +The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds. |
| 35 | + |
| 36 | +If you are going to use the binding in a `x86_64` host the CPU should support the flags: `avx2`, `fma`, `f16c`, `avx`. |
| 37 | +You can check those flags on linux using the terminal with `lscpu`. |
| 38 | +You can check those flags on Windows using a program like `CPU-Z`. |
| 39 | + |
| 40 | +If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`. |
| 41 | +You can check those flags on linux using the terminal with `lscpu`. |
| 42 | + |
| 43 | +## Transcription time |
| 44 | + |
| 45 | +On a Raspberry PI 5, the approximate transcription times are: |
| 46 | + |
| 47 | +| model | exec time | |
| 48 | +| ---------- | --------: | |
| 49 | +| tiny.bin | 1.5s | |
| 50 | +| base.bin | 3s | |
| 51 | +| small.bin | 8.5s | |
| 52 | +| medium.bin | 17s | |
| 53 | + |
| 54 | + |
| 55 | +## Configuring the model |
| 56 | + |
| 57 | +Before you can use this service you should configure your model. |
| 58 | + |
| 59 | +You can download them from the sources provided by the [whisper.cpp](https://github.com/ggerganov/whisper.cpp) author: |
| 60 | + |
| 61 | +* https://huggingface.co/ggerganov/whisper.cpp |
| 62 | +* https://ggml.ggerganov.com |
| 63 | + |
| 64 | +You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so the add-ons can find them. |
| 65 | + |
| 66 | +Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link. |
| 67 | + |
| 68 | +## Using alternative whisper.cpp library |
| 69 | + |
| 70 | +It's possible to use your own build of the whisper.cpp shared library with this add-on. |
| 71 | + |
| 72 | +On `Linux/macOs` you need to place the `libwhisper.so/libwhisper.dydib` at `/usr/local/lib/`. |
| 73 | + |
| 74 | +On `Windows` the `whisper.dll` file needs to be placed in any directory listed at the variable `$env:PATH`, for example `X:\\Windows\System32\`. |
| 75 | + |
| 76 | +In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can find information about the required flags to enable different acceleration methods on the cmake build and other relevant information. |
| 77 | + |
| 78 | +Note: You need to restart openHAB to reload the library. |
| 79 | + |
| 80 | +## Grammar |
| 81 | + |
| 82 | +The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model. |
| 83 | + |
| 84 | +Internally whisper works by inferring a matrix of possible tokens from the audio and then resolving the final transcription from it using either the Greedy or Bean Search algorithm. |
| 85 | +The grammar feature allows you to modify the probabilities of the inferred tokens by adding a penalty to the tokens outside the grammar so that the transcription gets resolved in a different way. |
| 86 | + |
| 87 | +It's a way to get the smallest models to perform better over a limited grammar. |
| 88 | + |
| 89 | +The grammar should be defined using [BNF](https://en.wikipedia.org/wiki/Backus–Naur_form), and the root variable should resolve the full grammar. |
| 90 | +It allows using regex and optional parts to make it more dynamic. |
| 91 | + |
| 92 | +This is a basic grammar example: |
| 93 | + |
| 94 | +```BNF |
| 95 | +root ::= (light_switch | light_state | tv_channel) "." |
| 96 | +light_switch ::= "turn the light " ("on" | "off") |
| 97 | +light_state ::= "set light to " ("high" | "low") |
| 98 | +tv_channel ::= ("set ")? "tv channel to " [0-9]+ |
| 99 | +``` |
| 100 | + |
| 101 | +You can provide the grammar and enable its usage using the binding configuration. |
| 102 | + |
| 103 | +## Configuration |
| 104 | + |
| 105 | +Use your favorite configuration UI to edit the Whisper settings: |
| 106 | + |
| 107 | +### Speech to Text Configuration |
| 108 | + |
| 109 | +General options. |
| 110 | + |
| 111 | +* **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin) |
| 112 | +* **Preload Model** - Keep whisper model loaded. |
| 113 | +* **Single Utterance Mode** - When enabled recognition stops listening after a single utterance. |
| 114 | +* **Min Transcription Seconds** - Forces min audio duration passed to whisper, in seconds. |
| 115 | +* **Max Transcription Seconds** - Max seconds for force trigger the transcription, without wait for detect silence. |
| 116 | +* **Initial Silence Seconds** - Max seconds without any voice activity to abort the transcription. |
| 117 | +* **Max Silence Seconds** - Max consecutive silence seconds to trigger the transcription. |
| 118 | +* **Remove Silence** - Remove start and end silence from the audio to transcribe. |
| 119 | + |
| 120 | +### Voice Activity Detection Configuration |
| 121 | + |
| 122 | +Configure VAD options. |
| 123 | + |
| 124 | +* **Audio Step** - Audio processing step in seconds for the voice activity detection. |
| 125 | +* **Voice Activity Detection Mode** - Selected VAD Mode. |
| 126 | +* **Voice Activity Detection Sensitivity** - Percentage in range 0-1 of voice activity in one second to consider it as voice. |
| 127 | +* **Voice Activity Detection Step** - VAD detector internal step in ms (only allows 10, 20 or 30). (Audio Step / Voice Activity Detection Step = number of vad executions per audio step). |
| 128 | + |
| 129 | +### Whisper Configuration |
| 130 | + |
| 131 | +Configure whisper options. |
| 132 | + |
| 133 | +* **Threads** - Number of threads used by whisper. (0 to use host max threads) |
| 134 | +* **Sampling Strategy** - Sampling strategy used. |
| 135 | +* **Beam Size** - Beam Size configuration for sampling strategy Bean Search. |
| 136 | +* **Greedy Best Of** - Best Of configuration for sampling strategy Greedy. |
| 137 | +* **Speed Up** - Speed up audio by x2. (Reduced accuracy) |
| 138 | +* **Audio Context** - Overwrite the audio context size. (0 to use whisper default context size) |
| 139 | +* **Temperature** - Temperature threshold. |
| 140 | +* **Initial Prompt** - Initial prompt for whisper. |
| 141 | +* **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) |
| 142 | +* **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect) |
| 143 | + |
| 144 | +### Grammar Configuration |
| 145 | + |
| 146 | +Configure the grammar options. |
| 147 | + |
| 148 | +* **Grammar** - Grammar to use in GBNF format (whisper.cpp BNF variant). |
| 149 | +* **Use Grammar** - Enable grammar usage. |
| 150 | +* **Grammar penalty** - Penalty for non grammar tokens. |
| 151 | + |
| 152 | +#### Grammar Example: |
| 153 | + |
| 154 | + |
| 155 | +```gbnf |
| 156 | +# Grammar should define a root expression that should end with a dot. |
| 157 | +root ::= " " command "." |
| 158 | +# Alternative command expression to expand into the root. |
| 159 | +command ::= "Turn " onoff " " (connector)? thing | |
| 160 | + put " " thing " to " state | |
| 161 | + watch " " show " at bedroom" | |
| 162 | + "Start " timer " minutes timer" |
| 163 | +
|
| 164 | +# You can use as many expressions as you need. |
| 165 | +
|
| 166 | +thing ::= "light" | "bedroom light" | "living room light" | "tv" |
| 167 | +
|
| 168 | +put ::= "set" | "put" |
| 169 | +
|
| 170 | +onoff ::= "on" | "off" |
| 171 | +
|
| 172 | +watch ::= "watch" | "play" |
| 173 | +
|
| 174 | +connector ::= "the" |
| 175 | +
|
| 176 | +state ::= "low" | "high" | "normal" |
| 177 | +
|
| 178 | +show ::= [a-zA-Z]+ |
| 179 | +
|
| 180 | +timer ::= [0-9]+ |
| 181 | +
|
| 182 | +``` |
| 183 | + |
| 184 | +### Messages Configuration |
| 185 | + |
| 186 | +* **No Results Message** - Message to be told on no results. |
| 187 | +* **Error Message** - Message to be told on exception. |
| 188 | + |
| 189 | +### Developer Configuration |
| 190 | + |
| 191 | +* **Create WAV Record** - Create wav audio file on each whisper execution, also creates a '.prop' file containing the transcription. |
| 192 | +* **Record Sample Format** - Change the record sample format. (allows i16 or f32) |
| 193 | +* **Enable Whisper Log** - Emit whisper.cpp library logs as add-on debug logs. |
| 194 | + |
| 195 | +You can find [here](https://github.com/givimad/whisper-finetune-oh) information on how to fine-tune a model using the generated records. |
| 196 | + |
| 197 | +### Configuration via a text file |
| 198 | + |
| 199 | +In case you would like to set up the service via a text file, create a new file in `$OPENHAB_ROOT/conf/services` named `whisperstt.cfg` |
| 200 | + |
| 201 | +Its contents should look similar to: |
| 202 | + |
| 203 | +``` |
| 204 | +org.openhab.voice.whisperstt:modelName=tiny |
| 205 | +org.openhab.voice.whisperstt:initSilenceSeconds=0.3 |
| 206 | +org.openhab.voice.whisperstt:removeSilence=true |
| 207 | +org.openhab.voice.whisperstt:stepSeconds=0.3 |
| 208 | +org.openhab.voice.whisperstt:vadStep=0.5 |
| 209 | +org.openhab.voice.whisperstt:singleUtteranceMode=true |
| 210 | +org.openhab.voice.whisperstt:preloadModel=false |
| 211 | +org.openhab.voice.whisperstt:vadMode=LOW_BITRATE |
| 212 | +org.openhab.voice.whisperstt:vadSensitivity=0.1 |
| 213 | +org.openhab.voice.whisperstt:maxSilenceSeconds=2 |
| 214 | +org.openhab.voice.whisperstt:minSeconds=2 |
| 215 | +org.openhab.voice.whisperstt:maxSeconds=10 |
| 216 | +org.openhab.voice.whisperstt:threads=0 |
| 217 | +org.openhab.voice.whisperstt:audioContext=0 |
| 218 | +org.openhab.voice.whisperstt:samplingStrategy=GREEDY |
| 219 | +org.openhab.voice.whisperstt:temperature=0 |
| 220 | +org.openhab.voice.whisperstt:noResultsMessage="Sorry, I didn't understand you" |
| 221 | +org.openhab.voice.whisperstt:errorMessage="Sorry, something went wrong" |
| 222 | +org.openhab.voice.whisperstt:createWAVRecord=false |
| 223 | +org.openhab.voice.whisperstt:recordSampleFormat=i16 |
| 224 | +org.openhab.voice.whisperstt:speedUp=false |
| 225 | +org.openhab.voice.whisperstt:beamSize=4 |
| 226 | +org.openhab.voice.whisperstt:enableWhisperLog=false |
| 227 | +org.openhab.voice.whisperstt:greedyBestOf=4 |
| 228 | +org.openhab.voice.whisperstt:initialPrompt= |
| 229 | +org.openhab.voice.whisperstt:openvinoDevice="" |
| 230 | +org.openhab.voice.whisperstt:useGPU=false |
| 231 | +org.openhab.voice.whisperstt:useGrammar=false |
| 232 | +org.openhab.voice.whisperstt:grammarPenalty=80.0 |
| 233 | +org.openhab.voice.whisperstt:grammarLines= |
| 234 | +``` |
| 235 | + |
| 236 | +### Default Speech-to-Text Configuration |
| 237 | + |
| 238 | +You can select your preferred default Speech-to-Text in the UI: |
| 239 | + |
| 240 | +* Go to **Settings**. |
| 241 | +* Edit **System Services - Voice**. |
| 242 | +* Set **Whisper** as **Speech-to-Text**. |
| 243 | + |
| 244 | +In case you would like to set up these settings via a text file, you can edit the file `runtime.cfg` in `$OPENHAB_ROOT/conf/services` and set the following entries: |
| 245 | + |
| 246 | +``` |
| 247 | +org.openhab.voice:defaultSTT=whisperstt |
| 248 | +``` |
0 commit comments