Skip to content

Commit bf82221

Browse files
authored
[WhisperSTT] Initial contribution (openhab#15166)
Signed-off-by: Miguel Álvarez <[email protected]> Signed-off-by: GiviMAD <[email protected]>
1 parent 56db6f8 commit bf82221

File tree

15 files changed

+1688
-0
lines changed

15 files changed

+1688
-0
lines changed

CODEOWNERS

+1
Original file line numberDiff line numberDiff line change
@@ -451,6 +451,7 @@
451451
/bundles/org.openhab.voice.voicerss/ @lolodomo
452452
/bundles/org.openhab.voice.voskstt/ @GiviMAD
453453
/bundles/org.openhab.voice.watsonstt/ @GiviMAD
454+
/bundles/org.openhab.voice.whisperstt/ @GiviMAD
454455
/itests/org.openhab.automation.groovyscripting.tests/ @wborn
455456
/itests/org.openhab.automation.jsscriptingnashorn.tests/ @wborn
456457
/itests/org.openhab.binding.astro.tests/ @gerrieg

bom/openhab-addons/pom.xml

+5
Original file line numberDiff line numberDiff line change
@@ -2251,6 +2251,11 @@
22512251
<artifactId>org.openhab.voice.watsonstt</artifactId>
22522252
<version>${project.version}</version>
22532253
</dependency>
2254+
<dependency>
2255+
<groupId>org.openhab.addons.bundles</groupId>
2256+
<artifactId>org.openhab.voice.whisperstt</artifactId>
2257+
<version>${project.version}</version>
2258+
</dependency>
22542259
</dependencies>
22552260

22562261
</project>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
This content is produced and maintained by the openHAB project.
2+
3+
* Project home: https://www.openhab.org
4+
5+
== Declared Project Licenses
6+
7+
This program and the accompanying materials are made available under the terms
8+
of the Eclipse Public License 2.0 which is available at
9+
https://www.eclipse.org/legal/epl-2.0/.
10+
11+
== Source Code
12+
13+
https://github.com/openhab/openhab-addons
14+
15+
== Third-party Content
16+
17+
io.github.givimad: whisper-jni
18+
* License: Apache 2.0 License
19+
* Project: https://github.com/GiviMAD/whisper-jni
20+
* Source: https://github.com/GiviMAD/whisper-jni/tree/main/src/
21+
22+
native dependency: whisper.cpp
23+
* License: MIT License https://github.com/ggerganov/whisper.cpp/blob/master/LICENSE
24+
* Project: https://github.com/ggerganov/whisper.cpp
25+
* Source: https://github.com/ggerganov/whisper.cpp
26+
27+
io.github.givimad: libfvad-jni
28+
* License: Apache 2.0 License https://github.com/GiviMAD/libfvad-jni/blob/main/LICENSE
29+
* Project: https://github.com/GiviMAD/libfvad-jni
30+
* Source: https://github.com/GiviMAD/libfvad-jni/tree/main/src/
31+
32+
native dependency: libfvad
33+
* License: BSD License https://github.com/dpirch/libfvad/blob/master/LICENSE
34+
* Project: https://github.com/dpirch/libfvad
35+
* Source: https://github.com/dpirch/libfvad
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
# Whisper Speech-to-Text
2+
3+
Whisper STT Service uses [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to perform offline speech-to-text in openHAB.
4+
It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity detection to isolate single command to transcribe, speeding up the execution.
5+
6+
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
7+
8+
Whisper enables speech recognition for multiple languages and dialects:
9+
10+
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
11+
italian, indonesian, hindi, finnish, vietnamese, hebrew, ukrainian, greek, malay, czech, romanian, danish, hungarian, tamil, norwegian,
12+
thai, urdu, croatian, bulgarian, lithuanian, latin, maori, malayalam, welsh, slovak, telugu, persian, latvian, bengali, serbian, azerbaijani,
13+
slovenian, kannada, estonian, macedonian, breton, basque, icelandic, armenian, nepali, mongolian, bosnian, kazakh, albanian, swahili, galician,
14+
marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, georgian, belarusian, tajik, sindhi, gujarati, amharic, yiddish, lao,
15+
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
16+
hausa, bashkir, javanese and sundanese.
17+
18+
## Supported platforms
19+
20+
This add-on uses some native binaries to work.
21+
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
22+
23+
The following platforms are supported:
24+
25+
* Windows10 x86_64
26+
* Debian GLIBC x86_64/arm64 (min GLIBC version 2.31 / min Debian version Focal)
27+
* macOS x86_64/arm64 (min version v11.0)
28+
29+
The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
30+
31+
## CPU compatibility
32+
33+
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
34+
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
35+
36+
If you are going to use the binding in a `x86_64` host the CPU should support the flags: `avx2`, `fma`, `f16c`, `avx`.
37+
You can check those flags on linux using the terminal with `lscpu`.
38+
You can check those flags on Windows using a program like `CPU-Z`.
39+
40+
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
41+
You can check those flags on linux using the terminal with `lscpu`.
42+
43+
## Transcription time
44+
45+
On a Raspberry PI 5, the approximate transcription times are:
46+
47+
| model | exec time |
48+
| ---------- | --------: |
49+
| tiny.bin | 1.5s |
50+
| base.bin | 3s |
51+
| small.bin | 8.5s |
52+
| medium.bin | 17s |
53+
54+
55+
## Configuring the model
56+
57+
Before you can use this service you should configure your model.
58+
59+
You can download them from the sources provided by the [whisper.cpp](https://github.com/ggerganov/whisper.cpp) author:
60+
61+
* https://huggingface.co/ggerganov/whisper.cpp
62+
* https://ggml.ggerganov.com
63+
64+
You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so the add-ons can find them.
65+
66+
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
67+
68+
## Using alternative whisper.cpp library
69+
70+
It's possible to use your own build of the whisper.cpp shared library with this add-on.
71+
72+
On `Linux/macOs` you need to place the `libwhisper.so/libwhisper.dydib` at `/usr/local/lib/`.
73+
74+
On `Windows` the `whisper.dll` file needs to be placed in any directory listed at the variable `$env:PATH`, for example `X:\\Windows\System32\`.
75+
76+
In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can find information about the required flags to enable different acceleration methods on the cmake build and other relevant information.
77+
78+
Note: You need to restart openHAB to reload the library.
79+
80+
## Grammar
81+
82+
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
83+
84+
Internally whisper works by inferring a matrix of possible tokens from the audio and then resolving the final transcription from it using either the Greedy or Bean Search algorithm.
85+
The grammar feature allows you to modify the probabilities of the inferred tokens by adding a penalty to the tokens outside the grammar so that the transcription gets resolved in a different way.
86+
87+
It's a way to get the smallest models to perform better over a limited grammar.
88+
89+
The grammar should be defined using [BNF](https://en.wikipedia.org/wiki/Backus–Naur_form), and the root variable should resolve the full grammar.
90+
It allows using regex and optional parts to make it more dynamic.
91+
92+
This is a basic grammar example:
93+
94+
```BNF
95+
root ::= (light_switch | light_state | tv_channel) "."
96+
light_switch ::= "turn the light " ("on" | "off")
97+
light_state ::= "set light to " ("high" | "low")
98+
tv_channel ::= ("set ")? "tv channel to " [0-9]+
99+
```
100+
101+
You can provide the grammar and enable its usage using the binding configuration.
102+
103+
## Configuration
104+
105+
Use your favorite configuration UI to edit the Whisper settings:
106+
107+
### Speech to Text Configuration
108+
109+
General options.
110+
111+
* **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
112+
* **Preload Model** - Keep whisper model loaded.
113+
* **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
114+
* **Min Transcription Seconds** - Forces min audio duration passed to whisper, in seconds.
115+
* **Max Transcription Seconds** - Max seconds for force trigger the transcription, without wait for detect silence.
116+
* **Initial Silence Seconds** - Max seconds without any voice activity to abort the transcription.
117+
* **Max Silence Seconds** - Max consecutive silence seconds to trigger the transcription.
118+
* **Remove Silence** - Remove start and end silence from the audio to transcribe.
119+
120+
### Voice Activity Detection Configuration
121+
122+
Configure VAD options.
123+
124+
* **Audio Step** - Audio processing step in seconds for the voice activity detection.
125+
* **Voice Activity Detection Mode** - Selected VAD Mode.
126+
* **Voice Activity Detection Sensitivity** - Percentage in range 0-1 of voice activity in one second to consider it as voice.
127+
* **Voice Activity Detection Step** - VAD detector internal step in ms (only allows 10, 20 or 30). (Audio Step / Voice Activity Detection Step = number of vad executions per audio step).
128+
129+
### Whisper Configuration
130+
131+
Configure whisper options.
132+
133+
* **Threads** - Number of threads used by whisper. (0 to use host max threads)
134+
* **Sampling Strategy** - Sampling strategy used.
135+
* **Beam Size** - Beam Size configuration for sampling strategy Bean Search.
136+
* **Greedy Best Of** - Best Of configuration for sampling strategy Greedy.
137+
* **Speed Up** - Speed up audio by x2. (Reduced accuracy)
138+
* **Audio Context** - Overwrite the audio context size. (0 to use whisper default context size)
139+
* **Temperature** - Temperature threshold.
140+
* **Initial Prompt** - Initial prompt for whisper.
141+
* **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
142+
* **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
143+
144+
### Grammar Configuration
145+
146+
Configure the grammar options.
147+
148+
* **Grammar** - Grammar to use in GBNF format (whisper.cpp BNF variant).
149+
* **Use Grammar** - Enable grammar usage.
150+
* **Grammar penalty** - Penalty for non grammar tokens.
151+
152+
#### Grammar Example:
153+
154+
155+
```gbnf
156+
# Grammar should define a root expression that should end with a dot.
157+
root ::= " " command "."
158+
# Alternative command expression to expand into the root.
159+
command ::= "Turn " onoff " " (connector)? thing |
160+
put " " thing " to " state |
161+
watch " " show " at bedroom" |
162+
"Start " timer " minutes timer"
163+
164+
# You can use as many expressions as you need.
165+
166+
thing ::= "light" | "bedroom light" | "living room light" | "tv"
167+
168+
put ::= "set" | "put"
169+
170+
onoff ::= "on" | "off"
171+
172+
watch ::= "watch" | "play"
173+
174+
connector ::= "the"
175+
176+
state ::= "low" | "high" | "normal"
177+
178+
show ::= [a-zA-Z]+
179+
180+
timer ::= [0-9]+
181+
182+
```
183+
184+
### Messages Configuration
185+
186+
* **No Results Message** - Message to be told on no results.
187+
* **Error Message** - Message to be told on exception.
188+
189+
### Developer Configuration
190+
191+
* **Create WAV Record** - Create wav audio file on each whisper execution, also creates a '.prop' file containing the transcription.
192+
* **Record Sample Format** - Change the record sample format. (allows i16 or f32)
193+
* **Enable Whisper Log** - Emit whisper.cpp library logs as add-on debug logs.
194+
195+
You can find [here](https://github.com/givimad/whisper-finetune-oh) information on how to fine-tune a model using the generated records.
196+
197+
### Configuration via a text file
198+
199+
In case you would like to set up the service via a text file, create a new file in `$OPENHAB_ROOT/conf/services` named `whisperstt.cfg`
200+
201+
Its contents should look similar to:
202+
203+
```
204+
org.openhab.voice.whisperstt:modelName=tiny
205+
org.openhab.voice.whisperstt:initSilenceSeconds=0.3
206+
org.openhab.voice.whisperstt:removeSilence=true
207+
org.openhab.voice.whisperstt:stepSeconds=0.3
208+
org.openhab.voice.whisperstt:vadStep=0.5
209+
org.openhab.voice.whisperstt:singleUtteranceMode=true
210+
org.openhab.voice.whisperstt:preloadModel=false
211+
org.openhab.voice.whisperstt:vadMode=LOW_BITRATE
212+
org.openhab.voice.whisperstt:vadSensitivity=0.1
213+
org.openhab.voice.whisperstt:maxSilenceSeconds=2
214+
org.openhab.voice.whisperstt:minSeconds=2
215+
org.openhab.voice.whisperstt:maxSeconds=10
216+
org.openhab.voice.whisperstt:threads=0
217+
org.openhab.voice.whisperstt:audioContext=0
218+
org.openhab.voice.whisperstt:samplingStrategy=GREEDY
219+
org.openhab.voice.whisperstt:temperature=0
220+
org.openhab.voice.whisperstt:noResultsMessage="Sorry, I didn't understand you"
221+
org.openhab.voice.whisperstt:errorMessage="Sorry, something went wrong"
222+
org.openhab.voice.whisperstt:createWAVRecord=false
223+
org.openhab.voice.whisperstt:recordSampleFormat=i16
224+
org.openhab.voice.whisperstt:speedUp=false
225+
org.openhab.voice.whisperstt:beamSize=4
226+
org.openhab.voice.whisperstt:enableWhisperLog=false
227+
org.openhab.voice.whisperstt:greedyBestOf=4
228+
org.openhab.voice.whisperstt:initialPrompt=
229+
org.openhab.voice.whisperstt:openvinoDevice=""
230+
org.openhab.voice.whisperstt:useGPU=false
231+
org.openhab.voice.whisperstt:useGrammar=false
232+
org.openhab.voice.whisperstt:grammarPenalty=80.0
233+
org.openhab.voice.whisperstt:grammarLines=
234+
```
235+
236+
### Default Speech-to-Text Configuration
237+
238+
You can select your preferred default Speech-to-Text in the UI:
239+
240+
* Go to **Settings**.
241+
* Edit **System Services - Voice**.
242+
* Set **Whisper** as **Speech-to-Text**.
243+
244+
In case you would like to set up these settings via a text file, you can edit the file `runtime.cfg` in `$OPENHAB_ROOT/conf/services` and set the following entries:
245+
246+
```
247+
org.openhab.voice:defaultSTT=whisperstt
248+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"
3+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
4+
5+
<modelVersion>4.0.0</modelVersion>
6+
7+
<parent>
8+
<groupId>org.openhab.addons.bundles</groupId>
9+
<artifactId>org.openhab.addons.reactor.bundles</artifactId>
10+
<version>4.2.0-SNAPSHOT</version>
11+
</parent>
12+
13+
<artifactId>org.openhab.voice.whisperstt</artifactId>
14+
15+
<name>openHAB Add-ons :: Bundles :: Voice :: Whisper Speech-to-Text</name>
16+
<dependencies>
17+
<!--Deps -->
18+
<dependency>
19+
<groupId>io.github.givimad</groupId>
20+
<artifactId>whisper-jni</artifactId>
21+
<version>1.6.1</version>
22+
</dependency>
23+
<dependency>
24+
<groupId>io.github.givimad</groupId>
25+
<artifactId>libfvad-jni</artifactId>
26+
<version>1.0.0-0</version>
27+
</dependency>
28+
</dependencies>
29+
</project>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<features name="org.openhab.voice.whisperstt-${project.version}" xmlns="http://karaf.apache.org/xmlns/features/v1.4.0">
3+
<repository>mvn:org.openhab.core.features.karaf/org.openhab.core.features.karaf.openhab-core/${ohc.version}/xml/features</repository>
4+
5+
<feature name="openhab-voice-whisperstt" description="Whisper Speech-to-Text" version="${project.version}">
6+
<feature>openhab-runtime-base</feature>
7+
<bundle start-level="80">mvn:org.openhab.addons.bundles/org.openhab.voice.whisperstt/${project.version}</bundle>
8+
</feature>
9+
</features>

0 commit comments

Comments
 (0)