docs: Speech to Text (#111)

chmjkb · jakmro · Mateusz Kopciński · web-flow · commit 83fd31f53f97 · 2025-03-06T16:05:56.000+01:00
## Description
&lt;!-- Provide a concise and descriptive summary of the changes
implemented in this PR. --&gt;

### Type of change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [x] Documentation update (improves or adds clarity to existing
documentation)

### Tested on
- [ ] iOS
- [ ] Android

### Testing instructions
&lt;!-- Provide step-by-step instructions on how to test your changes.
Include setup details if necessary. --&gt;

### Screenshots
&lt;!-- Add screenshots here, if applicable --&gt;

### Related issues
&lt;!-- Link related issues here using #issue-number --&gt;

### Checklist
- [ ] I have performed a self-review of my code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have updated the documentation accordingly
- [ ] My changes generate no new warnings

### Additional notes
&lt;!-- Include any additional information, assumptions, or context that
reviewers might need to understand this PR. --&gt;

---------

Co-authored-by: jakmro &lt;jakub.mroz@swmansion.com&gt;
Co-authored-by: Jakub Mroz &lt;115979017+jakmro@users.noreply.github.com&gt;
Co-authored-by: Mateusz Kopciński &lt;mateusz.kopcinski@swmansnion.com&gt;
diff --git a/docs/docs/benchmarks/_category_.json b/docs/docs/benchmarks/_category_.json
@@ -1,6 +1,6 @@
 {
   "label": "Benchmarks",
-  "position": 7,
+  "position": 8,
   "link": {
     "type": "generated-index"
   }
diff --git a/docs/docs/benchmarks/memory-usage.md b/docs/docs/benchmarks/memory-usage.md
@@ -47,3 +47,10 @@ sidebar_position: 2
 | LLAMA3_2_3B           | 7.1                    | 7.3                |
 | LLAMA3_2_3B_SPINQUANT | 3.7                    | 3.8                |
 | LLAMA3_2_3B_QLORA     | 4                      | 4.1                |
+
+## Speech to text
+
+| Model          | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
+| -------------- | ---------------------- | ------------------ |
+| WHISPER_TINY   | 900                    | 600                |
+| MOONSHINE_TINY | 650                    | 560                |
diff --git a/docs/docs/benchmarks/model-size.md b/docs/docs/benchmarks/model-size.md
@@ -52,3 +52,10 @@ sidebar_position: 1
 | LLAMA3_2_3B           | 6.43         |
 | LLAMA3_2_3B_SPINQUANT | 2.55         |
 | LLAMA3_2_3B_QLORA     | 2.65         |
+
+## Speech to text
+
+| Model          | XNNPACK [MB] |
+| -------------- | ------------ |
+| WHISPER_TINY   | 231.0        |
+| MOONSHINE_TINY | 148.9        |
diff --git a/docs/docs/computer-vision/_category_.json b/docs/docs/computer-vision/_category_.json
@@ -1,6 +1,6 @@
 {
   "label": "Computer Vision",
-  "position": 3,
+  "position": 4,
   "link": {
     "type": "generated-index"
   }
diff --git a/docs/docs/hookless-api/LLMModule.md b/docs/docs/hookless-api/LLMModule.md
@@ -3,7 +3,7 @@ title: LLMModule
 sidebar_position: 3
 ---
 
-Hookless implementation of the [useLLM](../llms/running-llms.md) hook.
+Hookless implementation of the [useLLM](../llms/useLLM.md) hook.
 
 ## Reference
 
diff --git a/docs/docs/hookless-api/SpeechToTextModule.md b/docs/docs/hookless-api/SpeechToTextModule.md
@@ -0,0 +1,55 @@
+---
+title: SpeechToTextModule
+sidebar_position: 6
+---
+
+Hookless implementation of the [useSpeechToText](../speech-to-text/) hook.
+
+## Reference
+
+```typescript
+import { SpeechToTextModule } from 'react-native-executorch';
+
+const audioUrl = 'https://www.your-url.com/cool-audio.mp3';
+
+// Loading the model
+const onSequenceUpdate = (sequence) => {
+    console.log(sequence);
+};
+await SpeechToTextModule.load('moonshine', onSequenceUpdate);
+
+// Loading the audio and running the model
+await SpeechToTextModule.loadAudio(audioUrl);
+const transcribedText = await SpeechToTextModule.transcribe();
+```
+
+### Methods
+
+| Method       | Type                                                                                                                                                                                                                                                                       | Description                                                                                                                                                                                                                                                                                                                                 |
+| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `load`       | <code>(modelName: 'whisper' &#124 'moonshine, transcribeCallback?: (sequence: string) => void, modelDownloadProgressCalback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
+| `transcribe` | `(waveform: number[]): Promise<string>`                                                                                                                                                                                                                                    | Starts a transcription process for a given input array, which should be a waveform at 16kHz. When no input is provided, it uses an internal state which is set by calling `loadAudio`. Resolves a promise with the output transcription when the model is finished.                                                                         |
+| `loadAudio`  | `(url: string) => void`                                                                                                                                                                                                                                                    | Loads audio file from given url. It sets an internal state which serves as an input to `transcribe()`.                                                                                                                                                                                                                                      |
+| `encode`     | `(waveform: number[]) => Promise<number[]>`                                                                                                                                                                                                                                | Runs the encoding part of the model. Returns a float array representing the output of the encoder.                                                                                                                                                                                                                                          |
+| `decode`     | `(tokens: number[], encodings: number[]) => Promise<number[]>`                                                                                                                                                                                                             | Runs the decoder of the model. Returns a single token representing a next token in the output sequence.                                                                                                                                                                                                                                     |
+
+<details>
+<summary>Type definitions</summary>
+
+```typescript
+type ResourceSource = string | number;
+```
+
+</details>
+
+## Loading the model
+
+To load the model, use the `load` method. The required argument is `modelName`, which serves as an identifier for which model to use. It also accepts accepts optional arguments such as `encoderSource`, `decoderSource`, `tokenizerSource` which are strings that specify the location of the binaries for the model. For more information, take a look at [loading models](../fundamentals/loading-models.md) page. This method returns a promise, which can resolve to an error or void.
+
+## Running the model
+
+To run the model, you can use the `transcribe` method. It accepts one argument, which is an array of numbers representing a waveform at 16kHz sampling rate. The method returns a promise, which can resolve either to an error or a string containing the output text.
+
+## Obtaining the input
+
+To get the input, you can use the `loadAudio` method, which sets the internal input state of the model. Then you can just call `transcribe` without passing any args. It is also possible to pass inputs from other sources, as long as it is a float array containing the aforementioned waveform.
diff --git a/docs/docs/hookless-api/_category_.json b/docs/docs/hookless-api/_category_.json
@@ -1,6 +1,6 @@
 {
   "label": "Hookless API",
-  "position": 4,
+  "position": 5,
   "link": {
     "type": "generated-index"
   }
diff --git a/docs/docs/module-api/_category_.json b/docs/docs/module-api/_category_.json
@@ -1,6 +1,6 @@
 {
   "label": "Module API",
-  "position": 5,
+  "position": 6,
   "link": {
     "type": "generated-index"
   }
diff --git a/docs/docs/speech-to-text/_category_.json b/docs/docs/speech-to-text/_category_.json
@@ -0,0 +1,7 @@
+{
+  "label": "Speech To Text",
+  "position": 3,
+  "link": {
+    "type": "generated-index"
+  }
+}
diff --git a/docs/docs/speech-to-text/useSpeechToText.md b/docs/docs/speech-to-text/useSpeechToText.md
@@ -0,0 +1,125 @@
+---
+title: useSpeechToText
+sidebar_position: 1
+---
+
+With the latest `v0.3.0` release we introduce a new hook - `useSpeechToText`. Speech to text is a task that allows to transform spoken language to written text. It is commonly used to implement features such as transcription or voice assistants. As of now, [all supported STT models](#supported-models) run on the XNNPACK backend.
+
+:::info
+Currently, we do not support direct microphone input streaming to the model. Instead, in v0.3.0, we provide a way to transcribe an audio file.
+:::
+
+:::caution
+It is recommended to use models provided by us, which are available at our [Hugging Face repository](https://huggingface.co/software-mansion/react-native-executorch-moonshine-tiny). You can also use [constants](https://github.com/software-mansion/react-native-executorch/tree/main/src/constants/modelUrls.ts) shipped with our library
+:::
+
+## Reference
+
+```typescript
+import { useSpeechToText } from 'react-native-executorch';
+
+const { transcribe, error, loadAudio } = useSpeechToText({
+  modelName: 'moonshine',
+});
+
+const audioUrl = ...; // URL with audio to transcribe
+
+await loadAudio(audioUrl);
+const transcription = await transcribe();
+if (error) {
+  console.log(error);
+} else {
+  console.log(transcription);
+}
+```
+
+### Streaming
+
+Given that STT models can process audio no longer than 30 seconds, there is a need to chunk the input audio. Chunking audio may result in cutting speech mid-sentence, which might be hard to understand for the model. To make it work, we employed an algorithm (adapted for mobile devices from [whisper-streaming](https://aclanthology.org/2023.ijcnlp-demo.3.pdf)) that uses overlapping audio chunks. This might introduce some overhead, but allows for processing audio inputs of arbitrary length.
+
+### Arguments
+
+**`modelName`**
+A literal of `"moonshine" | "whisper"` which serves as an identifier for which model should be used.
+
+**`encoderSource?`**
+A string that specifies the location of a .pte file for the encoder. For further information on passing model sources, check out [Loading Models](https://docs.swmansion.com/react-native-executorch/docs/fundamentals/loading-models). Defaults to [constants](https://github.com/software-mansion/react-native-executorch/blob/main/src/constants/modelUrls.ts) for given model.
+
+**`decoderSource?`**
+Analogous to the encoderSource, this takes in a string which is a source for the decoder part of the model. Defaults to [constants](https://github.com/software-mansion/react-native-executorch/blob/main/src/constants/modelUrls.ts) for given model.
+
+**`tokenizerSource?`**
+A string that specifies the location to the tokenizer for the model. This works just as the encoder and decoder do. Defaults to [constants](https://github.com/software-mansion/react-native-executorch/blob/main/src/constants/modelUrls.ts) for given model.
+
+**`overlapSeconds?`**
+Specifies the length of overlap between consecutive audio chunks (expressed in seconds).
+
+**`windowSize?`**
+Specifies the size of each audio chunk (expressed in seconds).
+
+### Returns
+
+| Field              | Type                                    | Description                                                                                                                                                                                                                                                         |
+| ------------------ | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `transcribe`       | `(input?: number[]) => Promise<string>` | Starts a transcription process for a given input array, which should be a waveform at 16kHz. When no input is provided, it uses an internal state which is set by calling `loadAudio`. Resolves a promise with the output transcription when the model is finished. |
+| `loadAudio`        | `(url: string) => void`                 | Loads audio file from given url. It sets an internal state which serves as an input to `transcribe()`.                                                                                                                                                              |
+| `error`            | <code>Error &#124; undefined</code>         | Contains the error message if the model failed to load.                                                                                                                                                                                                             |
+| `sequence`         | <code>string</code>         | This property is updated with each generated token. If you're looking to obtain tokens as they're generated, you should use this property.                                                                                                                          |
+| `isGenerating`     | `boolean`                               | Indicates whether the model is currently processing an inference.                                                                                                                                                                                                   |
+| `isReady`          | `boolean`                               | Indicates whether the model has successfully loaded and is ready for inference.                                                                                                                                                                                     |
+| `downloadProgress` | `number`                                | Tracks the progress of the model download process.                                                                                                                                                                                                                  |
+
+## Running the model
+
+Before running the model's `transcribe` method be sure to obtain waveform of the audio You wish to transcribe. You can either use `loadAudio` method to load audio from a url and save it in model's internal state or obtain the waveform on your own (remember to use sampling rate of 16kHz!). In the latter case just pass the obtained waveform as argument to the `transcribe` method which returns a promise resolving to the generated tokens when successful. If the model fails during inference the `error` property contains details of the error. If you want to obtain tokens in a streaming fashion, you can also use the sequence property, which is updated with each generated token, similar to the [useLLM](../llms/useLLM.md) hook.
+
+## Example
+
+```typescript
+import { Button, Text } from 'react-native';
+import { useSpeechToText } from 'react-native-executorch';
+
+function App() {
+  const { loadAudio, transcribe, sequence, error } = useSpeechToText({
+    modelName: 'whisper',
+  });
+
+  const audioUrl = ...; // URL with audio to transcribe
+
+  return (
+    <View>
+      <Button
+        onPress={async () => {
+          await loadAudio(audioUrl);
+          await transcribe();
+        }
+        title="Transcribe"
+      />
+      <Text>{error ? error : sequence}</Text>
+    </View>
+  );
+}
+```
+
+## Supported models
+
+| Model                                                                 | Language |
+| --------------------------------------------------------------------- | -------- |
+| [Whisper tiny.en](https://huggingface.co/openai/whisper-tiny.en)      | English  |
+| [Moonshine tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) | English  |
+
+## Benchmarks
+
+### Model size
+
+| Model          | XNNPACK [MB] |
+| -------------- | ------------ |
+| WHISPER_TINY   | 231.0        |
+| MOONSHINE_TINY | 148.9        |
+
+### Memory usage
+
+| Model          | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
+| -------------- | ---------------------- | ------------------ |
+| WHISPER_TINY   | 900                    | 600                |
+| MOONSHINE_TINY | 650                    | 560                |
diff --git a/docs/docs/utils/_category_.json b/docs/docs/utils/_category_.json
@@ -1,6 +1,6 @@
 {
   "label": "Utils",
-  "position": 6,
+  "position": 7,
   "link": {
     "type": "generated-index"
   }

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"label": "Benchmarks",`
`3`		`- "position": 7,`
	`3`	`+ "position": 8,`
`4`	`4`	`"link": {`
`5`	`5`	`"type": "generated-index"`
`6`	`6`	`}`
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"label": "Computer Vision",`
`3`		`- "position": 3,`
	`3`	`+ "position": 4,`
`4`	`4`	`"link": {`
`5`	`5`	`"type": "generated-index"`
`6`	`6`	`}`
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"label": "Hookless API",`
`3`		`- "position": 4,`
	`3`	`+ "position": 5,`
`4`	`4`	`"link": {`
`5`	`5`	`"type": "generated-index"`
`6`	`6`	`}`
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"label": "Module API",`
`3`		`- "position": 5,`
	`3`	`+ "position": 6,`
`4`	`4`	`"link": {`
`5`	`5`	`"type": "generated-index"`
`6`	`6`	`}`
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"label": "Utils",`
`3`		`- "position": 6,`
	`3`	`+ "position": 7,`
`4`	`4`	`"link": {`
`5`	`5`	`"type": "generated-index"`
`6`	`6`	`}`