Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Speech to Text #111

Merged
merged 35 commits into from
Mar 6, 2025
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
7c1a45f
wip
chmjkb Feb 27, 2025
12038d6
wip
chmjkb Feb 28, 2025
c660039
wip
chmjkb Feb 28, 2025
36ef6e4
wip
chmjkb Mar 3, 2025
1358370
wip
chmjkb Mar 3, 2025
b64d90e
delete android benchmarks as they have to be redone
chmjkb Mar 3, 2025
2f076a7
typo fix
chmjkb Mar 3, 2025
7ad17d6
fix lies
chmjkb Mar 3, 2025
c8c5573
cosmetics
chmjkb Mar 3, 2025
389f6e1
wip
chmjkb Mar 5, 2025
19c7e9d
change tokenizer urls
chmjkb Mar 5, 2025
80b6fd7
Add info about constants, improve styling
chmjkb Mar 5, 2025
3ecc002
add missing param, rename response to sequence
chmjkb Mar 5, 2025
0e6193c
fix link
chmjkb Mar 5, 2025
95bae02
fix example
chmjkb Mar 5, 2025
631c330
add benchmarks to benchmarks section of the docs
chmjkb Mar 5, 2025
dda1f25
finished stt docs
chmjkb Mar 5, 2025
ddbd3ce
Shift sidebar by one
jakmro Mar 5, 2025
fadcd1f
Fix formatting
jakmro Mar 5, 2025
1cfa667
Change naming
jakmro Mar 5, 2025
a8ca420
Styling fixes
jakmro Mar 5, 2025
21964f6
Add missing coma
chmjkb Mar 5, 2025
99769c0
Fix syntax issues in example code
chmjkb Mar 5, 2025
1a3ee9b
add docs for hookless API
chmjkb Mar 5, 2025
014eb12
add encode and decode to hookless api
chmjkb Mar 6, 2025
fb3bdc3
remove Rick Astley :(
chmjkb Mar 6, 2025
cab80b5
rephrase overlapSeconds param
chmjkb Mar 6, 2025
4130127
remove inference time
chmjkb Mar 6, 2025
decd2a6
Add type definitions to SpeechToTextModule docs
chmjkb Mar 6, 2025
f2c60e8
Add missing load() info
chmjkb Mar 6, 2025
eb0f63c
Add and improve memory usage benchmarks
jakmro Mar 6, 2025
3e7f804
Fix links
jakmro Mar 6, 2025
aecc0c0
documentation changes
Mar 6, 2025
67a960d
cosmetic stuf
chmjkb Mar 6, 2025
61e1df4
docs changes
Mar 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/docs/benchmarks/_category_.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "Benchmarks",
"position": 7,
"position": 8,
"link": {
"type": "generated-index"
}
Expand Down
7 changes: 7 additions & 0 deletions docs/docs/benchmarks/memory-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,10 @@ sidebar_position: 2
| LLAMA3_2_3B | 7.1 | 7.3 |
| LLAMA3_2_3B_SPINQUANT | 3.7 | 3.8 |
| LLAMA3_2_3B_QLORA | 4 | 4.1 |

## Speech to text

| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| -------------- | ---------------------- | ------------------ |
| WHISPER_TINY | 900 | 600 |
| MOONSHINE_TINY | 650 | 560 |
7 changes: 7 additions & 0 deletions docs/docs/benchmarks/model-size.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,10 @@ sidebar_position: 1
| LLAMA3_2_3B | 6.43 |
| LLAMA3_2_3B_SPINQUANT | 2.55 |
| LLAMA3_2_3B_QLORA | 2.65 |

## Speech to text

| Model | XNNPACK [MB] |
| -------------- | ------------ |
| WHISPER_TINY | 231.0 |
| MOONSHINE_TINY | 148.9 |
2 changes: 1 addition & 1 deletion docs/docs/computer-vision/_category_.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "Computer Vision",
"position": 3,
"position": 4,
"link": {
"type": "generated-index"
}
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/hookless-api/ClassificationModule.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: ClassificationModule
sidebar_position: 1
---

Hookless implementation of the [useClassification](../computer-vision/useClassification.mdx) hook.
Hookless implementation of the [useClassification](../computer-vision/useClassification.md) hook.

## Reference

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/hookless-api/LLMModule.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: LLMModule
sidebar_position: 3
---

Hookless implementation of the [useLLM](../llms/running-llms.md) hook.
Hookless implementation of the [useLLM](../llms/useLLM.md) hook.

## Reference

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/hookless-api/ObjectDetectionModule.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: ObjectDetectionModule
sidebar_position: 5
---

Hookless implementation of the [useObjectDetection](../computer-vision/useObjectDetection.mdx) hook.
Hookless implementation of the [useObjectDetection](../computer-vision/useObjectDetection.md) hook.

## Reference

Expand Down
55 changes: 55 additions & 0 deletions docs/docs/hookless-api/SpeechToTextModule.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
title: SpeechToTextModule
sidebar_position: 6
---

Hookless implementation of the [useSpeechToText](../speech-to-text/) hook.

## Reference

```typescript
import { SpeechToTextModule } from 'react-native-executorch';

const audioUrl = 'https://www.your-url.com/cool-audio.mp3';

// Loading the model
const onSequenceUpdate = (sequence) => {
console.log(sequence);
};
await SpeechToTextModule.load('moonshine', onSequenceUpdate);

// Loading the audio and running the model
await SpeechToTextModule.loadAudio(audioUrl);
const transcribedText = await SpeechToTextModule.transcribe();
```

### Methods

| Method | Type | Description |
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `load` | <code>(modelName: 'whisper' &#124 'moonshine, transcribeCallback?: (sequence: string) => void, modelDownloadProgressCalback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
| `transcribe` | `(waveform: number[]): Promise<string>` | Starts a transcription process for a given input array, which should be a waveform at 16kHz. When no input is provided, it uses an internal state which is set by calling `loadAudio`. Resolves a promise with the output transcription when the model is finished. |
| `loadAudio` | `(url: string) => void` | Loads audio file from given url. It sets an internal state which serves as an input to `transcribe()`. |
| `encode` | `(waveform: number[]) => Promise<number[]>` | Runs the encoding part of the model. Returns a float array representing the output of the encoder. |
| `decode` | `(tokens: number[], encodings: number[]) => Promise<number[]>` | Runs the decoder of the model. Returns a single token representing a next token in the output sequence. |

<details>
<summary>Type definitions</summary>

```typescript
type ResourceSource = string | number;
```

</details>

## Loading the model

To load the model, use the `load` method. The required argument is `modelName`, which serves as an identifier for which model to use. It also accepts accepts optional arguments such as `encoderSource`, `decoderSource`, `tokenizerSource` which are strings that specify the location of the binaries for the model. For more information, take a look at [loading models](../fundamentals/loading-models.md) page. This method returns a promise, which can resolve to an error or void.

## Running the model

To run the model, you can use the `transcribe` method. It accepts one argument, which is an array of numbers representing a waveform at 16kHz sampling rate. The method returns a promise, which can resolve either to an error or a string containing the output text.

## Obtaining the input

To get the input, you can use the `loadAudio` method, which sets the internal input state of the model. Then you can just call `transcribe` without passing any args. It is also possible to pass inputs from other sources, as long as it is a float array containing the aforementioned waveform.
2 changes: 1 addition & 1 deletion docs/docs/hookless-api/StyleTransferModule.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: StyleTransferModule
sidebar_position: 4
---

Hookless implementation of the [useStyleTransfer](../computer-vision/useStyleTransfer.mdx) hook.
Hookless implementation of the [useStyleTransfer](../computer-vision/useStyleTransfer.md) hook.

## Reference

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/hookless-api/_category_.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "Hookless API",
"position": 4,
"position": 5,
"link": {
"type": "generated-index"
}
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/module-api/_category_.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "Module API",
"position": 5,
"position": 6,
"link": {
"type": "generated-index"
}
Expand Down
7 changes: 7 additions & 0 deletions docs/docs/speech-to-text/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"label": "Speech To Text",
"position": 3,
"link": {
"type": "generated-index"
}
}
142 changes: 142 additions & 0 deletions docs/docs/speech-to-text/useSpeechToText.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: useSpeechToText
sidebar_position: 1
---

With the latest `v0.3.0` release we introduce a new hook - `useSpeechToText`. Speech to text is a task that allows to transform spoken language to written text. It is commonly used to implement features such as transcription or voice assistants. As of now, [all supported STT models](#supported-models) run on the XNNPACK backend.

:::info
Currently, we do not support direct microphone input streaming to the model. Instead, in v0.3.0, we provide a way to transcribe an audio file.
:::

:::caution
It is recommended to use models provided by us, which are available at our [Hugging Face repository](https://huggingface.co/software-mansion/react-native-executorch-moonshine-tiny). You can also use [constants](https://github.com/software-mansion/react-native-executorch/tree/main/src/constants/modelUrls.ts) shipped with our library
:::

## Reference

```typescript
import { useSpeechToText, MOONSHINE_TOKENIZER_URL, MOONSHINE_TINY_ENCODER_URL, MOONSHINE_TINY_DECODER_URL } from 'react-native-executorch';

const model = useSpeechToText({
encoderSource: MOONSHINE_TINY_ENCODER_URL,
decoderSource: MOONSHINE_TINY_DECODER_URL,
tokenizerSource: MOONSHINE_TOKENIZER_URL,
modelName: 'moonshine',
});

const audioUrl = 'https://your-url.com/your-audio.mp3';

try {
await model.loadAudio(audioUrl);
const transcription = await model.transcribe();
console.log(transcription);
} catch (error) {
console.error(error);
}
```

### Streaming

Given that STT models take in a fixed length sequence, there is a need to chunk the input audio. Chunking audio may result in cutting speech mid-sentence, which might be hard to understand for the model. To make it work, we employed an algorithm that uses overlapping audio chunks which might introduce some overhead, but yield way better transcription results for longer audio.

### Arguments

**`modelName`**
A literal of `"moonshine" | "whisper"` which serves as an identifier for which model should be used.

**`encoderSource?`**
A string that specifies the location of a .pte file for the encoder. For further information on passing model sources, check out [Loading Models](https://docs.swmansion.com/react-native-executorch/docs/fundamentals/loading-models). Defaults to [constants](https://github.com/software-mansion/react-native-executorch/blob/main/src/constants/modelUrls.ts) for given model.

**`decoderSource?`**
Analogous to the encoderSource, this takes in a string which is a source for the decoder part of the model. Defaults to [constants](https://github.com/software-mansion/react-native-executorch/blob/main/src/constants/modelUrls.ts) for given model.

**`tokenizerSource?`**
A string that specifies the location to the tokenizer for the model. This works just as the encoder and decoder do. Defaults to [constants](https://github.com/software-mansion/react-native-executorch/blob/main/src/constants/modelUrls.ts) for given model.

**`overlapSeconds?`**
Specifies the length of overlap between consecutive audio chunks.

**`windowSize?`**
Specifies the size of each audio chunk.

### Returns

| Field | Type | Description |
| ------------------ | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `transcribe` | `(input?: number[]) => Promise<string>` | Starts a transcription process for a given input array, which should be a waveform at 16kHz. When no input is provided, it uses an internal state which is set by calling `loadAudio`. Resolves a promise with the output transcription when the model is finished. |
| `loadAudio` | `(url: string) => void` | Loads audio file from given url. It sets an internal state which serves as an input to `transcribe()`. |
| `error` | <code>string &#124; null</code> | Contains the error message if the model failed to load. |
| `sequence` | <code>string &#124; null</code> | This property is updated with each generated token. If you're looking to obtain tokens as they're generated, you should use this property. |
| `isGenerating` | `boolean` | Indicates whether the model is currently processing an inference. |
| `isReady` | `boolean` | Indicates whether the model has successfully loaded and is ready for inference. |
| `downloadProgress` | `number` | Tracks the progress of the model download process. |

## Running the model

To run the model, you can use the `transcribe` method. It accepts one optional argument: the waveform representation of the audio. If you called `loadAudio` beforehand, you don't need to pass anything to `transcribe`. However, you can still pass this argument if you want to use your own audio.
This function returns a promise, which resolves to the generated tokens when successful. If the model fails during inference, it will throw an error. If you want to obtain tokens in a streaming fashion, you can also use the sequence property, which is updated with each generated token, similar to the [useLLM](../llms/useLLM.md) hook.

## Example

```typescript
import { Button, Text } from 'react-native';
import { useSpeechToText, WHISPER_TOKENIZER_URL, WHISPER_TINY_ENCODER_URL, WHISPER_TINY_DECODER_URL } from 'react-native-executorch';

function App() {
const model = useSpeechToText({
encoderSource: WHISPER_TINY_ENCODER_URL,
decoderSource: WHISPER_TINY_DECODER_URL,
tokenizerSource: WHISPER_TOKENIZER_URL,
modelName: 'whisper',
});

const audioUrl = 'https://your-url.com/your-audio.mp3';

return (
<View>
<Button
onPress={async () => {
try {
await model.loadAudio(audioUrl);
await model.transcribe();
} catch (error) {
console.error("Error transcribing audio:", error);
}
}}
title="Transcribe"
/>
<Text>{model.sequence}</Text>
</View>
);
}
```

## Supported models

| Model | Language |
| --------------------------------------------------------------------- | -------- |
| [Whisper tiny.en](https://huggingface.co/openai/whisper-tiny.en) | English |
| [Moonshine tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) | English |

## Benchmarks

### Model size

| Model | XNNPACK [MB] |
| -------------- | ------------ |
| WHISPER_TINY | 231.0 |
| MOONSHINE_TINY | 148.9 |

### Memory usage

| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| -------------- | ---------------------- | ------------------ |
| WHISPER_TINY | 900 | 600 |
| MOONSHINE_TINY | 650 | 560 |

### Inference time

:::warning warning
Given that Whisper accepts a 30 seconds audio chunks, we employed a streaming algorithm to maintain consistency across long audio files. Therefore, the inference time for benchmarks are not there yet.
:::
2 changes: 1 addition & 1 deletion docs/docs/utils/_category_.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "Utils",
"position": 6,
"position": 7,
"link": {
"type": "generated-index"
}
Expand Down