Skip to content

Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

kyutai-labs/delayed-streams-modeling

Delayed Streams Modeling: Kyutai STT & TTS

This repo contains instructions and examples of how to run Kyutai Speech-To-Text and Kyutai Text-To-Speech models. These models are powered by delayed streams modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. See also Unmute, an voice AI system built using Kyutai STT and Kyutai TTS.

But wait, what is "Delayed Streams Modeling"? It is a technique for solving many streaming X-to-Y tasks (with X, Y in {speech, text}) that formalize the approach we had with Moshi and Hibiki. A pre-print paper is coming soon!

Kyutai Speech-To-Text

Hugging Face Open In Colab

More details can be found on the project page.

Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps. We provide two models:

  • kyutai/stt-1b-en_fr, an English and French model with ~1B parameters, a 0.5 second delay, and a semantic VAD.
  • kyutai/stt-2.6b-en, an English-only model with ~2.6B parameters and a 2.5 second delay.

These speech-to-text models have several advantages:

  • Streaming inference: the models can process audio in chunks, which allows for real-time transcription, and is great for interactive applications.
  • Easy batching for maximum efficiency: a H100 can process 400 streams in real-time.
  • They return word-level timestamps.
  • The 1B model has a semantic Voice Activity Detection (VAD) component that can be used to detect when the user is speaking. This is especially useful for building voice agents.

Implementations overview

We provide different implementations of Kyutai STT for different use cases. Here is how to choose which one to use:

  • PyTorch: for research and tinkering. If you want to call the model from Python for research or experimentation, use our PyTorch implementation.
  • Rust: for production. If you want to serve Kyutai STT in a production setting, use our Rust server. Our robust Rust server provides streaming access to the model over websockets. We use this server to run Unmute; on a L40S GPU, we can serve 64 simultaneous connections at a real-time factor of 3x.
  • MLX: for on-device inference on iPhone and Mac. MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon. If you want to run the model on a Mac or an iPhone, choose the MLX implementation.
PyTorch implementation Hugging Face Open In Colab

For an example of how to use the model in a way where you can directly stream in PyTorch tensors, see our Colab notebook.

This requires the moshi package with version 0.2.6 or later, which can be installed via pip.

If you just want to run the model on a file, you can use moshi.run_inference.

python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3

If you have uv installed, you can skip the installation step and just prefix the command above with uvx --with moshi.

Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs:

uv run \
  scripts/stt_from_file_pytorch.py \
  --hf-repo kyutai/stt-2.6b-en \
  audio/bria.mp3

The second script can be used to run a model on an existing Hugging Face dataset and calculate its performance metrics:

uv run scripts/evaluate_on_dataset.py  \
  --dataset meanwhile  \
  --hf-repo kyutai/stt-2.6b-en

Another example shows how one can provide a text-, audio-, or text-audio prompt to our STT model:

uv run scripts/stt_from_file_pytorch_with_prompt.py \
  --hf-repo kyutai/stt-2.6b-en \
  --file bria.mp3 \
  --prompt_file ./audio/loonah.mp3 \
  --prompt_text "Loonah" \
  --cut-prompt-transcript

Produces the transcript of bria.mp3 using the Loonah spelling for the name, instead of the Luna used without any prompt:

In the heart of an ancient forest, where the trees whispered secrets of the past, there lived a peculiar rabbit named Loonah (...)

Apart from nudging the model for a specific spelling of a word, other potential use-cases include speaker adaptation and steering the model towards a specific formatting style or even a language. However, please bear in mind that is an experimental feature and its behavior is very sensitive to the prompt provided.

Rust server Hugging Face

The Rust implementation provides a server that can process multiple streaming queries in parallel. Depending on the amount of memory on your GPU, you may have to adjust the batch size from the config file. For a L40S GPU, a batch size of 64 works well and requests can be processed at 3x real-time speed.

In order to run the server, install the moshi-server crate via the following command. The server code can be found in the kyutai-labs/moshi repository.

cargo install --features cuda moshi-server

Then the server can be started via the following command using the config file from this repository. For kyutai/stt-1b-en_fr, use configs/config-stt-en_fr.hf.toml, and for kyutai/stt-2.6b-en, use configs/config-stt-en-hf.toml,

moshi-server worker --config configs/config-stt-en_fr-hf.toml

Once the server has started you can transcribe audio from your microphone with the following script.

uv run scripts/stt_from_mic_rust_server.py

We also provide a script for transcribing from an audio file.

uv run scripts/stt_from_file_rust_server.py audio/bria.mp3

The script limits the decoding speed to simulates real-time processing of the audio. Faster processing can be triggered by setting the real-time factor, e.g. --rtf 1000 will process the data as fast as possible.

Rust standalone Hugging Face

A standalone Rust example script is provided in the stt-rs directory in this repo. This can be used as follows:

cd stt-rs
cargo run --features cuda -r -- ../audio/bria.mp3

You can get the timestamps by adding the --timestamps flag, and see the output of the semantic VAD by adding the --vad flag.

MLX implementation Hugging Face

MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon.

This requires the moshi-mlx package with version 0.2.6 or later, which can be installed via pip.

If you just want to run the model on a file, you can use moshi_mlx.run_inference:

python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0

If you have uv installed, you can skip the installation step and just prefix the command above with uvx --with moshi-mlx.

If you want to transcribe audio from your microphone, use:

python scripts/stt_from_mic_mlx.py

The MLX models can also be used in swift using the moshi-swift codebase, the 1b model has been tested to work fine on an iPhone 16 Pro.

Kyutai Text-to-Speech

Hugging Face Open In Colab

More details can be found on the project page.

We provide different implementations of Kyutai TTS for different use cases. Here is how to choose which one to use:

  • PyTorch: for research and tinkering. If you want to call the model from Python for research or experimentation, use our PyTorch implementation.
  • Rust: for production. If you want to serve Kyutai TTS in a production setting, use our Rust server. Our robust Rust server provides streaming access to the model over websockets. We use this server to run Unmute.
  • MLX: for on-device inference on iPhone and Mac. MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon. If you want to run the model on a Mac or an iPhone, choose the MLX implementation.
PyTorch implementation Open In Colab

Check out our Colab notebook or use the script:

# From stdin, plays audio immediately
echo "Hey, how are you?" | python scripts/tts_pytorch.py - -

# From text file to audio file
python scripts/tts_pytorch.py text_to_say.txt audio_output.wav

The tts_pytorch.py script waits for all the text to be available before starting the audio generation. A fully streaming implementation is available in the tts_pytorch_streaming.py script, which can be used as follows:

echo "Hey, how are you?" | python scripts/tts_pytorch_streaming.py audio_output.wav

This requires the moshi package, which can be installed via pip. If you have uv installed, you can skip the installation step and just prefix the command above with uvx --with moshi.

Rust server

The Rust implementation provides a server that can process multiple streaming queries in parallel.

Installing the Rust server is a bit tricky because it uses our Python implementation under the hood, which also requires installing the Python dependencies. Use the start_tts.sh script to properly install the Rust server. If you already installed the moshi-server crate before and it's not working, you might need to force a reinstall by running cargo uninstall moshi-server first. Feel free to open an issue if the installation is still broken.

Once installed, the server can be started via the following command using the config file from this repository.

moshi-server worker --config configs/config-tts.toml

Once the server has started you can connect to it using our script as follows:

# From stdin, plays audio immediately
echo "Hey, how are you?" | python scripts/tts_rust_server.py - -

# From text file to audio file
python scripts/tts_rust_server.py text_to_say.txt audio_output.wav
MLX implementation

MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon.

Use our example script to run Kyutai TTS on MLX. The script takes text from stdin or a file and can output to a file or stream the resulting audio. When streaming the output, if the model is not fast enough to keep with real-time, you can use the --quantize 8 or --quantize 4 flags to quantize the model resulting in faster inference.

# From stdin, plays audio immediately
echo "Hey, how are you?" | python scripts/tts_mlx.py - - --quantize 8

# From text file to audio file
python scripts/tts_mlx.py text_to_say.txt audio_output.wav

This requires the moshi-mlx package, which can be installed via pip. If you have uv installed, you can skip the installation step and just prefix the command above with uvx --with moshi-mlx.

FAQ

Checkout the Frequently Asked Questions section before opening an issue.

License

The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend. The web client code is provided under the MIT license. Note that parts of this code is based on AudioCraft, released under the MIT license.

The weights for the speech-to-text models are released under the CC-BY 4.0 license.

Developing

Install the pre-commit hooks by running:

pip install pre-commit
pre-commit install

If you're using uv, you can replace the two commands with uvx pre-commit install.

About

Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published