Skip to content

Official implementation ACL 2025 paper "OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching"

Notifications You must be signed in to change notification settings

Fsoft-AIC/OZSpeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching

Overall Architecture

TL;DR

OZSpeech is a cutting-edge Zero-Shot TTS model that utilizes Optimal Transport Flow Matching for one-step sampling, significantly reducing inference time while delivering high-quality synthesized speech.

Installation requirement

Prepare your environment by creating a conda setup, preferably on Linux. Then, install the necessary requirements using pip:

pip install -r requirements.txt

If you plan to train the model yourself, a GPU is advised. However, you can still generate samples using our pretrained models without a GPU.

Inference

Download pretrained weights

To perform inference with pretrained weights, you must download the pretrained weights for both FaCodec and OZSpeech.

  • With FaCodec, you can download the FaCodec Encoder and FaCodec Decoder directly from Hugging Face: FaCodec Encoder, FaCodec Decoder. Alternatively, you can access them via Google Drive using this link.
  • With OZSpeech, please refer this link. You need to download both pretrained weights and config file for initializing model.

Inference using python script

Script synthesize.py provides end-to-end pipeline for inference. Please follow the instructions:

python synthesize.py \
	--text_file path/to/manifest.txt \
	--input_dir path/to/dir/of/prompt/audio/files \
	--output_dir path/to/dir/for/output/audio/files \
	--ckpt_path path/to/ckpt.pt \
	--cfg_path path/to/config.yaml \
	--device cuda:0 # cpu as default

The format of manifest.txt file is as follow:

<groundtruth_filename>|<prompt_filename>|<groundtruth_transcription>|<prompt_transcription>|<prompt_transcription_clipped>|<groundtruth_duration>

The LibriSpeech-test-clean dataset was utilized to synthesize and evaluate our model. For simplicity, <prompt_transcription>, <prompt_transcription_clipped>, and <groundtruth_duration> can be disregarded. In this context, <groundtruth_transcription> represents the target content of the synthesized speech, while the audio file identified by <prompt_filename> serves as the input prompt. Additionally, we provide manifest files and prompt samples, which can be accessed via this link.

Inference in IPython notebook

The synthesize.ipynb notebook offers a user-friendly interface for inference. You can directly provide a single pair consisting of the prompt audio file path and the target text for synthesis. Give it a try!

Training OZSpeech from scratch

TBD.

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

About

Official implementation ACL 2025 paper "OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published