OZSpeech is a cutting-edge Zero-Shot TTS model that utilizes Optimal Transport Flow Matching for one-step sampling, significantly reducing inference time while delivering high-quality synthesized speech.
Prepare your environment by creating a conda setup, preferably on Linux. Then, install the necessary requirements using pip:
pip install -r requirements.txt
If you plan to train the model yourself, a GPU is advised. However, you can still generate samples using our pretrained models without a GPU.
To perform inference with pretrained weights, you must download the pretrained weights for both FaCodec and OZSpeech.
- With FaCodec, you can download the FaCodec Encoder and FaCodec Decoder directly from Hugging Face: FaCodec Encoder, FaCodec Decoder. Alternatively, you can access them via Google Drive using this link.
- With OZSpeech, please refer this link. You need to download both pretrained weights and config file for initializing model.
Script synthesize.py
provides end-to-end pipeline for inference. Please follow the instructions:
python synthesize.py \
--text_file path/to/manifest.txt \
--input_dir path/to/dir/of/prompt/audio/files \
--output_dir path/to/dir/for/output/audio/files \
--ckpt_path path/to/ckpt.pt \
--cfg_path path/to/config.yaml \
--device cuda:0 # cpu as default
The format of manifest.txt file is as follow:
<groundtruth_filename>|<prompt_filename>|<groundtruth_transcription>|<prompt_transcription>|<prompt_transcription_clipped>|<groundtruth_duration>
The LibriSpeech-test-clean
dataset was utilized to synthesize and evaluate our model. For simplicity, <prompt_transcription>
, <prompt_transcription_clipped>
, and <groundtruth_duration>
can be disregarded. In this context, <groundtruth_transcription>
represents the target content of the synthesized speech, while the audio file identified by <prompt_filename>
serves as the input prompt. Additionally, we provide manifest files and prompt samples, which can be accessed via this link.
The synthesize.ipynb
notebook offers a user-friendly interface for inference. You can directly provide a single pair consisting of the prompt audio file path and the target text for synthesis. Give it a try!
TBD.
Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.