This repository contains Fatcord's Alternative WaveRNN (Faster training), which contains a fast-training, small GPU memory implementation of WaveRNN vocoder.
See geneing's awesome fork that has model pruning, export to C++ and real time inference on CPU: https://github.com/geneing/WaveRNN-Pytorch.
- support raw audio wav modelling (via a single Beta Distribution)
- relatively fast synthesis speed without much optimization yet (around 2000 samples/sec on GTX 1060 Ti, 16 GB ram, i5 processor)
- support Fatcord's original quantized (9-bit) wav modelling
-
Obama & Bernie Sanders See this repo in action!
-
10-bit audio on held-out testing data from LJSpeech. This model sounds and trains pretty close to 9 bit. We want the higher bit the better.
-
9-bit audio on held-out testing data from LJSpeech. This model trains the fastest (this is around 130 epochs)
-
Single beta distribution on held-out testing data from LjSpeech. This is trained with the single Beta distribution.
- Single Beta Distribution trained for 112k. Make sure to change
hparams.input_typetoraw. - 9-bit quantized audio trained for 11k, or around 130 epochs, can be trained further. Make sure to change
hparams.input_typetobits. - 10-bit quantized audio. To ensure your model is built properly, download the
hparams.pyhere, either replace this with your localhparams.pyfile or note and update any changes.
- Python 3
- CUDA >=8.0
- PyTorch >= v0.4.1
Ensure above requirements are met.
git clone https://github.com/G-Wang/WaveRNN-Pytorch.git
cd WaveRNN-Pytorch
pip install -r requirements.txt
Before running scripts, one can adjust hyperparameters in hparams.py.
Some hyperparameters that you might want to adjust:
fix_learning_rateThe model is robust enough to learn well with a fix learning rate of1e-4, I suggest you try this setting for fastest training, you can decrease this down to5e-6for final step refinement. Set this toNoneto train with learning rate schedule insteadinput_type(best performing ones are currentlybitsandraw, seehparams.pyfor more details)batch_sizesave_every_step(checkpoint saving frequency)evaluate_every_step(evaluation frequency)seq_len_factor(sequence length of training audio, the longer the more GPU it takes)
This function processes raw wav files into corresponding mel-spectrogram and wav files according to the audio processing hyperparameters.
Example usage:
python preprocess.py /path/to/my/wav/files
This will process all the .wav files in the folder /path/to/my/wav/files and save them in the default local directory called data_dir.
Can include --output_dir to specify a specific directory to store the processed outputs.
Start training process. checkpoints are by default stored in the local directory checkpoints.
The script will automatically save a checkpoint when terminated by crtl + c.
Example 1: starting a new model for training
python train.py data_dir
data_dir is the directory containing the processed files.
Example 2: Restoring training from checkpoint
python train.py data_dir --checkpoint=checkpoints/checkpoint0010000.pth
Evaluation .wav files and plots are saved in checkpoints/eval.
- optimize learning rate schedule
- optimize training hyperparameters (seq_len and batch_size)
- batch generation for synthesis speedup
- model pruning