- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.3k
FAQ
        Eren Gölge edited this page Mar 20, 2020 
        ·
        9 revisions
      
    - Make sure you use the right commit version of TTS. Each pre-trained model has its corresponding version that needs to be used. It is defined on the model table.
- If it is still problematic, go and post your problem on https://discourse.mozilla.org/c/tts/285 . Please give as many details as possible (error message, your TTS version, your TTS model and config.json etc.)
- If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny.
- Train first Tacotron. It is smaller and faster to experiment. If it performs poorly, try Tacotron2.
- Check your dataset with notebooks under dataset_analysis. Use this notebook to find the right audio processing parameters. The best parameters are the ones with the best GL synthesis.
- Write your own dataset formatter in datasets/preprocess.py or format your dataset as one of the supported datasets like LJSpeech.
- preprocessor parses the metadata file and converts a list of training samples.
 
- If you have a dataset with a different alphabet than English Latin, you need to add your alphabet in utils.text.symbols.- If you use phonemes for training and your language is supported here, you don't need to do that.
 
- Write your own text cleaner in utils.text.cleaners. It is not always necessary to expect you have a different alphabet or language-specific requirements.- This step is used to expand numbers, abbreviations and normalizing the text.
 
- Setup config.json for your dataset. Go over each parameter one by one and consider it regarding the commented explanation.
- 'sample_rate', 'phoneme_language' (if phoneme enabled), 'output_path', 'datasets', 'text_cleaner' are the fields you need to edit in most of the cases.
 
- Write down your test sentences in a txt file as a sentence per line and set it in config.json test_sentences_file.
- Train your model.
- SingleGPU training: python train.py --config_path config.json
- MultiGPU training: CUDA_VISIBLE_DEVICES="0,1,2" python distribute.py --config_path config.json- This command uses all the GPUs given in CUDA_VISIBLE_DEVICES. If you don't specify, it uses all the GPUs available.
 
- This command uses all the GPUs given in 
 
- SingleGPU training: 
- Check steps 2, 3, 4, 5 above.
- Check step 5 above.
- You can inspect model training and performance using tensorboard. It will show you loss values, attention alignments, model outputs. Go with the order below to verify the model.
- Check ground truth spectrograms. If they do not look as that are supposed to, then check audio processing parameters set in config.json.
- Check train and eval loss values and make sure that they all decrease smoothly in time.
- Check model spectrograms. Especially training outputs should converge to ground truth after 10K iterations.
- Your model would not work in test time until the attention has a near diagonal alignment. This is the sublime art of TTS training.
- Attention should converge diagonally after 50K iterations.
- If attention does not converge, the probabilities are;
- Your dataset is too noisy or small.
- Samples are too long.
- Batch size is too small (batch_size < 32 would be having a hard time to converge)
 
- You can also try other attention algorithms like 'graves', 'bidirectional_decoder', 'forward_attn'.
- 'bidirectional_decoder' is your ultimate savior but it trains 2x slower and demands 1.5x more GPU memory.
 
 
Stop condition is a combination of multiple parameters since there is no single final performance indicator due to both teacher-forcing and subjective voice evaluation. In my training, I consider following steps;
- Check test audio outputs, if it does not improve more.
- Check test attention maps, if they look diagonal
- Check validation loss, if it converged or started to overfit.
- If the answer is YES for all of the above, open the benchmark notebook and test checkpoints of the recent steps and see what works the best.
- You can also experiment with inference related parameters in the notebooks to see if something improves the inference performance. (forward attention, using dropout on prenet at inference etc.)
- Go over the steps under "How can I check model performance?"
- Check the 4th step under "How can I check model performance?"
- The best way is to use Benchmark notebooks.
- You can try synthesize.py.
- You can try our demo server. It is quite limited to only demo purposes.
- Check this issue to see an example of TB output.
- In general, all of the above relates to the stopnet which is the part telling the network to stop inference.
- In general, a poor stopnet relates to something else broken in your model, especially the attention module. So it is better to debug your model using the notebooks and the tensorboard outputs.
- It might be about the silences at the beginning and the end of the audio recordings. Check trim_dbvalue in the config file. You can find a better value for your dataset by usingCheckSpectrogramnotebook. If this value is too small, too much of the audio might be trimmed. If too big, then too much silence might remain and that degrades stopnet performance.
- You can also play with the loss weight set in train.py which tries to balance out the improportion of stopnet labels. But the default value should work fine generally.
- If nothing fixes the problem, another option is to use attention weights to decide where to stop. If your attention weights look good, then you can check if it reaches the end of the sentence and stops the inference.