-
Notifications
You must be signed in to change notification settings - Fork 489
Issue training with DeepVoice3 model with LJSpeech Data #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for the detailed report. If everything goes good, you should get reasonable results with 200k steps, so there must be something wrong in your case. I would like to investigate the problem but I am currently in a business trip and won’t have time for a week. I believe I didn’t change anything performance critical but I might miss something important... masked_loss_weight was not a sensible parameter from my experience. |
@timbrucks : Same problem here. Maybe it is related to #38 ? |
Could anyone try to revert 421e8b7 and see if it works? |
I have reverted to 421e8b7 and started a training run. I will let you know what happens. |
Hi, |
Looks like dup of #37. Also please do not comment on the unrelated issue. Create a new one if necessary. |
Ops, I`m new to git hub. That issue is the same as mine. Thanks for the support :) |
I allowed my training process for 421e8b7 to run up to 400k iterations. The sample WAV files produced in the checkpoints folder are pretty good (a little more reverb / metallic sound than expected) and the alignment is pretty good as well. Here is the alignment (from the "alignment_ave" dir) at 300k iterations for the utterance "that an anti-Castro organization had maintained offices there for a period ending early in 1962" Here is a zip of the predicted WAV file at 300k iterations: step000300000_predicted.wav.zip However, if I use the stored checkpoint to synthesize that same utterance, the alignment looks like this and the WAV file produced is unintelligible. So at this point I am wondering if the issue is something in the synthesis process ... |
As an additional experiment, I reset my git commit to 4357976 and attempted to train the LJSpeech model. I stopped the training at ~370K iterations. I see similar behavior as mentioned above: the predicted WAV files sounds good, but the synthesized result is much lower quality. This is the command I used, which based on the looking at hparams.py for that specific commit seems like the correct way to get the preset parameters for LJSpeech.
I again suspected something was amiss with the synthesis process. But with the same environment / setup, when I synthesize using the pre-trained LJSpeech model, the output sounds great - just like the samples provided. At this point, all I can think of is that somehow I hosed up the pre-processing of the LJSpeech data ... maybe I ran pre-processing when I had the git pointed at a later commit? So I reran the preprocess step and have just started another training run using commit 4357976. Any other ideas? |
Hi All, I believe the quality drop is really tough to figure out. I guess 2 things - random seed and silence of audio - would cause such degradation. For random seed, If I checked correctly, I was not able to find random seed initialization, which will be differently set up and will yield different training outputs. Silence of audio in the beginning and in the end also could affect model training. Here is what I had experienced so far. I only tried JSUT and NIKL using two commits: aeed2 and ed38d. I only checked the prediction wav files which were produced on training. These wav files sounded gradually better as iteration went by. For JSUT, most of these wav files made good sound with high iteration around 370K. But, for NIKL, a few wav files made good sound with high iteration around 400K. I thought this would be caused by some parameter tuning and I didn't investigate further. Also, untrimmed audio in the beginning and in the end, especially in NIKL, was really bad even with the high iteration. Thus I trimmed and left margin about 100 msec in the beginning and in the end. Details are here. |
I was able to successfully train a model using 421e8b7. My first attempt (described above) did not work, but after rerunning the pre-processing the results are much better, very similar to the sample results. Somehow I must have hosed up the pre-processing step (not sure how?). I will now try to train using the recent commit 48d1014 (dated Feb 7). |
Finaly I'm back. I will do some experiments locally soon. At the moment I am guessing r9y9/wavenet_vocoder#22 is the bug I have to fix, which was introduced at 421e8b7. |
The results of training with with 48d1014 (now that I have good pre-processed data) were better. One of the synthesized utterances is bad, but several were good. I will give this latest commit a try. |
OK, now I can reproduce. Looking into it... |
Can confirm the bug with 2987b76 — sound quality after 210k iterations on LJSpeech is far from examples at r9y9.github.io/deepvoice3_pytorch Will 18bd61d fix it? Were VCTK and Nyanko models affected too? |
No, the bug persists. I'm looking into it. |
A little progress: diff --git a/audio.py b/audio.py
index 0decdbc..53fa56c 100644
--- a/audio.py
+++ b/audio.py
@@ -45,7 +45,7 @@ def inv_spectrogram(spectrogram):
def melspectrogram(y):
D = _lws_processor().stft(preemphasis(y)).T
- S = _amp_to_db(_linear_to_mel(np.abs(D))) - hparams.ref_level_db
+ S = _amp_to_db(_linear_to_mel(np.abs(D))) # - hparams.ref_level_db
if not hparams.allow_clipping_in_normalization:
assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
return _normalize(S)
@@ -69,18 +69,15 @@ def _linear_to_mel(spectrogram):
def _build_mel_basis():
- assert hparams.fmax <= hparams.sample_rate // 2
- return librosa.filters.mel(hparams.sample_rate, hparams.fft_size,
- fmin=hparams.fmin, fmax=hparams.fmax,
- n_mels=hparams.num_mels)
+ return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels)
def _amp_to_db(x):
- return 20 * np.log10(x + 0.01)
+ return 20 * np.log10(np.maximum(1e-5, x))
def _db_to_amp(x):
- return np.maximum(np.power(10.0, x * 0.05) - 0.01, 0.0)
+ return np.power(10.0, x * 0.05) with this I can get reasonable quality after 100k steps. Will look into it further. |
I had misunderstood how log compression is done in Tactron2. The old code already did clipping same way as mentioned in Tacotron2. Althrough the threshold is different between ours (1e-5) and Tacotron2 (0.01), I think it's safe to keep the value as is because it has been tested for months. We can try 0.01 later. should fix #43
I think #46 should fix this. I'm training a model from scratch to confirm if it actually fixes the problem. |
@r9y9 With Tacotron 2 I've been able to get reasonable quality with 20k iterations using the log of the clipped magnitudes instead of the representation with the data normalized to [0, 1]. |
Sorry for the bug. I think I fixed the problem. Feel free to reopen if you still see the bug. |
Based on my tests, I think you fixed the problem. Thanks! |
Thanks for your excellent implementation of Deep Voice 3. I am attempting to retrain a DeepVoice3 model using the LJSpeech data. My interest in training a new model is that I want to make some small model parameter changes in order to enable fine-tuning using some Spanish data that I have.
As a first step I tried to retrain the baseline model and I have run into some issues.
With my installation, I have been able to successfully synthesize using the pre-trained DeepVoice3 model with git commit 4357976 as your instructions indicate. That synthesized audio sounds very much like the samples linked from the instructions page.
However, I am trying to train now with the latest git commit (commit 48d1014, dated Feb 7). I am using the LJSpeech data set downloaded from the link you provided. I have run the pre-processing and training steps as indicated in your instructions. I am using the default preset parameters for deepvoice3_ljspeech.
I have let the training process run for a while. When I synthesize using the checkpoint saved at 210K iterations, the alignment is bad and the audio is very robotic and mostly unintelligible.
When I synthesize using the checkpoint saved at 700K iterations, the alignment is better (but not great); the audio is improved but still robotic and choppy.
I can post the synthesized wav files via dropbox if you are interested. I expected to have good alignment and audio at 210K iterations as that is what the pretrained model used.
Any ideas what has changed between git commits 4357976 and 48d1014 that could have caused this issue? When I diff the two commits, I see some changes in audio.py, some places where support for multi-voice has been added, and some other changes I do not yet understand. There are some additions to hparams.py, but I only noticed one difference: in the current commit, masked_loss_weight defaults to 0.5, but in the prior commit the default was 0.0.
I have just started a new training run with masked_loss_weight set to 0.0. In the meantime, do you have thoughts on anything else that might be causing the issues I am seeing?
The text was updated successfully, but these errors were encountered: