Issue training with DeepVoice3 model with LJSpeech Data #43

timbrucks · 2018-02-16T14:05:34Z

Thanks for your excellent implementation of Deep Voice 3. I am attempting to retrain a DeepVoice3 model using the LJSpeech data. My interest in training a new model is that I want to make some small model parameter changes in order to enable fine-tuning using some Spanish data that I have.

As a first step I tried to retrain the baseline model and I have run into some issues.

With my installation, I have been able to successfully synthesize using the pre-trained DeepVoice3 model with git commit 4357976 as your instructions indicate. That synthesized audio sounds very much like the samples linked from the instructions page.

However, I am trying to train now with the latest git commit (commit 48d1014, dated Feb 7). I am using the LJSpeech data set downloaded from the link you provided. I have run the pre-processing and training steps as indicated in your instructions. I am using the default preset parameters for deepvoice3_ljspeech.

I have let the training process run for a while. When I synthesize using the checkpoint saved at 210K iterations, the alignment is bad and the audio is very robotic and mostly unintelligible.

When I synthesize using the checkpoint saved at 700K iterations, the alignment is better (but not great); the audio is improved but still robotic and choppy.

I can post the synthesized wav files via dropbox if you are interested. I expected to have good alignment and audio at 210K iterations as that is what the pretrained model used.

Any ideas what has changed between git commits 4357976 and 48d1014 that could have caused this issue? When I diff the two commits, I see some changes in audio.py, some places where support for multi-voice has been added, and some other changes I do not yet understand. There are some additions to hparams.py, but I only noticed one difference: in the current commit, masked_loss_weight defaults to 0.5, but in the prior commit the default was 0.0.

I have just started a new training run with masked_loss_weight set to 0.0. In the meantime, do you have thoughts on anything else that might be causing the issues I am seeing?

The text was updated successfully, but these errors were encountered:

r9y9 · 2018-02-16T15:28:49Z

Thank you for the detailed report. If everything goes good, you should get reasonable results with 200k steps, so there must be something wrong in your case. I would like to investigate the problem but I am currently in a business trip and won’t have time for a week. I believe I didn’t change anything performance critical but I might miss something important... masked_loss_weight was not a sensible parameter from my experience.

imdatceleste · 2018-02-19T07:55:07Z

@timbrucks : Same problem here. Maybe it is related to #38 ?

r9y9 · 2018-02-19T10:13:09Z

Could anyone try to revert 421e8b7 and see if it works?

timbrucks · 2018-02-19T12:45:52Z

I have reverted to 421e8b7 and started a training run. I will let you know what happens.

amilamad · 2018-02-19T15:34:27Z

Hi,
This not relevant to the problem you have. I tried to do a synthesizing but it failed. I used the same instructions as in https://github.com/r9y9/deepvoice3_pytorch/tree/43579764f35de6b8bac2b18b52a06e4e11b705b2

But I get a error.

Please help.

r9y9 · 2018-02-19T15:42:56Z

Looks like dup of #37. Also please do not comment on the unrelated issue. Create a new one if necessary.

amilamad · 2018-02-19T15:45:22Z

Ops, I`m new to git hub. That issue is the same as mine. Thanks for the support :)

imdatceleste · 2018-02-20T08:34:14Z

@r9y9 : Also reverted to 421e8b7 and training... will let you know asap...
No changes, eval still doesn't deliver any results (see alignment).

Training Alignment (Ave):

Eval Alignment at same (60000 steps):

timbrucks · 2018-02-20T16:49:43Z

I allowed my training process for 421e8b7 to run up to 400k iterations. The sample WAV files produced in the checkpoints folder are pretty good (a little more reverb / metallic sound than expected) and the alignment is pretty good as well. Here is the alignment (from the "alignment_ave" dir) at 300k iterations for the utterance "that an anti-Castro organization had maintained offices there for a period ending early in 1962"

Here is a zip of the predicted WAV file at 300k iterations:

step000300000_predicted.wav.zip

However, if I use the stored checkpoint to synthesize that same utterance, the alignment looks like this and the WAV file produced is unintelligible.

So at this point I am wondering if the issue is something in the synthesis process ...

r9y9 · 2018-02-21T08:30:03Z

Hi, @homink, I am wondering if you hit the issue while working on #44. Could you give us some insight?

timbrucks · 2018-02-21T13:16:09Z

As an additional experiment, I reset my git commit to 4357976 and attempted to train the LJSpeech model. I stopped the training at ~370K iterations. I see similar behavior as mentioned above: the predicted WAV files sounds good, but the synthesized result is much lower quality.

This is the command I used, which based on the looking at hparams.py for that specific commit seems like the correct way to get the preset parameters for LJSpeech.

python train.py --hparams "use_preset=True,builder=deepvoice3" --data-root <path to the ljspeech feats produced by preprocess.py> --checkpoint-dir <path checkpoint dir> --log-event-path <path log dir>

I again suspected something was amiss with the synthesis process. But with the same environment / setup, when I synthesize using the pre-trained LJSpeech model, the output sounds great - just like the samples provided.

At this point, all I can think of is that somehow I hosed up the pre-processing of the LJSpeech data ... maybe I ran pre-processing when I had the git pointed at a later commit? So I reran the preprocess step and have just started another training run using commit 4357976.

Any other ideas?

homink · 2018-02-21T15:22:59Z

Hi All,

I believe the quality drop is really tough to figure out. I guess 2 things - random seed and silence of audio - would cause such degradation. For random seed, If I checked correctly, I was not able to find random seed initialization, which will be differently set up and will yield different training outputs. Silence of audio in the beginning and in the end also could affect model training. Here is what I had experienced so far.

I only tried JSUT and NIKL using two commits: aeed2 and ed38d. I only checked the prediction wav files which were produced on training. These wav files sounded gradually better as iteration went by. For JSUT, most of these wav files made good sound with high iteration around 370K. But, for NIKL, a few wav files made good sound with high iteration around 400K. I thought this would be caused by some parameter tuning and I didn't investigate further. Also, untrimmed audio in the beginning and in the end, especially in NIKL, was really bad even with the high iteration. Thus I trimmed and left margin about 100 msec in the beginning and in the end. Details are here.

timbrucks · 2018-02-23T14:33:53Z

I was able to successfully train a model using 421e8b7. My first attempt (described above) did not work, but after rerunning the pre-processing the results are much better, very similar to the sample results. Somehow I must have hosed up the pre-processing step (not sure how?).

I will now try to train using the recent commit 48d1014 (dated Feb 7).

r9y9 · 2018-02-24T07:35:29Z

Finaly I'm back. I will do some experiments locally soon. At the moment I am guessing r9y9/wavenet_vocoder#22 is the bug I have to fix, which was introduced at 421e8b7.

hopefully helps #43

timbrucks · 2018-02-24T13:38:47Z

The results of training with with 48d1014 (now that I have good pre-processed data) were better. One of the synthesized utterances is bad, but several were good. I will give this latest commit a try.

r9y9 · 2018-02-24T15:02:32Z

OK, now I can reproduce. Looking into it...

nikita-smetanin · 2018-02-24T20:14:47Z

Can confirm the bug with 2987b76 — sound quality after 210k iterations on LJSpeech is far from examples at r9y9.github.io/deepvoice3_pytorch

Will 18bd61d fix it? Were VCTK and Nyanko models affected too?

r9y9 · 2018-02-25T00:13:50Z

No, the bug persists. I'm looking into it.

r9y9 · 2018-02-25T08:02:27Z

A little progress:

diff --git a/audio.py b/audio.py
index 0decdbc..53fa56c 100644
--- a/audio.py
+++ b/audio.py
@@ -45,7 +45,7 @@ def inv_spectrogram(spectrogram):
 
 def melspectrogram(y):
     D = _lws_processor().stft(preemphasis(y)).T
-    S = _amp_to_db(_linear_to_mel(np.abs(D))) - hparams.ref_level_db
+    S = _amp_to_db(_linear_to_mel(np.abs(D)))  # - hparams.ref_level_db
     if not hparams.allow_clipping_in_normalization:
         assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
     return _normalize(S)
@@ -69,18 +69,15 @@ def _linear_to_mel(spectrogram):
 
 
 def _build_mel_basis():
-    assert hparams.fmax <= hparams.sample_rate // 2
-    return librosa.filters.mel(hparams.sample_rate, hparams.fft_size,
-                               fmin=hparams.fmin, fmax=hparams.fmax,
-                               n_mels=hparams.num_mels)
+    return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels)
 
 
 def _amp_to_db(x):
-    return 20 * np.log10(x + 0.01)
+    return 20 * np.log10(np.maximum(1e-5, x))
 
 
 def _db_to_amp(x):
-    return np.maximum(np.power(10.0, x * 0.05) - 0.01, 0.0)
+    return np.power(10.0, x * 0.05)

with this I can get reasonable quality after 100k steps. Will look into it further.

I had misunderstood how log compression is done in Tactron2. The old code already did clipping same way as mentioned in Tacotron2. Althrough the threshold is different between ours (1e-5) and Tacotron2 (0.01), I think it's safe to keep the value as is because it has been tested for months. We can try 0.01 later. should fix #43

r9y9 · 2018-02-25T11:16:19Z

I think #46 should fix this. I'm training a model from scratch to confirm if it actually fixes the problem.

rafaelvalle · 2018-02-25T17:01:42Z

@r9y9 With Tacotron 2 I've been able to get reasonable quality with 20k iterations using the log of the clipped magnitudes instead of the representation with the data normalized to [0, 1].

r9y9 · 2018-02-26T12:24:42Z

Sorry for the bug. I think I fixed the problem. Feel free to reopen if you still see the bug.

timbrucks · 2018-02-26T15:46:36Z

Based on my tests, I think you fixed the problem. Thanks!

r9y9 added a commit that referenced this issue Feb 24, 2018

Fix inconsistency between amp_to_db and db_to_amp

18bd61d

hopefully helps #43

r9y9 added the bug label Feb 24, 2018

r9y9 self-assigned this Feb 24, 2018

r9y9 mentioned this issue Feb 25, 2018

Rollback to previous log compression #46

Merged

r9y9 closed this as completed in #46 Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue training with DeepVoice3 model with LJSpeech Data #43

Issue training with DeepVoice3 model with LJSpeech Data #43

timbrucks commented Feb 16, 2018

r9y9 commented Feb 16, 2018

imdatceleste commented Feb 19, 2018

r9y9 commented Feb 19, 2018

timbrucks commented Feb 19, 2018

amilamad commented Feb 19, 2018 •

edited

Loading

r9y9 commented Feb 19, 2018

amilamad commented Feb 19, 2018

imdatceleste commented Feb 20, 2018 •

edited

Loading

timbrucks commented Feb 20, 2018

r9y9 commented Feb 21, 2018

timbrucks commented Feb 21, 2018

homink commented Feb 21, 2018

timbrucks commented Feb 23, 2018

r9y9 commented Feb 24, 2018

timbrucks commented Feb 24, 2018

r9y9 commented Feb 24, 2018

nikita-smetanin commented Feb 24, 2018 •

edited

Loading

r9y9 commented Feb 25, 2018

r9y9 commented Feb 25, 2018

r9y9 commented Feb 25, 2018

rafaelvalle commented Feb 25, 2018

r9y9 commented Feb 26, 2018

timbrucks commented Feb 26, 2018

Issue training with DeepVoice3 model with LJSpeech Data #43

Issue training with DeepVoice3 model with LJSpeech Data #43

Comments

timbrucks commented Feb 16, 2018

r9y9 commented Feb 16, 2018

imdatceleste commented Feb 19, 2018

r9y9 commented Feb 19, 2018

timbrucks commented Feb 19, 2018

amilamad commented Feb 19, 2018 • edited Loading

r9y9 commented Feb 19, 2018

amilamad commented Feb 19, 2018

imdatceleste commented Feb 20, 2018 • edited Loading

timbrucks commented Feb 20, 2018

r9y9 commented Feb 21, 2018

timbrucks commented Feb 21, 2018

homink commented Feb 21, 2018

timbrucks commented Feb 23, 2018

r9y9 commented Feb 24, 2018

timbrucks commented Feb 24, 2018

r9y9 commented Feb 24, 2018

nikita-smetanin commented Feb 24, 2018 • edited Loading

r9y9 commented Feb 25, 2018

r9y9 commented Feb 25, 2018

r9y9 commented Feb 25, 2018

rafaelvalle commented Feb 25, 2018

r9y9 commented Feb 26, 2018

timbrucks commented Feb 26, 2018

amilamad commented Feb 19, 2018 •

edited

Loading

imdatceleste commented Feb 20, 2018 •

edited

Loading

nikita-smetanin commented Feb 24, 2018 •

edited

Loading