You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/en/finetune.md
+103-67
Original file line number
Diff line number
Diff line change
@@ -2,65 +2,22 @@
2
2
3
3
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
4
4
5
-
`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`and `VITS`.
5
+
`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`, and `VITS`.
6
6
7
7
!!! info
8
-
You should first conduct the following test to determine if you need to fine-tune `VQGAN`:
8
+
You should first conduct the following test to determine if you need to fine-tune `VITS Decoder`:
9
9
```bash
10
10
python tools/vqgan/inference.py -i test.wav
11
+
python tools/vits_decoder/inference.py \
12
+
-ckpt checkpoints/vits_decoder_v1.1.ckpt \
13
+
-i fake.npy -r test.wav \
14
+
--text "The text you want to generate"
11
15
```
12
-
This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`.
16
+
This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VITS Decoder`.
13
17
14
18
Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
15
19
16
-
It is recommended to fine-tune the LLAMA and VITS model first, then fine-tune the `VQGAN` according to your needs.
17
-
18
-
## Fine-tuning VQGAN
19
-
### 1. Prepare the Dataset
20
-
21
-
```
22
-
.
23
-
├── SPK1
24
-
│ ├── 21.15-26.44.mp3
25
-
│ ├── 27.51-29.98.mp3
26
-
│ └── 30.1-32.71.mp3
27
-
└── SPK2
28
-
└── 38.79-40.85.mp3
29
-
```
30
-
31
-
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
32
-
33
-
### 2. Split Training and Validation Sets
34
-
35
-
```bash
36
-
python tools/vqgan/create_train_split.py data
37
-
```
38
-
39
-
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
40
-
41
-
!!!info
42
-
For the VITS format, you can specify a file list using `--filelist xxx.list`.
43
-
Please note that the audio files in `filelist` must also be located in the `data` folder.
You can review `fake.wav` to assess the fine-tuning results.
61
-
62
-
!!! note
63
-
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
20
+
It is recommended to fine-tune the LLAMA first, then fine-tune the `VITS Decoder` according to your needs.
64
21
65
22
## Fine-tuning LLAMA
66
23
### 1. Prepare the dataset
@@ -168,8 +125,27 @@ After training is complete, you can refer to the [inference](inference.md) secti
168
125
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
169
126
If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
170
127
171
-
## Fine-tuning VITS
172
-
### 1. Prepare the dataset
128
+
#### Fine-tuning with LoRA
129
+
130
+
!!! note
131
+
LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets.
132
+
133
+
If you want to use LoRA, please add the following parameter: `[email protected]_config=r_8_alpha_16`.
134
+
135
+
After training, you need to convert the LoRA weights to regular weights before performing inference.
@@ -184,32 +160,92 @@ After training is complete, you can refer to the [inference](inference.md) secti
184
160
├── 38.79-40.85.lab
185
161
└── 38.79-40.85.mp3
186
162
```
163
+
187
164
!!! note
188
-
The fine-tuning for VITS only support the .lab format files, please don't use .list file!
165
+
VITS fine-tuning currently only supports `.lab` as the label file and does not support the `filelist` format.
189
166
190
-
You need to convert the dataset to the format above, and move them to the `data` , the suffix of the files can be`.mp3`, `.wav` 或`.flac`, the label files' suffix are recommended to be `.lab`.
167
+
You need to format your dataset as shown above and place it under `data`. Audio files can have`.mp3`, `.wav`, or`.flac` extensions, and the annotation files should have the `.lab` extension.
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
198
176
199
-
#### Fine-tuning with LoRA
177
+
!!! info
178
+
For the VITS format, you can specify a file list using `--filelist xxx.list`.
179
+
Please note that the audio files in `filelist` must also be located in the `data` folder.
After training, you need to convert the LoRA weights to regular weights before performing inference.
199
+
You can review `fake.wav` to assess the fine-tuning results.
200
+
201
+
202
+
## Fine-tuning VQGAN (Not Recommended)
203
+
204
+
205
+
We no longer recommend using VQGAN for fine-tuning in version 1.1. Using VITS Decoder will yield better results, but if you still want to fine-tune VQGAN, you can refer to the following steps.
206
+
207
+
### 1. Prepare the Dataset
208
+
209
+
```
210
+
.
211
+
├── SPK1
212
+
│ ├── 21.15-26.44.mp3
213
+
│ ├── 27.51-29.98.mp3
214
+
│ └── 30.1-32.71.mp3
215
+
└── SPK2
216
+
└── 38.79-40.85.mp3
217
+
```
218
+
219
+
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
228
+
229
+
!!!info
230
+
For the VITS format, you can specify a file list using `--filelist xxx.list`.
231
+
Please note that the audio files in `filelist` must also be located in the `data` folder.
You can review `fake.wav` to assess the fine-tuning results.
249
+
250
+
!!! note
251
+
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
0 commit comments