Skip to content

Commit b96ebdb

Browse files
committed
Update docs and samples
1 parent 56962cc commit b96ebdb

9 files changed

+320
-179
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@ asr-label*
2323
/.cache
2424
/fishenv
2525
/.locale
26+
/demo-audios

docs/en/finetune.md

+103-67
Original file line numberDiff line numberDiff line change
@@ -2,65 +2,22 @@
22

33
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
44

5-
`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`and `VITS`.
5+
`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`, and `VITS`.
66

77
!!! info
8-
You should first conduct the following test to determine if you need to fine-tune `VQGAN`:
8+
You should first conduct the following test to determine if you need to fine-tune `VITS Decoder`:
99
```bash
1010
python tools/vqgan/inference.py -i test.wav
11+
python tools/vits_decoder/inference.py \
12+
-ckpt checkpoints/vits_decoder_v1.1.ckpt \
13+
-i fake.npy -r test.wav \
14+
--text "The text you want to generate"
1115
```
12-
This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`.
16+
This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VITS Decoder`.
1317

1418
Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
1519

16-
It is recommended to fine-tune the LLAMA and VITS model first, then fine-tune the `VQGAN` according to your needs.
17-
18-
## Fine-tuning VQGAN
19-
### 1. Prepare the Dataset
20-
21-
```
22-
.
23-
├── SPK1
24-
│ ├── 21.15-26.44.mp3
25-
│ ├── 27.51-29.98.mp3
26-
│ └── 30.1-32.71.mp3
27-
└── SPK2
28-
└── 38.79-40.85.mp3
29-
```
30-
31-
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
32-
33-
### 2. Split Training and Validation Sets
34-
35-
```bash
36-
python tools/vqgan/create_train_split.py data
37-
```
38-
39-
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
40-
41-
!!!info
42-
For the VITS format, you can specify a file list using `--filelist xxx.list`.
43-
Please note that the audio files in `filelist` must also be located in the `data` folder.
44-
45-
### 3. Start Training
46-
47-
```bash
48-
python fish_speech/train.py --config-name vqgan_finetune
49-
```
50-
51-
!!! note
52-
You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
53-
54-
### 4. Test the Audio
55-
56-
```bash
57-
python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
58-
```
59-
60-
You can review `fake.wav` to assess the fine-tuning results.
61-
62-
!!! note
63-
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
20+
It is recommended to fine-tune the LLAMA first, then fine-tune the `VITS Decoder` according to your needs.
6421

6522
## Fine-tuning LLAMA
6623
### 1. Prepare the dataset
@@ -168,8 +125,27 @@ After training is complete, you can refer to the [inference](inference.md) secti
168125
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
169126
If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
170127

171-
## Fine-tuning VITS
172-
### 1. Prepare the dataset
128+
#### Fine-tuning with LoRA
129+
130+
!!! note
131+
LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets.
132+
133+
If you want to use LoRA, please add the following parameter: `[email protected]_config=r_8_alpha_16`.
134+
135+
After training, you need to convert the LoRA weights to regular weights before performing inference.
136+
137+
```bash
138+
python tools/llama/merge_lora.py \
139+
--llama-config dual_ar_2_codebook_medium \
140+
--lora-config r_8_alpha_16 \
141+
--llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
142+
--lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
143+
--output checkpoints/merged.ckpt
144+
```
145+
146+
147+
## Fine-tuning VITS Decoder
148+
### 1. Prepare the Dataset
173149

174150
```
175151
.
@@ -184,32 +160,92 @@ After training is complete, you can refer to the [inference](inference.md) secti
184160
├── 38.79-40.85.lab
185161
└── 38.79-40.85.mp3
186162
```
163+
187164
!!! note
188-
The fine-tuning for VITS only support the .lab format files, please don't use .list file!
165+
VITS fine-tuning currently only supports `.lab` as the label file and does not support the `filelist` format.
189166

190-
You need to convert the dataset to the format above, and move them to the `data` , the suffix of the files can be `.mp3`, `.wav` `.flac`, the label files' suffix are recommended to be `.lab`.
167+
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions, and the annotation files should have the `.lab` extension.
191168

192-
### 2.Start Training
169+
### 2. Split Training and Validation Sets
193170

194171
```bash
195-
python fish_speech/train.py --config-name vits_decoder_finetune
172+
python tools/vqgan/create_train_split.py data
196173
```
197174

175+
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
198176

199-
#### Fine-tuning with LoRA
177+
!!! info
178+
For the VITS format, you can specify a file list using `--filelist xxx.list`.
179+
Please note that the audio files in `filelist` must also be located in the `data` folder.
180+
181+
### 3. Start Training
182+
183+
```bash
184+
python fish_speech/train.py --config-name vits_decoder_finetune
185+
```
200186

201187
!!! note
202-
LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets.
188+
You can modify training parameters by editing `fish_speech/configs/vits_decoder_finetune.yaml`, but in most cases, this won't be necessary.
203189

204-
If you want to use LoRA, please add the following parameter: `[email protected]_config=r_8_alpha_16`.
190+
### 4. Test the Audio
191+
192+
```bash
193+
python tools/vits_decoder/inference.py \
194+
--checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \
195+
-i test.npy -r test.wav \
196+
--text "The text you want to generate"
197+
```
205198

206-
After training, you need to convert the LoRA weights to regular weights before performing inference.
199+
You can review `fake.wav` to assess the fine-tuning results.
200+
201+
202+
## Fine-tuning VQGAN (Not Recommended)
203+
204+
205+
We no longer recommend using VQGAN for fine-tuning in version 1.1. Using VITS Decoder will yield better results, but if you still want to fine-tune VQGAN, you can refer to the following steps.
206+
207+
### 1. Prepare the Dataset
208+
209+
```
210+
.
211+
├── SPK1
212+
│ ├── 21.15-26.44.mp3
213+
│ ├── 27.51-29.98.mp3
214+
│ └── 30.1-32.71.mp3
215+
└── SPK2
216+
└── 38.79-40.85.mp3
217+
```
218+
219+
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
220+
221+
### 2. Split Training and Validation Sets
207222

208223
```bash
209-
python tools/llama/merge_lora.py \
210-
--llama-config dual_ar_2_codebook_medium \
211-
--lora-config r_8_alpha_16 \
212-
--llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
213-
--lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
214-
--output checkpoints/merged.ckpt
224+
python tools/vqgan/create_train_split.py data
215225
```
226+
227+
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
228+
229+
!!!info
230+
For the VITS format, you can specify a file list using `--filelist xxx.list`.
231+
Please note that the audio files in `filelist` must also be located in the `data` folder.
232+
233+
### 3. Start Training
234+
235+
```bash
236+
python fish_speech/train.py --config-name vqgan_finetune
237+
```
238+
239+
!!! note
240+
You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
241+
242+
### 4. Test the Audio
243+
244+
```bash
245+
python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
246+
```
247+
248+
You can review `fake.wav` to assess the fine-tuning results.
249+
250+
!!! note
251+
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.

docs/en/index.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,13 @@ pip3 install torch torchvision torchaudio
3939
# Install fish-speech
4040
pip3 install -e .
4141

42-
#install sox
42+
# (Ubuntu / Debian User) Install sox
4343
apt install libsox-dev
4444
```
4545

4646
## Changelog
4747

48-
- 2024/05/10: Updated Fish-Speech to 1.1 version, importing VITS as the Decoder part.
48+
- 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity.
4949
- 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models.
5050
- 2023/12/28: Added `lora` fine-tuning support.
5151
- 2023/12/27: Add `gradient checkpointing`, `causual sampling`, and `flash-attn` support.

docs/en/inference.md

+38-4
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@ Inference support command line, HTTP API and web UI.
55
!!! note
66
Overall, reasoning consists of several parts:
77

8-
1. Encode a given 5-10 seconds of voice using VQGAN.
8+
1. Encode a given ~10 seconds of voice using VQGAN.
99
2. Input the encoded semantic tokens and the corresponding text into the language model as an example.
1010
3. Given a new piece of text, let the model generate the corresponding semantic tokens.
11-
4. Input the generated semantic tokens into VQGAN to decode and generate the corresponding voice.
11+
4. Input the generated semantic tokens into VITS / VQGAN to decode and generate the corresponding voice.
12+
13+
In version 1.1, we recommend using VITS for decoding, as it performs better than VQGAN in both timbre and pronunciation.
1214

1315
## Command Line Inference
1416

@@ -17,6 +19,7 @@ Download the required `vqgan` and `text2semantic` models from our Hugging Face r
1719
```bash
1820
huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
1921
huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
22+
huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
2023
```
2124

2225
### 1. Generate prompt from voice:
@@ -56,6 +59,16 @@ This command will create a `codes_N` file in the working directory, where N is a
5659
If you are using your own fine-tuned model, please be sure to carry the `--speaker` parameter to ensure the stability of pronunciation.
5760

5861
### 3. Generate vocals from semantic tokens:
62+
63+
#### VITS Decoder
64+
```bash
65+
python tools/vits_decoder/inference.py \
66+
--checkpoint-path checkpoints/vits_decoder_v1.1.ckpt \
67+
-i codes_0.npy -r ref.wav \
68+
--text "The text you want to generate"
69+
```
70+
71+
#### VQGAN Decoder (not recommended)
5972
```bash
6073
python tools/vqgan/inference.py \
6174
-i "codes_0.npy" \
@@ -71,11 +84,20 @@ python -m tools.api \
7184
--listen 0.0.0.0:8000 \
7285
--llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
7386
--llama-config-name dual_ar_2_codebook_medium \
74-
--vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
87+
--decoder-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth" \
88+
--decoder-config-name vqgan_pretrain
7589
```
7690

7791
After that, you can view and test the API at http://127.0.0.1:8000/.
7892

93+
!!! info
94+
You should use following parameters to start VITS decoder:
95+
96+
```bash
97+
--decoder-config-name vits_decoder_finetune \
98+
--decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
99+
```
100+
79101
## WebUI Inference
80102

81103
You can start the WebUI using the following command:
@@ -84,7 +106,19 @@ You can start the WebUI using the following command:
84106
python -m tools.webui \
85107
--llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
86108
--llama-config-name dual_ar_2_codebook_medium \
87-
--vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
109+
--vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth" \
110+
--vits-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt"
88111
```
89112

113+
!!! info
114+
You should use following parameters to start VITS decoder:
115+
116+
```bash
117+
--decoder-config-name vits_decoder_finetune \
118+
--decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
119+
```
120+
121+
!!! note
122+
You can use Gradio environment variables, such as `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` to configure WebUI.
123+
90124
Enjoy!

0 commit comments

Comments
 (0)