-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathresources.txt
50 lines (44 loc) · 4 KB
/
resources.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Libraries:
https://github.com/nii-yamagishilab/multi-speaker-tacotron
https://github.com/espnet/espnet
https://github.com/Tomiinek/Multilingual_Text_to_Speech
https://github.com/coqui-ai/TTS
https://github.com/erikbern/lang-pitch/blob/master/forvo_spider.py
https://github.com/as-ideas/DeepPhonemizer
Miscellaneous:
Ukrain ASR: https://huggingface.co/nvidia/stt_uk_citrinet_1024_gamma_0_25
Coqui demo: https://huggingface.co/spaces/coqui/CoquiTTS
YouTube metadata API: https://developers.google.com/youtube/v3/quickstart/python
Stream audio from YouTube: https://stackoverflow.com/questions/68522350/python-stream-audio-from-youtube-livestream
Segment away the background music: https://huggingface.co/tasks/audio-to-audio
History of TTS: https://programmer.group/teach-you-how-to-build-a-chinese-speech-synthesis-model-training-and-prediction.html
Underlying technology:
VAEs: https://lilianweng.github.io/posts/2018-08-12-vae/
Flow models: https://lilianweng.github.io/posts/2018-10-13-flow-models/
Diffusion models: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Papers:
Tacotron: Towards End-to-End Speech Synthesis [https://google.github.io/tacotron/]: https://arxiv.org/abs/1703.10135
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis [https://google.github.io/tacotron/]: https://arxiv.org/pdf/1806.04558.pdf
Multilingual Speech Synthesis and Cross-Language Voice Cloning [https://google.github.io/tacotron/]: https://arxiv.org/pdf/1907.04448.pdf
(VITS) Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [https://tts.readthedocs.io/en/latest/models/vits.html]: https://arxiv.org/pdf/2106.06103.pdf
Speaker Generation [https://google.github.io/tacotron/]: https://arxiv.org/pdf/2111.05095.pdf
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone [https://github.com/Edresson/YourTTS & https://coqui.ai/blog/tts/yourtts-zero-shot-text-synthesis-low-resource-languages]: https://arxiv.org/abs/2112.02418
https://github.com/Tomiinek/Multilingual_Text_to_Speech: https://arxiv.org/pdf/2008.00768.pdf
Towards cross-lingual voice cloning in higher education: https://www.sciencedirect.com/science/article/pii/S095219762100261X
Datasets:
VCTK dataset
https://www.nb.no/sprakbanken/ressurskatalog/?_search=NST&_type=tale
https://github.com/egorsmkv/ukrainian-tts-datasets
Videos:
HuggingFace TTS deepdive: https://www.youtube.com/watch?v=aLBedWj-5CQ
Coqui voice cloning tutorial: https://www.youtube.com/watch?v=6QAGk_rHipE
Coqui tutorial playlist: https://www.youtube.com/playlist?list=PL19C7uchWZerUT0qIiEv7m2zXBs5kYl1L
Notes:
Pytube instead of youtube_dl?
PyAnnotate.audio
@vastal_aggarwal on Twitter said he is open to help people with TTS
Model saved at: C:\Users\lundb\AppData\Local\tts\tts_models--multilingual--multi-dataset--your_tts
Compute speaker embeddings: python Coqui_TTS/TTS/bin/compute_embeddings.py tts_models/tts_models--multilingual--multi-dataset--your_tts--experimental/model_se.pth tts_models/tts_models--multilingual--multi-dataset--your_tts--experimental/config_se.json tts_models/tts_models--multilingual--multi-dataset--your_tts--experimental/config_se.json
Train model: CUDA_VISIBLE_DEVICES=0 python Coqui_TTS/TTS/bin/train_tts.py --config_path tts_models/tts_models--multilingual--multi-dataset--your_tts--experimental/config.json --restore_path tts_models/tts_models--multilingual--multi-dataset--your_tts--experimental/model_file.pth
Run inference on a new model: tts --text "Und \u00fcberzeugen dank feingef\u00fchl f\u00fcr den ganz gro\u00dfen leinwand-stoff." --model_path "C:/Users/lundb/Documents/Other/Voice/checkpoints/vits_tts-german-August-30-2022_12+56PM-0000000/checkpoint_20000.pth" --config_path "C:/Users/lundb/Documents/Other/Voice/checkpoints/vits_tts-german-August-30-2022_12+56PM-0000000/config.json" --out_path "C:/Users/lundb/Documents/Other/Voice/test_german.wav" --speaker_idx "thorsten" --language_idx "de-de"
Example of someone else training multi-speaker: https://github.com/wannaphong/KhanomTan-TTS-v1.0