Add Hindi (hi-IN) support for TTS#15248
Conversation
b1dedfe to
588c156
Compare
|
could you pls add unit tests to verify if they worked as expected? examples are, |
There was a problem hiding this comment.
Pull request overview
This PR adds Hindi (hi-IN) language support for TTS functionality by introducing character-based tokenization and IPA (International Phonetic Alphabet) grapheme-to-phoneme (G2P) capabilities for the Devanagari script.
Key Changes:
- Adds Hindi locale ("hi-IN") to supported TTS locales
- Introduces
HindiCharsTokenizerclass for character-level tokenization of Devanagari script - Extends Unicode range support to include Devanagari characters (\u0900-\u097F) across the TTS pipeline
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py | Adds Hindi to supported locales; defines Devanagari grapheme character set and Hindi IPA phoneme set |
| nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py | Extends regex patterns to include Devanagari Unicode range for text processing |
| nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py | Implements new HindiCharsTokenizer class with custom encoding logic for Devanagari combining marks |
| nemo/collections/tts/g2p/models/i18n_ipa.py | Extends character and punctuation regex patterns to support Devanagari script; adds experimental decorator to IpaG2p class |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "hi-IN": ( | ||
| '.', 'a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j', | ||
| 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', | ||
| 'u', 'w', 'x', 'z', 'ŋ', 'ɔ', 'ɖ', 'ə', 'ɛ', 'ɟ', | ||
| 'ɡ', 'ɣ', 'ɪ', 'ɭ', 'ɲ', 'ɳ', 'ɾ', 'ʂ', 'ʃ', 'ʈ', | ||
| 'ʊ', 'ʋ', 'ʌ', 'ʰ', 'ː', '̃', '̩', 'χ', | ||
| ), |
There was a problem hiding this comment.
"hi-IN" is defined in IPA_CHARACTER_SETS, so I guess you may have implemented phoneme tokenizer as well. Could you pls also add below items if any?
- Hindi's IPA dictionary;
- if no phoneme tokenizer is defined, pls ensure
IpaG2pandIPATokenizercorrectly support Hindi. - add unit test following
test_ipa_tokenizer_es_es.
There was a problem hiding this comment.
Yes, it will have a dictionary. Currently, it works in our tests, so I will add this as well.
There was a problem hiding this comment.
nice. pls add those changes.
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
|
This PR is just for Char tokenizer correct? and NOT phoneme tokenizer. @quapham You have the phoneme dict but it is not being used anywhere in this PR. Is that the correct understanding? |
|
Thank @XuesongYang . I will create a new PR for the IPA G2P phoneme. |
Signed-off-by: Jason <jasoli@nvidia.com>
|
The test failure is likely temporary network issue. It's a 404 from huggingface on a test that has worked before. Will go ahead and merge this. |
* add Hindi char tokenizer, IPA G2P, and Unicode Hindi support
Signed-off-by: quanpham <youngkwan199@gmail.com>
* Add Hindi chars tokenizer
Signed-off-by: quanpham <youngkwan199@gmail.com>
* hindi grapheme and ipa sets
Signed-off-by: quanpham <youngkwan199@gmail.com>
* remove ipa hindi
Signed-off-by: quanpham <youngkwan199@gmail.com>
* remove hindi ipa
Signed-off-by: quanpham <youngkwan199@gmail.com>
* Restore file to base version
Signed-off-by: quanpham <youngkwan199@gmail.com>
* hindi chartokenizer unit test
Signed-off-by: quanpham <youngkwan199@gmail.com>
* Restore tokenizer_utils.py to base version
Signed-off-by: quanpham <youngkwan199@gmail.com>
* Apply suggestion from @Copilot
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
* Apply isort and black reformatting
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
* Update nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
* Add simple docstrings for helper funcs
Signed-off-by: Jason <jasoli@nvidia.com>
---------
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add Hindi (hi-IN) TTS support with char tokenizer
Collection: [Note which collection this PR will affect]
TTS, common
Changelog
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py
Added Hindi grapheme character set (
hi-IN) with Devanagari script supportnemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py
"HindiCharsTokenizer" class for Hindi grapheme tokenization
Handles Unicode decomposition correctly (e.g., ड़ = ड + ़ nukta)
Usage
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information