Skip to content

Add Hindi (hi-IN) support for TTS#15248

Merged
chtruong814 merged 13 commits intoNVIDIA-NeMo:mainfrom
quapham:hindi_char
Jan 26, 2026
Merged

Add Hindi (hi-IN) support for TTS#15248
chtruong814 merged 13 commits intoNVIDIA-NeMo:mainfrom
quapham:hindi_char

Conversation

@quapham
Copy link
Copy Markdown
Contributor

@quapham quapham commented Jan 2, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add Hindi (hi-IN) TTS support with char tokenizer

Collection: [Note which collection this PR will affect]
TTS, common

Changelog

  • nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py
    Added Hindi grapheme character set (hi-IN) with Devanagari script support

  • nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py
    "HindiCharsTokenizer" class for Hindi grapheme tokenization
    Handles Unicode decomposition correctly (e.g., ड़ = ड + ़ nukta)

Usage

  • You can potentially add a usage example below
# Code tests for Hindi character tokenizer
from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import HindiCharsTokenizer
tokenizer = HindiCharsTokenizer(
    punct=True,
    apostrophe=True,
    pad_with_space=False
)
text="अंगड़ाई"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"IDs:   {ids}")
print(f"Decoded: {decoded}")

#IDS: [74, 138, 90, 100, 141, 124, 77]
#Decoded:अंगड़ाई

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@XuesongYang
Copy link
Copy Markdown
Collaborator

XuesongYang commented Jan 5, 2026

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Hindi (hi-IN) language support for TTS functionality by introducing character-based tokenization and IPA (International Phonetic Alphabet) grapheme-to-phoneme (G2P) capabilities for the Devanagari script.

Key Changes:

  • Adds Hindi locale ("hi-IN") to supported TTS locales
  • Introduces HindiCharsTokenizer class for character-level tokenization of Devanagari script
  • Extends Unicode range support to include Devanagari characters (\u0900-\u097F) across the TTS pipeline

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

File Description
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Adds Hindi to supported locales; defines Devanagari grapheme character set and Hindi IPA phoneme set
nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py Extends regex patterns to include Devanagari Unicode range for text processing
nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py Implements new HindiCharsTokenizer class with custom encoding logic for Devanagari combining marks
nemo/collections/tts/g2p/models/i18n_ipa.py Extends character and punctuation regex patterns to support Devanagari script; adds experimental decorator to IpaG2p class

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py
Comment thread nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py
Comment thread nemo/collections/tts/g2p/models/i18n_ipa.py Outdated
Comment thread nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py
Comment thread nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Outdated
Comment on lines +176 to 182
"hi-IN": (
'.', 'a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'w', 'x', 'z', 'ŋ', 'ɔ', 'ɖ', 'ə', 'ɛ', 'ɟ',
'ɡ', 'ɣ', 'ɪ', 'ɭ', 'ɲ', 'ɳ', 'ɾ', 'ʂ', 'ʃ', 'ʈ',
'ʊ', 'ʋ', 'ʌ', 'ʰ', 'ː', '̃', '̩', 'χ',
),
Copy link
Copy Markdown
Collaborator

@XuesongYang XuesongYang Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"hi-IN" is defined in IPA_CHARACTER_SETS, so I guess you may have implemented phoneme tokenizer as well. Could you pls also add below items if any?

  1. Hindi's IPA dictionary;
  2. if no phoneme tokenizer is defined, pls ensure IpaG2p and IPATokenizer correctly support Hindi.
  3. add unit test following test_ipa_tokenizer_es_es.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will have a dictionary. Currently, it works in our tests, so I will add this as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. pls add those changes.

Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
@XuesongYang XuesongYang marked this pull request as ready for review January 22, 2026 19:21
@XuesongYang XuesongYang self-requested a review January 22, 2026 19:21
XuesongYang
XuesongYang previously approved these changes Jan 22, 2026
Copy link
Copy Markdown
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s merge this grapheme based tokenizer. Would you pls open a new PR adding phoneme based ones?

@subhankar-ghosh
Copy link
Copy Markdown
Collaborator

This PR is just for Char tokenizer correct? and NOT phoneme tokenizer. @quapham You have the phoneme dict but it is not being used anywhere in this PR. Is that the correct understanding?

@quapham
Copy link
Copy Markdown
Contributor Author

quapham commented Jan 23, 2026

Thank @XuesongYang . I will create a new PR for the IPA G2P phoneme.
@subhankar-ghosh, you are correct, this PR currently only supports the Hindi chartokenizer.

@XuesongYang XuesongYang enabled auto-merge (squash) January 24, 2026 00:26
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 24, 2026
Signed-off-by: Jason <jasoli@nvidia.com>
@chtruong814
Copy link
Copy Markdown
Collaborator

The test failure is likely temporary network issue. It's a 404 from huggingface on a test that has worked before. Will go ahead and merge this.

@chtruong814 chtruong814 disabled auto-merge January 26, 2026 23:05
@chtruong814 chtruong814 merged commit 119593e into NVIDIA-NeMo:main Jan 26, 2026
514 of 531 checks passed
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 26, 2026
nune-tadevosyan pushed a commit to nune-tadevosyan/NeMo that referenced this pull request Mar 13, 2026
* add Hindi char tokenizer, IPA G2P, and Unicode Hindi support

Signed-off-by: quanpham <youngkwan199@gmail.com>

* Add Hindi chars tokenizer

Signed-off-by: quanpham <youngkwan199@gmail.com>

* hindi grapheme and ipa sets

Signed-off-by: quanpham <youngkwan199@gmail.com>

* remove ipa hindi

Signed-off-by: quanpham <youngkwan199@gmail.com>

* remove hindi ipa

Signed-off-by: quanpham <youngkwan199@gmail.com>

* Restore file to base version

Signed-off-by: quanpham <youngkwan199@gmail.com>

* hindi chartokenizer unit test

Signed-off-by: quanpham <youngkwan199@gmail.com>

* Restore tokenizer_utils.py to base version

Signed-off-by: quanpham <youngkwan199@gmail.com>

* Apply suggestion from @Copilot

    Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

* Update nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Add simple docstrings for helper funcs

Signed-off-by: Jason <jasoli@nvidia.com>

---------

Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants