Add Hindi (hi-IN) support for TTS by quapham · Pull Request #15248 · NVIDIA-NeMo/NeMo

quapham · 2026-01-02T06:48:30Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add Hindi (hi-IN) TTS support with char tokenizer

Collection: [Note which collection this PR will affect]
TTS, common

Changelog

nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py
Added Hindi grapheme character set (hi-IN) with Devanagari script support
nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py
"HindiCharsTokenizer" class for Hindi grapheme tokenization
Handles Unicode decomposition correctly (e.g., ड़ = ड + ़ nukta)

Usage

You can potentially add a usage example below

# Code tests for Hindi character tokenizer
from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import HindiCharsTokenizer
tokenizer = HindiCharsTokenizer(
    punct=True,
    apostrophe=True,
    pad_with_space=False
)
text="अंगड़ाई"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"IDs:   {ids}")
print(f"Decoded: {decoded}")

#IDS: [74, 138, 90, 100, 141, 124, 77]
#Decoded:अंगड़ाई

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

XuesongYang · 2026-01-05T19:40:57Z

could you pls add unit tests to verify if they worked as expected? examples are,

Copilot

Pull request overview

This PR adds Hindi (hi-IN) language support for TTS functionality by introducing character-based tokenization and IPA (International Phonetic Alphabet) grapheme-to-phoneme (G2P) capabilities for the Devanagari script.

Key Changes:

Adds Hindi locale ("hi-IN") to supported TTS locales
Introduces HindiCharsTokenizer class for character-level tokenization of Devanagari script
Extends Unicode range support to include Devanagari characters (\u0900-\u097F) across the TTS pipeline

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

File	Description
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py	Adds Hindi to supported locales; defines Devanagari grapheme character set and Hindi IPA phoneme set
nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py	Extends regex patterns to include Devanagari Unicode range for text processing
nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py	Implements new HindiCharsTokenizer class with custom encoding logic for Devanagari combining marks
nemo/collections/tts/g2p/models/i18n_ipa.py	Extends character and punctuation regex patterns to support Devanagari script; adds experimental decorator to IpaG2p class

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

XuesongYang · 2026-01-06T19:20:06Z

+    "hi-IN": (
+        '.', 'a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j',
+        'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
+        'u', 'w', 'x', 'z', 'ŋ', 'ɔ', 'ɖ', 'ə', 'ɛ', 'ɟ',
+        'ɡ', 'ɣ', 'ɪ', 'ɭ', 'ɲ', 'ɳ', 'ɾ', 'ʂ', 'ʃ', 'ʈ',
+        'ʊ', 'ʋ', 'ʌ', 'ʰ', 'ː', '̃', '̩', 'χ',
    ),


"hi-IN" is defined in IPA_CHARACTER_SETS, so I guess you may have implemented phoneme tokenizer as well. Could you pls also add below items if any?

Hindi's IPA dictionary;

if no phoneme tokenizer is defined, pls ensure IpaG2p and IPATokenizer correctly support Hindi.

add unit test following test_ipa_tokenizer_es_es.

Yes, it will have a dictionary. Currently, it works in our tests, so I will add this as well.

nice. pls add those changes.

Signed-off-by: quanpham <youngkwan199@gmail.com>

XuesongYang

Let’s merge this grapheme based tokenizer. Would you pls open a new PR adding phoneme based ones?

subhankar-ghosh · 2026-01-22T19:30:36Z

This PR is just for Char tokenizer correct? and NOT phoneme tokenizer. @quapham You have the phoneme dict but it is not being used anywhere in this PR. Is that the correct understanding?

quapham · 2026-01-23T02:50:19Z

Thank @XuesongYang . I will create a new PR for the IPA G2P phoneme.
@subhankar-ghosh, you are correct, this PR currently only supports the Hindi chartokenizer.

Signed-off-by: Jason <jasoli@nvidia.com>

chtruong814 · 2026-01-26T23:05:44Z

The test failure is likely temporary network issue. It's a 404 from huggingface on a test that has worked before. Will go ahead and merge this.

* add Hindi char tokenizer, IPA G2P, and Unicode Hindi support Signed-off-by: quanpham <youngkwan199@gmail.com> * Add Hindi chars tokenizer Signed-off-by: quanpham <youngkwan199@gmail.com> * hindi grapheme and ipa sets Signed-off-by: quanpham <youngkwan199@gmail.com> * remove ipa hindi Signed-off-by: quanpham <youngkwan199@gmail.com> * remove hindi ipa Signed-off-by: quanpham <youngkwan199@gmail.com> * Restore file to base version Signed-off-by: quanpham <youngkwan199@gmail.com> * hindi chartokenizer unit test Signed-off-by: quanpham <youngkwan199@gmail.com> * Restore tokenizer_utils.py to base version Signed-off-by: quanpham <youngkwan199@gmail.com> * Apply suggestion from @Copilot Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com> * Update nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Add simple docstrings for helper funcs Signed-off-by: Jason <jasoli@nvidia.com> --------- Signed-off-by: quanpham <youngkwan199@gmail.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com> Signed-off-by: Jason <jasoli@nvidia.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com>

github-actions Bot added TTS common labels Jan 2, 2026

XuesongYang requested review from XuesongYang and Copilot January 5, 2026 19:37

Copilot started reviewing on behalf of XuesongYang January 5, 2026 19:38 View session

XuesongYang force-pushed the hindi_char branch from b1dedfe to 588c156 Compare January 5, 2026 19:40

github-actions Bot added the community-request label Jan 5, 2026

XuesongYang added the Run CICD label Jan 5, 2026

Copilot AI reviewed Jan 5, 2026

View reviewed changes

quapham force-pushed the hindi_char branch from 588c156 to 28b2004 Compare January 6, 2026 10:51

chtruong814 added Run CICD and removed Run CICD labels Jan 6, 2026

XuesongYang reviewed Jan 6, 2026

View reviewed changes

XuesongYang removed Run CICD community-request labels Jan 6, 2026

quapham added 4 commits January 6, 2026 11:28

add Hindi char tokenizer, IPA G2P, and Unicode Hindi support

a9b6943

Signed-off-by: quanpham <youngkwan199@gmail.com>

Add Hindi chars tokenizer

aa2ab12

Signed-off-by: quanpham <youngkwan199@gmail.com>

hindi grapheme and ipa sets

09dcdc3

Signed-off-by: quanpham <youngkwan199@gmail.com>

remove ipa hindi

38164c8

Signed-off-by: quanpham <youngkwan199@gmail.com>

github-actions Bot added the community-request label Jan 22, 2026

XuesongYang marked this pull request as ready for review January 22, 2026 19:21

XuesongYang self-requested a review January 22, 2026 19:21

XuesongYang previously approved these changes Jan 22, 2026

View reviewed changes

XuesongYang enabled auto-merge (squash) January 24, 2026 00:26

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 24, 2026

blisc removed skip-linting community-request needs-follow-up Issue needs follow-up labels Jan 26, 2026

Merge branch 'main' into hindi_char

b8aade3

chtruong814 added Run CICD and removed Run CICD labels Jan 26, 2026

Add simple docstrings for helper funcs

dbab8d3

Signed-off-by: Jason <jasoli@nvidia.com>

blisc dismissed XuesongYang’s stale review via dbab8d3 January 26, 2026 15:52

chtruong814 added Run CICD and removed Run CICD labels Jan 26, 2026

blisc approved these changes Jan 26, 2026

View reviewed changes

chtruong814 temporarily deployed to test January 26, 2026 15:54 — with GitHub Actions Inactive

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 26, 2026

blisc added Run CICD and removed Run CICD labels Jan 26, 2026

blisc temporarily deployed to test January 26, 2026 17:54 — with GitHub Actions Inactive

chtruong814 disabled auto-merge January 26, 2026 23:05

chtruong814 merged commit 119593e into NVIDIA-NeMo:main Jan 26, 2026
514 of 531 checks passed

github-actions Bot added the community-request label Jan 26, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hindi (hi-IN) support for TTS#15248

Add Hindi (hi-IN) support for TTS#15248
chtruong814 merged 13 commits intoNVIDIA-NeMo:mainfrom
quapham:hindi_char

quapham commented Jan 2, 2026 •

edited

Loading

Uh oh!

XuesongYang commented Jan 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XuesongYang Jan 6, 2026 •

edited

Loading

Uh oh!

quapham Jan 8, 2026

Uh oh!

XuesongYang Jan 9, 2026

Uh oh!

XuesongYang left a comment •

edited

Loading

Uh oh!

subhankar-ghosh commented Jan 22, 2026

Uh oh!

quapham commented Jan 23, 2026

Uh oh!

chtruong814 commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

quapham commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

XuesongYang commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XuesongYang Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quapham Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

XuesongYang Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

XuesongYang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subhankar-ghosh commented Jan 22, 2026

Uh oh!

quapham commented Jan 23, 2026

Uh oh!

chtruong814 commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

quapham commented Jan 2, 2026 •

edited

Loading

XuesongYang commented Jan 5, 2026 •

edited

Loading

XuesongYang Jan 6, 2026 •

edited

Loading

XuesongYang left a comment •

edited

Loading