Urdu TTS G2P support to NeMo by mwzkhalil · Pull Request #15446 · NVIDIA-NeMo/NeMo

mwzkhalil · 2026-02-26T09:22:39Z

[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS

Module: nemo/collections/tts/g2p/
Type: New language support

Summary

This PR adds UrduIpaG2p — a Grapheme-to-Phoneme module for Urdu (ur-PK)
that converts Urdu text written in Nastaliq/Naskh script into IPA phoneme
sequences. It follows the exact same design pattern as the existing
EnglishG2p (en_us_arpabet.py) and ChineseG2p (zh_cn_pinyin.py)
modules, subclassing BaseG2p directly.

Motivation

Urdu is spoken by approximately 230 million people worldwide and is the
national language of Pakistan. Despite this, NeMo currently has no G2P support
for Urdu, making it impossible to train TTS models for Urdu using the standard
NeMo pipeline.

This contribution provides:

A complete, tested UrduIpaG2p class (nemo/collections/tts/g2p/models/ur_pk_ipa.py)
A pronunciation dictionary of ~470,000 Urdu word/phrase → IPA entries in JSON format
Full feature parity with existing NeMo G2P modules

Urdu Script Notes

Property	Detail
Script	Nastaliq / Naskh (Arabic-based, RTL)
Unicode range	U+0600–U+06FF (core), U+0750–U+077F (extensions: ڈ ڑ ھ ے ں)
Letter case	None — no uppercase/lowercase distinction
Phoneme set	IPA (Urdu-specific phones: ɦ, ʔ, ɖ, ɽ, t̪, d̪, etc.)
Word boundary	Whitespace-delimited (after NFC normalisation)

Files Changed

nemo/collections/tts/g2p/models/ur_pk_ipa.py     ← new file
nemo/collections/tts/g2p/modules.py               ← add UrduIpaG2p import
scripts/tts_dataset_files/urdu_ipa_dict.json      ← ~470k entries
tests/collections/tts/g2p/test_ur_pk_ipa.py       ← new tests

Implementation Details

Dictionary format (JSON):

{
  "غیر حاضری":    "ɣɛːr hɑːzriː",
  "شوکت خانم ليب": "ʃoːˈkət̪ xɑːˈnəm leːb",
  "مختار احمد":   "mʊxˈt̪aːr ˈæhməd"
}

Keys are single words or multi-word phrases; values are space-separated IPA.

Key design decisions:

Subclasses BaseG2p directly (not IpaG2p) — IpaG2p.__init__
unconditionally calls set_grapheme_case(), which is meaningless for
Urdu script (no letter case) and raises ValueError with case=None.
This mirrors the approach taken by EnglishG2p and ChineseG2p.
Longest-phrase-first matching — __call__ tries up to max_phrase_len
(default 4) consecutive tokens as a phrase key before falling back to
single-word lookup, enabling correct handling of named entities and
compound words that span multiple tokens.
NFC normalisation — both input text and dictionary keys are
NFC-normalised at load and inference time, ensuring consistent lookup
regardless of how the Urdu text was composed.
Full feature parity with existing G2P modules:
- heteronyms support
- apply_to_oov_word fallback
- use_stresses toggle
- phoneme_probability for mixed grapheme/phoneme training
- Hyphenated OOV word splitting

Usage:

from nemo.collections.tts.g2p.models.ur_pk_ipa import UrduIpaG2p

g2p = UrduIpaG2p(phoneme_dict="scripts/tts_dataset_files/urdu_ipa_dict.json")

g2p("غیر حاضری")
# -> ['ɣɛːr', 'hɑːzriː']

g2p("شوکت خانم ليب")
# -> ['ʃoːˈkət̪', 'xɑːˈnəm', 'leːb']

Pronunciation Dictionary

Size: ~470,000 entries
Coverage: single words, named entities, multi-word phrases, abbreviations
Source: Collected and IPA-transcribed for Urdu TTS research
Format: UTF-8 JSON, NFC-normalised

Testing

python3 -m pytest tests/collections/tts/g2p/test_ur_pk_ipa.py -v

Checklist

Follows existing NeMo G2P module patterns (EnglishG2p, ChineseG2p)
Subclasses BaseG2p directly
Full docstrings (module, class, all methods)
Registered in modules.py
~470k entry pronunciation dictionary included
Unit tests
Config YAML example (happy to add if requested)
Integration with IPATokenizer for FastPitch/VITS training

Related Issues / References

Urdu phonology: https://en.wikipedia.org/wiki/Urdu_phonology
Unicode Arabic block: https://www.unicode.org/charts/PDF/U0600.pdf
eSpeak-NG Urdu support: https://github.com/espeak-ng/espeak-ng

I am happy to address review feedback, add a YAML config example, or extend
the dictionary coverage. Thank you for considering this contribution!

Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>

Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>

Copilot

Pull request overview

Adds Urdu (ur-PK) IPA grapheme-to-phoneme support to NeMo TTS by introducing a new UrduIpaG2p model, wiring it into the legacy modules.py import path, and adding unit tests to validate tokenization, dictionary loading, and inference behavior.

Changes:

Add UrduIpaG2p (dictionary-backed Urdu→IPA G2P with NFC normalization and phrase matching).
Export UrduIpaG2p via nemo/collections/tts/g2p/modules.py for backward-compatible imports.
Add unit tests covering tokenizer behavior, init/load paths, single-word lookup, phrase lookup, and NFC normalization.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File	Description
`nemo/collections/tts/g2p/models/ur_pk_ipa.py`	New Urdu IPA G2P implementation (tokenizer, dict parsing, phrase matching, inference).
`nemo/collections/tts/g2p/modules.py`	Adds import/export for `UrduIpaG2p` in the backward-compat module.
`tests/collections/tts/g2p/test_ur_pk_ipa.py`	New unit tests for Urdu tokenizer + G2P behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T17:47:27Z

+
+
+class TestUrduWordTokenize:
+    def test_pure_urdu_tokens_are_not_unchanged(self):


Test name reads as a double-negative (are_not_unchanged) even though the assertion checks that Urdu tokens are marked as changeable (False). Consider renaming the test to something clearer (e.g., test_pure_urdu_tokens_are_changeable).

Suggested change

def test_pure_urdu_tokens_are_not_unchanged(self):

def test_pure_urdu_tokens_are_changeable(self):

Copilot · 2026-04-20T17:47:28Z

+import os
+import tempfile


os and tempfile are imported but never used, which will trigger flake8 F401 on this test file. Please remove the unused imports (or use them if intended).

Suggested change

import os

import tempfile

Copilot · 2026-04-20T17:47:28Z

+import re
+import unicodedata
+from collections import defaultdict
+from typing import Callable, Dict, List, Optional, Tuple, Union


Several names imported from typing are unused (e.g., Callable, Dict, List, Optional, Tuple, Union), which will trigger flake8 F401. Either remove the unused imports or add type annotations that use them.

Suggested change

from typing import Callable, Dict, List, Optional, Tuple, Union

Copilot · 2026-04-20T17:47:29Z

+import json
+import pathlib
+import random
+import re
+import unicodedata


This new module is missing the standard NeMo/NVIDIA Apache-2.0 copyright + license header that other G2P model modules include (e.g., en_us_arpabet.py, zh_cn_pinyin.py, i18n_ipa.py). Please add the header at the top of the file for consistency and licensing compliance.

Copilot · 2026-04-20T17:47:29Z

+                self.heteronyms = set(line.rstrip() for line in f)
+        elif isinstance(heteronyms, list):
+            self.heteronyms = set(heteronyms)


heteronyms are loaded verbatim but not NFC-normalized, while text and dictionary keys are normalized with NFC. This can cause heteronym matching to fail for canonically-equivalent Unicode spellings. Consider normalizing heteronyms with unicodedata.normalize("NFC", ...) when loading (and skipping empty lines).

Suggested change

self.heteronyms = set(line.rstrip() for line in f)

elif isinstance(heteronyms, list):

self.heteronyms = set(heteronyms)

self.heteronyms = {

unicodedata.normalize("NFC", line.strip()) for line in f if line.strip()

}

elif isinstance(heteronyms, list):

self.heteronyms = {

unicodedata.normalize("NFC", heteronym.strip()) for heteronym in heteronyms if heteronym.strip()

}

Copilot · 2026-04-20T17:47:29Z

+                phrase = " ".join(words[i : i + phrase_len])
+                if phrase in self.phoneme_dict and (
+                    not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase)
+                ):
+                    prons.extend(self.phoneme_dict[phrase][0])


phoneme_probability is only applied in parse_one_word(), but multi-word phrase matches bypass parse_one_word() and are always converted to phonemes. If phoneme_probability is intended to control grapheme/phoneme mixing for all dictionary-backed conversions, you should apply the probability check before accepting a phrase match (e.g., treat the whole phrase as graphemes when the random draw chooses graphemes).

Suggested change

phrase = " ".join(words[i : i + phrase_len])

if phrase in self.phoneme_dict and (

not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase)

):

prons.extend(self.phoneme_dict[phrase][0])

phrase_words = words[i : i + phrase_len]

phrase = " ".join(phrase_words)

if phrase in self.phoneme_dict and (

not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase)

):

if self.phoneme_probability is not None and random.random() > self.phoneme_probability:

prons.extend(phrase_words)

else:

prons.extend(self.phoneme_dict[phrase][0])

Copilot · 2026-04-20T17:47:29Z

+    def _parse_urdu_json_dict(phoneme_dict, use_stresses, stress_symbols):
+        if isinstance(phoneme_dict, (str, pathlib.Path)):
+            with open(phoneme_dict, "r", encoding="utf-8") as f:
+                raw = json.load(f)
+        else:
+            raw = phoneme_dict
+        result = defaultdict(list)


If phoneme_dict is None (or not a mapping / path), _parse_urdu_json_dict() will fail with an AttributeError when calling .items(). Please add an explicit validation with a clear error message (similar to other G2P modules that assert/raise when phoneme_dict is missing).

github-actions bot added the TTS label Feb 26, 2026

mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from e922e0e to cecd3e2 Compare February 26, 2026 10:15

Urdu TTS G2P support to NeMo

428fad6

Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>

mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from cecd3e2 to 428fad6 Compare February 26, 2026 10:16

mwzkhalil and others added 2 commits February 26, 2026 10:16

Apply isort and black reformatting

031a8ae

Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>

Merge branch 'main' into feature/urdu-ipa-g2p

ddb50e2

XuesongYang requested a review from Copilot April 20, 2026 17:42

Copilot started reviewing on behalf of XuesongYang April 20, 2026 17:43 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

XuesongYang added the Run CICD label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Urdu TTS G2P support to NeMo#15446

Urdu TTS G2P support to NeMo#15446
mwzkhalil wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
mwzkhalil:feature/urdu-ipa-g2p

mwzkhalil commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		class TestUrduWordTokenize:
		def test_pure_urdu_tokens_are_not_unchanged(self):

	def test_pure_urdu_tokens_are_not_unchanged(self):
	def test_pure_urdu_tokens_are_changeable(self):

Conversation

mwzkhalil commented Feb 26, 2026

[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS

Summary

Motivation

Urdu Script Notes

Files Changed

Implementation Details

Pronunciation Dictionary

Testing

Checklist

Related Issues / References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants