Skip to content

Urdu TTS G2P support to NeMo#15446

Open
mwzkhalil wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
mwzkhalil:feature/urdu-ipa-g2p
Open

Urdu TTS G2P support to NeMo#15446
mwzkhalil wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
mwzkhalil:feature/urdu-ipa-g2p

Conversation

@mwzkhalil
Copy link
Copy Markdown

[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS

#15445

Module: nemo/collections/tts/g2p/
Type: New language support


Summary

This PR adds UrduIpaG2p — a Grapheme-to-Phoneme module for Urdu (ur-PK)
that converts Urdu text written in Nastaliq/Naskh script into IPA phoneme
sequences. It follows the exact same design pattern as the existing
EnglishG2p (en_us_arpabet.py) and ChineseG2p (zh_cn_pinyin.py)
modules, subclassing BaseG2p directly.


Motivation

Urdu is spoken by approximately 230 million people worldwide and is the
national language of Pakistan. Despite this, NeMo currently has no G2P support
for Urdu, making it impossible to train TTS models for Urdu using the standard
NeMo pipeline.

This contribution provides:

  • A complete, tested UrduIpaG2p class (nemo/collections/tts/g2p/models/ur_pk_ipa.py)
  • A pronunciation dictionary of ~470,000 Urdu word/phrase → IPA entries in JSON format
  • Full feature parity with existing NeMo G2P modules

Urdu Script Notes

Property Detail
Script Nastaliq / Naskh (Arabic-based, RTL)
Unicode range U+0600–U+06FF (core), U+0750–U+077F (extensions: ڈ ڑ ھ ے ں)
Letter case None — no uppercase/lowercase distinction
Phoneme set IPA (Urdu-specific phones: ɦ, ʔ, ɖ, ɽ, t̪, d̪, etc.)
Word boundary Whitespace-delimited (after NFC normalisation)

Files Changed

nemo/collections/tts/g2p/models/ur_pk_ipa.py     ← new file
nemo/collections/tts/g2p/modules.py               ← add UrduIpaG2p import
scripts/tts_dataset_files/urdu_ipa_dict.json      ← ~470k entries
tests/collections/tts/g2p/test_ur_pk_ipa.py       ← new tests

Implementation Details

Dictionary format (JSON):

{
  "غیر حاضری":    "ɣɛːr hɑːzriː",
  "شوکت خانم ليب": "ʃoːˈkət̪ xɑːˈnəm leːb",
  "مختار احمد":   "mʊxˈt̪aːr ˈæhməd"
}

Keys are single words or multi-word phrases; values are space-separated IPA.

Key design decisions:

  1. Subclasses BaseG2p directly (not IpaG2p) — IpaG2p.__init__
    unconditionally calls set_grapheme_case(), which is meaningless for
    Urdu script (no letter case) and raises ValueError with case=None.
    This mirrors the approach taken by EnglishG2p and ChineseG2p.

  2. Longest-phrase-first matching__call__ tries up to max_phrase_len
    (default 4) consecutive tokens as a phrase key before falling back to
    single-word lookup, enabling correct handling of named entities and
    compound words that span multiple tokens.

  3. NFC normalisation — both input text and dictionary keys are
    NFC-normalised at load and inference time, ensuring consistent lookup
    regardless of how the Urdu text was composed.

  4. Full feature parity with existing G2P modules:

    • heteronyms support
    • apply_to_oov_word fallback
    • use_stresses toggle
    • phoneme_probability for mixed grapheme/phoneme training
    • Hyphenated OOV word splitting

Usage:

from nemo.collections.tts.g2p.models.ur_pk_ipa import UrduIpaG2p

g2p = UrduIpaG2p(phoneme_dict="scripts/tts_dataset_files/urdu_ipa_dict.json")

g2p("غیر حاضری")
# -> ['ɣɛːr', 'hɑːzriː']

g2p("شوکت خانم ليب")
# -> ['ʃoːˈkət̪', 'xɑːˈnəm', 'leːb']

Pronunciation Dictionary

  • Size: ~470,000 entries
  • Coverage: single words, named entities, multi-word phrases, abbreviations
  • Source: Collected and IPA-transcribed for Urdu TTS research
  • Format: UTF-8 JSON, NFC-normalised

Testing

python3 -m pytest tests/collections/tts/g2p/test_ur_pk_ipa.py -v

Checklist

  • Follows existing NeMo G2P module patterns (EnglishG2p, ChineseG2p)
  • Subclasses BaseG2p directly
  • Full docstrings (module, class, all methods)
  • Registered in modules.py
  • ~470k entry pronunciation dictionary included
  • Unit tests
  • Config YAML example (happy to add if requested)
  • Integration with IPATokenizer for FastPitch/VITS training

Related Issues / References


I am happy to address review feedback, add a YAML config example, or extend
the dictionary coverage. Thank you for considering this contribution!

@github-actions github-actions bot added the TTS label Feb 26, 2026
@mwzkhalil mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from e922e0e to cecd3e2 Compare February 26, 2026 10:15
Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>
@mwzkhalil mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from cecd3e2 to 428fad6 Compare February 26, 2026 10:16
mwzkhalil and others added 2 commits February 26, 2026 10:16
Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Urdu (ur-PK) IPA grapheme-to-phoneme support to NeMo TTS by introducing a new UrduIpaG2p model, wiring it into the legacy modules.py import path, and adding unit tests to validate tokenization, dictionary loading, and inference behavior.

Changes:

  • Add UrduIpaG2p (dictionary-backed Urdu→IPA G2P with NFC normalization and phrase matching).
  • Export UrduIpaG2p via nemo/collections/tts/g2p/modules.py for backward-compatible imports.
  • Add unit tests covering tokenizer behavior, init/load paths, single-word lookup, phrase lookup, and NFC normalization.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File Description
nemo/collections/tts/g2p/models/ur_pk_ipa.py New Urdu IPA G2P implementation (tokenizer, dict parsing, phrase matching, inference).
nemo/collections/tts/g2p/modules.py Adds import/export for UrduIpaG2p in the backward-compat module.
tests/collections/tts/g2p/test_ur_pk_ipa.py New unit tests for Urdu tokenizer + G2P behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.



class TestUrduWordTokenize:
def test_pure_urdu_tokens_are_not_unchanged(self):
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name reads as a double-negative (are_not_unchanged) even though the assertion checks that Urdu tokens are marked as changeable (False). Consider renaming the test to something clearer (e.g., test_pure_urdu_tokens_are_changeable).

Suggested change
def test_pure_urdu_tokens_are_not_unchanged(self):
def test_pure_urdu_tokens_are_changeable(self):

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +19
import os
import tempfile
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os and tempfile are imported but never used, which will trigger flake8 F401 on this test file. Please remove the unused imports (or use them if intended).

Suggested change
import os
import tempfile

Copilot uses AI. Check for mistakes.
import re
import unicodedata
from collections import defaultdict
from typing import Callable, Dict, List, Optional, Tuple, Union
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several names imported from typing are unused (e.g., Callable, Dict, List, Optional, Tuple, Union), which will trigger flake8 F401. Either remove the unused imports or add type annotations that use them.

Suggested change
from typing import Callable, Dict, List, Optional, Tuple, Union

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
import json
import pathlib
import random
import re
import unicodedata
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new module is missing the standard NeMo/NVIDIA Apache-2.0 copyright + license header that other G2P model modules include (e.g., en_us_arpabet.py, zh_cn_pinyin.py, i18n_ipa.py). Please add the header at the top of the file for consistency and licensing compliance.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +57
self.heteronyms = set(line.rstrip() for line in f)
elif isinstance(heteronyms, list):
self.heteronyms = set(heteronyms)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heteronyms are loaded verbatim but not NFC-normalized, while text and dictionary keys are normalized with NFC. This can cause heteronym matching to fail for canonically-equivalent Unicode spellings. Consider normalizing heteronyms with unicodedata.normalize("NFC", ...) when loading (and skipping empty lines).

Suggested change
self.heteronyms = set(line.rstrip() for line in f)
elif isinstance(heteronyms, list):
self.heteronyms = set(heteronyms)
self.heteronyms = {
unicodedata.normalize("NFC", line.strip()) for line in f if line.strip()
}
elif isinstance(heteronyms, list):
self.heteronyms = {
unicodedata.normalize("NFC", heteronym.strip()) for heteronym in heteronyms if heteronym.strip()
}

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +124
phrase = " ".join(words[i : i + phrase_len])
if phrase in self.phoneme_dict and (
not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase)
):
prons.extend(self.phoneme_dict[phrase][0])
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phoneme_probability is only applied in parse_one_word(), but multi-word phrase matches bypass parse_one_word() and are always converted to phonemes. If phoneme_probability is intended to control grapheme/phoneme mixing for all dictionary-backed conversions, you should apply the probability check before accepting a phrase match (e.g., treat the whole phrase as graphemes when the random draw chooses graphemes).

Suggested change
phrase = " ".join(words[i : i + phrase_len])
if phrase in self.phoneme_dict and (
not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase)
):
prons.extend(self.phoneme_dict[phrase][0])
phrase_words = words[i : i + phrase_len]
phrase = " ".join(phrase_words)
if phrase in self.phoneme_dict and (
not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase)
):
if self.phoneme_probability is not None and random.random() > self.phoneme_probability:
prons.extend(phrase_words)
else:
prons.extend(self.phoneme_dict[phrase][0])

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +75
def _parse_urdu_json_dict(phoneme_dict, use_stresses, stress_symbols):
if isinstance(phoneme_dict, (str, pathlib.Path)):
with open(phoneme_dict, "r", encoding="utf-8") as f:
raw = json.load(f)
else:
raw = phoneme_dict
result = defaultdict(list)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If phoneme_dict is None (or not a mapping / path), _parse_urdu_json_dict() will fail with an AttributeError when calling .items(). Please add an explicit validation with a clear error message (similar to other G2P modules that assert/raise when phoneme_dict is missing).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants