Conversation
e922e0e to
cecd3e2
Compare
Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>
cecd3e2 to
428fad6
Compare
Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds Urdu (ur-PK) IPA grapheme-to-phoneme support to NeMo TTS by introducing a new UrduIpaG2p model, wiring it into the legacy modules.py import path, and adding unit tests to validate tokenization, dictionary loading, and inference behavior.
Changes:
- Add
UrduIpaG2p(dictionary-backed Urdu→IPA G2P with NFC normalization and phrase matching). - Export
UrduIpaG2pvianemo/collections/tts/g2p/modules.pyfor backward-compatible imports. - Add unit tests covering tokenizer behavior, init/load paths, single-word lookup, phrase lookup, and NFC normalization.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
nemo/collections/tts/g2p/models/ur_pk_ipa.py |
New Urdu IPA G2P implementation (tokenizer, dict parsing, phrase matching, inference). |
nemo/collections/tts/g2p/modules.py |
Adds import/export for UrduIpaG2p in the backward-compat module. |
tests/collections/tts/g2p/test_ur_pk_ipa.py |
New unit tests for Urdu tokenizer + G2P behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
| class TestUrduWordTokenize: | ||
| def test_pure_urdu_tokens_are_not_unchanged(self): |
There was a problem hiding this comment.
Test name reads as a double-negative (are_not_unchanged) even though the assertion checks that Urdu tokens are marked as changeable (False). Consider renaming the test to something clearer (e.g., test_pure_urdu_tokens_are_changeable).
| def test_pure_urdu_tokens_are_not_unchanged(self): | |
| def test_pure_urdu_tokens_are_changeable(self): |
| import os | ||
| import tempfile |
There was a problem hiding this comment.
os and tempfile are imported but never used, which will trigger flake8 F401 on this test file. Please remove the unused imports (or use them if intended).
| import os | |
| import tempfile |
| import re | ||
| import unicodedata | ||
| from collections import defaultdict | ||
| from typing import Callable, Dict, List, Optional, Tuple, Union |
There was a problem hiding this comment.
Several names imported from typing are unused (e.g., Callable, Dict, List, Optional, Tuple, Union), which will trigger flake8 F401. Either remove the unused imports or add type annotations that use them.
| from typing import Callable, Dict, List, Optional, Tuple, Union |
| import json | ||
| import pathlib | ||
| import random | ||
| import re | ||
| import unicodedata |
There was a problem hiding this comment.
This new module is missing the standard NeMo/NVIDIA Apache-2.0 copyright + license header that other G2P model modules include (e.g., en_us_arpabet.py, zh_cn_pinyin.py, i18n_ipa.py). Please add the header at the top of the file for consistency and licensing compliance.
| self.heteronyms = set(line.rstrip() for line in f) | ||
| elif isinstance(heteronyms, list): | ||
| self.heteronyms = set(heteronyms) |
There was a problem hiding this comment.
heteronyms are loaded verbatim but not NFC-normalized, while text and dictionary keys are normalized with NFC. This can cause heteronym matching to fail for canonically-equivalent Unicode spellings. Consider normalizing heteronyms with unicodedata.normalize("NFC", ...) when loading (and skipping empty lines).
| self.heteronyms = set(line.rstrip() for line in f) | |
| elif isinstance(heteronyms, list): | |
| self.heteronyms = set(heteronyms) | |
| self.heteronyms = { | |
| unicodedata.normalize("NFC", line.strip()) for line in f if line.strip() | |
| } | |
| elif isinstance(heteronyms, list): | |
| self.heteronyms = { | |
| unicodedata.normalize("NFC", heteronym.strip()) for heteronym in heteronyms if heteronym.strip() | |
| } |
| phrase = " ".join(words[i : i + phrase_len]) | ||
| if phrase in self.phoneme_dict and ( | ||
| not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase) | ||
| ): | ||
| prons.extend(self.phoneme_dict[phrase][0]) |
There was a problem hiding this comment.
phoneme_probability is only applied in parse_one_word(), but multi-word phrase matches bypass parse_one_word() and are always converted to phonemes. If phoneme_probability is intended to control grapheme/phoneme mixing for all dictionary-backed conversions, you should apply the probability check before accepting a phrase match (e.g., treat the whole phrase as graphemes when the random draw chooses graphemes).
| phrase = " ".join(words[i : i + phrase_len]) | |
| if phrase in self.phoneme_dict and ( | |
| not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase) | |
| ): | |
| prons.extend(self.phoneme_dict[phrase][0]) | |
| phrase_words = words[i : i + phrase_len] | |
| phrase = " ".join(phrase_words) | |
| if phrase in self.phoneme_dict and ( | |
| not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(phrase) | |
| ): | |
| if self.phoneme_probability is not None and random.random() > self.phoneme_probability: | |
| prons.extend(phrase_words) | |
| else: | |
| prons.extend(self.phoneme_dict[phrase][0]) |
| def _parse_urdu_json_dict(phoneme_dict, use_stresses, stress_symbols): | ||
| if isinstance(phoneme_dict, (str, pathlib.Path)): | ||
| with open(phoneme_dict, "r", encoding="utf-8") as f: | ||
| raw = json.load(f) | ||
| else: | ||
| raw = phoneme_dict | ||
| result = defaultdict(list) |
There was a problem hiding this comment.
If phoneme_dict is None (or not a mapping / path), _parse_urdu_json_dict() will fail with an AttributeError when calling .items(). Please add an explicit validation with a clear error message (similar to other G2P modules that assert/raise when phoneme_dict is missing).
[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS
#15445
Module:
nemo/collections/tts/g2p/Type: New language support
Summary
This PR adds
UrduIpaG2p— a Grapheme-to-Phoneme module for Urdu (ur-PK)that converts Urdu text written in Nastaliq/Naskh script into IPA phoneme
sequences. It follows the exact same design pattern as the existing
EnglishG2p(en_us_arpabet.py) andChineseG2p(zh_cn_pinyin.py)modules, subclassing
BaseG2pdirectly.Motivation
Urdu is spoken by approximately 230 million people worldwide and is the
national language of Pakistan. Despite this, NeMo currently has no G2P support
for Urdu, making it impossible to train TTS models for Urdu using the standard
NeMo pipeline.
This contribution provides:
UrduIpaG2pclass (nemo/collections/tts/g2p/models/ur_pk_ipa.py)Urdu Script Notes
Files Changed
Implementation Details
Dictionary format (JSON):
{ "غیر حاضری": "ɣɛːr hɑːzriː", "شوکت خانم ليب": "ʃoːˈkət̪ xɑːˈnəm leːb", "مختار احمد": "mʊxˈt̪aːr ˈæhməd" }Keys are single words or multi-word phrases; values are space-separated IPA.
Key design decisions:
Subclasses
BaseG2pdirectly (notIpaG2p) —IpaG2p.__init__unconditionally calls
set_grapheme_case(), which is meaningless forUrdu script (no letter case) and raises
ValueErrorwithcase=None.This mirrors the approach taken by
EnglishG2pandChineseG2p.Longest-phrase-first matching —
__call__tries up tomax_phrase_len(default 4) consecutive tokens as a phrase key before falling back to
single-word lookup, enabling correct handling of named entities and
compound words that span multiple tokens.
NFC normalisation — both input text and dictionary keys are
NFC-normalised at load and inference time, ensuring consistent lookup
regardless of how the Urdu text was composed.
Full feature parity with existing G2P modules:
heteronymssupportapply_to_oov_wordfallbackuse_stressestogglephoneme_probabilityfor mixed grapheme/phoneme trainingUsage:
Pronunciation Dictionary
Testing
Checklist
EnglishG2p,ChineseG2p)BaseG2pdirectlymodules.pyIPATokenizerfor FastPitch/VITS trainingRelated Issues / References
I am happy to address review feedback, add a YAML config example, or extend
the dictionary coverage. Thank you for considering this contribution!