Narrowing type hints, update docstring#1344
Conversation
bact
commented
Mar 19, 2026
- Passed code styles and structures
- Passed code linting checks and unit test
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR focuses on tightening type hints and docstrings across PyThaiNLP modules, largely by replacing # type: ignore with explicit typing constructs (cast, NDArray[...], narrowed generics) and adjusting a few global model singletons to cache-based patterns.
Changes:
- Replace many
# type: ignorereturn annotations withcast(...)and more precise types (includingnumpy.typing.NDArray). - Refactor some global model singletons into keyed caches (e.g., by device/model) to avoid cross-call state issues.
- Update docstrings and tooling configuration (notably
pyproject.tomlmypy/tox formatting changes).
Reviewed changes
Copilot reviewed 57 out of 57 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pythainlp/wsd/core.py | Refactor WSD model handling to a device-keyed cache; tighten typing and return conversions. |
| pythainlp/word_vector/core.py | Replace ignores with casts; clarify sentence_vectorizer return type and dtype. |
| pythainlp/wangchanberta/core.py | Replace ignore with cast for tokenizer output typing. |
| pythainlp/util/thai.py | Reformat constant map; narrow return types for Thai character analysis helpers. |
| pythainlp/util/pronounce.py | Formatting-only readability change in comprehension. |
| pythainlp/util/numtoword.py | Broaden bahttext parameter type and update docstring typing. |
| pythainlp/util/keywords.py | Narrow rank() return type and clarify docstring contract. |
| pythainlp/util/collate.py | Type collate() input as Iterable[str] and align docstring types. |
| pythainlp/util/abbreviation.py | Replace return ignore with cast for optional-score tuples. |
| pythainlp/ulmfit/core.py | Tighten rule/token types, NDArray returns, and float32 conversion behavior. |
| pythainlp/transliterate/wunsen.py | Replace ignore with cast for transliteration output. |
| pythainlp/transliterate/w2p.py | Narrow internal NDArray typing; add docstrings for numeric helpers. |
| pythainlp/transliterate/umt5_thaig2p.py | Replace ignore with cast and structured pipeline output typing. |
| pythainlp/transliterate/tltk.py | Replace ignores with casts for third-party outputs; minor refactor. |
| pythainlp/transliterate/thaig2p_v2.py | Replace ignore with cast and structured pipeline output typing. |
| pythainlp/transliterate/thaig2p.py | Add type: ignore[misc] on torch modules and narrow NDArray generics. |
| pythainlp/transliterate/thai2rom_onnx.py | Improve IO encoding, NDArray typing, and fix array end-token comparison. |
| pythainlp/transliterate/thai2rom.py | Add type: ignore[misc] on torch module classes. |
| pythainlp/transliterate/lookup.py | Replace return ignores with casts for typed optionals/strings. |
| pythainlp/transliterate/ipa.py | Replace ignores with casts for epitran outputs. |
| pythainlp/translate/word2word_translate.py | Replace ignore with cast for optional list return. |
| pythainlp/translate/tokenization_small100.py | Replace ignores with casts; narrow state/vocab typing. |
| pythainlp/tokenize/newmm.py | Add type parameters to defaultdict graph used in BFS. |
| pythainlp/tokenize/han_solo.py | Narrow featurizer return types to list[Any] payloads. |
| pythainlp/tokenize/core.py | Replace return ignore with cast for paragraph tokenization typing. |
| pythainlp/tag/wangchanberta_onnx.py | Tighten NDArray typing, providers defaulting, and SentencePiece API usage. |
| pythainlp/tag/tltk.py | Replace ignore with cast for POS tagging output typing. |
| pythainlp/tag/thainer.py | Add explicit feature dict typing for NER feature extraction. |
| pythainlp/tag/thai_nner.py | Narrow entity dict typing and expand docstrings/exception chaining. |
| pythainlp/tag/crfchunk.py | Add a blank line for module formatting consistency. |
| pythainlp/tag/chunk.py | Add a blank line for module formatting consistency. |
| pythainlp/tag/_tag_perceptron.py | Simplify saved JSON payload typing to Any. |
| pythainlp/summarize/keybert.py | Add NDArray typing, float32 conversions, and exception chaining. |
| pythainlp/summarize/freq.py | Narrow ranking/frequency typing and adjust ranking implementation. |
| pythainlp/spell/words_spelling_correction.py | Replace import checks with import_module, add docstrings, and add cache. |
| pythainlp/spell/wanchanberta_thai_grammarly.py | Add type: ignore[misc] on torch module class. |
| pythainlp/spell/tltk.py | Replace ignore with cast for spell candidate outputs. |
| pythainlp/spell/phunspell.py | Replace ignore with cast for correction output typing. |
| pythainlp/soundex/sound.py | Replace ignore with cast for panphon output typing. |
| pythainlp/soundex/complete_soundex.py | Narrow internal helper signatures and return tuple typing. |
| pythainlp/phayathaibert/core.py | Tighten callable typing and replace ignore with cast for tokenizer output. |
| pythainlp/parse/ud_goeswith.py | Rename intermediate variables and reformat vectorized operations for clarity. |
| pythainlp/generate/wangchanglm.py | Narrow regex pattern typing and replace ignore with cast for decode. |
| pythainlp/el/core.py | Tighten return types for entity linking results and fix docstring param name. |
| pythainlp/el/_multiel.py | Tighten EL output typing and replace ignore with cast. |
| pythainlp/corpus/wordnet.py | Replace ignores with casts and align custom_lemmas return behavior. |
| pythainlp/corpus/core.py | Replace ignores with casts and adjust minor formatting. |
| pythainlp/corpus/common.py | Reformat complex condition and dict construction for readability. |
| pythainlp/coref/core.py | Replace singleton with cache keyed by (model_name, device); tighten return typing. |
| pythainlp/chunk/crfchunk.py | Replace ignore with cast for tagger outputs. |
| pythainlp/chunk/init.py | Add a blank line for module formatting consistency. |
| pythainlp/benchmarks/word_tokenization.py | Tighten stats typing, improve exception chaining, and type NDArray helpers. |
| pythainlp/augment/wordnet.py | Narrow WordNet augmentation internal list typings. |
| pythainlp/augment/word2vec/core.py | Minor formatting for long call readability. |
| pythainlp/augment/word2vec/bpemb_wv.py | Replace ignore with cast for BPEmb tokenizer output. |
| pythainlp/ancient/currency.py | Narrow return type and fix minor comment formatting. |
| pyproject.toml | Reformat extras/tox entries; enable mypy strictness; expand ignore-missing-imports modules. |
| def bahttext(number: Optional[float]) -> str: | ||
| """Converts a number to Thai text and adds | ||
| a suffix "บาท" (Baht). | ||
| The precision will be fixed at two decimal places (0.00) | ||
| to fit "สตางค์" (Satang) unit. | ||
| This function works similarly to the `BAHTTEXT` function in Microsoft Excel. | ||
|
|
||
| :param float number: number to be converted into Thai Baht currency format | ||
| :param Optional[float] number: number to be converted into Thai Baht | ||
| currency format | ||
| :return: text representing the amount of money in the format |
There was a problem hiding this comment.
The bahttext() signature now accepts Optional[float], but when number is None the function returns an empty string. That’s a behavioral change from the previous contract and isn’t documented in the docstring; consider either rejecting None (raise) or explicitly documenting the None -> "" behavior to avoid surprising callers.
| strict_optional = true | ||
|
|
||
| warn_no_return = true |
There was a problem hiding this comment.
Enabling global mypy strict = true can introduce many new type-checking failures across the project and CI (the repo runs mypy pythainlp in PRs). If this PR isn’t also updating the codebase to be strict-clean, consider scoping strictness via per-module overrides or enabling the stricter flags incrementally.
| def __compute_frequencies( | ||
| self, word_tokenized_sents: list[list[str]] | ||
| ) -> defaultdict: | ||
| ) -> defaultdict[str, float]: | ||
| word_freqs: defaultdict[str, float] = defaultdict(int) | ||
| for sent in word_tokenized_sents: |
There was a problem hiding this comment.
__compute_frequencies() declares defaultdict[str, float] but initializes it with defaultdict(int), which produces int defaults and can break type-checking (especially with mypy strict). Consider using a float default factory (e.g., defaultdict(float)) or otherwise ensuring the declared value type matches the factory.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|


