Add Thai-NNER integration with nested entity filtering utilities#1221
Add Thai-NNER integration with nested entity filtering utilities#1221
Conversation
- Enhanced thai_nner.py with detailed documentation - Added get_top_level_entities() helper function to filter nested entities - Updated NNER class to support top_level_only parameter - Added comprehensive tests for entity filtering - Exported get_top_level_entities in tag module Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Improved time complexity from O(n²) to O(n×m) where m is top-level entities - Added sorting to process larger spans first for better efficiency - Fixed package name in error message (thai-nner instead of thai_nner) - Added early return for empty input Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Extracted _is_contained_in() helper function for better readability - Added note explaining empty strings in tokenization output - Clarified documentation for top_level_only parameter Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
|
@copilot Add docs to docs/api/tag.rst |
- Added autofunction directive for get_top_level_entities in docs/api/tag.rst - Documentation now includes the helper function for filtering nested entities Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
|
@copilot Add to NER class (as an engine) in pythainlp/tag/named_entity.py |
- Added "thai-nner" engine to NER class for nested NER support - Implemented get_ner() method in Thai_NNER class for IOB/HTML format - Added _entities_to_iob() helper to convert nested entities to IOB format - Added _entities_to_html() helper to convert nested entities to HTML tags - Updated NER class documentation with thai-nner engine option - Added comprehensive tests for IOB and HTML conversion functions - Updated extra tests to include thai-nner engine in NER test cases Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Added in commit 4032cb1. Thai-NNER is now available as an engine option in the NER class: from pythainlp.tag import NER
ner = NER(engine="thai-nner")
result = ner.tag("วันที่ 5 เมษายน 2565") # IOB format
result = ner.tag("วันที่ 5 เมษายน 2565", tag=True) # HTML tagsThe implementation converts Thai-NNER's nested entity format to standard IOB format, using only top-level entities to avoid overlapping tags. |
Rebased on dev branch (commit b536a8b). The conflict in tests/extra/testx_tag.py has been resolved by following the new test suite conventions from PR #1248:
The rebase is complete locally but I need assistance with force-pushing the rebased branch since I don't have direct push access. |
|
@copilot review again |
- Removed get_top_level_entities from ThaiNNER module's __all__ export - Function remains accessible via pythainlp.tag.thai_nner.get_top_level_entities - Not exported to pythainlp.tag level as it's meant for Thai-NNER context - Addresses feedback to keep function scoped to thai_nner module Co-authored-by: bact <128572+bact@users.noreply.github.com>
The overlap assumptions have been documented in both functions:
Both were fixed in commit 51a335a. |
|
@copilot dev branch is updated, with a lot of type hints improvements. Resync with the dev branch. |
Synced with dev branch in commit c221d94. Merged 55 commits including comprehensive type hints improvements. The auto-merge in pythainlp/tag/named_entity.py was successful, updating type: ignore comments to more specific directives (type: ignore[no-any-return]). All Thai-NNER integration tests pass. |
|
@copilot dev branch is updated, with a lot of type hints improvements. Resync with the dev branch. |
…ilot/add-ner-thai-nner-dataset # Conflicts: # pythainlp/tag/thai_nner.py
Synced with dev branch in commit 03116b3. Merged 12 commits including comprehensive type hints and mypy improvements. Resolved conflict in pythainlp/tag/thai_nner.py by keeping our full implementation with top_level_only parameter, get_ner method, and helper functions. All Thai-NNER core tests pass successfully. |
|
# Conflicts: # pythainlp/tag/named_entity.py # pythainlp/tag/thai_nner.py
Completed in commit 350d1af:
All imports verified successfully. |
|



What do these changes do
Integrates Thai Nested Named Entity Recognition (Thai-NNER) into PyThaiNLP, adding utilities for filtering nested entity hierarchies and providing Thai-NNER as a standard NER engine option alongside existing engines like thainer and wangchanberta.
What was wrong
The existing Thai-NNER wrapper lacked practical utilities for working with nested entities:
How this fixes it
Core utilities (
pythainlp/tag/thai_nner.py):ThaiNNERclass with PascalCase naming for consistencyget_top_level_entities()function using O(n×m) sorted span comparison (scoped to thai_nner module)top_level_onlyparameter inThaiNNER.tag()for direct filtering__init__for safety)__all__export list for explicit public API declarationNER engine integration (
pythainlp/tag/named_entity.py):get_ner()method provides IOB/HTML format compatibilityNER.load_engine()to check corpus-independent engines first_entities_to_iob()and_entities_to_html()converters with documented non-overlapping entity assumptionsCode quality improvements:
Unioninstead of|)Optional[str]anddict[str, Any]Tests:
tests/core/test_tag.py): Helper function tests without external dependenciestests/noauto/testn_tag.py): Integration tests requiring torch/transformersUsage examples:
Model information:
Your checklist for this pull request
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.