Skip to content

Add Thai-NNER integration with nested entity filtering utilities#1221

Merged
bact merged 25 commits intodevfrom
copilot/add-ner-thai-nner-dataset
Feb 3, 2026
Merged

Add Thai-NNER integration with nested entity filtering utilities#1221
bact merged 25 commits intodevfrom
copilot/add-ner-thai-nner-dataset

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jan 21, 2026

What do these changes do

Integrates Thai Nested Named Entity Recognition (Thai-NNER) into PyThaiNLP, adding utilities for filtering nested entity hierarchies and providing Thai-NNER as a standard NER engine option alongside existing engines like thainer and wangchanberta.

What was wrong

The existing Thai-NNER wrapper lacked practical utilities for working with nested entities:

  • No mechanism to extract only top-level entities from nested hierarchies
  • Only accessible via NNER class with custom output format, inconsistent with other NER engines
  • Missing standard IOB format compatibility used by other engines
  • Class naming used underscores instead of PascalCase convention
  • Missing explicit public API declaration

How this fixes it

Core utilities (pythainlp/tag/thai_nner.py):

  • ThaiNNER class with PascalCase naming for consistency
  • get_top_level_entities() function using O(n×m) sorted span comparison (scoped to thai_nner module)
  • top_level_only parameter in ThaiNNER.tag() for direct filtering
  • Helper functions work independently without thai-nner library (import in __init__ for safety)
  • __all__ export list for explicit public API declaration

NER engine integration (pythainlp/tag/named_entity.py):

  • Added "thai-nner" as engine option in NER class
  • get_ner() method provides IOB/HTML format compatibility
  • Restructured NER.load_engine() to check corpus-independent engines first
  • _entities_to_iob() and _entities_to_html() converters with documented non-overlapping entity assumptions

Code quality improvements:

  • Python 3.9+ compatible type hints (Union instead of |)
  • Proper type annotations including Optional[str] and dict[str, Any]
  • Consistent docstring formatting (lowercase type names matching type hints)
  • Clear error messages for missing dependencies
  • Default parameter resolution at runtime instead of module import time
  • Documented import pattern rationale

Tests:

  • Core tests (tests/core/test_tag.py): Helper function tests without external dependencies
  • Noauto tests (tests/noauto/testn_tag.py): Integration tests requiring torch/transformers

Usage examples:

from pythainlp.tag import NNER, NER

# NNER class with top-level filtering
nner = NNER()
tokens, entities = nner.tag("วันที่ 5 เมษายน 2565", top_level_only=True)

# NER class with thai-nner engine (IOB format)
ner = NER(engine="thai-nner")
result = ner.tag("วันที่ 5 เมษายน 2565")  # IOB format
result = ner.tag("วันที่ 5 เมษายน 2565", tag=True)  # HTML tags

# Advanced: Direct utility access
from pythainlp.tag.thai_nner import ThaiNNER, get_top_level_entities
nner = ThaiNNER()
top_entities = get_top_level_entities(entities)

Model information:

  • PyThaiNLP corpus database (version 1.0), CC-BY-SA 3.0 license
  • Auto-downloads on first use (431MB)
  • Supports 104 entity types across 8 nesting layers

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

This section details on the original issue you should resolve

<issue_title>[TODO]: Add NER from Thai-NNER dataset</issue_title>
<issue_description>From Thai-NNER, The dataset has 4,894 docs.

The dataset is licensed under CC-BY-SA 3.0.

dev.txt
train.txt</issue_description>

<agent_instructions>You can use Thai-NNER and get just the top tag of nner. https://github.com/vistec-AI/Thai-NNER/blob/main/thai_nner/__init__.py#L31</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@wannaphong @pavaris-pm @MpolaarbearM Do you interested to train the model? @wannaphong > I completed the training on the NER model and performed some fine-tuning, but the outcomes were not meeting expectations. I plan to make adjustments to the model in an attempt to enhance its performance. https://huggingface.co/lunarlist/NER_thai_phayathai

The dataset is very hard. I think you can turn some parameter.</comment_new>
<comment_new>@wannaphong
I rethink about this issue. You can actually use Thai-NNER and get just the top tag of nner. https://github.com/vistec-AI/Thai-NNER/blob/main/thai_nner/__init__.py#L31

Don't need to train any model!</comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 4 commits January 21, 2026 19:14
- Enhanced thai_nner.py with detailed documentation
- Added get_top_level_entities() helper function to filter nested entities
- Updated NNER class to support top_level_only parameter
- Added comprehensive tests for entity filtering
- Exported get_top_level_entities in tag module

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Improved time complexity from O(n²) to O(n×m) where m is top-level entities
- Added sorting to process larger spans first for better efficiency
- Fixed package name in error message (thai-nner instead of thai_nner)
- Added early return for empty input

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Extracted _is_contained_in() helper function for better readability
- Added note explaining empty strings in tokenization output
- Clarified documentation for top_level_only parameter

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copilot AI changed the title [WIP] Add NER from Thai-NNER dataset Add Thai-NNER integration with top-level entity filtering Jan 21, 2026
Copilot AI requested a review from wannaphong January 21, 2026 19:22
@coveralls
Copy link
Copy Markdown

coveralls commented Jan 21, 2026

Coverage Status

coverage: 65.726% (+0.2%) from 65.571%
when pulling 350d1af on copilot/add-ner-thai-nner-dataset
into e9b8f3b on dev.

@wannaphong
Copy link
Copy Markdown
Member

@copilot Add docs to docs/api/tag.rst

- Added autofunction directive for get_top_level_entities in docs/api/tag.rst
- Documentation now includes the helper function for filtering nested entities

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jan 22, 2026

@copilot Add docs to docs/api/tag.rst

Added in commit 8f7a9ae. The get_top_level_entities function is now documented in docs/api/tag.rst with the autofunction directive.

@wannaphong
Copy link
Copy Markdown
Member

wannaphong commented Jan 22, 2026

@copilot Add to NER class (as an engine) in pythainlp/tag/named_entity.py

- Added "thai-nner" engine to NER class for nested NER support
- Implemented get_ner() method in Thai_NNER class for IOB/HTML format
- Added _entities_to_iob() helper to convert nested entities to IOB format
- Added _entities_to_html() helper to convert nested entities to HTML tags
- Updated NER class documentation with thai-nner engine option
- Added comprehensive tests for IOB and HTML conversion functions
- Updated extra tests to include thai-nner engine in NER test cases

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jan 22, 2026

@copilot Add to NER class (as an engine) in pythainlp/tag/named_entity.py

Added in commit 4032cb1. Thai-NNER is now available as an engine option in the NER class:

from pythainlp.tag import NER

ner = NER(engine="thai-nner")
result = ner.tag("วันที่ 5 เมษายน 2565")  # IOB format
result = ner.tag("วันที่ 5 เมษายน 2565", tag=True)  # HTML tags

The implementation converts Thai-NNER's nested entity format to standard IOB format, using only top-level entities to avoid overlapping tags.

Copilot AI changed the title Add Thai-NNER integration with top-level entity filtering Add Thai-NNER integration with top-level entity filtering and NER engine support Jan 22, 2026
@bact bact marked this pull request as ready for review January 23, 2026 02:47
@bact
Copy link
Copy Markdown
Member

bact commented Feb 1, 2026

@copilot The dev branch is updated. Please resync with it.

The conflicts come from the "extra" test suite, which got refactored.
Some test cases were relocated to other test suites.

See the test suite refactoring and naming conventions in this merged PR
#1248

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 1, 2026

@copilot The dev branch is updated. Please resync with it.

The conflicts come from the "extra" test suite, which got refactored.
...

Rebased on dev branch (commit b536a8b). The conflict in tests/extra/testx_tag.py has been resolved by following the new test suite conventions from PR #1248:

  • Removed tests requiring extra dependencies (NER class with transformers/tltk engines, NNER class, pos_tag_transformers) from testx_tag.py as they were moved to the noautotest suite
  • Kept tests in tests/core/test_tag.py that test helper functions (_entities_to_iob, _entities_to_html, get_top_level_entities) which don't require the actual thai-nner library
  • All our Thai-NNER integration changes remain intact and are now based on the latest dev branch

The rebase is complete locally but I need assistance with force-pushing the rebased branch since I don't have direct push access.

Copilot AI requested a review from bact February 2, 2026 20:43
Copilot stopped work on behalf of bact due to an error February 2, 2026 21:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@bact
Copy link
Copy Markdown
Member

bact commented Feb 3, 2026

@copilot review again

- Removed get_top_level_entities from ThaiNNER module's __all__ export
- Function remains accessible via pythainlp.tag.thai_nner.get_top_level_entities
- Not exported to pythainlp.tag level as it's meant for Thai-NNER context
- Addresses feedback to keep function scoped to thai_nner module

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 3, 2026

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The overlap assumptions have been documented in both functions:

  1. _entities_to_iob() (lines 205-208): Documents that the function assumes non-overlapping entities and recommends using top-level entities
  2. _entities_to_html() (lines 239-241): Documents the same assumption with guidance to use get_top_level_entities() for filtering

Both were fixed in commit 51a335a.

@bact
Copy link
Copy Markdown
Member

bact commented Feb 3, 2026

@copilot dev branch is updated, with a lot of type hints improvements. Resync with the dev branch.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 3, 2026

@copilot dev branch is updated, with a lot of type hints improvements. Resync with the dev branch.

Synced with dev branch in commit c221d94. Merged 55 commits including comprehensive type hints improvements. The auto-merge in pythainlp/tag/named_entity.py was successful, updating type: ignore comments to more specific directives (type: ignore[no-any-return]). All Thai-NNER integration tests pass.

@bact
Copy link
Copy Markdown
Member

bact commented Feb 3, 2026

@copilot dev branch is updated, with a lot of type hints improvements. Resync with the dev branch.

…ilot/add-ner-thai-nner-dataset

# Conflicts:
#	pythainlp/tag/thai_nner.py
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 3, 2026

@copilot dev branch is updated, with a lot of type hints improvements. Resync with the dev branch.

Synced with dev branch in commit 03116b3. Merged 12 commits including comprehensive type hints and mypy improvements. Resolved conflict in pythainlp/tag/thai_nner.py by keeping our full implementation with top_level_only parameter, get_ner method, and helper functions. All Thai-NNER core tests pass successfully.

@bact
Copy link
Copy Markdown
Member

bact commented Feb 3, 2026

@copilot

# Conflicts:
#	pythainlp/tag/named_entity.py
#	pythainlp/tag/thai_nner.py
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 3, 2026

@copilot

Completed in commit 350d1af:

  1. Reread updated Copilot instructions from .github/copilot-instructions.md - reviewed all guidelines including type hints, imports, testing conventions, and code quality standards
  2. Synced with dev branch - merged 14 commits including comprehensive type hints improvements across 30+ files
  3. Resolved conflicts - kept our full Thai-NNER implementation while updating type annotations to use dict[str, Any] for better type safety (matching dev branch improvements)
  4. Updated return type - Changed NNER.tag() return type from tuple[list[str], list[dict]] to tuple[list[str], list[dict[str, Any]]] for improved type checking

All imports verified successfully.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Feb 3, 2026

@bact bact merged commit 71275c2 into dev Feb 3, 2026
23 of 33 checks passed
@bact bact deleted the copilot/add-ner-thai-nner-dataset branch February 3, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement enhance functionalities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TODO]: Add NER from Thai-NNER dataset

5 participants