refactoring to work with the annotated plain text #18
Open
tpeng wants to merge 248 commits intoscrapinghub:masterfrom
Open
refactoring to work with the annotated plain text #18tpeng wants to merge 248 commits intoscrapinghub:masterfrom
tpeng wants to merge 248 commits intoscrapinghub:masterfrom
Conversation
… various industries.
…es_from_files` to `load_trees`.
webstruct/feature_extraction.py
Outdated
Member
There was a problem hiding this comment.
the data is not necessarily annotated: HtmlLoader is used to load raw data
Member
|
My main concern in Token class and TextTokenizer thing. Creating Token instances looks like a total overkill - why would anyone need to wrap text token in Token instance and to keep reference to all other tokens in the text there? Also, there is already a text_tokenizers module, so this adds to confusion. |
HtmlFeatureExtractor to FeatureExtractor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sometime the training data maybe plain text, instead of using python-crfsuite or any other CRF package, i still prefer to use webstruct because it has sklearn
pipelineand some evaluation tools out of box.the input text annotated text is similar to GATE: e.g.
this is a <NER>test</NER>. the entities are surrounded by <> tags. the rest of the change just moving the generic code to a more proper place.