refactoring to work with the annotated plain text by tpeng · Pull Request #18 · scrapinghub/webstruct

tpeng · 2014-05-26T13:56:12Z

sometime the training data maybe plain text, instead of using python-crfsuite or any other CRF package, i still prefer to use webstruct because it has sklearn pipeline and some evaluation tools out of box.

the input text annotated text is similar to GATE: e.g. this is a <NER>test</NER>. the entities are surrounded by <> tags. the rest of the change just moving the generic code to a more proper place.

…actions

…See scrapinghubGH-8.

… various industries.

See xtannier/WebAnnotator#20

…es_from_files` to `load_trees`.

…e" tokens

kmike · 2014-05-26T14:59:54Z

webstruct/feature_extraction.py

the data is not necessarily annotated: HtmlLoader is used to load raw data

kmike · 2014-05-26T15:08:29Z

My main concern in Token class and TextTokenizer thing. Creating Token instances looks like a total overkill - why would anyone need to wrap text token in Token instance and to keep reference to all other tokens in the text there? Also, there is already a text_tokenizers module, so this adds to confusion.

HtmlFeatureExtractor to FeatureExtractor

kmike and others added 30 commits July 25, 2013 05:13

partially annotated corpus with contact pages of US websites

2943c63

annotation fix

03a851e

finished US contact pages annotation

1bd1e6e

some code for token-based model

36e1553

nose is better at doctests

5710aa5

various small improvements (tokenization, argument names, etc)

f163c28

wapiti helper

2e773cc

annotation fix

d527cd6

more compact feature representation

02b589f

some feature functions

d22ab77

merge address IOE tags

b90a4d2

annotation fixes

7c4abec

split feature functions into files

c254de0

better comment handling for wapiti templates

04a840a

more annotated data

77cda05

corpus readme fixes

4889d29

"caps" shape

4c4c388

change tokenization rules again: don't split : and don't handle contr…

8392112

…actions

always clean html before feature extraction

c9cbedc

let +34 and -8 be numbers

a59c4d1

more features

4a44842

more ideas for tags

113a5c9

more annotated data

feeb392

finished annotating US contacts pages corpus

9792d0e

utility for grouping IOB-encoded entities

58de2dd

discourage usage of preprocess.to_features_and_labels

1cd4ed6

utility for substrings extraction

cf1877b

small cleanup

cd40b8d

WapitiTagger class

4badffc

add IDEA files to gitignore

4f26c07

kmike added 24 commits May 16, 2014 04:30

fix HtmlTokenizer pickling

383f8b7

WapitiCRF.fit returns self

0adaaf2

train_test_split_noshuffle

92553b7

TST runcoverage script

55598e0

python-crfsuite support; tests for NER and crfsuite pipeline

a2111d4

expose CRFsuiteCRF and CCRFsuiteFeatureEncoder

01b0ee6

rename wapiti_kwargs to crf_kwargs for consistency

0f248b6

move tostr to wapiti module because it is wapiti-specific

441ebf4

NER.annotate and NER.annotate_url methods

7d12376

Abstract temporary model files handling; add this feature to wapiti. …

85e9407

…See scrapinghubGH-8.

A corpus (not annotated yet) with 450 pages from business websites in…

9525c46

… various industries.

add EMAIL to dtd in order to load annotated files properly

38730d8

annotation fixes

4619e8f

Fix html produced by WebAnnotator.

be9a91c

See xtannier/WebAnnotator#20

(backwards incompatible) drop existing load_trees; rename `load_tre…

591051d

…es_from_files` to `load_trees`.

make it possible to use existing WebAnnotator colors

5bb3768

+100 annotated pages

6cd6265

annotation fixes

2e746c4

annotation fixes

223d8f1

more annotation fixes

8875d3c

+100 pages

146ad5e

annotation fixes

448048e

BUG fix an issue with WebAnnotatorLoader: it shouldn't add extra "Non…

87279df

…e" tokens

fix a test after annotation fix

2150bda

kmike reviewed May 26, 2014
View reviewed changes

webstruct/feature_extraction.py Outdated

Copy link

Member

kmike May 26, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data is not necessarily annotated: HtmlLoader is used to load raw data

refactoring to allow learn from annotated plain text and rename

42d644f

HtmlFeatureExtractor to FeatureExtractor

Gallaecio force-pushed the master branch from 9e46156 to 17c8254 Compare December 19, 2019 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactoring to work with the annotated plain text #18

refactoring to work with the annotated plain text #18
tpeng wants to merge 248 commits intoscrapinghub:masterfrom
tpeng:plain-text-tokenizer

tpeng commented May 26, 2014

Uh oh!

kmike May 26, 2014

Uh oh!

kmike commented May 26, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tpeng commented May 26, 2014

Uh oh!

kmike May 26, 2014

Choose a reason for hiding this comment

Uh oh!

kmike commented May 26, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants