Tokenizer fixes and span_tokenize method by chekunkov · Pull Request #20 · scrapinghub/webstruct

chekunkov · 2014-06-07T12:30:06Z

Tokenizer from #15 had issues like not splitting a dot at the end of a sentence as a separate token

40006,40007c40017
< community
< .

---
> community.
41148,41149c41158
< Reserved
< .

---
> Reserved.

Now this issue should be fixed.

Also I've refactored code and added span_tokenize method (@kmike I remember you said it would be nice to have this method)

Performance wasn't hurt

X, y = webstruct.HtmlTokenizer().tokenize(trees)

CPU times: user 3.42 s, sys: 32 ms, total: 3.46 s
Wall time: 3.45 s

…actions

…e" tokens

…names_filename automatically

…less memory efficient)

Dropping it gives a nice speedup because computations are now in Cython.

…simplify code and make it faster. If needed, it can be implemented as a global feature.

…use code. span_tokenize method.

Conflicts: webstruct/text_tokenizers.py

kmike · 2016-11-25T17:54:44Z

@chekunkov do you by chance recall why wasn't this PR merged?

chekunkov · 2016-11-25T19:26:30Z

@kmike nope, have no idea why.

kmike and others added 30 commits July 31, 2013 07:27

change tokenization rules again: don't split : and don't handle contr…

8392112

…actions

always clean html before feature extraction

c9cbedc

let +34 and -8 be numbers

a59c4d1

more features

4a44842

more ideas for tags

113a5c9

more annotated data

feeb392

finished annotating US contacts pages corpus

9792d0e

utility for grouping IOB-encoded entities

58de2dd

discourage usage of preprocess.to_features_and_labels

1cd4ed6

utility for substrings extraction

cf1877b

small cleanup

cd40b8d

WapitiTagger class

4badffc

add IDEA files to gitignore

4f26c07

handle an edge case for feature extraction

43314ba

WapitiChunker is a better name

1ea5656

more training data

caa97de

more training data

2c37a10

remove nltk dependency

66c13d9

clarify requirements

9cac9ed

fix default value in cleaning script

e42cfd0

reannotated corpus (2/3 so far)

5b2e22f

annotation guidelines

3be853f

finish reannotation

71ba386

new tags

d43da01

simple docs

5ac1800

add IPython temp files to gitignore

6e13b51

prepare html pages for NL

0e494e4

annotate NL pages

5a03eab

fix encoding

91cef1e

add notebook to train NL open hours parser

acf4519

kmike and others added 26 commits May 21, 2014 14:59

annotation fixes

223d8f1

more annotation fixes

8875d3c

+100 pages

146ad5e

annotation fixes

448048e

BUG fix an issue with WebAnnotatorLoader: it shouldn't add extra "Non…

87279df

…e" tokens

fix a test after annotation fix

2150bda

easier Trainer customization for CRFsuiteCRF

79d81c5

X_dev and y_dev support for webstruct.crfsuite

a98431e

+100 pages

1c47f9e

doctests (failing) for some tokenization gotchas

e9ebeaa

expose LongestMatchGlobalFeature

f80c382

annotations fix

1c17e7c

one more failing tokenization example

17a5d4e

webstruct.gazetteers.geonames.read_geonames_zipped: try to handle geo…

9d8fcdc

…names_filename automatically

DAWG gazetteers support (they are much faster than MARISA-based, but …

ce775e6

…less memory efficient)

more annotated data

6ee718f

CRFsuiteFeatureEncoder is not needed with python-crfsuite==0.6

ed40e3e

Dropping it gives a nice speedup because computations are now in Cython.

Undocumented HtmlFeatureExtractor post-processing step is removed to …

b2cb0e7

…simplify code and make it faster. If needed, it can be implemented as a global feature.

bias feature

649c814

tiny speedup for BestMatch._find_matches

12be72e

NER.extract_groups_from_url

727f61b

export webstruct.smart_join

cd1860d

annotation fixes (more locations for about 70 pages)

56cd57e

tokenizer - dot regex fix. WordTokenizer refactoring to be able to re…

33e638d

…use code. span_tokenize method.

Merge branch 'master' into tokenizer_additional_fixes_and_span_method

960bc7b

Conflicts: webstruct/text_tokenizers.py

fixed broken doctests

a010b00

kmike mentioned this pull request Dec 23, 2016

Tokenizer additional fixes and span method #36

Open

Gallaecio force-pushed the master branch from 9e46156 to 17c8254 Compare December 19, 2019 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer fixes and span_tokenize method#20

Tokenizer fixes and span_tokenize method#20
chekunkov wants to merge 267 commits intoscrapinghub:masterfrom
chekunkov:tokenizer_additional_fixes_and_span_method

chekunkov commented Jun 7, 2014 •

edited

Loading

Uh oh!

kmike commented Nov 25, 2016

Uh oh!

chekunkov commented Nov 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

chekunkov commented Jun 7, 2014 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmike commented Nov 25, 2016

Uh oh!

chekunkov commented Nov 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chekunkov commented Jun 7, 2014 •

edited

Loading