Fix CJK tokenization#1
Conversation
|
The second commit is for the T342805 problem. I have identified and fixed this problem -- the word-phase diff in The root cause was three nested rescans after calling
The fix: Differ.compare() emits items in alignment order --
The diff is also consumed as a generator now. Tested against revision 1296988276 of the "Google Play" article (~22189 current tokens, ~22444 previous tokens), which was the worst-case revision that triggered the timeout class of failure, and got ~0.43 s after the fix, compared to ~36 s before the fix. Correctness is also verified. |
MusikAnimal
left a comment
There was a problem hiding this comment.
I cannot really give a good review of the Python code itself, but I can verify that Chinese language pages appear to have each character tokenized and not groups of characters. I'll trust you that it works as it should :)
I checked a few articles in English, and was able to verify there's no discernible difference in the tokenization. So I think we're good as far as CJK support goes!
However, the fix for https://phabricator.wikimedia.org/T342805 I was not able to verify. It never finished processing on my local, after about an hour of waiting. How long did it take you?
If you can move 52b1642 to a separate PR, I can merge the CJK fix and we can get things going again for Chinese. |
MusikAnimal
left a comment
There was a problem hiding this comment.
Thank you!! I will get this deployed and re-do the XML processing for zhwiki
Fix Chinese and Japanese tokenization by introducing a Unicode block range regex
regex_cjk.Note: To make sure this addition doesn't drain calculation speed for other languages, I added two short-circuit guards
not text.isascii()andregex_cjk.search(text)beforetext = regex_cjk.sub(r'||\1||', text).