Fix CJK tokenization by QZGao · Pull Request #1 · wikimedia/WikiWho

QZGao · 2026-04-24T07:39:57Z

Fix Chinese and Japanese tokenization by introducing a Unicode block range regex regex_cjk.

Note: To make sure this addition doesn't drain calculation speed for other languages, I added two short-circuit guards not text.isascii() and regex_cjk.search(text) before text = regex_cjk.sub(r'||\1||', text).

QZGao · 2026-04-24T09:10:59Z

The second commit is for the T342805 problem.

I have identified and fixed this problem -- the word-phase diff in analyse_words_in_sentences() had quadratic time complexity in the number of tokens. For a revision with ~22000 current words and ~22000 previous words, the word phase alone took ~36 seconds.

The root cause was three nested rescans after calling Differ.compare():

The full diff list was materialized with list(d.compare(...)).
For each current token (outer loop over sentences × words), the entire diff list was rescanned from position 0 to find a matching entry.
On a ' ' or '-' hit, unmatched_words_prev was rescanned from the start again to find a previous word object with the same value.
Entries were consumed by zeroing them out (diff[pos] = ''), but the scan still started from index 0 every time, so the worst-case work per token was O(n_diff + n_prev). Over all current tokens that is O(n^2).

The fix: Differ.compare() emits items in alignment order -- ' ' items correspond to the next unmatched prev word and the next current slot, in sequence. This means the entire post-diff assignment can be done in a single ordered pass with two integer cursors:

diff tag	action
`' '`	advance prev_index: attach that prev word to `curr_slots[curr_index]`; advance `curr_index`
`'-'`	advance prev_index: mark that prev word deleted
`'+'`	advance curr_index: create a new `Word` for `curr_slots[curr_index]`
`'?'`	skip (`Differ` hint line)

curr_slots is a flat [(sentence_curr, word_value), ...] list built during the same pass that already populates text_curr and sentence_curr.splitted, so there is no extra work.

The diff is also consumed as a generator now. list(d.compare(...)) is gone, so the full diff sequence is never held in memory.

Tested against revision 1296988276 of the "Google Play" article (~22189 current tokens, ~22444 previous tokens), which was the worst-case revision that triggered the timeout class of failure, and got ~0.43 s after the fix, compared to ~36 s before the fix.

Correctness is also verified. matched_prev count, vandalism flag, token_id_delta, tokens_delta, and per-sentence word totals are identical between the old and new implementations on this revision.

MusikAnimal

I cannot really give a good review of the Python code itself, but I can verify that Chinese language pages appear to have each character tokenized and not groups of characters. I'll trust you that it works as it should :)

I checked a few articles in English, and was able to verify there's no discernible difference in the tokenization. So I think we're good as far as CJK support goes!

However, the fix for https://phabricator.wikimedia.org/T342805 I was not able to verify. It never finished processing on my local, after about an hour of waiting. How long did it take you?

MusikAnimal · 2026-05-06T23:07:07Z

However, the fix for https://phabricator.wikimedia.org/T342805 I was not able to verify. It never finished processing on my local, after about an hour of waiting. How long did it take you?

If you can move 52b1642 to a separate PR, I can merge the CJK fix and we can get things going again for Chinese.

QZGao · 2026-05-07T00:13:19Z

I have moved 52b1642 to #2. Let's continue the troubleshooting in that PR.

MusikAnimal

Thank you!! I will get this deployed and re-do the XML processing for zhwiki

fix cjk tokenization

fb48be7

QZGao changed the title ~~Fix CJK tokenization~~ Fix CJK tokenization; optimize diff algorithm's time complexity Apr 24, 2026

MusikAnimal reviewed May 6, 2026

View reviewed changes

QZGao force-pushed the master branch from 52b1642 to fb48be7 Compare May 7, 2026 00:10

QZGao changed the title ~~Fix CJK tokenization; optimize diff algorithm's time complexity~~ Fix CJK tokenization May 7, 2026

MusikAnimal approved these changes May 7, 2026

View reviewed changes

MusikAnimal merged commit 61d2174 into wikimedia:master May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CJK tokenization#1

Fix CJK tokenization#1
MusikAnimal merged 1 commit into
wikimedia:masterfrom
QZGao:master

QZGao commented Apr 24, 2026

Uh oh!

QZGao commented Apr 24, 2026

Uh oh!

MusikAnimal left a comment

Uh oh!

MusikAnimal commented May 6, 2026

Uh oh!

QZGao commented May 7, 2026

Uh oh!

MusikAnimal left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

QZGao commented Apr 24, 2026

Uh oh!

QZGao commented Apr 24, 2026

Uh oh!

MusikAnimal left a comment

Choose a reason for hiding this comment

Uh oh!

MusikAnimal commented May 6, 2026

Uh oh!

QZGao commented May 7, 2026

Uh oh!

MusikAnimal left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants