Skip to content

Fix CJK tokenization#1

Merged
MusikAnimal merged 1 commit into
wikimedia:masterfrom
QZGao:master
May 7, 2026
Merged

Fix CJK tokenization#1
MusikAnimal merged 1 commit into
wikimedia:masterfrom
QZGao:master

Conversation

@QZGao

@QZGao QZGao commented Apr 24, 2026

Copy link
Copy Markdown

Fix Chinese and Japanese tokenization by introducing a Unicode block range regex regex_cjk.

Note: To make sure this addition doesn't drain calculation speed for other languages, I added two short-circuit guards not text.isascii() and regex_cjk.search(text) before text = regex_cjk.sub(r'||\1||', text).

@QZGao

QZGao commented Apr 24, 2026

Copy link
Copy Markdown
Author

The second commit is for the T342805 problem.

I have identified and fixed this problem -- the word-phase diff in analyse_words_in_sentences() had quadratic time complexity in the number of tokens. For a revision with ~22000 current words and ~22000 previous words, the word phase alone took ~36 seconds.

The root cause was three nested rescans after calling Differ.compare():

  • The full diff list was materialized with list(d.compare(...)).
  • For each current token (outer loop over sentences × words), the entire diff list was rescanned from position 0 to find a matching entry.
  • On a ' ' or '-' hit, unmatched_words_prev was rescanned from the start again to find a previous word object with the same value.
    Entries were consumed by zeroing them out (diff[pos] = ''), but the scan still started from index 0 every time, so the worst-case work per token was O(n_diff + n_prev). Over all current tokens that is O(n^2).

The fix: Differ.compare() emits items in alignment order -- ' ' items correspond to the next unmatched prev word and the next current slot, in sequence. This means the entire post-diff assignment can be done in a single ordered pass with two integer cursors:

diff tag action
' ' advance prev_index: attach that prev word to curr_slots[curr_index]; advance curr_index
'-' advance prev_index: mark that prev word deleted
'+' advance curr_index: create a new Word for curr_slots[curr_index]
'?' skip (Differ hint line)

curr_slots is a flat [(sentence_curr, word_value), ...] list built during the same pass that already populates text_curr and sentence_curr.splitted, so there is no extra work.

The diff is also consumed as a generator now. list(d.compare(...)) is gone, so the full diff sequence is never held in memory.


Tested against revision 1296988276 of the "Google Play" article (~22189 current tokens, ~22444 previous tokens), which was the worst-case revision that triggered the timeout class of failure, and got ~0.43 s after the fix, compared to ~36 s before the fix.

Correctness is also verified. matched_prev count, vandalism flag, token_id_delta, tokens_delta, and per-sentence word totals are identical between the old and new implementations on this revision.

@QZGao QZGao changed the title Fix CJK tokenization Fix CJK tokenization; optimize diff algorithm's time complexity Apr 24, 2026

@MusikAnimal MusikAnimal left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot really give a good review of the Python code itself, but I can verify that Chinese language pages appear to have each character tokenized and not groups of characters. I'll trust you that it works as it should :)

I checked a few articles in English, and was able to verify there's no discernible difference in the tokenization. So I think we're good as far as CJK support goes!

However, the fix for https://phabricator.wikimedia.org/T342805 I was not able to verify. It never finished processing on my local, after about an hour of waiting. How long did it take you?

@MusikAnimal

Copy link
Copy Markdown
Member

However, the fix for https://phabricator.wikimedia.org/T342805 I was not able to verify. It never finished processing on my local, after about an hour of waiting. How long did it take you?

If you can move 52b1642 to a separate PR, I can merge the CJK fix and we can get things going again for Chinese.

@QZGao QZGao changed the title Fix CJK tokenization; optimize diff algorithm's time complexity Fix CJK tokenization May 7, 2026
@QZGao

QZGao commented May 7, 2026

Copy link
Copy Markdown
Author

I have moved 52b1642 to #2. Let's continue the troubleshooting in that PR.

@MusikAnimal MusikAnimal left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!! I will get this deployed and re-do the XML processing for zhwiki

@MusikAnimal MusikAnimal merged commit 61d2174 into wikimedia:master May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants