Optimize diff algorithm's time complexity, part 1 by QZGao · Pull Request #4 · wikimedia/WikiWho

QZGao · 2026-06-06T00:48:52Z

This PR partially addresses T342805 by removing several avoidable slow paths in WikiWho's revision analysis while keeping the existing difflib.Differ-based word matcher.

The main fix is in analyse_words_in_sentences(). The old word phase had quadratic behavior for large revisions:

The full Differ.compare(...) output was materialized with list(...).
For each current token, the code rescanned the diff list from the beginning.
On matched/deleted tokens, it then rescanned unmatched_words_prev to find the first unused previous Word object with the same value.
Diff entries were consumed by replacing them with '', but every token still started scanning from position 0.

For a worst-case Google Play revision with about 22k current tokens and 22k previous tokens, this word phase alone took about 36 seconds.

This PR changes that phase to consume Differ().compare(text_prev, text_curr) once, in order, with prev_index and curr_index cursors. Differ already emits alignment-ordered rows, so the post-diff assignment can be handled directly:

diff tag	action
`' '`	reuse the next unmatched previous `Word` for the next current token slot
`'-'`	mark the next previous `Word` as deleted/outbound
`'+'`	create a new `Word` for the next current token slot
`'?'`	skip the `Differ` hint line

To make that possible, the code now builds curr_slots, a flat ordered list of (sentence_curr, word_value) pairs, while building text_curr. That preserves the destination sentence for each current token without later rediscovering it by looping back through unmatched_sentences_curr.

The pure-addition case also uses curr_slots, removing a duplicated traversal over current sentences and split tokens.

Additional performance fixes:

Add spam_hashes_set for O(1) spam hash membership checks while preserving the existing spam_hashes list.
Centralize spam revision recording in _add_spam_revision(...).
Make JSON revision hash calculation lazy: calculate_hash(text) now only runs when the API response does not include sha1.
Replace self.temp.append(...); self.temp.count(...) duplicate tracking with local counters for repeated paragraph and sentence hashes.
Move TOKEN_SYMBOLS and TOKEN_SYMBOL_REPLACEMENTS to module scope so the long symbol list and formatted replacements are built once instead of on every tokenization call.
Combine empty-token filtering and pipe placeholder restoration into one list comprehension in split_into_tokens().

Testing:

Tested against revision 1296988276 of the Google Play article, with about 22,189 current tokens and 22,444 previous tokens.
The optimized word phase took about 0.43 seconds, compared to about 36 seconds before.
Correctness checks on that revision matched between the old and new implementations for:
- matched_prev count
- vandalism flag
- token_id_delta
- tokens_delta
- per-sentence word totals

I also tested the full Google Play page locally against already-fetched revision history. The WikiWho package processing itself still takes roughly 60 seconds for that page, which reflects the cost of the current one-request-calculates-all design. Further improvement for very large pages such as Google Play or Barack Obama would require larger architectural changes in #2.

Copilot

Pull request overview

This PR improves WikiWho’s revision analysis performance by removing avoidable quadratic behavior in the word-diff phase while still using difflib.Differ. It also applies several smaller optimizations to reduce repeated work during parsing and spam detection.

Changes:

Reworks analyse_words_in_sentences() to consume Differ.compare() in a single forward pass using cursors and precomputed current-token “slots”.
Adds an internal spam-hash set and a helper to record spam revisions, making spam-hash membership checks O(1).
Reduces repeated allocations/work in hashing and tokenization (module-scope token symbol tables; combined filtering/restoration in split_into_tokens; lazy hash computation when sha1 is present).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`WikiWho/wikiwho.py`	Removes slow rescans in word diffing; adds O(1) spam hash membership tracking; reduces duplicate-count overhead in paragraph/sentence processing.
`WikiWho/utils.py`	Moves token symbol tables to module scope and simplifies token list construction to avoid repeated per-call setup work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MusikAnimal

I've tested this and I see no changes to the algorithm, so I'm going to merge.

Thank you for putting so much work into this, and for your patience with my slow reviews :)

QZGao added 4 commits April 24, 2026 01:56

Optimize WikiWho diff algorithm

52b1642

Merge remote-tracking branch 'origin' into optimization

1ad7ee0

Multiple small fixes

359337c

Move tokenizer to global

89ca645

QZGao mentioned this pull request Jun 6, 2026

Optimize diff algorithm's time complexity, part 2 #2

Open

MusikAnimal requested a review from Copilot June 7, 2026 18:11

Copilot started reviewing on behalf of MusikAnimal June 7, 2026 18:11 View session

Copilot AI reviewed Jun 7, 2026

View reviewed changes

MusikAnimal approved these changes Jun 8, 2026

View reviewed changes

MusikAnimal merged commit c3fa64d into wikimedia:master Jun 8, 2026

QZGao deleted the optimization-without-differ branch June 8, 2026 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize diff algorithm's time complexity, part 1#4

Optimize diff algorithm's time complexity, part 1#4
MusikAnimal merged 4 commits into
wikimedia:masterfrom
QZGao:optimization-without-differ

QZGao commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

MusikAnimal left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

QZGao commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

MusikAnimal left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants