Optimize diff algorithm's time complexity, part 1#4
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves WikiWho’s revision analysis performance by removing avoidable quadratic behavior in the word-diff phase while still using difflib.Differ. It also applies several smaller optimizations to reduce repeated work during parsing and spam detection.
Changes:
- Reworks
analyse_words_in_sentences()to consumeDiffer.compare()in a single forward pass using cursors and precomputed current-token “slots”. - Adds an internal spam-hash set and a helper to record spam revisions, making spam-hash membership checks O(1).
- Reduces repeated allocations/work in hashing and tokenization (module-scope token symbol tables; combined filtering/restoration in
split_into_tokens; lazy hash computation whensha1is present).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
WikiWho/wikiwho.py |
Removes slow rescans in word diffing; adds O(1) spam hash membership tracking; reduces duplicate-count overhead in paragraph/sentence processing. |
WikiWho/utils.py |
Moves token symbol tables to module scope and simplifies token list construction to avoid repeated per-call setup work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MusikAnimal
approved these changes
Jun 8, 2026
MusikAnimal
left a comment
Member
There was a problem hiding this comment.
I've tested this and I see no changes to the algorithm, so I'm going to merge.
Thank you for putting so much work into this, and for your patience with my slow reviews :)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR partially addresses T342805 by removing several avoidable slow paths in WikiWho's revision analysis while keeping the existing
difflib.Differ-based word matcher.The main fix is in
analyse_words_in_sentences(). The old word phase had quadratic behavior for large revisions:Differ.compare(...)output was materialized withlist(...).unmatched_words_prevto find the first unused previousWordobject with the same value.'', but every token still started scanning from position 0.For a worst-case Google Play revision with about 22k current tokens and 22k previous tokens, this word phase alone took about 36 seconds.
This PR changes that phase to consume
Differ().compare(text_prev, text_curr)once, in order, withprev_indexandcurr_indexcursors.Differalready emits alignment-ordered rows, so the post-diff assignment can be handled directly:' 'Wordfor the next current token slot'-'Wordas deleted/outbound'+'Wordfor the next current token slot'?'Differhint lineTo make that possible, the code now builds
curr_slots, a flat ordered list of(sentence_curr, word_value)pairs, while buildingtext_curr. That preserves the destination sentence for each current token without later rediscovering it by looping back throughunmatched_sentences_curr.The pure-addition case also uses
curr_slots, removing a duplicated traversal over current sentences and split tokens.Additional performance fixes:
spam_hashes_setfor O(1) spam hash membership checks while preserving the existingspam_hasheslist._add_spam_revision(...).calculate_hash(text)now only runs when the API response does not includesha1.self.temp.append(...); self.temp.count(...)duplicate tracking with local counters for repeated paragraph and sentence hashes.TOKEN_SYMBOLSandTOKEN_SYMBOL_REPLACEMENTSto module scope so the long symbol list and formatted replacements are built once instead of on every tokenization call.split_into_tokens().Testing:
matched_prevcounttoken_id_deltatokens_deltaI also tested the full Google Play page locally against already-fetched revision history. The WikiWho package processing itself still takes roughly 60 seconds for that page, which reflects the cost of the current one-request-calculates-all design. Further improvement for very large pages such as Google Play or Barack Obama would require larger architectural changes in #2.