Optimize diff algorithm's time complexity, part 2 by QZGao · Pull Request #2 · wikimedia/WikiWho

QZGao · 2026-05-07T00:11:29Z

This PR fully addresses T342805 by replacing the slow difflib.Differ word matcher with a SequenceMatcher-based matcher. This is a continuation of #4's part 1 changes.

Differ is too slow on some large repetitive revisions because WikiWho only needs equal/delete/insert token matching, but Differ also computes expensive hint/fancy-replace output that WikiWho does not use. In particular, it can stall in difflib._fancy_replace on large repeated-token edits.

The new matcher uses SequenceMatcher.get_opcodes() directly, with extra WikiWho-specific handling to keep authorship attribution intuitive:

Common prefix/suffix tokens are matched first.
Generic wikitext tokens such as {{, }}, [[, ]], |, and = are matched with local syntax context instead of by token value alone.
Prefix/suffix matches are rolled back when they would cut through templates, links, or comments.
Repetitive replace regions use bounded nearest-token matching.
Moved text is recovered by finding unique informative n-gram runs in unmatched regions, extending them left/right, and verifying that the content core is unique in both the previous and current article text.
Edited internal-link boundaries are recovered when an unpiped link target is extended, for example [[Category:Video games]] becoming [[Category:Video games developed in Japan]].
Stale suffix matches for short glue words are demoted after large pure deletions, so old words like to, in, or of do not incorrectly survive into unrelated later prose.

This change can produce different token attribution from the old Differ matcher. That is expected: the goal is not byte-for-byte parity with Differ, but faster completion on large revisions and better attribution for moved or reformatted text.

Tested cases include these previously analyzed situations:

Google Play, where the Differ version could not finish reliably within practical runtime.
URL/reference expansion around support.google.com, where the new matcher preserves the stable domain tokens.
Availability-table cleanup, where the new matcher preserves the same qualitative behavior as Differ.
Large repetitive availability-table edits, where Differ stalls but the new matcher completes.
Adam Himebauch, Splatoon 3, American goldfinch, Japan Cup, 2026 Canvas security incident, Nintendo Switch, and Through the Looking-Glass regression cases covering prose, references, links, templates, duplicate words, punctuation, and stale suffix tokens.

Tested pytest and passed 24 out of 24 test cases. Should merge #4 and #3 before this.

QZGao · 2026-05-07T01:13:47Z

I ran the whole "Google Play" page locally. Without counting the request time, the local WikiWho processing itself took 62.46 s.

QZGao · 2026-05-07T01:52:44Z

Another run took 66.70 s total local processing:

determine_authorship: 62.62 s
- reset_update_matched_paragraphs: 25.61 s
- analyse_paragraphs: 21.01 s
- analyse_sentences: 12.44 s
- analyse_words: 2.85 s
- reset_update_matched_paragraphs_words: 106,439,777
- reset_update_matched_paragraphs_sentences: 6,706,455
- reset_update_matched_paragraphs_paragraphs: 845,938
lower_text: 3.02 s
eager_hash_lookup: 0.91 s
JSON payload build + encode: 0.13 s

QZGao · 2026-05-07T02:02:36Z

Multiple small fixes:

Added spam_hashes_set for O(1) spam hash membership while preserving the existing spam_hashes list.
Fixed eager hash calculation in JSON revision processing.
Replaced self.temp.append(...); self.temp.count(...) duplicate tracking with local counters for paragraph and sentence duplicates.

QZGao · 2026-05-07T06:49:25Z

Another thing that severely drags the program down is the final all-token scan. I am investigating whether we can optimize that.

QZGao

I have done some investigation on the code but sadly could not find any more optimization. The one-request-that-calculates-all approach clearly has a lower bound in efficiency. If we want to get "Google Play" and "Barack Obama" to be parsed within time and memory constraints, we would need to fundamentally change some behaviors.

For example, if we want to speed up, we would want to cache previous runs on previous revisions, and only do incremental analysis. If we want to optimize memory usage, we would want to give up on storing all revisions -- maybe only keep the latest 500 revisions and delete old revisions, if we assume we already do incremental analysis.

Still, I believe the current changes already made are worthwhile. I've documented the changes in these comments.

QZGao · 2026-05-07T08:15:58Z

+        # Build flat (sentence, token) slots so we can assign words during the
+        # diff pass without re-scanning sentences or the diff list.
+        curr_slots = []  # list of (sentence_curr, word_value)
        text_curr = []
        for sentence_curr in unmatched_sentences_curr:
            # split_into_tokens is already done in analyse_sentences_in_paragraphs
            words = sentence_curr.value.split(' ')
            text_curr.extend(words)
            sentence_curr.splitted.extend(words)
+            for word in words:
+                curr_slots.append((sentence_curr, word))


In analyse_words_in_sentences, the algorithm needs to decide, for every token in the current revision, which Sentence object should receive either a reused old Word object or a newly created Word object. Originally, after computing the diff, the code loops back through unmatched_sentences_curr and each sentence’s split tokens to rediscover that sentence/token position while also scanning the diff list. This patch builds curr_slots, a flat ordered list of (sentence_curr, word) pairs, at the same time it builds text_curr. This preserves the same token order as text_curr, and also remembers the destination sentence for each token.

QZGao · 2026-05-07T08:17:57Z

        # Edit consists of adding new content, not changing/removing content
        if not text_prev:
-            for sentence_curr in unmatched_sentences_curr:
-                for word in sentence_curr.splitted:
-                    word_curr = Word()
-                    word_curr.value = word
-                    word_curr.token_id = self.token_id
-                    word_curr.origin_rev_id = self.revision_curr.id
-                    word_curr.last_rev_id = self.revision_curr.id
-
-                    sentence_curr.words.append(word_curr)
-                    self.token_id += 1
-                    self.revision_curr.original_adds += 1
-                    self.tokens.append(word_curr)
+            for sentence_curr, word in curr_slots:
+                word_curr = Word()
+                word_curr.value = word
+                word_curr.token_id = self.token_id
+                word_curr.origin_rev_id = self.revision_curr.id
+                word_curr.last_rev_id = self.revision_curr.id
+                sentence_curr.words.append(word_curr)
+                self.token_id += 1
+                self.revision_curr.original_adds += 1
+                self.tokens.append(word_curr)
            return matched_words_prev, possible_vandalism


Pure-addition edits are the case where there are current tokens but no previous unmatched tokens to compare against. Previously, the code loops through unmatched_sentences_curr and then through sentence_curr.splitted to create new Word objects. Since the branch already built curr_slots, it can create the same new Word objects from that flat ordered list instead. This removes duplicated traversal logic.

QZGao · 2026-05-07T08:27:38Z

+        # Single pass diff
+        prev_index = 0
+        curr_index = 0
        d = Differ()
-        diff = list(d.compare(text_prev, text_curr))
-        for sentence_curr in unmatched_sentences_curr:
-            for word in sentence_curr.splitted:
-                curr_matched = False
-                pos = 0
-                diff_len = len(diff)
-                while pos < diff_len:
-                    word_diff = diff[pos]
-                    if word == word_diff[2:]:
-                        if word_diff[0] == ' ':
-                            # match
-                            for word_prev in unmatched_words_prev:
-                                if not word_prev.matched and word_prev.value == word:
-
-                                    word_prev.matched = True
-                                    curr_matched = True
-                                    sentence_curr.words.append(word_prev)
-                                    matched_words_prev.append(word_prev)
-                                    diff[pos] = ''
-                                    pos = diff_len + 1
-                                    break
-                        elif word_diff[0] == '-':
-                            # deleted
-                            for word_prev in unmatched_words_prev:
-                                if not word_prev.matched and word_prev.value == word:
-                                    word_prev.matched = True
-                                    word_prev.outbound.append(self.revision_curr.id)
-                                    matched_words_prev.append(word_prev)
-                                    diff[pos] = ''
-                                    break
-                        elif word_diff[0] == '+':
-                            # a new added word
-                            curr_matched = True
-                            word_curr = Word()
-                            word_curr.value = word
-                            word_curr.token_id = self.token_id
-                            word_curr.origin_rev_id = self.revision_curr.id
-                            word_curr.last_rev_id = self.revision_curr.id
-
-                            sentence_curr.words.append(word_curr)
-                            self.token_id += 1
-                            self.revision_curr.original_adds += 1
-                            self.tokens.append(word_curr)
-                            diff[pos] = ''
-                            pos = diff_len + 1
-                    pos += 1
-
-                if not curr_matched:
-                    word_curr = Word()
-                    word_curr.value = word
-                    word_curr.token_id = self.token_id
-                    word_curr.origin_rev_id = self.revision_curr.id
-                    word_curr.last_rev_id = self.revision_curr.id
-                    sentence_curr.words.append(word_curr)
-
-                    self.token_id += 1
-                    self.revision_curr.original_adds += 1
-                    self.tokens.append(word_curr)
+        for diff_item in d.compare(text_prev, text_curr):
+            op = diff_item[0]
+            if op == '?': # Differ hint line; skip
+                continue
+            elif op == ' ': # next prev word is matched to next curr slot
+                word_prev = unmatched_words_prev[prev_index]
+                sentence_curr, _ = curr_slots[curr_index]
+                word_prev.matched = True
+                sentence_curr.words.append(word_prev)
+                matched_words_prev.append(word_prev)
+                prev_index += 1
+                curr_index += 1
+            elif op == '-': # next prev word is deleted (not present in curr)
+                word_prev = unmatched_words_prev[prev_index]
+                word_prev.matched = True
+                word_prev.outbound.append(self.revision_curr.id)
+                matched_words_prev.append(word_prev)
+                prev_index += 1
+            elif op == '+': # next curr slot gets a brand-new Word
+                sentence_curr, word = curr_slots[curr_index]
+                word_curr = Word()
+                word_curr.value = word
+                word_curr.token_id = self.token_id
+                word_curr.origin_rev_id = self.revision_curr.id
+                word_curr.last_rev_id = self.revision_curr.id
+                sentence_curr.words.append(word_curr)
+                self.token_id += 1
+                self.revision_curr.original_adds += 1
+                self.tokens.append(word_curr)
+                curr_index += 1



The main word-diff loop now reads Differ().compare(text_prev, text_curr) once, using prev_index and curr_index to keep the previous Word stream and current token stream aligned. This is needed because the old code matched diff rows by token text. Token text is often repeated, especially for common words, punctuation, markup characters, and template syntax. For every current token, the old loop could scan much of the diff output looking for a same-text row, then scan previous words again looking for a same-text Word object that was not already marked matched. The new loop does not search by text. It follows the order already produced by Differ: unchanged tokens consume one previous word and one current slot, deleted tokens consume one previous word, and inserted tokens consume one current slot. That removes repeated scans and makes the matching rule positional instead of “first unused word with this text”.

QZGao · 2026-05-07T08:29:39Z

            text = revision.get('*', '')
            rev_id = int(revision['revid'])
-            rev_hash = revision.get('sha1', calculate_hash(text))
-            if rev_hash in self.spam_hashes:
+            rev_hash = revision.get('sha1')
+            if not rev_hash:
+                rev_hash = calculate_hash(text)
+            if rev_hash in self.spam_hashes_set:
                vandalism = True


The JSON revision path no longer computes calculate_hash(text) unless the API response lacks sha1. In Python, function arguments are evaluated before the function call, so previously revision.get('sha1', calculate_hash(text)) still hashes the full revision text even when sha1 exists. This patch makes the fallback lazy, which avoids unnecessary full-text hashing for normal API responses.

QZGao · 2026-05-07T08:35:17Z

+        # Keep the public list shape while using a set for membership checks.
+        self.spam_hashes_set = set()
+
+    def _add_spam_revision(self, rev_id, rev_hash):
+        self.spam_ids.append(rev_id)
+        self.spam_hashes.append(rev_hash)
+        self.spam_hashes_set.add(rev_hash)


Spam hash lookup now uses spam_hashes_set. Spam revision recording is centralized in _add_spam_revision(...). The existing spam_hashes list is still kept (because I am not sure if any other scripts would query that), but is no longer used in the main logic. This helps when many vandalism/spam revisions have already been recorded.

QZGao · 2026-05-07T08:37:21Z

        # Identify unmatched paragraphs in previous revision for further analysis.
+        paragraph_duplicate_counts = {}
        for paragraph_prev_hash in self.revision_prev.ordered_paragraphs:
            if len(self.revision_prev.paragraphs[paragraph_prev_hash]) > 1:
-                s = 'p-{}-{}'.format(self.revision_prev, paragraph_prev_hash)
-                self.temp.append(s)
-                count = self.temp.count(s)
+                count = paragraph_duplicate_counts.get(paragraph_prev_hash, 0) + 1
+                paragraph_duplicate_counts[paragraph_prev_hash] = count
                paragraph_prev = self.revision_prev.paragraphs[paragraph_prev_hash][count - 1]
            else:
                paragraph_prev = self.revision_prev.paragraphs[paragraph_prev_hash][0]
            if not paragraph_prev.matched:
                unmatched_paragraphs_prev.append(paragraph_prev)


Duplicate paragraph tracking now uses paragraph_duplicate_counts. The algorithm needs the nth paragraph object for repeated paragraph hashes in a revision. Previously, it builds a formatted string key, appends it to self.temp, then calls self.temp.count(...) to find the occurrence number. Now it uses a local dictionary counter keyed by paragraph hash, so finding the occurrence number is direct and does not scan self.temp.

QZGao · 2026-05-07T08:39:12Z

        # Identify the unmatched sentences in the previous paragraph revision.
+        sentence_duplicate_counts = {}
        for paragraph_prev in unmatched_paragraphs_prev:
            for sentence_prev_hash in paragraph_prev.ordered_sentences:
                if len(paragraph_prev.sentences[sentence_prev_hash]) > 1:
-                    s = 's-{}-{}'.format(paragraph_prev, sentence_prev_hash)
-                    self.temp.append(s)
-                    count = self.temp.count(s)
+                    key = (id(paragraph_prev), sentence_prev_hash)
+                    count = sentence_duplicate_counts.get(key, 0) + 1
+                    sentence_duplicate_counts[key] = count
                    sentence_prev = paragraph_prev.sentences[sentence_prev_hash][count - 1]
                else:
                    sentence_prev = paragraph_prev.sentences[sentence_prev_hash][0]
                if not sentence_prev.matched:
                    unmatched_sentences_prev.append(sentence_prev)
                    # to reset 'matched words in analyse_words_in_sentences' of unmatched paragraphs and sentences
                    sentence_prev.matched = True
                    matched_sentences_prev.append(sentence_prev)

        return unmatched_sentences_curr, unmatched_sentences_prev, matched_sentences_prev, total_sentences

    def analyse_words_in_sentences(self, unmatched_sentences_curr, unmatched_sentences_prev, possible_vandalism):
        matched_words_prev = []
        unmatched_words_prev = []

        # Split sentences into words.
        text_prev = []
        for sentence_prev in unmatched_sentences_prev:
            for word_prev in sentence_prev.words:
                if not word_prev.matched:
                    text_prev.append(word_prev.value)
                    unmatched_words_prev.append(word_prev)


Duplicate sentence tracking now uses sentence_duplicate_counts. This is the same idea as paragraph_duplicate_counts explained above.

QZGao · 2026-05-07T08:42:12Z

+TOKEN_SYMBOLS = (
+    '.', ',', ';', ':', '?', '!', '-', '_', '/', '\\', '(', ')', '[', ']', '{', '}', '*', '#', '@',
+    '&', '=', '+', '%', '~', '$', '^', '<', '>', '"', '\'', '´', '`', '¸', '˛', '’',
+    '¤', '₳', '฿', '₵', '¢', '₡', '₢', '₫', '₯', '֏', '₠', '€', 'ƒ', '₣', '₲', '₴', '₭', '₺',
+    '₾', 'ℳ', '₥', '₦', '₧', '₱', '₰', '£', '៛', '₽', '₹', '₨', '₪', '৳', '₸', '₮', '₩', '¥',
+    '§', '‖', '¦', '⟨', '⟩', '–', '—', '¯', '»', '«', '”', '÷', '×', '′', '″', '‴', '¡',
+    '¿', '©', '℗', '®', '℠', '™'
+)
+TOKEN_SYMBOL_REPLACEMENTS = tuple((c, '||{}||'.format(c)) for c in TOKEN_SYMBOLS)


Moved TOKEN_SYMBOLS and TOKEN_SYMBOL_REPLACEMENTS to module-level. Previously, the long symbol list is constructed every time split_into_tokens runs, and every tokenization call formats '||{}||'.format(c) for every symbol. Since the list is constant, we now builds TOKEN_SYMBOLS once when the module is loaded, and computes those (symbol, replacement) pairs once and reuses them.

QZGao · 2026-05-07T08:43:57Z


-    tokens = filter(lambda a: a != '', text.split('||'))  # filter empty strings
-    tokens = ['|' if w == 'ææææ' else w for w in tokens]  # insert back the |s
+    tokens = ['|' if w == 'ææææ' else w for w in text.split('||') if w != '']  # insert back the |s


Empty-token filtering and pipe restoration are combined into one list comprehension. Previously, tokenization first filters empty strings and then does a second pass to turn the internal pipe placeholder back into |. The patch does both in one pass over text.split('||').

MusikAnimal · 2026-05-09T16:38:19Z

I'm still trying to get [[Google Play]] to wok locally.

I went by https://github.com/wikimedia/wikiwho_api/blob/main/SETUP.md for setting up the API. Then to test this PR, I first removed the wikiwho package with pip uninstall wikiwho. I then changed requirements.txt to point to the local clone of WikiWho with this PR checked out, ala ../WikiWho, and re-ran pip install -r requirements.txt. Then I started the server with python manage.py runserver.

Then finally, I tried to load Google Play with http://localhost:8000/en/api/v1.0.0-beta/latest_rev_content/Google_Play/?o_rev_id=true&editor=true&token_id=true&out=true&in=true

Then it just hangs indefinitely. I don't even see output in the logs. Things work okay though for smaller articles. I am guessing the slow part is it hitting the API. I recall in the past such requests were logged in the server output, but I don't see it now. Maybe my issue is the volume of API requests is too great, possibly also made worse by API limitations recently deployed at Wikimedia. (Such limits don't apply to Wikimedia Cloud Services where production WikiWho is hosted.)

How did you test it? Something similar to the above, or are you somehow testing directly against the WikiWho package?

QZGao · 2026-05-10T10:22:36Z

The API is just not very testable on "Google Play" page due to the abysmal amount of 429 Too Many Requests. I directly ran a test script on the WikiWho package in previous tests that cache everything before timer starts. Around 60 seconds on "Google Play" page is the time it takes for the WikiWho package to run on an already fetched revision history.

QZGao · 2026-05-10T12:58:44Z

I have just recognized one more place that could potentially optimize the speed of the diff algorithm. Let me test a bit.

QZGao · 2026-05-10T13:39:24Z

I found that Differ is way too slow on some revisions. Because WikiWho only needs token equal/delete/insert operations and does not use Differ's expensive hint/fancy-replace output, it is cheaper to run SequenceMatcher instead.

QZGao · 2026-05-10T14:07:30Z

I modified the WikiWho API a bit so that it can respect Wikimedia's Retry-After policy, then ran accordingly ( https://github.com/wikimedia/wikiwho_api/blob/main/SETUP.md ). One ran on [[Google Play]] took 10 minutes in total, while retry waits took around 4 minute, therefore the actually time cost on [[Google Play]] would be 6 minute on the Wikimedia Cloud Services server.

The modified fork: https://github.com/QZGao/wikiwho_api. If you'd like it to be reused, we can make it a new Pull Request, though it's only going to have an impact when running locally.

MusikAnimal · 2026-05-10T23:32:58Z

Thanks for the thorough explanations, and commitment to find ways to speed up the algorithm! I'll finish the review hopefully tonight and get zh started shortly thereafter.

The modified fork: https://github.com/QZGao/wikiwho_api. If you'd like it to be reused, we can make it a new Pull Request, though it's only going to have an impact when running locally.

That would be great :) I'll use your fork for testing this PR for the time being.

QZGao · 2026-05-11T15:19:43Z

That would be great :)

The PR for the API fork is here: wikimedia/wikiwho_api#28.

MusikAnimal · 2026-05-12T21:37:29Z

After using wikimedia/wikiwho_api#28, I was able to get [[Google Play]] to load! :)

I then started comparing and contrasting the WikiWho output for various other articles before/after this patch. As I understood it, tokenization shouldn't change, rather just performance improvements, right?

Take Adam Himebauch for example, which is a small article that I started. At minimum the token_ids are different, but it seems there's a few places where the rev IDs are different, for example:

Currently in production:

{"str":"'","o_rev_id":678457635,"editor":"14882394","token_id":2,"in":[747629772],"out":[747627155]}

and

{"str":"adam","o_rev_id":1162559322,"editor":"36240525","token_id":5552,"in":[],"out":[]}

Compared to my local with this PR checked out:

{"str":"'","o_rev_id":1162559322,"editor":"36240525","token_id":5551,"in":[],"out":[]}

and

{"str":"adam","o_rev_id":1315073789,"editor":"26944994","token_id":6197,"in":[],"out":[]}

It seems the markup is attributed differently now (it may be easier to just look for the second token instance of the string adam), and the preceding ' token has an in and out before, but doesn't now.

QZGao · 2026-05-13T01:45:10Z

I then started comparing and contrasting the WikiWho output for various other articles before/after this patch. As I understood it, tokenization shouldn't change, rather just performance improvements, right?

3c734ab replaced Differ with SequenceMatcher. This is the reason why tokenization result is different. I did mention that they were identical, but that is no longer the case with this change.

Here I asked Codex to compare SequenceMatcher and Differ with several sampled revisions and concluded that the results did not change much:

Case 1: 480556009
URL/reference expansion around support.google.com, changing the old support URL into a longer support.google.com/googleplay/... URL.

Original Differ: reuses support and the first ., but marks google, ., com, and googleplay as newly added in 480556009.

SequenceMatcher: reuses the unchanged domain tokens support . google . com /; only the genuinely new path pieces like googleplay, bin, answer, etc. are new.

Comparison: SequenceMatcher is more intuitive. The domain survived the edit; original Differ breaks authorship inside the stable domain.

Case 2: 690409934
Availability-table cleanup, e.g. {{ yes | yes }} becoming {{ yes }} and {{ no | no }} becoming {{ no }}.

Original Differ: reuses the core table/template tokens around {{ yes }} and {{ no }} and deletes the redundant parameter-style pieces.

SequenceMatcher: does the same qualitative thing: keeps the core {{ yes }} / {{ no }} tokens from prior history instead of treating them as new.

Comparison: correctness is roughly equivalent here. SequenceMatcher preserves the intuitive Differ behavior while being faster in the measured run.

Case 3: 896108303
Large repetitive availability-table update dominated by |, {{, }}, yes, no.

Original Differ: stalls in difflib._fancy_replace; no completed token mapping should be claimed.

SequenceMatcher: completes and uses the nearest-token fallback for the repeated-token region.

Comparison: SequenceMatcher wins operationally because it produces a result. There is no original-Differ correctness output to compare against here, because original Differ does not finish in the local test scenario.

This replacement from Differ to SequenceMatcher made [[Google Play]] able to work within time, so I believe it to be worth some discrepancy.

QZGao · 2026-05-13T02:07:19Z

After 5205df6 consistency patch, this is the result against Adam Himebauch:

{"str":"'","o_rev_id":678457635,"editor":"14882394","token_id":2,"in":[747629772],"out":[747627155]}

and

{"str":"adam","o_rev_id":1315073789,"editor":"26944994","token_id":6824,"in":[],"out":[]}

Now identical with the production.

xenacode-art · 2026-05-14T09:03:26Z

The single-pass cursor approach in analyse_words_in_sentences() is correct and well-reasoned — since Differ.compare() emits items in alignment order, the two-index scan is a clean O(n) replacement. The utils.py changes (module-level constants, combined filter/restore) are straightforward improvements too.

The main thing I'd want to be confident about is the Differ → SequenceMatcher switch. The performance case is clear — the O(n²) behavior was a real blocker for large articles. But since this is an authorship attribution tool, the behavioral difference matters. From the case-by-case analysis already shared, it looks like SequenceMatcher generally produces more intuitive results, but it would be good to know: is there a test suite that captures authorship output for known articles we can diff before/after? If not, it might be worth adding a few regression fixtures so future changes can be validated against this baseline.
Overall the direction looks right to me.

MusikAnimal · 2026-05-14T16:47:55Z

There are no such tests that I'm aware of, but that does sound worthwhile if @QZGao is up for it. No pressure though; I will revisit this PR today and will manually test a bunch more articles in various languages. If that all looks okay, then I'm happy to merge.

I do want to stress that the algorithm is sacred, and no changes to how it tokenizes should be made unless we are certain it is improving accuracy.

xenacode-art · 2026-05-14T17:41:23Z

I can take a shot at adding regression fixtures for this. The idea would be: pick a handful of small articles with stable authorship history, capture their WikiWho token output with the current production behavior as golden files, then write tests that verify the new code matches. Would that be a useful format @MusikAnimal, or is there an existing test structure in the repo I should follow?

MusikAnimal · 2026-05-14T19:37:27Z

I can take a shot at adding regression fixtures for this. The idea would be: pick a handful of small articles with stable authorship history, capture their WikiWho token output with the current production behavior as golden files, then write tests that verify the new code matches. Would that be a useful format @MusikAnimal, or is there an existing test structure in the repo I should follow?

Sounds useful to me! There are no tests for the repository as far as I can tell. Given my Python inexperience, I defer to the two of you with respect to best practices. I will be doing some manual comparisons, regardless, so no pressure if it seems like it's too much work.

I will also run the Copilot on this PR. Again, we all know AI can be wrong, so don't feel like you need to address everything, or even anything at all. In fact, in my limited experience so far, Copilot will always find something to complain about :-P

Thanks so much for the hard work! I'm excited to get these performance improvements deployed.

Copilot

Pull request overview

This PR aims to fix the quadratic time complexity in analyse_words_in_sentences() (Phabricator T342805), which dominated runtime for revisions with tens of thousands of tokens. According to the PR description, the intent is a single ordered pass over difflib.Differ.compare() with two integer cursors. In practice, the PR replaces Differ entirely with a different algorithm (prefix/suffix trim + SequenceMatcher.get_opcodes() gated by a drift bound + a greedy nearest-neighbor matcher with several heuristic thresholds). Smaller refactors include replacing self.temp list-based duplicate counting with local dict counters, introducing a spam_hashes_set mirror for O(1) membership, and tidying split_into_tokens.

Changes:

Replaces the Differ-based word matching in analyse_words_in_sentences with a SequenceMatcher + drift-bounded nearest-neighbor heuristic, plus a flat curr_slots precomputation.
Adds spam_hashes_set/_add_spam_revision to give O(1) spam-hash lookups, and converts paragraph/sentence duplicate counting from self.temp list counts to per-call dicts.
Hoists the token-symbol list out of split_into_tokens into module-level TOKEN_SYMBOL_REPLACEMENTS and simplifies the empty-string filter.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
WikiWho/wikiwho.py	Rewrites word-level diff matching with a new heuristic algorithm and accompanying helpers; refactors spam-hash and duplicate counting bookkeeping.
WikiWho/utils.py	Pulls token symbols and their replacement pairs to module scope; folds the empty-token filter into a single comprehension.

Comments suppressed due to low confidence (1)

WikiWho/wikiwho.py:153

The drift cap on 'equal' opcodes — abs((prev_mid_start + i1) - (curr_mid_start + j1)) <= max_drift — silently discards a perfectly valid equal block reported by SequenceMatcher whenever the block has shifted by more than max(50, 10% * max(len_prev, len_curr)). Those tokens then fall through to the replace/equal branch on the next line only if the same opcode is 'replace' (it isn't — it's 'equal'), so the entire equal block is effectively unmatched and every token in it will be re-created as a brand new Word with a new token_id. For an article where a section is moved, this means every token in the moved section loses its authorship lineage.

The original Differ-based code had no such positional cap and would match these tokens. If a drift cap is genuinely necessary for performance, the discarded equal blocks should at least be re-fed through _nearest_word_matches (or simply matched directly, since SequenceMatcher already certified them as a 1-to-1 equal run) rather than dropped.

            for tag, i1, i2, j1, j2 in matcher.get_opcodes():
                if tag == 'equal' and abs((prev_mid_start + i1) - (curr_mid_start + j1)) <= max_drift:
                    for prev_index, curr_index in zip(range(i1, i2), range(j1, j2)):
                        prev_for_curr[curr_mid_start + curr_index] = prev_mid_start + prev_index
                elif tag in ('replace', 'equal') and (i2 - i1) * (j2 - j1) <= WORD_MATCH_MAX_LOCAL_PAIRS:
                    local_matches = _nearest_word_matches(prev_mid[i1:i2],
                                                          curr_mid[j1:j2],
                                                          prev_mid_start + i1,
                                                          curr_mid_start + j1,
                                                          max_drift)
                    for curr_index, prev_index in local_matches.items():
                        prev_for_curr[curr_mid_start + j1 + curr_index] = prev_mid_start + i1 + prev_index

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

QZGao · 2026-05-17T23:54:10Z

This is only a fraction of the article, but as we can see, the differences are rather dramatic. Some examples show the new algorithm is better, others show it is worse.

Thanks for more test cases. I'll experiment with those test cases to see if they are patch-able. Differ is currently the major obstacle for the huge articles like [[Google Play]] to run successfully, so I'll still try my best to improve the algorithm based on SequenceMatcher.

The thing is, despite its flaws, the older algorithm has been widely battle-tested and trusted for over a decade now. In an earlier version of this PR (before 3c734ab I think), you said you were able to offer performance improvements without changing the algorithm. Would it be better to start with just that, perhaps?

Indeed. You can checkout 89ca645 as that's the last commit without differ changes.

QZGao · 2026-05-17T23:57:06Z

In the meantime, should I go ahead and process zhwiki? I was waiting for this to get the performance improvements since CJK processes much slower now that each character is tokenized, but given the risks, maybe it's better to just move forward with the older algorithm. You decide :)

We could wait a day or two more, as I'm still motivated to improve the algorithm at the moment. Let's see if those worse case situations are solvable or not.

QZGao · 2026-05-18T12:21:20Z

With 7884b1e, the word matcher now tracks match confidence while assigning tokens. The new helper is _assign_word_match(). Every match records three things together: which previous token a current token maps to, how confident that match is, and which current token is already using a previous token. This matters because the matcher now has several passes that can overlap. A local nearest-token match should not block a better moved-run match, and a moved-run match should not override a hard prefix/suffix edge match. The confidence values encode that ordering.

The main new behavior is _recover_moved_word_runs(). After prefix/suffix matching and SequenceMatcher, the code records the non-equal opcode ranges into move_prev_spans and move_curr_spans. Those are the only regions searched for moved text. The recovery pass builds n-gram indexes for those spans using sizes 10, 8, 6, 4, 3. It only indexes windows with enough informative tokens, so punctuation-heavy windows do not seed a moved match.

When the same n-gram appears exactly once on the previous side and exactly once on the current side, it becomes a candidate. The code extends that candidate left and right while adjacent tokens still match. Then it checks whether the run is safe to treat as a moved copy. That check takes the longest informative content core from the run and confirms that the core appears exactly once in the full previous article text and exactly once in the full current article text.

That full-article uniqueness check is the expensive part. It is why full-history runs on large pages are now about 25-30% slower. The pair-index cache in _count_subsequence_cached() reduces the cost by indexing token pairs and only counting up to 2, but the matcher is still doing real extra work that the old code did not do.

The link fix is _recover_edited_link_boundaries(). The existing keying makes [[ and ]] depend on the full link target. That is good for avoiding random delimiter matches, but it breaks when a link target is extended. For example, if [[Category:Video games]] becomes [[Category:Video games developed in Japan]], the words Category:Video games match, but the delimiters do not because the full target changed. The new pass scans link spans after normal token matching. If the old unpiped target is still a contiguous, substantial part of the new target, and those target tokens are already matched, it reattaches the old [[ and ]].

The suffix fix is _demote_stale_suffix_edge_matches(). Prefix/suffix matching is usually correct, but it can keep very old short words alive after a large deletion right before an unchanged suffix. The code now handles only pure deletions before a suffix. It looks at the first small window of that suffix and unassigns old alphabetic glue tokens like to, in, or of when the previous Word already has old lineage. It does not touch content words, punctuation, or replacement edits.

The tradeoff: moved-run recovery gives better attribution for relocated copied text, but it costs time. The link-boundary and stale-suffix fixes are cheap. The 25-30% slowdown mostly comes from the moved-run uniqueness checks.

QZGao · 2026-05-18T12:35:04Z

I made all previous analyzed situations test cases, and all failures are cleared now.

#	Article	Case Description	Pass	Values	Origin Revs	Last Revs	Inbound	Outbound
1	Adam Himebauch	Phrase `Beginning in 2022`	Yes	`beginning`, `in`, `2022`	`1082167926`, `1082167926`, `1082167926`	`1316823741`, `1316823741`, `1316823741`	`[]`, `[]`, `[]`	`[]`, `[]`, `[]`
2	Adam Himebauch	First `Hanksy` mention	Yes	`hanksy`	`1315073789`	`1316823741`	`[]`	`[]`
3	Adam Himebauch	First NYT reference title token `A`	Yes	`a`	`1316823741`	`1316823741`	`[]`	`[]`
4	Adam Himebauch	`https` token in first NYT URL	Yes	`https`	`767656880`	`1316823741`	`[]`	`[]`
5	Adam Himebauch	Rest of first NYT URL after `https`	Yes	`:`, `/`, `/`, `www`, `.`, `nytimes`, `.`, `com`, `/`, `2014`, `/`, `02`, `/`, `16`, `/`, `nyregion`, `/`, `a`, `-`, `parodist`, `-`, `who`, `-`, `calls`, `-`, `himself`, `-`, `hanksy`, `.`, `html`	all `678458561`	all `1316823741`	all `[]`	all `[]`
6	Adam Himebauch	`The` in `The image went viral`	Yes	`the`	`754869525`	`1316823741`	`[]`	`[]`
7	Adam Himebauch	Reference title `Reverend Jen exclusive on Hanksy`	Yes	`reverend`, `jen`, `exclusive`, `on`, `hanksy`	all `754869525`	all `1316823741`	all `[]`	all `[]`
8	Adam Himebauch	Date range `Between 2011 and 2017`	Yes	`between`, `2011`, `and`, `2017`	`1162525604`, `754869525`, `1162525604`, `1162559322`	all `1316823741`	all `[]`	all `[]`
9	Adam Himebauch	Words `created` and `images` in the street-art sentence	Yes	`created`, `images`	`754869525`, `754869525`	`1316823741`, `1316823741`	`[]`, `[]`	`[]`, `[]`
10	Adam Himebauch	Phrase `exhibited across the country`	Yes	`exhibited`, `across`, `the`, `country`	all `754869525`	all `1316823741`	all `[]`	all `[]`
11	Adam Himebauch	Second NYT title `A Parodist Who Calls Himself Hanksy`	Yes	`a`, `parodist`, `who`, `calls`, `himself`, `hanksy`	all `1078016622`	all `1316823741`	all `[]`	all `[]`
12	Japan Cup	Token `over` in rewritten distance wording	Yes	`over`	`24011298`	`112007899`	`[]`	`[112010328]`
13	Splatoon 3	Lead title `Splatoon 3`, not the infobox title field	Yes	`splatoon`, `3`	`1007397055`, `1007397055`	`1007397695`, `1007397695`	`[]`, `[]`	`[]`, `[]`
14	Splatoon 3	Category `[[Category:Video games developed in Japan]]` split between reused broad category and new words	Yes	`[[`, `category`, `:`, `video`, `games`, `developed`, `in`, `japan`, `]]`	`1007397440`, `1007397440`, `1007397440`, `1007397440`, `1007397440`, `1007397695`, `1007397695`, `1007397695`, `1007397440`	all `1007397695`	all `[]`	all `[]`
15	2026 Canvas security incident	Short-description closing token `}}` stays continuous	Yes	`}}`	`1353079526`	`1354204849`	`[]`	`[]`
16	2026 Canvas security incident	Adjacent duplicate `said said`	Yes	`said`, `said`	`1353079526`, `1353079526`	`1353090711`, `1353090491`	`[]`, `[]`	`[]`, `[1353090711]`
17	American goldfinch	Original period removed when the length phrase is inserted	Yes	`.`	`24921`	`122039732`	`[]`	`[122044470]`
18	American goldfinch	Image link opener `[[` preserved while image target stays same	Yes	`[[`	`859274`	`15936180`	`[]`	`[18645783]`
19	American goldfinch	Image link opener `[[` resets after image target changes	Yes	`[[`	`18645783`	`18692594`	`[]`	`[20553313]`
20	American goldfinch	Taxobox opener `{{` resets on template conversion	Yes	`{{`	`37287458`	`37287458`	`[]`	`[]`
21	Nintendo Switch	Apostrophe in `Don't` from Article Wizard comment, not later title markup	Yes	`'`	`651807915`	`651992792`	`[651992792]`	`[651930777, 651993021]`
22	Through the Looking-Glass	Stub token `to` in `Sequel to ...` should not survive final mature lead	Yes	`to`	`291417`	`1268074738`	`[]`	`[1273903188]`

QZGao · 2026-05-18T12:36:42Z

@MusikAnimal Please, compare more revisions and we can see if there are any more improvements that should be made.

QZGao · 2026-05-18T12:40:53Z

Let me also ask Copilot to review it.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

xenacode-art · 2026-05-21T06:22:56Z

Ran regression tests against the latest commit (7884b1e). Results vs master:

Article	Master tokens	This PR	Δ
Adam Himebauch	8,572	8,341	−231
Splatoon 3	37,819	37,817	−2

Splatoon 3 has nearly converged with master (−2). The larger Adam Himebauch gap (−231 vs −72 in the previous round) reflects the moved-run recovery in 7884b1e successfully matching more token occurrences as reuses of existing tokens rather than creating new entries — which is what improved attribution looks like in the stored token count.
Tests are running in 17s on these two articles. Happy to add more articles to the fixture set if useful.

MusikAnimal · 2026-05-26T08:57:25Z

Sorry for the delay! I've been out on holiday, then my team at WMF was disbanded :( Rest assured though, I will see this through, or at least the (re-)adding CJK languages to WikiWho. I hope to get to this review this week but I am very busy trying to delegate CommTech projects before the team is officially no more. WikiWho is safe, for the record. I will continue to maintain it with Sage Ross.

ragesoss

Note on this feedback: Parts of this review were produced with Claude Code (parity experiment, drafting). It's offered as input to read, understand, and use or set aside on your own judgment — not as authoritative conclusions. The central numbers below are reproducible; the interpretation is provisional.

The performance fix is real and worth having. The old analyse_words_in_sentences was genuinely O(n²) — the post-Differ triple rescan plus self.temp.count() in the duplicate-paragraph/sentence paths — and the worst-case revisions (the ~36s Google Play case) were a real problem. The spam_hashes_set, the duplicate-count dicts, the module-level token-symbol list, and the single-pass slot assignment are all sound and uncontroversial improvements.

The concern is that this PR also replaces the matching algorithm, and that change is consumer-visible. The diff drops difflib.Differ for bare SequenceMatcher.get_opcodes() plus a new stack of heuristics (a confidence ladder, wikitext-aware match keys, bounded nearest-neighbor recovery, n-gram moved-run recovery, edited-link-boundary recovery, stale-suffix demotion). Differ is not SequenceMatcher — Differ adds a character-level _fancy_replace layer on top of it — so opcode-level matching produces different token-to-token assignments even before the new heuristics are considered.

To measure the impact, each article's full revision history was replayed through both the base (Differ) and PR versions of the library, and the per-token output of the target revision was diffed. Structural state (revisions processed, spam IDs, paragraph/sentence hash tables) is byte-for-byte identical between the two. Per-token attribution is not:

Article	revs	`o_rev_id` differs (all tokens)	`o_rev_id` differs (prose words only)	inbound/outbound differs
Photosynthesis	5.4K	19.3%	14.5%	14.5%
Wikipedia (simplewiki)	3.5K	4.9%	3.7%	37.5%
Jesse Owens	6.4K	6.2%	4.8%	2.5%

The "prose words only" column excludes wikitext punctuation/markup, so this is not just template-syntax reshuffling — content words change attributed origin revision (and therefore editor) for 4–15% of the article. Examples from Photosynthesis: schematic of photosynthesis, vegetation; from Jesse Owens: athletics at the summer olympics. Three of WikiWho's four downstream consumers (the Wiki Education Dashboard's ArticleViewer, XTools Authorship/Blame, the WhoWroteThat gadget) render attribution per token, so this surfaces as different coloring / different "added by" results. The inbound/outbound deltas feed WhoColor conflict scores.

Important: these numbers measure divergence from the current algorithm, not correctness. The PR's stated aim ("Improve word-level authorship matching for moved text") is to attribute better than Differ — and some fraction of these differences may well be improvements (moved text credited to its original author rather than the mover is a plausible win). But which differences are improvements and which are regressions has not been established here, and that question is not cheap to answer — it requires sampling specific divergences and judging editing intent case by case.

Why the PR's own verification doesn't settle this. The description verifies matched_prev count, the vandalism flag, token_id_delta, tokens_delta, and per-sentence word totals on one revision (Google Play 1296988276). Those are aggregate counts on a single diff step. Two matchers can produce the same number of matched tokens while assigning them to different origin revisions — which is exactly what the table above shows (the structural counts match; the per-token o_rev_ids don't). Aggregate-count parity on one revision can't detect attribution drift across a history.

Suggestions, in rough priority:

Consider splitting the PR. The performance fixes (set-based spam lookup, duplicate-count dicts, single-pass slot assignment, module-level constants) are separable from the matcher replacement and could merge on their own with little risk. That isolates the part that needs scrutiny.
If the algorithm change is intended, characterize it before merging. A per-token replay over a corpus of full histories (the method used above) quantifies the drift; sampling N divergent tokens and judging better/worse gives a quality signal. Because the change is consumer-visible, the maintainers of the per-token consumers above probably want a say before this lands.
Tests. This adds ~860 lines to the core matcher in a repo with no test suite or CI. Even a handful of unit tests on the new helpers, plus one full-history regression check, would make the behavior reviewable and pin it against future drift.
The description is stale. It describes a single-pass cursor optimization; the PR is a matcher rewrite. The note at the bottom acknowledges this, but a reader still has to reverse-engineer the actual change. (Copilot's review raised the same point, plus the _nearest_word_matches nearest-position concern and the unreferenced WORD_MATCH_* constants — worth a look.)

Drafted in a Claude Code session directed by Sage. The attribution-drift figures were produced by replaying each article's full history through both the base and PR builds of the library and diffing per-token output; that comparison is reproducible and was run during the session, not taken on faith. Sage directed the review and the experiment and read this comment before posting. The reading of the 860-line diff and the per-token analysis were done by Claude Code and were not independently re-derived by hand; the divergence-vs-correctness question is left open, not resolved.

ragesoss · 2026-06-05T22:17:48Z

Follow-up to the review above. Same caveats: AI-assisted, provisional, offered as input to weigh rather than a verdict.

The earlier comment measured how much attribution changes but left open whether the changes are improvements or regressions. A closer look at where the divergence actually lands, on the same three article histories:

Most of the divergence is on tokens where per-token authorship is arbitrary, or isn't "prose authorship" at all. Across the three articles, 90–98% of the tokens whose o_rev_id changed have a value that recurs elsewhere in the article — the ambiguous-match case, where which prior the / | / of a current token descends from is genuinely undefined. 55–68% sit inside templates/citations or are markup delimiters (|, =, {{, [[). Differ's assignments there were never well-specified either, so this is churn between two equally-arbitrary answers, not a regression.

On the smaller slice that is genuine prose, the PR's attribution looks more correct, not less. 38–50% of divergent tokens fall in contiguous runs of 5+ positions — the fingerprint of moved or reformatted passages. Tracing specific content words back through the actual revision text:

Phrase	First authored in	`Differ` (current) credits	PR credits
"Charles Riley" (Jesse Owens)	95595 (2002)	18254337 (2005, "tweaks")	95595 ✓
"Adi Dassler" (Jesse Owens)	234610129 (2008)	567245817 (2013, "Filling in 3 references using Reflinks")	234610129 ✓
"growing [less quickly]" (simplewiki)	5022409 (2015)	8311103 (2022, "fixed … 5x extra whitespace")	5022409 ✓

The pattern is consistent: Differ mis-credits maintenance edits — whitespace fixes, reference-filling, tweaks — that reshaped text without authoring it, and the PR's moved-run recovery traces the prose back to where it was actually written. That is a real, longstanding WikiWho weakness, and it is the dimension this PR targets.

This is not a clean sweep. The PR can also over-reach in the other direction — a couple of words adjacent to the simplewiki passage were pulled to a 2009 revision that predates the phrase, via a spurious duplicate match. And this is four hand-checked cases, not a measured rate; a larger better/worse sample would firm it up (and is a reasonable thing to ask for before relying on the new behavior).

Net: the earlier "changes attribution for X% of tokens" figure overstates the consumer-meaningful impact, and the meaningful part appears to be a quality improvement. The suggestions from the prior comment still stand — separating the uncontroversial performance fixes from the matcher change, adding tests, and getting sign-off from the per-token consumers — but the matcher change now looks worth pursuing on its merits, not merely tolerating as a risk.

Drafted in a Claude Code session directed by Sage. The "first authored in" revisions were verified by searching each phrase against the actual stored revision text of the article's history; that check is reproducible. The where-divergence-lands breakdown was computed by replaying both library builds over full histories and classifying every changed token. Sage directed the analysis and reviewed this comment before posting; the classification and per-token analysis were done by Claude Code and not independently re-derived by hand. A quantified better/worse/wash rate over a large sample has not yet been computed — the "looks more correct" claim rests on the four verified cases plus the structural and directional evidence.

ragesoss · 2026-06-05T22:30:41Z

Quantifying the better/worse question (delivering on "a larger sample would firm it up" from the comment above; same caveats — AI-assisted, provisional).

Adjudicating a single token's "true author" automatically is circular (it is the attribution problem), so the test used here sidesteps it: for each prose divergence, take a short context window and ask whether each credited revision actually introduced that text (window present in the revision, absent in its parent) or merely inherited it (window already in the parent — a later edit that reshaped nearby text without writing this).

Among the cases where exactly one of the two credited revisions actually introduced the prose:

Article	decisive cases	PR introduced / `Differ` only inherited	`Differ` introduced / PR inherited
Photosynthesis	50	49 (98%)	1
Wikipedia (simplewiki)	4	4 (100%)	0
Jesse Owens	51	48 (94%)	3

So where it's determinable, the PR credits the true author ~94–100% of the time and Differ is the one inheriting — consistent with the reformatting-steals-authorship pattern across the board, not just the hand-picked cases.

Caveats, to keep this honest:

The test only adjudicates ~20–30% of prose divergences; the rest are indeterminate because reformatting changed the exact wording, so the window isn't stably present in one credited revision. Those aren't PR losses — just not cleanly decidable this way.
This covers the prose slice (~9–12% of total divergence); the other ~90% is the arbitrary tier (duplicated tokens, markup, citation internals) where neither matcher's choice was ever well-defined.
The PR isn't perfect — it loses a few decisive cases per article (the over-reach-via-duplicate-match direction noted earlier), so this is ~94–100%, not 100%.

Net of all three comments: the performance fix is clearly worth having; the matcher change is a real attribution-quality improvement on moved/reformatted text, measurable and in the right direction. The earlier suggestions stand — split the uncontroversial perf work from the matcher change, add tests (the introduced-vs-inherited check above is itself a usable regression-test shape), and loop in the per-token consumers — but the matcher work looks worth landing on its merits.

Drafted in a Claude Code session directed by Sage. The introduced-vs-inherited adjudication was run over sampled prose divergences from full-history replays of three articles, checking each credited revision's text against its parent's; reproducible. Sage directed the analysis and reviewed this before posting. The method and code were produced by Claude Code; the ~94–100% figure is the machine-computed rate on the decisive subset, not independently re-derived by hand beyond the four cases verified in the previous comment.

…t/WikiWho into optimization

QZGao · 2026-06-06T01:37:03Z

I'll merge #3 into this branch (to be consistent with existing test suite), then add dedicated test cases I previously mentioned.

QZGao · 2026-06-06T02:50:18Z

Now the test cases are also implemented into the branch.

MusikAnimal · 2026-06-08T17:49:12Z

I'll merge #3 into this branch (to be consistent with existing test suite), then add dedicated test cases I previously mentioned.

Now the test cases are also implemented into the branch.

Sorry I missed these comments. I merged #3 already.

Copilot

Pull request overview

Copilot reviewed 7 out of 17 changed files in this pull request and generated no new comments.

MusikAnimal · 2026-06-08T23:28:22Z

This PR does change algorithm, correct?

I'm wondering what folks think of just moving forward without this. #4 by itself improves performance. I just worry changing the way things are attributed is going to cause confusion, for better or for worse.

I am also out of time. I am happy to run the script to import the CJK languages and do one more final release of downstream projects, but beyond that, I don't really have much energy left to invest in testing this PR. Perhaps Sage is able to, however.

Let me know what you all think :)

…zation

QZGao · 2026-06-08T23:36:06Z

The algorithm is already outdated -- and as people have been noticing, more pages will gradually start to be unable to render (especially on enwiki) just like [[Google Play]]. Therefore, I would highly recommend we optimize the algorithm now, while we are at it.

ragesoss · 2026-06-08T23:49:22Z

Yeah, I'll keep up with this PR @MusikAnimal. From what I've tested so far, I do think the algorithm changes are going to improve attribution.

QZGao mentioned this pull request May 7, 2026

Fix CJK tokenization #1

Merged

QZGao marked this pull request as draft May 7, 2026 02:28

QZGao marked this pull request as ready for review May 7, 2026 08:02

QZGao commented May 7, 2026

View reviewed changes

Replace Differ with SequenceMatcher

3c734ab

Patch SequenceMatcher for consistency

5205df6

MusikAnimal requested a review from Copilot May 14, 2026 19:37

Copilot started reviewing on behalf of MusikAnimal May 14, 2026 19:38 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread WikiWho/wikiwho.py Outdated

Comment thread WikiWho/wikiwho.py

Comment thread WikiWho/wikiwho.py

Comment thread WikiWho/wikiwho.py

QZGao added 3 commits May 14, 2026 15:03

Add several documentations

65ad75f

Keep scanning and let both sides stop independently

c7900e9

Prevent SequenceMatcher equal from rejection by max_drift

3a87d77

Improve word-level authorship matching for moved text

7884b1e

QZGao requested a review from Copilot May 18, 2026 12:41

Copilot started reviewing on behalf of QZGao May 18, 2026 12:41 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

ragesoss reviewed Jun 5, 2026

View reviewed changes

QZGao mentioned this pull request Jun 6, 2026

Optimize diff algorithm's time complexity, part 1 #4

Merged

QZGao changed the title ~~Optimize diff algorithm's time complexity~~ Optimize diff algorithm's time complexity, part 2 Jun 6, 2026

Merge branch 'add-regression-tests' of https://github.com/xenacode-ar…

3eba492

…t/WikiWho into optimization

QZGao added 2 commits June 5, 2026 19:06

Add test cases for expected authorship behaviors

51e9ed6

Regenerate golden fixtures for the new algorithm

15ccfe6

MusikAnimal requested a review from Copilot June 8, 2026 17:49

Copilot started reviewing on behalf of MusikAnimal June 8, 2026 17:49 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Merge branch 'master' of https://github.com/QZGao/WikiWho into optimi…

fdfb60f

…zation

Conversation

QZGao commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QZGao commented May 7, 2026

Uh oh!

QZGao commented May 7, 2026

Uh oh!

QZGao commented May 7, 2026

Uh oh!

QZGao commented May 7, 2026

Uh oh!

QZGao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MusikAnimal commented May 9, 2026

Uh oh!

QZGao commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QZGao commented May 10, 2026

Uh oh!

QZGao commented May 10, 2026

Uh oh!

QZGao commented May 10, 2026

Uh oh!

MusikAnimal commented May 10, 2026

Uh oh!

QZGao commented May 11, 2026

Uh oh!

MusikAnimal commented May 12, 2026

Uh oh!

QZGao commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QZGao commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenacode-art commented May 14, 2026

Uh oh!

MusikAnimal commented May 14, 2026

Uh oh!

xenacode-art commented May 14, 2026

Uh oh!

MusikAnimal commented May 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QZGao commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QZGao commented May 17, 2026

Uh oh!

QZGao commented May 18, 2026

Uh oh!

QZGao commented May 7, 2026 •

edited

Loading

QZGao commented May 10, 2026 •

edited

Loading

QZGao commented May 13, 2026 •

edited

Loading

QZGao commented May 13, 2026 •

edited

Loading

QZGao commented May 17, 2026 •

edited

Loading