Optimize diff algorithm's time complexity, part 2#2
Conversation
|
I ran the whole "Google Play" page locally. Without counting the request time, the local WikiWho processing itself took 62.46 s. |
|
Another run took 66.70 s total local processing:
|
|
Multiple small fixes:
|
|
Another thing that severely drags the program down is the final all-token scan. I am investigating whether we can optimize that. |
QZGao
left a comment
There was a problem hiding this comment.
I have done some investigation on the code but sadly could not find any more optimization. The one-request-that-calculates-all approach clearly has a lower bound in efficiency. If we want to get "Google Play" and "Barack Obama" to be parsed within time and memory constraints, we would need to fundamentally change some behaviors.
For example, if we want to speed up, we would want to cache previous runs on previous revisions, and only do incremental analysis. If we want to optimize memory usage, we would want to give up on storing all revisions -- maybe only keep the latest 500 revisions and delete old revisions, if we assume we already do incremental analysis.
Still, I believe the current changes already made are worthwhile. I've documented the changes in these comments.
| # Build flat (sentence, token) slots so we can assign words during the | ||
| # diff pass without re-scanning sentences or the diff list. | ||
| curr_slots = [] # list of (sentence_curr, word_value) | ||
| text_curr = [] | ||
| for sentence_curr in unmatched_sentences_curr: | ||
| # split_into_tokens is already done in analyse_sentences_in_paragraphs | ||
| words = sentence_curr.value.split(' ') | ||
| text_curr.extend(words) | ||
| sentence_curr.splitted.extend(words) | ||
| for word in words: | ||
| curr_slots.append((sentence_curr, word)) |
There was a problem hiding this comment.
In analyse_words_in_sentences, the algorithm needs to decide, for every token in the current revision, which Sentence object should receive either a reused old Word object or a newly created Word object. Originally, after computing the diff, the code loops back through unmatched_sentences_curr and each sentence’s split tokens to rediscover that sentence/token position while also scanning the diff list. This patch builds curr_slots, a flat ordered list of (sentence_curr, word) pairs, at the same time it builds text_curr. This preserves the same token order as text_curr, and also remembers the destination sentence for each token.
| # Edit consists of adding new content, not changing/removing content | ||
| if not text_prev: | ||
| for sentence_curr in unmatched_sentences_curr: | ||
| for word in sentence_curr.splitted: | ||
| word_curr = Word() | ||
| word_curr.value = word | ||
| word_curr.token_id = self.token_id | ||
| word_curr.origin_rev_id = self.revision_curr.id | ||
| word_curr.last_rev_id = self.revision_curr.id | ||
|
|
||
| sentence_curr.words.append(word_curr) | ||
| self.token_id += 1 | ||
| self.revision_curr.original_adds += 1 | ||
| self.tokens.append(word_curr) | ||
| for sentence_curr, word in curr_slots: | ||
| word_curr = Word() | ||
| word_curr.value = word | ||
| word_curr.token_id = self.token_id | ||
| word_curr.origin_rev_id = self.revision_curr.id | ||
| word_curr.last_rev_id = self.revision_curr.id | ||
| sentence_curr.words.append(word_curr) | ||
| self.token_id += 1 | ||
| self.revision_curr.original_adds += 1 | ||
| self.tokens.append(word_curr) | ||
| return matched_words_prev, possible_vandalism |
There was a problem hiding this comment.
Pure-addition edits are the case where there are current tokens but no previous unmatched tokens to compare against. Previously, the code loops through unmatched_sentences_curr and then through sentence_curr.splitted to create new Word objects. Since the branch already built curr_slots, it can create the same new Word objects from that flat ordered list instead. This removes duplicated traversal logic.
| # Single pass diff | ||
| prev_index = 0 | ||
| curr_index = 0 | ||
| d = Differ() | ||
| diff = list(d.compare(text_prev, text_curr)) | ||
| for sentence_curr in unmatched_sentences_curr: | ||
| for word in sentence_curr.splitted: | ||
| curr_matched = False | ||
| pos = 0 | ||
| diff_len = len(diff) | ||
| while pos < diff_len: | ||
| word_diff = diff[pos] | ||
| if word == word_diff[2:]: | ||
| if word_diff[0] == ' ': | ||
| # match | ||
| for word_prev in unmatched_words_prev: | ||
| if not word_prev.matched and word_prev.value == word: | ||
|
|
||
| word_prev.matched = True | ||
| curr_matched = True | ||
| sentence_curr.words.append(word_prev) | ||
| matched_words_prev.append(word_prev) | ||
| diff[pos] = '' | ||
| pos = diff_len + 1 | ||
| break | ||
| elif word_diff[0] == '-': | ||
| # deleted | ||
| for word_prev in unmatched_words_prev: | ||
| if not word_prev.matched and word_prev.value == word: | ||
| word_prev.matched = True | ||
| word_prev.outbound.append(self.revision_curr.id) | ||
| matched_words_prev.append(word_prev) | ||
| diff[pos] = '' | ||
| break | ||
| elif word_diff[0] == '+': | ||
| # a new added word | ||
| curr_matched = True | ||
| word_curr = Word() | ||
| word_curr.value = word | ||
| word_curr.token_id = self.token_id | ||
| word_curr.origin_rev_id = self.revision_curr.id | ||
| word_curr.last_rev_id = self.revision_curr.id | ||
|
|
||
| sentence_curr.words.append(word_curr) | ||
| self.token_id += 1 | ||
| self.revision_curr.original_adds += 1 | ||
| self.tokens.append(word_curr) | ||
| diff[pos] = '' | ||
| pos = diff_len + 1 | ||
| pos += 1 | ||
|
|
||
| if not curr_matched: | ||
| word_curr = Word() | ||
| word_curr.value = word | ||
| word_curr.token_id = self.token_id | ||
| word_curr.origin_rev_id = self.revision_curr.id | ||
| word_curr.last_rev_id = self.revision_curr.id | ||
| sentence_curr.words.append(word_curr) | ||
|
|
||
| self.token_id += 1 | ||
| self.revision_curr.original_adds += 1 | ||
| self.tokens.append(word_curr) | ||
| for diff_item in d.compare(text_prev, text_curr): | ||
| op = diff_item[0] | ||
| if op == '?': # Differ hint line; skip | ||
| continue | ||
| elif op == ' ': # next prev word is matched to next curr slot | ||
| word_prev = unmatched_words_prev[prev_index] | ||
| sentence_curr, _ = curr_slots[curr_index] | ||
| word_prev.matched = True | ||
| sentence_curr.words.append(word_prev) | ||
| matched_words_prev.append(word_prev) | ||
| prev_index += 1 | ||
| curr_index += 1 | ||
| elif op == '-': # next prev word is deleted (not present in curr) | ||
| word_prev = unmatched_words_prev[prev_index] | ||
| word_prev.matched = True | ||
| word_prev.outbound.append(self.revision_curr.id) | ||
| matched_words_prev.append(word_prev) | ||
| prev_index += 1 | ||
| elif op == '+': # next curr slot gets a brand-new Word | ||
| sentence_curr, word = curr_slots[curr_index] | ||
| word_curr = Word() | ||
| word_curr.value = word | ||
| word_curr.token_id = self.token_id | ||
| word_curr.origin_rev_id = self.revision_curr.id | ||
| word_curr.last_rev_id = self.revision_curr.id | ||
| sentence_curr.words.append(word_curr) | ||
| self.token_id += 1 | ||
| self.revision_curr.original_adds += 1 | ||
| self.tokens.append(word_curr) | ||
| curr_index += 1 | ||
|
|
There was a problem hiding this comment.
The main word-diff loop now reads Differ().compare(text_prev, text_curr) once, using prev_index and curr_index to keep the previous Word stream and current token stream aligned. This is needed because the old code matched diff rows by token text. Token text is often repeated, especially for common words, punctuation, markup characters, and template syntax. For every current token, the old loop could scan much of the diff output looking for a same-text row, then scan previous words again looking for a same-text Word object that was not already marked matched. The new loop does not search by text. It follows the order already produced by Differ: unchanged tokens consume one previous word and one current slot, deleted tokens consume one previous word, and inserted tokens consume one current slot. That removes repeated scans and makes the matching rule positional instead of “first unused word with this text”.
| text = revision.get('*', '') | ||
| rev_id = int(revision['revid']) | ||
| rev_hash = revision.get('sha1', calculate_hash(text)) | ||
| if rev_hash in self.spam_hashes: | ||
| rev_hash = revision.get('sha1') | ||
| if not rev_hash: | ||
| rev_hash = calculate_hash(text) | ||
| if rev_hash in self.spam_hashes_set: | ||
| vandalism = True |
There was a problem hiding this comment.
The JSON revision path no longer computes calculate_hash(text) unless the API response lacks sha1. In Python, function arguments are evaluated before the function call, so previously revision.get('sha1', calculate_hash(text)) still hashes the full revision text even when sha1 exists. This patch makes the fallback lazy, which avoids unnecessary full-text hashing for normal API responses.
| # Keep the public list shape while using a set for membership checks. | ||
| self.spam_hashes_set = set() | ||
|
|
||
| def _add_spam_revision(self, rev_id, rev_hash): | ||
| self.spam_ids.append(rev_id) | ||
| self.spam_hashes.append(rev_hash) | ||
| self.spam_hashes_set.add(rev_hash) |
There was a problem hiding this comment.
Spam hash lookup now uses spam_hashes_set. Spam revision recording is centralized in _add_spam_revision(...). The existing spam_hashes list is still kept (because I am not sure if any other scripts would query that), but is no longer used in the main logic. This helps when many vandalism/spam revisions have already been recorded.
| # Identify unmatched paragraphs in previous revision for further analysis. | ||
| paragraph_duplicate_counts = {} | ||
| for paragraph_prev_hash in self.revision_prev.ordered_paragraphs: | ||
| if len(self.revision_prev.paragraphs[paragraph_prev_hash]) > 1: | ||
| s = 'p-{}-{}'.format(self.revision_prev, paragraph_prev_hash) | ||
| self.temp.append(s) | ||
| count = self.temp.count(s) | ||
| count = paragraph_duplicate_counts.get(paragraph_prev_hash, 0) + 1 | ||
| paragraph_duplicate_counts[paragraph_prev_hash] = count | ||
| paragraph_prev = self.revision_prev.paragraphs[paragraph_prev_hash][count - 1] | ||
| else: | ||
| paragraph_prev = self.revision_prev.paragraphs[paragraph_prev_hash][0] | ||
| if not paragraph_prev.matched: | ||
| unmatched_paragraphs_prev.append(paragraph_prev) |
There was a problem hiding this comment.
Duplicate paragraph tracking now uses paragraph_duplicate_counts. The algorithm needs the nth paragraph object for repeated paragraph hashes in a revision. Previously, it builds a formatted string key, appends it to self.temp, then calls self.temp.count(...) to find the occurrence number. Now it uses a local dictionary counter keyed by paragraph hash, so finding the occurrence number is direct and does not scan self.temp.
| # Identify the unmatched sentences in the previous paragraph revision. | ||
| sentence_duplicate_counts = {} | ||
| for paragraph_prev in unmatched_paragraphs_prev: | ||
| for sentence_prev_hash in paragraph_prev.ordered_sentences: | ||
| if len(paragraph_prev.sentences[sentence_prev_hash]) > 1: | ||
| s = 's-{}-{}'.format(paragraph_prev, sentence_prev_hash) | ||
| self.temp.append(s) | ||
| count = self.temp.count(s) | ||
| key = (id(paragraph_prev), sentence_prev_hash) | ||
| count = sentence_duplicate_counts.get(key, 0) + 1 | ||
| sentence_duplicate_counts[key] = count | ||
| sentence_prev = paragraph_prev.sentences[sentence_prev_hash][count - 1] | ||
| else: | ||
| sentence_prev = paragraph_prev.sentences[sentence_prev_hash][0] | ||
| if not sentence_prev.matched: | ||
| unmatched_sentences_prev.append(sentence_prev) | ||
| # to reset 'matched words in analyse_words_in_sentences' of unmatched paragraphs and sentences | ||
| sentence_prev.matched = True | ||
| matched_sentences_prev.append(sentence_prev) | ||
|
|
||
| return unmatched_sentences_curr, unmatched_sentences_prev, matched_sentences_prev, total_sentences | ||
|
|
||
| def analyse_words_in_sentences(self, unmatched_sentences_curr, unmatched_sentences_prev, possible_vandalism): | ||
| matched_words_prev = [] | ||
| unmatched_words_prev = [] | ||
|
|
||
| # Split sentences into words. | ||
| text_prev = [] | ||
| for sentence_prev in unmatched_sentences_prev: | ||
| for word_prev in sentence_prev.words: | ||
| if not word_prev.matched: | ||
| text_prev.append(word_prev.value) | ||
| unmatched_words_prev.append(word_prev) |
There was a problem hiding this comment.
Duplicate sentence tracking now uses sentence_duplicate_counts. This is the same idea as paragraph_duplicate_counts explained above.
| TOKEN_SYMBOLS = ( | ||
| '.', ',', ';', ':', '?', '!', '-', '_', '/', '\\', '(', ')', '[', ']', '{', '}', '*', '#', '@', | ||
| '&', '=', '+', '%', '~', '$', '^', '<', '>', '"', '\'', '´', '`', '¸', '˛', '’', | ||
| '¤', '₳', '฿', '₵', '¢', '₡', '₢', '₫', '₯', '֏', '₠', '€', 'ƒ', '₣', '₲', '₴', '₭', '₺', | ||
| '₾', 'ℳ', '₥', '₦', '₧', '₱', '₰', '£', '៛', '₽', '₹', '₨', '₪', '৳', '₸', '₮', '₩', '¥', | ||
| '§', '‖', '¦', '⟨', '⟩', '–', '—', '¯', '»', '«', '”', '÷', '×', '′', '″', '‴', '¡', | ||
| '¿', '©', '℗', '®', '℠', '™' | ||
| ) | ||
| TOKEN_SYMBOL_REPLACEMENTS = tuple((c, '||{}||'.format(c)) for c in TOKEN_SYMBOLS) |
There was a problem hiding this comment.
Moved TOKEN_SYMBOLS and TOKEN_SYMBOL_REPLACEMENTS to module-level. Previously, the long symbol list is constructed every time split_into_tokens runs, and every tokenization call formats '||{}||'.format(c) for every symbol. Since the list is constant, we now builds TOKEN_SYMBOLS once when the module is loaded, and computes those (symbol, replacement) pairs once and reuses them.
|
|
||
| tokens = filter(lambda a: a != '', text.split('||')) # filter empty strings | ||
| tokens = ['|' if w == 'ææææ' else w for w in tokens] # insert back the |s | ||
| tokens = ['|' if w == 'ææææ' else w for w in text.split('||') if w != ''] # insert back the |s |
There was a problem hiding this comment.
Empty-token filtering and pipe restoration are combined into one list comprehension. Previously, tokenization first filters empty strings and then does a second pass to turn the internal pipe placeholder back into |. The patch does both in one pass over text.split('||').
|
I'm still trying to get [[Google Play]] to wok locally. I went by https://github.com/wikimedia/wikiwho_api/blob/main/SETUP.md for setting up the API. Then to test this PR, I first removed the Then finally, I tried to load Google Play with http://localhost:8000/en/api/v1.0.0-beta/latest_rev_content/Google_Play/?o_rev_id=true&editor=true&token_id=true&out=true&in=true Then it just hangs indefinitely. I don't even see output in the logs. Things work okay though for smaller articles. I am guessing the slow part is it hitting the API. I recall in the past such requests were logged in the server output, but I don't see it now. Maybe my issue is the volume of API requests is too great, possibly also made worse by API limitations recently deployed at Wikimedia. (Such limits don't apply to Wikimedia Cloud Services where production WikiWho is hosted.) How did you test it? Something similar to the above, or are you somehow testing directly against the WikiWho package? |
|
The API is just not very testable on "Google Play" page due to the abysmal amount of 429 Too Many Requests. I directly ran a test script on the WikiWho package in previous tests that cache everything before timer starts. Around 60 seconds on "Google Play" page is the time it takes for the WikiWho package to run on an already fetched revision history. |
|
I have just recognized one more place that could potentially optimize the speed of the diff algorithm. Let me test a bit. |
|
I found that Differ is way too slow on some revisions. Because WikiWho only needs token equal/delete/insert operations and does not use Differ's expensive hint/fancy-replace output, it is cheaper to run SequenceMatcher instead. |
|
I modified the WikiWho API a bit so that it can respect Wikimedia's Retry-After policy, then ran accordingly ( https://github.com/wikimedia/wikiwho_api/blob/main/SETUP.md ). One ran on [[Google Play]] took 10 minutes in total, while retry waits took around 4 minute, therefore the actually time cost on [[Google Play]] would be 6 minute on the Wikimedia Cloud Services server. The modified fork: https://github.com/QZGao/wikiwho_api. If you'd like it to be reused, we can make it a new Pull Request, though it's only going to have an impact when running locally. |
|
Thanks for the thorough explanations, and commitment to find ways to speed up the algorithm! I'll finish the review hopefully tonight and get
That would be great :) I'll use your fork for testing this PR for the time being. |
The PR for the API fork is here: wikimedia/wikiwho_api#28. |
|
After using wikimedia/wikiwho_api#28, I was able to get [[Google Play]] to load! :) I then started comparing and contrasting the WikiWho output for various other articles before/after this patch. As I understood it, tokenization shouldn't change, rather just performance improvements, right? Take Adam Himebauch for example, which is a small article that I started. At minimum the Currently in production: and Compared to my local with this PR checked out: and It seems the markup is attributed differently now (it may be easier to just look for the second token instance of the string |
3c734ab replaced Here I asked Codex to compare
This replacement from |
|
After 5205df6 consistency patch, this is the result against Adam Himebauch: and Now identical with the production. |
|
The single-pass cursor approach in analyse_words_in_sentences() is correct and well-reasoned — since Differ.compare() emits items in alignment order, the two-index scan is a clean O(n) replacement. The utils.py changes (module-level constants, combined filter/restore) are straightforward improvements too. The main thing I'd want to be confident about is the Differ → SequenceMatcher switch. The performance case is clear — the O(n²) behavior was a real blocker for large articles. But since this is an authorship attribution tool, the behavioral difference matters. From the case-by-case analysis already shared, it looks like SequenceMatcher generally produces more intuitive results, but it would be good to know: is there a test suite that captures authorship output for known articles we can diff before/after? If not, it might be worth adding a few regression fixtures so future changes can be validated against this baseline. |
|
There are no such tests that I'm aware of, but that does sound worthwhile if @QZGao is up for it. No pressure though; I will revisit this PR today and will manually test a bunch more articles in various languages. If that all looks okay, then I'm happy to merge. I do want to stress that the algorithm is sacred, and no changes to how it tokenizes should be made unless we are certain it is improving accuracy. |
|
I can take a shot at adding regression fixtures for this. The idea would be: pick a handful of small articles with stable authorship history, capture their WikiWho token output with the current production behavior as golden files, then write tests that verify the new code matches. Would that be a useful format @MusikAnimal, or is there an existing test structure in the repo I should follow? |
Sounds useful to me! There are no tests for the repository as far as I can tell. Given my Python inexperience, I defer to the two of you with respect to best practices. I will be doing some manual comparisons, regardless, so no pressure if it seems like it's too much work. I will also run the Copilot on this PR. Again, we all know AI can be wrong, so don't feel like you need to address everything, or even anything at all. In fact, in my limited experience so far, Copilot will always find something to complain about :-P Thanks so much for the hard work! I'm excited to get these performance improvements deployed. |
There was a problem hiding this comment.
Pull request overview
This PR aims to fix the quadratic time complexity in analyse_words_in_sentences() (Phabricator T342805), which dominated runtime for revisions with tens of thousands of tokens. According to the PR description, the intent is a single ordered pass over difflib.Differ.compare() with two integer cursors. In practice, the PR replaces Differ entirely with a different algorithm (prefix/suffix trim + SequenceMatcher.get_opcodes() gated by a drift bound + a greedy nearest-neighbor matcher with several heuristic thresholds). Smaller refactors include replacing self.temp list-based duplicate counting with local dict counters, introducing a spam_hashes_set mirror for O(1) membership, and tidying split_into_tokens.
Changes:
- Replaces the
Differ-based word matching inanalyse_words_in_sentenceswith a SequenceMatcher + drift-bounded nearest-neighbor heuristic, plus a flatcurr_slotsprecomputation. - Adds
spam_hashes_set/_add_spam_revisionto give O(1) spam-hash lookups, and converts paragraph/sentence duplicate counting fromself.templist counts to per-call dicts. - Hoists the token-symbol list out of
split_into_tokensinto module-levelTOKEN_SYMBOL_REPLACEMENTSand simplifies the empty-string filter.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| WikiWho/wikiwho.py | Rewrites word-level diff matching with a new heuristic algorithm and accompanying helpers; refactors spam-hash and duplicate counting bookkeeping. |
| WikiWho/utils.py | Pulls token symbols and their replacement pairs to module scope; folds the empty-token filter into a single comprehension. |
Comments suppressed due to low confidence (1)
WikiWho/wikiwho.py:153
- The drift cap on
'equal'opcodes —abs((prev_mid_start + i1) - (curr_mid_start + j1)) <= max_drift— silently discards a perfectly validequalblock reported bySequenceMatcherwhenever the block has shifted by more thanmax(50, 10% * max(len_prev, len_curr)). Those tokens then fall through to thereplace/equalbranch on the next line only if the same opcode is'replace'(it isn't — it's'equal'), so the entire equal block is effectively unmatched and every token in it will be re-created as a brand newWordwith a newtoken_id. For an article where a section is moved, this means every token in the moved section loses its authorship lineage.
The original Differ-based code had no such positional cap and would match these tokens. If a drift cap is genuinely necessary for performance, the discarded equal blocks should at least be re-fed through _nearest_word_matches (or simply matched directly, since SequenceMatcher already certified them as a 1-to-1 equal run) rather than dropped.
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == 'equal' and abs((prev_mid_start + i1) - (curr_mid_start + j1)) <= max_drift:
for prev_index, curr_index in zip(range(i1, i2), range(j1, j2)):
prev_for_curr[curr_mid_start + curr_index] = prev_mid_start + prev_index
elif tag in ('replace', 'equal') and (i2 - i1) * (j2 - j1) <= WORD_MATCH_MAX_LOCAL_PAIRS:
local_matches = _nearest_word_matches(prev_mid[i1:i2],
curr_mid[j1:j2],
prev_mid_start + i1,
curr_mid_start + j1,
max_drift)
for curr_index, prev_index in local_matches.items():
prev_for_curr[curr_mid_start + j1 + curr_index] = prev_mid_start + i1 + prev_index
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Thanks for more test cases. I'll experiment with those test cases to see if they are patch-able. Differ is currently the major obstacle for the huge articles like [[Google Play]] to run successfully, so I'll still try my best to improve the algorithm based on SequenceMatcher.
Indeed. You can checkout 89ca645 as that's the last commit without differ changes. |
We could wait a day or two more, as I'm still motivated to improve the algorithm at the moment. Let's see if those worse case situations are solvable or not. |
|
With 7884b1e, the word matcher now tracks match confidence while assigning tokens. The new helper is The main new behavior is When the same n-gram appears exactly once on the previous side and exactly once on the current side, it becomes a candidate. The code extends that candidate left and right while adjacent tokens still match. Then it checks whether the run is safe to treat as a moved copy. That check takes the longest informative content core from the run and confirms that the core appears exactly once in the full previous article text and exactly once in the full current article text. That full-article uniqueness check is the expensive part. It is why full-history runs on large pages are now about 25-30% slower. The pair-index cache in The link fix is The suffix fix is The tradeoff: moved-run recovery gives better attribution for relocated copied text, but it costs time. The link-boundary and stale-suffix fixes are cheap. The 25-30% slowdown mostly comes from the moved-run uniqueness checks. |
|
I made all previous analyzed situations test cases, and all failures are cleared now.
|
|
@MusikAnimal Please, compare more revisions and we can see if there are any more improvements that should be made. |
|
Let me also ask Copilot to review it. |
|
Ran regression tests against the latest commit (7884b1e). Results vs master:
Splatoon 3 has nearly converged with master (−2). The larger Adam Himebauch gap (−231 vs −72 in the previous round) reflects the moved-run recovery in 7884b1e successfully matching more token occurrences as reuses of existing tokens rather than creating new entries — which is what improved attribution looks like in the stored token count. |
|
Sorry for the delay! I've been out on holiday, then my team at WMF was disbanded :( Rest assured though, I will see this through, or at least the (re-)adding CJK languages to WikiWho. I hope to get to this review this week but I am very busy trying to delegate CommTech projects before the team is officially no more. WikiWho is safe, for the record. I will continue to maintain it with Sage Ross. |
ragesoss
left a comment
There was a problem hiding this comment.
Note on this feedback: Parts of this review were produced with Claude Code (parity experiment, drafting). It's offered as input to read, understand, and use or set aside on your own judgment — not as authoritative conclusions. The central numbers below are reproducible; the interpretation is provisional.
The performance fix is real and worth having. The old analyse_words_in_sentences was genuinely O(n²) — the post-Differ triple rescan plus self.temp.count() in the duplicate-paragraph/sentence paths — and the worst-case revisions (the ~36s Google Play case) were a real problem. The spam_hashes_set, the duplicate-count dicts, the module-level token-symbol list, and the single-pass slot assignment are all sound and uncontroversial improvements.
The concern is that this PR also replaces the matching algorithm, and that change is consumer-visible. The diff drops difflib.Differ for bare SequenceMatcher.get_opcodes() plus a new stack of heuristics (a confidence ladder, wikitext-aware match keys, bounded nearest-neighbor recovery, n-gram moved-run recovery, edited-link-boundary recovery, stale-suffix demotion). Differ is not SequenceMatcher — Differ adds a character-level _fancy_replace layer on top of it — so opcode-level matching produces different token-to-token assignments even before the new heuristics are considered.
To measure the impact, each article's full revision history was replayed through both the base (Differ) and PR versions of the library, and the per-token output of the target revision was diffed. Structural state (revisions processed, spam IDs, paragraph/sentence hash tables) is byte-for-byte identical between the two. Per-token attribution is not:
| Article | revs | o_rev_id differs (all tokens) |
o_rev_id differs (prose words only) |
inbound/outbound differs |
|---|---|---|---|---|
| Photosynthesis | 5.4K | 19.3% | 14.5% | 14.5% |
| Wikipedia (simplewiki) | 3.5K | 4.9% | 3.7% | 37.5% |
| Jesse Owens | 6.4K | 6.2% | 4.8% | 2.5% |
The "prose words only" column excludes wikitext punctuation/markup, so this is not just template-syntax reshuffling — content words change attributed origin revision (and therefore editor) for 4–15% of the article. Examples from Photosynthesis: schematic of photosynthesis, vegetation; from Jesse Owens: athletics at the summer olympics. Three of WikiWho's four downstream consumers (the Wiki Education Dashboard's ArticleViewer, XTools Authorship/Blame, the WhoWroteThat gadget) render attribution per token, so this surfaces as different coloring / different "added by" results. The inbound/outbound deltas feed WhoColor conflict scores.
Important: these numbers measure divergence from the current algorithm, not correctness. The PR's stated aim ("Improve word-level authorship matching for moved text") is to attribute better than Differ — and some fraction of these differences may well be improvements (moved text credited to its original author rather than the mover is a plausible win). But which differences are improvements and which are regressions has not been established here, and that question is not cheap to answer — it requires sampling specific divergences and judging editing intent case by case.
Why the PR's own verification doesn't settle this. The description verifies matched_prev count, the vandalism flag, token_id_delta, tokens_delta, and per-sentence word totals on one revision (Google Play 1296988276). Those are aggregate counts on a single diff step. Two matchers can produce the same number of matched tokens while assigning them to different origin revisions — which is exactly what the table above shows (the structural counts match; the per-token o_rev_ids don't). Aggregate-count parity on one revision can't detect attribution drift across a history.
Suggestions, in rough priority:
- Consider splitting the PR. The performance fixes (set-based spam lookup, duplicate-count dicts, single-pass slot assignment, module-level constants) are separable from the matcher replacement and could merge on their own with little risk. That isolates the part that needs scrutiny.
- If the algorithm change is intended, characterize it before merging. A per-token replay over a corpus of full histories (the method used above) quantifies the drift; sampling N divergent tokens and judging better/worse gives a quality signal. Because the change is consumer-visible, the maintainers of the per-token consumers above probably want a say before this lands.
- Tests. This adds ~860 lines to the core matcher in a repo with no test suite or CI. Even a handful of unit tests on the new helpers, plus one full-history regression check, would make the behavior reviewable and pin it against future drift.
- The description is stale. It describes a single-pass cursor optimization; the PR is a matcher rewrite. The note at the bottom acknowledges this, but a reader still has to reverse-engineer the actual change. (Copilot's review raised the same point, plus the
_nearest_word_matchesnearest-position concern and the unreferencedWORD_MATCH_*constants — worth a look.)
Drafted in a Claude Code session directed by Sage. The attribution-drift figures were produced by replaying each article's full history through both the base and PR builds of the library and diffing per-token output; that comparison is reproducible and was run during the session, not taken on faith. Sage directed the review and the experiment and read this comment before posting. The reading of the 860-line diff and the per-token analysis were done by Claude Code and were not independently re-derived by hand; the divergence-vs-correctness question is left open, not resolved.
|
Follow-up to the review above. Same caveats: AI-assisted, provisional, offered as input to weigh rather than a verdict. The earlier comment measured how much attribution changes but left open whether the changes are improvements or regressions. A closer look at where the divergence actually lands, on the same three article histories: Most of the divergence is on tokens where per-token authorship is arbitrary, or isn't "prose authorship" at all. Across the three articles, 90–98% of the tokens whose On the smaller slice that is genuine prose, the PR's attribution looks more correct, not less. 38–50% of divergent tokens fall in contiguous runs of 5+ positions — the fingerprint of moved or reformatted passages. Tracing specific content words back through the actual revision text:
The pattern is consistent: This is not a clean sweep. The PR can also over-reach in the other direction — a couple of words adjacent to the simplewiki passage were pulled to a 2009 revision that predates the phrase, via a spurious duplicate match. And this is four hand-checked cases, not a measured rate; a larger better/worse sample would firm it up (and is a reasonable thing to ask for before relying on the new behavior). Net: the earlier "changes attribution for X% of tokens" figure overstates the consumer-meaningful impact, and the meaningful part appears to be a quality improvement. The suggestions from the prior comment still stand — separating the uncontroversial performance fixes from the matcher change, adding tests, and getting sign-off from the per-token consumers — but the matcher change now looks worth pursuing on its merits, not merely tolerating as a risk. Drafted in a Claude Code session directed by Sage. The "first authored in" revisions were verified by searching each phrase against the actual stored revision text of the article's history; that check is reproducible. The where-divergence-lands breakdown was computed by replaying both library builds over full histories and classifying every changed token. Sage directed the analysis and reviewed this comment before posting; the classification and per-token analysis were done by Claude Code and not independently re-derived by hand. A quantified better/worse/wash rate over a large sample has not yet been computed — the "looks more correct" claim rests on the four verified cases plus the structural and directional evidence. |
|
Quantifying the better/worse question (delivering on "a larger sample would firm it up" from the comment above; same caveats — AI-assisted, provisional). Adjudicating a single token's "true author" automatically is circular (it is the attribution problem), so the test used here sidesteps it: for each prose divergence, take a short context window and ask whether each credited revision actually introduced that text (window present in the revision, absent in its parent) or merely inherited it (window already in the parent — a later edit that reshaped nearby text without writing this). Among the cases where exactly one of the two credited revisions actually introduced the prose:
So where it's determinable, the PR credits the true author ~94–100% of the time and Caveats, to keep this honest:
Net of all three comments: the performance fix is clearly worth having; the matcher change is a real attribution-quality improvement on moved/reformatted text, measurable and in the right direction. The earlier suggestions stand — split the uncontroversial perf work from the matcher change, add tests (the introduced-vs-inherited check above is itself a usable regression-test shape), and loop in the per-token consumers — but the matcher work looks worth landing on its merits. Drafted in a Claude Code session directed by Sage. The introduced-vs-inherited adjudication was run over sampled prose divergences from full-history replays of three articles, checking each credited revision's text against its parent's; reproducible. Sage directed the analysis and reviewed this before posting. The method and code were produced by Claude Code; the ~94–100% figure is the machine-computed rate on the decisive subset, not independently re-derived by hand beyond the four cases verified in the previous comment. |
|
I'll merge #3 into this branch (to be consistent with existing test suite), then add dedicated test cases I previously mentioned. |
|
Now the test cases are also implemented into the branch. |
|
This PR does change algorithm, correct? I'm wondering what folks think of just moving forward without this. #4 by itself improves performance. I just worry changing the way things are attributed is going to cause confusion, for better or for worse. I am also out of time. I am happy to run the script to import the CJK languages and do one more final release of downstream projects, but beyond that, I don't really have much energy left to invest in testing this PR. Perhaps Sage is able to, however. Let me know what you all think :) |
|
The algorithm is already outdated -- and as people have been noticing, more pages will gradually start to be unable to render (especially on enwiki) just like [[Google Play]]. Therefore, I would highly recommend we optimize the algorithm now, while we are at it. |
|
Yeah, I'll keep up with this PR @MusikAnimal. From what I've tested so far, I do think the algorithm changes are going to improve attribution. |
This PR fully addresses T342805 by replacing the slow
difflib.Differword matcher with aSequenceMatcher-based matcher. This is a continuation of #4's part 1 changes.Differis too slow on some large repetitive revisions because WikiWho only needs equal/delete/insert token matching, butDifferalso computes expensive hint/fancy-replace output that WikiWho does not use. In particular, it can stall indifflib._fancy_replaceon large repeated-token edits.The new matcher uses
SequenceMatcher.get_opcodes()directly, with extra WikiWho-specific handling to keep authorship attribution intuitive:{{,}},[[,]],|, and=are matched with local syntax context instead of by token value alone.[[Category:Video games]]becoming[[Category:Video games developed in Japan]].to,in, orofdo not incorrectly survive into unrelated later prose.This change can produce different token attribution from the old
Differmatcher. That is expected: the goal is not byte-for-byte parity withDiffer, but faster completion on large revisions and better attribution for moved or reformatted text.Tested cases include these previously analyzed situations:
Differversion could not finish reliably within practical runtime.support.google.com, where the new matcher preserves the stable domain tokens.Differ.Differstalls but the new matcher completes.Tested
pytestand passed 24 out of 24 test cases. Should merge #4 and #3 before this.