Skip to content

Optimize diff algorithm's time complexity, part 2#2

Open
QZGao wants to merge 12 commits into
wikimedia:masterfrom
QZGao:optimization
Open

Optimize diff algorithm's time complexity, part 2#2
QZGao wants to merge 12 commits into
wikimedia:masterfrom
QZGao:optimization

Conversation

@QZGao

@QZGao QZGao commented May 7, 2026

Copy link
Copy Markdown

This PR fully addresses T342805 by replacing the slow difflib.Differ word matcher with a SequenceMatcher-based matcher. This is a continuation of #4's part 1 changes.

Differ is too slow on some large repetitive revisions because WikiWho only needs equal/delete/insert token matching, but Differ also computes expensive hint/fancy-replace output that WikiWho does not use. In particular, it can stall in difflib._fancy_replace on large repeated-token edits.

The new matcher uses SequenceMatcher.get_opcodes() directly, with extra WikiWho-specific handling to keep authorship attribution intuitive:

  • Common prefix/suffix tokens are matched first.
  • Generic wikitext tokens such as {{, }}, [[, ]], |, and = are matched with local syntax context instead of by token value alone.
  • Prefix/suffix matches are rolled back when they would cut through templates, links, or comments.
  • Repetitive replace regions use bounded nearest-token matching.
  • Moved text is recovered by finding unique informative n-gram runs in unmatched regions, extending them left/right, and verifying that the content core is unique in both the previous and current article text.
  • Edited internal-link boundaries are recovered when an unpiped link target is extended, for example [[Category:Video games]] becoming [[Category:Video games developed in Japan]].
  • Stale suffix matches for short glue words are demoted after large pure deletions, so old words like to, in, or of do not incorrectly survive into unrelated later prose.

This change can produce different token attribution from the old Differ matcher. That is expected: the goal is not byte-for-byte parity with Differ, but faster completion on large revisions and better attribution for moved or reformatted text.

Tested cases include these previously analyzed situations:

  • Google Play, where the Differ version could not finish reliably within practical runtime.
  • URL/reference expansion around support.google.com, where the new matcher preserves the stable domain tokens.
  • Availability-table cleanup, where the new matcher preserves the same qualitative behavior as Differ.
  • Large repetitive availability-table edits, where Differ stalls but the new matcher completes.
  • Adam Himebauch, Splatoon 3, American goldfinch, Japan Cup, 2026 Canvas security incident, Nintendo Switch, and Through the Looking-Glass regression cases covering prose, references, links, templates, duplicate words, punctuation, and stale suffix tokens.

Tested pytest and passed 24 out of 24 test cases. Should merge #4 and #3 before this.

@QZGao QZGao mentioned this pull request May 7, 2026
@QZGao

QZGao commented May 7, 2026

Copy link
Copy Markdown
Author

I ran the whole "Google Play" page locally. Without counting the request time, the local WikiWho processing itself took 62.46 s.

@QZGao

QZGao commented May 7, 2026

Copy link
Copy Markdown
Author

Another run took 66.70 s total local processing:

  • determine_authorship: 62.62 s
    • reset_update_matched_paragraphs: 25.61 s
    • analyse_paragraphs: 21.01 s
    • analyse_sentences: 12.44 s
    • analyse_words: 2.85 s
    • reset_update_matched_paragraphs_words: 106,439,777
    • reset_update_matched_paragraphs_sentences: 6,706,455
    • reset_update_matched_paragraphs_paragraphs: 845,938
  • lower_text: 3.02 s
  • eager_hash_lookup: 0.91 s
  • JSON payload build + encode: 0.13 s

@QZGao

QZGao commented May 7, 2026

Copy link
Copy Markdown
Author

Multiple small fixes:

  • Added spam_hashes_set for O(1) spam hash membership while preserving the existing spam_hashes list.
  • Fixed eager hash calculation in JSON revision processing.
  • Replaced self.temp.append(...); self.temp.count(...) duplicate tracking with local counters for paragraph and sentence duplicates.

@QZGao QZGao marked this pull request as draft May 7, 2026 02:28
@QZGao

QZGao commented May 7, 2026

Copy link
Copy Markdown
Author

Another thing that severely drags the program down is the final all-token scan. I am investigating whether we can optimize that.

@QZGao QZGao marked this pull request as ready for review May 7, 2026 08:02

@QZGao QZGao left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some investigation on the code but sadly could not find any more optimization. The one-request-that-calculates-all approach clearly has a lower bound in efficiency. If we want to get "Google Play" and "Barack Obama" to be parsed within time and memory constraints, we would need to fundamentally change some behaviors.

For example, if we want to speed up, we would want to cache previous runs on previous revisions, and only do incremental analysis. If we want to optimize memory usage, we would want to give up on storing all revisions -- maybe only keep the latest 500 revisions and delete old revisions, if we assume we already do incremental analysis.

Still, I believe the current changes already made are worthwhile. I've documented the changes in these comments.

Comment thread WikiWho/wikiwho.py
Comment on lines +603 to +613
# Build flat (sentence, token) slots so we can assign words during the
# diff pass without re-scanning sentences or the diff list.
curr_slots = [] # list of (sentence_curr, word_value)
text_curr = []
for sentence_curr in unmatched_sentences_curr:
# split_into_tokens is already done in analyse_sentences_in_paragraphs
words = sentence_curr.value.split(' ')
text_curr.extend(words)
sentence_curr.splitted.extend(words)
for word in words:
curr_slots.append((sentence_curr, word))

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In analyse_words_in_sentences, the algorithm needs to decide, for every token in the current revision, which Sentence object should receive either a reused old Word object or a newly created Word object. Originally, after computing the diff, the code loops back through unmatched_sentences_curr and each sentence’s split tokens to rediscover that sentence/token position while also scanning the diff list. This patch builds curr_slots, a flat ordered list of (sentence_curr, word) pairs, at the same time it builds text_curr. This preserves the same token order as text_curr, and also remembers the destination sentence for each token.

Comment thread WikiWho/wikiwho.py
Comment on lines 627 to 639
# Edit consists of adding new content, not changing/removing content
if not text_prev:
for sentence_curr in unmatched_sentences_curr:
for word in sentence_curr.splitted:
word_curr = Word()
word_curr.value = word
word_curr.token_id = self.token_id
word_curr.origin_rev_id = self.revision_curr.id
word_curr.last_rev_id = self.revision_curr.id

sentence_curr.words.append(word_curr)
self.token_id += 1
self.revision_curr.original_adds += 1
self.tokens.append(word_curr)
for sentence_curr, word in curr_slots:
word_curr = Word()
word_curr.value = word
word_curr.token_id = self.token_id
word_curr.origin_rev_id = self.revision_curr.id
word_curr.last_rev_id = self.revision_curr.id
sentence_curr.words.append(word_curr)
self.token_id += 1
self.revision_curr.original_adds += 1
self.tokens.append(word_curr)
return matched_words_prev, possible_vandalism

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure-addition edits are the case where there are current tokens but no previous unmatched tokens to compare against. Previously, the code loops through unmatched_sentences_curr and then through sentence_curr.splitted to create new Word objects. Since the branch already built curr_slots, it can create the same new Word objects from that flat ordered list instead. This removes duplicated traversal logic.

Comment thread WikiWho/wikiwho.py Outdated
Comment on lines 641 to 675
# Single pass diff
prev_index = 0
curr_index = 0
d = Differ()
diff = list(d.compare(text_prev, text_curr))
for sentence_curr in unmatched_sentences_curr:
for word in sentence_curr.splitted:
curr_matched = False
pos = 0
diff_len = len(diff)
while pos < diff_len:
word_diff = diff[pos]
if word == word_diff[2:]:
if word_diff[0] == ' ':
# match
for word_prev in unmatched_words_prev:
if not word_prev.matched and word_prev.value == word:

word_prev.matched = True
curr_matched = True
sentence_curr.words.append(word_prev)
matched_words_prev.append(word_prev)
diff[pos] = ''
pos = diff_len + 1
break
elif word_diff[0] == '-':
# deleted
for word_prev in unmatched_words_prev:
if not word_prev.matched and word_prev.value == word:
word_prev.matched = True
word_prev.outbound.append(self.revision_curr.id)
matched_words_prev.append(word_prev)
diff[pos] = ''
break
elif word_diff[0] == '+':
# a new added word
curr_matched = True
word_curr = Word()
word_curr.value = word
word_curr.token_id = self.token_id
word_curr.origin_rev_id = self.revision_curr.id
word_curr.last_rev_id = self.revision_curr.id

sentence_curr.words.append(word_curr)
self.token_id += 1
self.revision_curr.original_adds += 1
self.tokens.append(word_curr)
diff[pos] = ''
pos = diff_len + 1
pos += 1

if not curr_matched:
word_curr = Word()
word_curr.value = word
word_curr.token_id = self.token_id
word_curr.origin_rev_id = self.revision_curr.id
word_curr.last_rev_id = self.revision_curr.id
sentence_curr.words.append(word_curr)

self.token_id += 1
self.revision_curr.original_adds += 1
self.tokens.append(word_curr)
for diff_item in d.compare(text_prev, text_curr):
op = diff_item[0]
if op == '?': # Differ hint line; skip
continue
elif op == ' ': # next prev word is matched to next curr slot
word_prev = unmatched_words_prev[prev_index]
sentence_curr, _ = curr_slots[curr_index]
word_prev.matched = True
sentence_curr.words.append(word_prev)
matched_words_prev.append(word_prev)
prev_index += 1
curr_index += 1
elif op == '-': # next prev word is deleted (not present in curr)
word_prev = unmatched_words_prev[prev_index]
word_prev.matched = True
word_prev.outbound.append(self.revision_curr.id)
matched_words_prev.append(word_prev)
prev_index += 1
elif op == '+': # next curr slot gets a brand-new Word
sentence_curr, word = curr_slots[curr_index]
word_curr = Word()
word_curr.value = word
word_curr.token_id = self.token_id
word_curr.origin_rev_id = self.revision_curr.id
word_curr.last_rev_id = self.revision_curr.id
sentence_curr.words.append(word_curr)
self.token_id += 1
self.revision_curr.original_adds += 1
self.tokens.append(word_curr)
curr_index += 1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main word-diff loop now reads Differ().compare(text_prev, text_curr) once, using prev_index and curr_index to keep the previous Word stream and current token stream aligned. This is needed because the old code matched diff rows by token text. Token text is often repeated, especially for common words, punctuation, markup characters, and template syntax. For every current token, the old loop could scan much of the diff output looking for a same-text row, then scan previous words again looking for a same-text Word object that was not already marked matched. The new loop does not search by text. It follows the order already produced by Differ: unchanged tokens consume one previous word and one current slot, deleted tokens consume one previous word, and inserted tokens consume one current slot. That removes repeated scans and makes the matching rule positional instead of “first unused word with this text”.

Comment thread WikiWho/wikiwho.py
Comment on lines 159 to 165
text = revision.get('*', '')
rev_id = int(revision['revid'])
rev_hash = revision.get('sha1', calculate_hash(text))
if rev_hash in self.spam_hashes:
rev_hash = revision.get('sha1')
if not rev_hash:
rev_hash = calculate_hash(text)
if rev_hash in self.spam_hashes_set:
vandalism = True

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON revision path no longer computes calculate_hash(text) unless the API response lacks sha1. In Python, function arguments are evaluated before the function call, so previously revision.get('sha1', calculate_hash(text)) still hashes the full revision text even when sha1 exists. This patch makes the fallback lazy, which avoids unnecessary full-text hashing for normal API responses.

Comment thread WikiWho/wikiwho.py
Comment on lines +54 to +60
# Keep the public list shape while using a set for membership checks.
self.spam_hashes_set = set()

def _add_spam_revision(self, rev_id, rev_hash):
self.spam_ids.append(rev_id)
self.spam_hashes.append(rev_hash)
self.spam_hashes_set.add(rev_hash)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spam hash lookup now uses spam_hashes_set. Spam revision recording is centralized in _add_spam_revision(...). The existing spam_hashes list is still kept (because I am not sure if any other scripts would query that), but is no longer used in the main logic. This helps when many vandalism/spam revisions have already been recorded.

Comment thread WikiWho/wikiwho.py
Comment on lines 453 to 463
# Identify unmatched paragraphs in previous revision for further analysis.
paragraph_duplicate_counts = {}
for paragraph_prev_hash in self.revision_prev.ordered_paragraphs:
if len(self.revision_prev.paragraphs[paragraph_prev_hash]) > 1:
s = 'p-{}-{}'.format(self.revision_prev, paragraph_prev_hash)
self.temp.append(s)
count = self.temp.count(s)
count = paragraph_duplicate_counts.get(paragraph_prev_hash, 0) + 1
paragraph_duplicate_counts[paragraph_prev_hash] = count
paragraph_prev = self.revision_prev.paragraphs[paragraph_prev_hash][count - 1]
else:
paragraph_prev = self.revision_prev.paragraphs[paragraph_prev_hash][0]
if not paragraph_prev.matched:
unmatched_paragraphs_prev.append(paragraph_prev)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate paragraph tracking now uses paragraph_duplicate_counts. The algorithm needs the nth paragraph object for repeated paragraph hashes in a revision. Previously, it builds a formatted string key, appends it to self.temp, then calls self.temp.count(...) to find the occurrence number. Now it uses a local dictionary counter keyed by paragraph hash, so finding the occurrence number is direct and does not scan self.temp.

Comment thread WikiWho/wikiwho.py
Comment on lines 572 to 601
# Identify the unmatched sentences in the previous paragraph revision.
sentence_duplicate_counts = {}
for paragraph_prev in unmatched_paragraphs_prev:
for sentence_prev_hash in paragraph_prev.ordered_sentences:
if len(paragraph_prev.sentences[sentence_prev_hash]) > 1:
s = 's-{}-{}'.format(paragraph_prev, sentence_prev_hash)
self.temp.append(s)
count = self.temp.count(s)
key = (id(paragraph_prev), sentence_prev_hash)
count = sentence_duplicate_counts.get(key, 0) + 1
sentence_duplicate_counts[key] = count
sentence_prev = paragraph_prev.sentences[sentence_prev_hash][count - 1]
else:
sentence_prev = paragraph_prev.sentences[sentence_prev_hash][0]
if not sentence_prev.matched:
unmatched_sentences_prev.append(sentence_prev)
# to reset 'matched words in analyse_words_in_sentences' of unmatched paragraphs and sentences
sentence_prev.matched = True
matched_sentences_prev.append(sentence_prev)

return unmatched_sentences_curr, unmatched_sentences_prev, matched_sentences_prev, total_sentences

def analyse_words_in_sentences(self, unmatched_sentences_curr, unmatched_sentences_prev, possible_vandalism):
matched_words_prev = []
unmatched_words_prev = []

# Split sentences into words.
text_prev = []
for sentence_prev in unmatched_sentences_prev:
for word_prev in sentence_prev.words:
if not word_prev.matched:
text_prev.append(word_prev.value)
unmatched_words_prev.append(word_prev)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate sentence tracking now uses sentence_duplicate_counts. This is the same idea as paragraph_duplicate_counts explained above.

Comment thread WikiWho/utils.py
Comment on lines +24 to +32
TOKEN_SYMBOLS = (
'.', ',', ';', ':', '?', '!', '-', '_', '/', '\\', '(', ')', '[', ']', '{', '}', '*', '#', '@',
'&', '=', '+', '%', '~', '$', '^', '<', '>', '"', '\'', '´', '`', '¸', '˛', '’',
'¤', '₳', '฿', '₵', '¢', '₡', '₢', '₫', '₯', '֏', '₠', '€', 'ƒ', '₣', '₲', '₴', '₭', '₺',
'₾', 'ℳ', '₥', '₦', '₧', '₱', '₰', '£', '៛', '₽', '₹', '₨', '₪', '৳', '₸', '₮', '₩', '¥',
'§', '‖', '¦', '⟨', '⟩', '–', '—', '¯', '»', '«', '”', '÷', '×', '′', '″', '‴', '¡',
'¿', '©', '℗', '®', '℠', '™'
)
TOKEN_SYMBOL_REPLACEMENTS = tuple((c, '||{}||'.format(c)) for c in TOKEN_SYMBOLS)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved TOKEN_SYMBOLS and TOKEN_SYMBOL_REPLACEMENTS to module-level. Previously, the long symbol list is constructed every time split_into_tokens runs, and every tokenization call formats '||{}||'.format(c) for every symbol. Since the list is constant, we now builds TOKEN_SYMBOLS once when the module is loaded, and computes those (symbol, replacement) pairs once and reuses them.

Comment thread WikiWho/utils.py

tokens = filter(lambda a: a != '', text.split('||')) # filter empty strings
tokens = ['|' if w == 'ææææ' else w for w in tokens] # insert back the |s
tokens = ['|' if w == 'ææææ' else w for w in text.split('||') if w != ''] # insert back the |s

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty-token filtering and pipe restoration are combined into one list comprehension. Previously, tokenization first filters empty strings and then does a second pass to turn the internal pipe placeholder back into |. The patch does both in one pass over text.split('||').

@MusikAnimal

Copy link
Copy Markdown
Member

I'm still trying to get [[Google Play]] to wok locally.

I went by https://github.com/wikimedia/wikiwho_api/blob/main/SETUP.md for setting up the API. Then to test this PR, I first removed the wikiwho package with pip uninstall wikiwho. I then changed requirements.txt to point to the local clone of WikiWho with this PR checked out, ala ../WikiWho, and re-ran pip install -r requirements.txt. Then I started the server with python manage.py runserver.

Then finally, I tried to load Google Play with http://localhost:8000/en/api/v1.0.0-beta/latest_rev_content/Google_Play/?o_rev_id=true&editor=true&token_id=true&out=true&in=true

Then it just hangs indefinitely. I don't even see output in the logs. Things work okay though for smaller articles. I am guessing the slow part is it hitting the API. I recall in the past such requests were logged in the server output, but I don't see it now. Maybe my issue is the volume of API requests is too great, possibly also made worse by API limitations recently deployed at Wikimedia. (Such limits don't apply to Wikimedia Cloud Services where production WikiWho is hosted.)

How did you test it? Something similar to the above, or are you somehow testing directly against the WikiWho package?

@QZGao

QZGao commented May 10, 2026

Copy link
Copy Markdown
Author

The API is just not very testable on "Google Play" page due to the abysmal amount of 429 Too Many Requests. I directly ran a test script on the WikiWho package in previous tests that cache everything before timer starts. Around 60 seconds on "Google Play" page is the time it takes for the WikiWho package to run on an already fetched revision history.

@QZGao

QZGao commented May 10, 2026

Copy link
Copy Markdown
Author

I have just recognized one more place that could potentially optimize the speed of the diff algorithm. Let me test a bit.

@QZGao

QZGao commented May 10, 2026

Copy link
Copy Markdown
Author

I found that Differ is way too slow on some revisions. Because WikiWho only needs token equal/delete/insert operations and does not use Differ's expensive hint/fancy-replace output, it is cheaper to run SequenceMatcher instead.

@QZGao

QZGao commented May 10, 2026

Copy link
Copy Markdown
Author

I modified the WikiWho API a bit so that it can respect Wikimedia's Retry-After policy, then ran accordingly ( https://github.com/wikimedia/wikiwho_api/blob/main/SETUP.md ). One ran on [[Google Play]] took 10 minutes in total, while retry waits took around 4 minute, therefore the actually time cost on [[Google Play]] would be 6 minute on the Wikimedia Cloud Services server.

The modified fork: https://github.com/QZGao/wikiwho_api. If you'd like it to be reused, we can make it a new Pull Request, though it's only going to have an impact when running locally.

@MusikAnimal

Copy link
Copy Markdown
Member

Thanks for the thorough explanations, and commitment to find ways to speed up the algorithm! I'll finish the review hopefully tonight and get zh started shortly thereafter.

The modified fork: https://github.com/QZGao/wikiwho_api. If you'd like it to be reused, we can make it a new Pull Request, though it's only going to have an impact when running locally.

That would be great :) I'll use your fork for testing this PR for the time being.

@QZGao

QZGao commented May 11, 2026

Copy link
Copy Markdown
Author

That would be great :)

The PR for the API fork is here: wikimedia/wikiwho_api#28.

@MusikAnimal

Copy link
Copy Markdown
Member

After using wikimedia/wikiwho_api#28, I was able to get [[Google Play]] to load! :)

I then started comparing and contrasting the WikiWho output for various other articles before/after this patch. As I understood it, tokenization shouldn't change, rather just performance improvements, right?

Take Adam Himebauch for example, which is a small article that I started. At minimum the token_ids are different, but it seems there's a few places where the rev IDs are different, for example:

Currently in production:

{"str":"'","o_rev_id":678457635,"editor":"14882394","token_id":2,"in":[747629772],"out":[747627155]}

and

{"str":"adam","o_rev_id":1162559322,"editor":"36240525","token_id":5552,"in":[],"out":[]}

Compared to my local with this PR checked out:

{"str":"'","o_rev_id":1162559322,"editor":"36240525","token_id":5551,"in":[],"out":[]}

and

{"str":"adam","o_rev_id":1315073789,"editor":"26944994","token_id":6197,"in":[],"out":[]}

It seems the markup is attributed differently now (it may be easier to just look for the second token instance of the string adam), and the preceding ' token has an in and out before, but doesn't now.

@QZGao

QZGao commented May 13, 2026

Copy link
Copy Markdown
Author

I then started comparing and contrasting the WikiWho output for various other articles before/after this patch. As I understood it, tokenization shouldn't change, rather just performance improvements, right?

3c734ab replaced Differ with SequenceMatcher. This is the reason why tokenization result is different. I did mention that they were identical, but that is no longer the case with this change.

Here I asked Codex to compare SequenceMatcher and Differ with several sampled revisions and concluded that the results did not change much:

Case 1: 480556009
URL/reference expansion around support.google.com, changing the old support URL into a longer support.google.com/googleplay/... URL.

  • Original Differ: reuses support and the first ., but marks google, ., com, and googleplay as newly added in 480556009.
  • SequenceMatcher: reuses the unchanged domain tokens support . google . com /; only the genuinely new path pieces like googleplay, bin, answer, etc. are new.

Comparison: SequenceMatcher is more intuitive. The domain survived the edit; original Differ breaks authorship inside the stable domain.

Case 2: 690409934
Availability-table cleanup, e.g. {{ yes | yes }} becoming {{ yes }} and {{ no | no }} becoming {{ no }}.

  • Original Differ: reuses the core table/template tokens around {{ yes }} and {{ no }} and deletes the redundant parameter-style pieces.
  • SequenceMatcher: does the same qualitative thing: keeps the core {{ yes }} / {{ no }} tokens from prior history instead of treating them as new.

Comparison: correctness is roughly equivalent here. SequenceMatcher preserves the intuitive Differ behavior while being faster in the measured run.

Case 3: 896108303
Large repetitive availability-table update dominated by |, {{, }}, yes, no.

  • Original Differ: stalls in difflib._fancy_replace; no completed token mapping should be claimed.
  • SequenceMatcher: completes and uses the nearest-token fallback for the repeated-token region.

Comparison: SequenceMatcher wins operationally because it produces a result. There is no original-Differ correctness output to compare against here, because original Differ does not finish in the local test scenario.

This replacement from Differ to SequenceMatcher made [[Google Play]] able to work within time, so I believe it to be worth some discrepancy.

@QZGao

QZGao commented May 13, 2026

Copy link
Copy Markdown
Author

After 5205df6 consistency patch, this is the result against Adam Himebauch:

{"str":"'","o_rev_id":678457635,"editor":"14882394","token_id":2,"in":[747629772],"out":[747627155]}

and

{"str":"adam","o_rev_id":1315073789,"editor":"26944994","token_id":6824,"in":[],"out":[]}

Now identical with the production.

@xenacode-art

Copy link
Copy Markdown

The single-pass cursor approach in analyse_words_in_sentences() is correct and well-reasoned — since Differ.compare() emits items in alignment order, the two-index scan is a clean O(n) replacement. The utils.py changes (module-level constants, combined filter/restore) are straightforward improvements too.

The main thing I'd want to be confident about is the Differ → SequenceMatcher switch. The performance case is clear — the O(n²) behavior was a real blocker for large articles. But since this is an authorship attribution tool, the behavioral difference matters. From the case-by-case analysis already shared, it looks like SequenceMatcher generally produces more intuitive results, but it would be good to know: is there a test suite that captures authorship output for known articles we can diff before/after? If not, it might be worth adding a few regression fixtures so future changes can be validated against this baseline.
Overall the direction looks right to me.

@MusikAnimal

Copy link
Copy Markdown
Member

There are no such tests that I'm aware of, but that does sound worthwhile if @QZGao is up for it. No pressure though; I will revisit this PR today and will manually test a bunch more articles in various languages. If that all looks okay, then I'm happy to merge.

I do want to stress that the algorithm is sacred, and no changes to how it tokenizes should be made unless we are certain it is improving accuracy.

@xenacode-art

Copy link
Copy Markdown

I can take a shot at adding regression fixtures for this. The idea would be: pick a handful of small articles with stable authorship history, capture their WikiWho token output with the current production behavior as golden files, then write tests that verify the new code matches. Would that be a useful format @MusikAnimal, or is there an existing test structure in the repo I should follow?

@MusikAnimal

Copy link
Copy Markdown
Member

I can take a shot at adding regression fixtures for this. The idea would be: pick a handful of small articles with stable authorship history, capture their WikiWho token output with the current production behavior as golden files, then write tests that verify the new code matches. Would that be a useful format @MusikAnimal, or is there an existing test structure in the repo I should follow?

Sounds useful to me! There are no tests for the repository as far as I can tell. Given my Python inexperience, I defer to the two of you with respect to best practices. I will be doing some manual comparisons, regardless, so no pressure if it seems like it's too much work.

I will also run the Copilot on this PR. Again, we all know AI can be wrong, so don't feel like you need to address everything, or even anything at all. In fact, in my limited experience so far, Copilot will always find something to complain about :-P

Thanks so much for the hard work! I'm excited to get these performance improvements deployed.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix the quadratic time complexity in analyse_words_in_sentences() (Phabricator T342805), which dominated runtime for revisions with tens of thousands of tokens. According to the PR description, the intent is a single ordered pass over difflib.Differ.compare() with two integer cursors. In practice, the PR replaces Differ entirely with a different algorithm (prefix/suffix trim + SequenceMatcher.get_opcodes() gated by a drift bound + a greedy nearest-neighbor matcher with several heuristic thresholds). Smaller refactors include replacing self.temp list-based duplicate counting with local dict counters, introducing a spam_hashes_set mirror for O(1) membership, and tidying split_into_tokens.

Changes:

  • Replaces the Differ-based word matching in analyse_words_in_sentences with a SequenceMatcher + drift-bounded nearest-neighbor heuristic, plus a flat curr_slots precomputation.
  • Adds spam_hashes_set/_add_spam_revision to give O(1) spam-hash lookups, and converts paragraph/sentence duplicate counting from self.temp list counts to per-call dicts.
  • Hoists the token-symbol list out of split_into_tokens into module-level TOKEN_SYMBOL_REPLACEMENTS and simplifies the empty-string filter.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
WikiWho/wikiwho.py Rewrites word-level diff matching with a new heuristic algorithm and accompanying helpers; refactors spam-hash and duplicate counting bookkeeping.
WikiWho/utils.py Pulls token symbols and their replacement pairs to module scope; folds the empty-token filter into a single comprehension.
Comments suppressed due to low confidence (1)

WikiWho/wikiwho.py:153

  • The drift cap on 'equal' opcodes — abs((prev_mid_start + i1) - (curr_mid_start + j1)) <= max_drift — silently discards a perfectly valid equal block reported by SequenceMatcher whenever the block has shifted by more than max(50, 10% * max(len_prev, len_curr)). Those tokens then fall through to the replace/equal branch on the next line only if the same opcode is 'replace' (it isn't — it's 'equal'), so the entire equal block is effectively unmatched and every token in it will be re-created as a brand new Word with a new token_id. For an article where a section is moved, this means every token in the moved section loses its authorship lineage.

The original Differ-based code had no such positional cap and would match these tokens. If a drift cap is genuinely necessary for performance, the discarded equal blocks should at least be re-fed through _nearest_word_matches (or simply matched directly, since SequenceMatcher already certified them as a 1-to-1 equal run) rather than dropped.

            for tag, i1, i2, j1, j2 in matcher.get_opcodes():
                if tag == 'equal' and abs((prev_mid_start + i1) - (curr_mid_start + j1)) <= max_drift:
                    for prev_index, curr_index in zip(range(i1, i2), range(j1, j2)):
                        prev_for_curr[curr_mid_start + curr_index] = prev_mid_start + prev_index
                elif tag in ('replace', 'equal') and (i2 - i1) * (j2 - j1) <= WORD_MATCH_MAX_LOCAL_PAIRS:
                    local_matches = _nearest_word_matches(prev_mid[i1:i2],
                                                          curr_mid[j1:j2],
                                                          prev_mid_start + i1,
                                                          curr_mid_start + j1,
                                                          max_drift)
                    for curr_index, prev_index in local_matches.items():
                        prev_for_curr[curr_mid_start + j1 + curr_index] = prev_mid_start + i1 + prev_index

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread WikiWho/wikiwho.py Outdated
Comment thread WikiWho/wikiwho.py
Comment thread WikiWho/wikiwho.py
Comment thread WikiWho/wikiwho.py
@QZGao

QZGao commented May 17, 2026

Copy link
Copy Markdown
Author

This is only a fraction of the article, but as we can see, the differences are rather dramatic. Some examples show the new algorithm is better, others show it is worse.

Thanks for more test cases. I'll experiment with those test cases to see if they are patch-able. Differ is currently the major obstacle for the huge articles like [[Google Play]] to run successfully, so I'll still try my best to improve the algorithm based on SequenceMatcher.

The thing is, despite its flaws, the older algorithm has been widely battle-tested and trusted for over a decade now. In an earlier version of this PR (before 3c734ab I think), you said you were able to offer performance improvements without changing the algorithm. Would it be better to start with just that, perhaps?

Indeed. You can checkout 89ca645 as that's the last commit without differ changes.

@QZGao

QZGao commented May 17, 2026

Copy link
Copy Markdown
Author

In the meantime, should I go ahead and process zhwiki? I was waiting for this to get the performance improvements since CJK processes much slower now that each character is tokenized, but given the risks, maybe it's better to just move forward with the older algorithm. You decide :)

We could wait a day or two more, as I'm still motivated to improve the algorithm at the moment. Let's see if those worse case situations are solvable or not.

@QZGao

QZGao commented May 18, 2026

Copy link
Copy Markdown
Author

With 7884b1e, the word matcher now tracks match confidence while assigning tokens. The new helper is _assign_word_match(). Every match records three things together: which previous token a current token maps to, how confident that match is, and which current token is already using a previous token. This matters because the matcher now has several passes that can overlap. A local nearest-token match should not block a better moved-run match, and a moved-run match should not override a hard prefix/suffix edge match. The confidence values encode that ordering.

The main new behavior is _recover_moved_word_runs(). After prefix/suffix matching and SequenceMatcher, the code records the non-equal opcode ranges into move_prev_spans and move_curr_spans. Those are the only regions searched for moved text. The recovery pass builds n-gram indexes for those spans using sizes 10, 8, 6, 4, 3. It only indexes windows with enough informative tokens, so punctuation-heavy windows do not seed a moved match.

When the same n-gram appears exactly once on the previous side and exactly once on the current side, it becomes a candidate. The code extends that candidate left and right while adjacent tokens still match. Then it checks whether the run is safe to treat as a moved copy. That check takes the longest informative content core from the run and confirms that the core appears exactly once in the full previous article text and exactly once in the full current article text.

That full-article uniqueness check is the expensive part. It is why full-history runs on large pages are now about 25-30% slower. The pair-index cache in _count_subsequence_cached() reduces the cost by indexing token pairs and only counting up to 2, but the matcher is still doing real extra work that the old code did not do.

The link fix is _recover_edited_link_boundaries(). The existing keying makes [[ and ]] depend on the full link target. That is good for avoiding random delimiter matches, but it breaks when a link target is extended. For example, if [[Category:Video games]] becomes [[Category:Video games developed in Japan]], the words Category:Video games match, but the delimiters do not because the full target changed. The new pass scans link spans after normal token matching. If the old unpiped target is still a contiguous, substantial part of the new target, and those target tokens are already matched, it reattaches the old [[ and ]].

The suffix fix is _demote_stale_suffix_edge_matches(). Prefix/suffix matching is usually correct, but it can keep very old short words alive after a large deletion right before an unchanged suffix. The code now handles only pure deletions before a suffix. It looks at the first small window of that suffix and unassigns old alphabetic glue tokens like to, in, or of when the previous Word already has old lineage. It does not touch content words, punctuation, or replacement edits.

The tradeoff: moved-run recovery gives better attribution for relocated copied text, but it costs time. The link-boundary and stale-suffix fixes are cheap. The 25-30% slowdown mostly comes from the moved-run uniqueness checks.

@QZGao

QZGao commented May 18, 2026

Copy link
Copy Markdown
Author

I made all previous analyzed situations test cases, and all failures are cleared now.

# Article Case Description Pass Values Origin Revs Last Revs Inbound Outbound
1 Adam Himebauch Phrase Beginning in 2022 Yes beginning, in, 2022 1082167926, 1082167926, 1082167926 1316823741, 1316823741, 1316823741 [], [], [] [], [], []
2 Adam Himebauch First Hanksy mention Yes hanksy 1315073789 1316823741 [] []
3 Adam Himebauch First NYT reference title token A Yes a 1316823741 1316823741 [] []
4 Adam Himebauch https token in first NYT URL Yes https 767656880 1316823741 [] []
5 Adam Himebauch Rest of first NYT URL after https Yes :, /, /, www, ., nytimes, ., com, /, 2014, /, 02, /, 16, /, nyregion, /, a, -, parodist, -, who, -, calls, -, himself, -, hanksy, ., html all 678458561 all 1316823741 all [] all []
6 Adam Himebauch The in The image went viral Yes the 754869525 1316823741 [] []
7 Adam Himebauch Reference title Reverend Jen exclusive on Hanksy Yes reverend, jen, exclusive, on, hanksy all 754869525 all 1316823741 all [] all []
8 Adam Himebauch Date range Between 2011 and 2017 Yes between, 2011, and, 2017 1162525604, 754869525, 1162525604, 1162559322 all 1316823741 all [] all []
9 Adam Himebauch Words created and images in the street-art sentence Yes created, images 754869525, 754869525 1316823741, 1316823741 [], [] [], []
10 Adam Himebauch Phrase exhibited across the country Yes exhibited, across, the, country all 754869525 all 1316823741 all [] all []
11 Adam Himebauch Second NYT title A Parodist Who Calls Himself Hanksy Yes a, parodist, who, calls, himself, hanksy all 1078016622 all 1316823741 all [] all []
12 Japan Cup Token over in rewritten distance wording Yes over 24011298 112007899 [] [112010328]
13 Splatoon 3 Lead title Splatoon 3, not the infobox title field Yes splatoon, 3 1007397055, 1007397055 1007397695, 1007397695 [], [] [], []
14 Splatoon 3 Category [[Category:Video games developed in Japan]] split between reused broad category and new words Yes [[, category, :, video, games, developed, in, japan, ]] 1007397440, 1007397440, 1007397440, 1007397440, 1007397440, 1007397695, 1007397695, 1007397695, 1007397440 all 1007397695 all [] all []
15 2026 Canvas security incident Short-description closing token }} stays continuous Yes }} 1353079526 1354204849 [] []
16 2026 Canvas security incident Adjacent duplicate said said Yes said, said 1353079526, 1353079526 1353090711, 1353090491 [], [] [], [1353090711]
17 American goldfinch Original period removed when the length phrase is inserted Yes . 24921 122039732 [] [122044470]
18 American goldfinch Image link opener [[ preserved while image target stays same Yes [[ 859274 15936180 [] [18645783]
19 American goldfinch Image link opener [[ resets after image target changes Yes [[ 18645783 18692594 [] [20553313]
20 American goldfinch Taxobox opener {{ resets on template conversion Yes {{ 37287458 37287458 [] []
21 Nintendo Switch Apostrophe in Don't from Article Wizard comment, not later title markup Yes ' 651807915 651992792 [651992792] [651930777, 651993021]
22 Through the Looking-Glass Stub token to in Sequel to ... should not survive final mature lead Yes to 291417 1268074738 [] [1273903188]

@QZGao

QZGao commented May 18, 2026

Copy link
Copy Markdown
Author

@MusikAnimal Please, compare more revisions and we can see if there are any more improvements that should be made.

@QZGao

QZGao commented May 18, 2026

Copy link
Copy Markdown
Author

Let me also ask Copilot to review it.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@xenacode-art

Copy link
Copy Markdown

Ran regression tests against the latest commit (7884b1e). Results vs master:

Article Master tokens This PR Δ
Adam Himebauch 8,572 8,341 −231
Splatoon 3 37,819 37,817 −2

Splatoon 3 has nearly converged with master (−2). The larger Adam Himebauch gap (−231 vs −72 in the previous round) reflects the moved-run recovery in 7884b1e successfully matching more token occurrences as reuses of existing tokens rather than creating new entries — which is what improved attribution looks like in the stored token count.
Tests are running in 17s on these two articles. Happy to add more articles to the fixture set if useful.

@MusikAnimal

Copy link
Copy Markdown
Member

Sorry for the delay! I've been out on holiday, then my team at WMF was disbanded :( Rest assured though, I will see this through, or at least the (re-)adding CJK languages to WikiWho. I hope to get to this review this week but I am very busy trying to delegate CommTech projects before the team is officially no more. WikiWho is safe, for the record. I will continue to maintain it with Sage Ross.

@ragesoss ragesoss left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note on this feedback: Parts of this review were produced with Claude Code (parity experiment, drafting). It's offered as input to read, understand, and use or set aside on your own judgment — not as authoritative conclusions. The central numbers below are reproducible; the interpretation is provisional.


The performance fix is real and worth having. The old analyse_words_in_sentences was genuinely O(n²) — the post-Differ triple rescan plus self.temp.count() in the duplicate-paragraph/sentence paths — and the worst-case revisions (the ~36s Google Play case) were a real problem. The spam_hashes_set, the duplicate-count dicts, the module-level token-symbol list, and the single-pass slot assignment are all sound and uncontroversial improvements.

The concern is that this PR also replaces the matching algorithm, and that change is consumer-visible. The diff drops difflib.Differ for bare SequenceMatcher.get_opcodes() plus a new stack of heuristics (a confidence ladder, wikitext-aware match keys, bounded nearest-neighbor recovery, n-gram moved-run recovery, edited-link-boundary recovery, stale-suffix demotion). Differ is not SequenceMatcherDiffer adds a character-level _fancy_replace layer on top of it — so opcode-level matching produces different token-to-token assignments even before the new heuristics are considered.

To measure the impact, each article's full revision history was replayed through both the base (Differ) and PR versions of the library, and the per-token output of the target revision was diffed. Structural state (revisions processed, spam IDs, paragraph/sentence hash tables) is byte-for-byte identical between the two. Per-token attribution is not:

Article revs o_rev_id differs (all tokens) o_rev_id differs (prose words only) inbound/outbound differs
Photosynthesis 5.4K 19.3% 14.5% 14.5%
Wikipedia (simplewiki) 3.5K 4.9% 3.7% 37.5%
Jesse Owens 6.4K 6.2% 4.8% 2.5%

The "prose words only" column excludes wikitext punctuation/markup, so this is not just template-syntax reshuffling — content words change attributed origin revision (and therefore editor) for 4–15% of the article. Examples from Photosynthesis: schematic of photosynthesis, vegetation; from Jesse Owens: athletics at the summer olympics. Three of WikiWho's four downstream consumers (the Wiki Education Dashboard's ArticleViewer, XTools Authorship/Blame, the WhoWroteThat gadget) render attribution per token, so this surfaces as different coloring / different "added by" results. The inbound/outbound deltas feed WhoColor conflict scores.

Important: these numbers measure divergence from the current algorithm, not correctness. The PR's stated aim ("Improve word-level authorship matching for moved text") is to attribute better than Differ — and some fraction of these differences may well be improvements (moved text credited to its original author rather than the mover is a plausible win). But which differences are improvements and which are regressions has not been established here, and that question is not cheap to answer — it requires sampling specific divergences and judging editing intent case by case.

Why the PR's own verification doesn't settle this. The description verifies matched_prev count, the vandalism flag, token_id_delta, tokens_delta, and per-sentence word totals on one revision (Google Play 1296988276). Those are aggregate counts on a single diff step. Two matchers can produce the same number of matched tokens while assigning them to different origin revisions — which is exactly what the table above shows (the structural counts match; the per-token o_rev_ids don't). Aggregate-count parity on one revision can't detect attribution drift across a history.

Suggestions, in rough priority:

  1. Consider splitting the PR. The performance fixes (set-based spam lookup, duplicate-count dicts, single-pass slot assignment, module-level constants) are separable from the matcher replacement and could merge on their own with little risk. That isolates the part that needs scrutiny.
  2. If the algorithm change is intended, characterize it before merging. A per-token replay over a corpus of full histories (the method used above) quantifies the drift; sampling N divergent tokens and judging better/worse gives a quality signal. Because the change is consumer-visible, the maintainers of the per-token consumers above probably want a say before this lands.
  3. Tests. This adds ~860 lines to the core matcher in a repo with no test suite or CI. Even a handful of unit tests on the new helpers, plus one full-history regression check, would make the behavior reviewable and pin it against future drift.
  4. The description is stale. It describes a single-pass cursor optimization; the PR is a matcher rewrite. The note at the bottom acknowledges this, but a reader still has to reverse-engineer the actual change. (Copilot's review raised the same point, plus the _nearest_word_matches nearest-position concern and the unreferenced WORD_MATCH_* constants — worth a look.)

Drafted in a Claude Code session directed by Sage. The attribution-drift figures were produced by replaying each article's full history through both the base and PR builds of the library and diffing per-token output; that comparison is reproducible and was run during the session, not taken on faith. Sage directed the review and the experiment and read this comment before posting. The reading of the 860-line diff and the per-token analysis were done by Claude Code and were not independently re-derived by hand; the divergence-vs-correctness question is left open, not resolved.

@ragesoss

ragesoss commented Jun 5, 2026

Copy link
Copy Markdown

Follow-up to the review above. Same caveats: AI-assisted, provisional, offered as input to weigh rather than a verdict.

The earlier comment measured how much attribution changes but left open whether the changes are improvements or regressions. A closer look at where the divergence actually lands, on the same three article histories:

Most of the divergence is on tokens where per-token authorship is arbitrary, or isn't "prose authorship" at all. Across the three articles, 90–98% of the tokens whose o_rev_id changed have a value that recurs elsewhere in the article — the ambiguous-match case, where which prior the / | / of a current token descends from is genuinely undefined. 55–68% sit inside templates/citations or are markup delimiters (|, =, {{, [[). Differ's assignments there were never well-specified either, so this is churn between two equally-arbitrary answers, not a regression.

On the smaller slice that is genuine prose, the PR's attribution looks more correct, not less. 38–50% of divergent tokens fall in contiguous runs of 5+ positions — the fingerprint of moved or reformatted passages. Tracing specific content words back through the actual revision text:

Phrase First authored in Differ (current) credits PR credits
"Charles Riley" (Jesse Owens) 95595 (2002) 18254337 (2005, "tweaks") 95595
"Adi Dassler" (Jesse Owens) 234610129 (2008) 567245817 (2013, "Filling in 3 references using Reflinks") 234610129
"growing [less quickly]" (simplewiki) 5022409 (2015) 8311103 (2022, "fixed … 5x extra whitespace") 5022409

The pattern is consistent: Differ mis-credits maintenance edits — whitespace fixes, reference-filling, tweaks — that reshaped text without authoring it, and the PR's moved-run recovery traces the prose back to where it was actually written. That is a real, longstanding WikiWho weakness, and it is the dimension this PR targets.

This is not a clean sweep. The PR can also over-reach in the other direction — a couple of words adjacent to the simplewiki passage were pulled to a 2009 revision that predates the phrase, via a spurious duplicate match. And this is four hand-checked cases, not a measured rate; a larger better/worse sample would firm it up (and is a reasonable thing to ask for before relying on the new behavior).

Net: the earlier "changes attribution for X% of tokens" figure overstates the consumer-meaningful impact, and the meaningful part appears to be a quality improvement. The suggestions from the prior comment still stand — separating the uncontroversial performance fixes from the matcher change, adding tests, and getting sign-off from the per-token consumers — but the matcher change now looks worth pursuing on its merits, not merely tolerating as a risk.

Drafted in a Claude Code session directed by Sage. The "first authored in" revisions were verified by searching each phrase against the actual stored revision text of the article's history; that check is reproducible. The where-divergence-lands breakdown was computed by replaying both library builds over full histories and classifying every changed token. Sage directed the analysis and reviewed this comment before posting; the classification and per-token analysis were done by Claude Code and not independently re-derived by hand. A quantified better/worse/wash rate over a large sample has not yet been computed — the "looks more correct" claim rests on the four verified cases plus the structural and directional evidence.

@ragesoss

ragesoss commented Jun 5, 2026

Copy link
Copy Markdown

Quantifying the better/worse question (delivering on "a larger sample would firm it up" from the comment above; same caveats — AI-assisted, provisional).

Adjudicating a single token's "true author" automatically is circular (it is the attribution problem), so the test used here sidesteps it: for each prose divergence, take a short context window and ask whether each credited revision actually introduced that text (window present in the revision, absent in its parent) or merely inherited it (window already in the parent — a later edit that reshaped nearby text without writing this).

Among the cases where exactly one of the two credited revisions actually introduced the prose:

Article decisive cases PR introduced / Differ only inherited Differ introduced / PR inherited
Photosynthesis 50 49 (98%) 1
Wikipedia (simplewiki) 4 4 (100%) 0
Jesse Owens 51 48 (94%) 3

So where it's determinable, the PR credits the true author ~94–100% of the time and Differ is the one inheriting — consistent with the reformatting-steals-authorship pattern across the board, not just the hand-picked cases.

Caveats, to keep this honest:

  • The test only adjudicates ~20–30% of prose divergences; the rest are indeterminate because reformatting changed the exact wording, so the window isn't stably present in one credited revision. Those aren't PR losses — just not cleanly decidable this way.
  • This covers the prose slice (~9–12% of total divergence); the other ~90% is the arbitrary tier (duplicated tokens, markup, citation internals) where neither matcher's choice was ever well-defined.
  • The PR isn't perfect — it loses a few decisive cases per article (the over-reach-via-duplicate-match direction noted earlier), so this is ~94–100%, not 100%.

Net of all three comments: the performance fix is clearly worth having; the matcher change is a real attribution-quality improvement on moved/reformatted text, measurable and in the right direction. The earlier suggestions stand — split the uncontroversial perf work from the matcher change, add tests (the introduced-vs-inherited check above is itself a usable regression-test shape), and loop in the per-token consumers — but the matcher work looks worth landing on its merits.

Drafted in a Claude Code session directed by Sage. The introduced-vs-inherited adjudication was run over sampled prose divergences from full-history replays of three articles, checking each credited revision's text against its parent's; reproducible. Sage directed the analysis and reviewed this before posting. The method and code were produced by Claude Code; the ~94–100% figure is the machine-computed rate on the decisive subset, not independently re-derived by hand beyond the four cases verified in the previous comment.

@QZGao QZGao changed the title Optimize diff algorithm's time complexity Optimize diff algorithm's time complexity, part 2 Jun 6, 2026
@QZGao

QZGao commented Jun 6, 2026

Copy link
Copy Markdown
Author

I'll merge #3 into this branch (to be consistent with existing test suite), then add dedicated test cases I previously mentioned.

@QZGao

QZGao commented Jun 6, 2026

Copy link
Copy Markdown
Author

Now the test cases are also implemented into the branch.

@MusikAnimal

Copy link
Copy Markdown
Member

I'll merge #3 into this branch (to be consistent with existing test suite), then add dedicated test cases I previously mentioned.

Now the test cases are also implemented into the branch.

Sorry I missed these comments. I merged #3 already.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 17 changed files in this pull request and generated no new comments.

@MusikAnimal

Copy link
Copy Markdown
Member

This PR does change algorithm, correct?

I'm wondering what folks think of just moving forward without this. #4 by itself improves performance. I just worry changing the way things are attributed is going to cause confusion, for better or for worse.

I am also out of time. I am happy to run the script to import the CJK languages and do one more final release of downstream projects, but beyond that, I don't really have much energy left to invest in testing this PR. Perhaps Sage is able to, however.

Let me know what you all think :)

@QZGao

QZGao commented Jun 8, 2026

Copy link
Copy Markdown
Author

The algorithm is already outdated -- and as people have been noticing, more pages will gradually start to be unable to render (especially on enwiki) just like [[Google Play]]. Therefore, I would highly recommend we optimize the algorithm now, while we are at it.

@ragesoss

ragesoss commented Jun 8, 2026

Copy link
Copy Markdown

Yeah, I'll keep up with this PR @MusikAnimal. From what I've tested so far, I do think the algorithm changes are going to improve attribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants