fix: rune-safe byte slicing in Patch{Apply,SplitMax,AddContext} (fixes #132)#155
Open
SnowWarri0r wants to merge 1 commit into
Open
fix: rune-safe byte slicing in Patch{Apply,SplitMax,AddContext} (fixes #132)#155SnowWarri0r wants to merge 1 commit into
SnowWarri0r wants to merge 1 commit into
Conversation
Fixes sergi#132 (PatchApply slice bounds out of range on multi-byte text). PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte index throughout. When MatchMain's Bitap returns a byte position that lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes, emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8 string. Downstream calls (MatchMain on the slice, DiffMain, further text[c:d]) then either silently miss anchors or panic with "slice bounds out of range [:N] with length M". The reference JS diff-match-patch hits the same code paths but slices by UTF-16 code unit, where surrogate-pair-middle splits are well-defined — so the original algorithm tolerates mid-codepoint cuts in JS but not in Go. Fix: introduce a small `alignRuneStart(s, idx)` helper and call it before every byte-level slice that uses an index derived from MatchMain / DiffXIndex / arithmetic on multi-byte text. The alignment costs at most 1-3 bytes of shrinkage per cut point but guarantees the result is always valid UTF-8, so subsequent MatchMain / DiffMain operations behave predictably. Touched call sites: - PatchAddContext: prefix start, suffix end - PatchApply: startLoc after MatchMain, text2 end bound, replace end (perfect-match path), insert-at index, delete start/end (imperfect match path) - PatchSplitMax: diffText cut point (+ loop-progress guard so cutAt that aligns to 0 always advances by one full rune), remainder slice on bigpatch.Text, precontext start, postcontext end Tests added: - TestAlignRuneStart: byte-index → rune-start mapping (ASCII + multi-byte cases). - TestPatchApply_Issue132_NoPanic: the minimal repro from the upstream issue no longer panics. - TestPatchSplitMax_NoInvalidUTF8: pure-CJK PatchMake + PatchSplitMax produces only valid-UTF-8 diff text. Full existing test suite still passes; no behavior change on ASCII-only input (all alignment calls are no-ops when the index already sits at a rune boundary or at len(s)).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #132 (PatchApply slice bounds out of range on multi-byte text).
Root cause
PatchApply/PatchSplitMax/PatchAddContextslice text by UTF-8 byte index throughout. WhenMatchMain's Bitap returns a byte position that lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes, emoji 4 bytes), the resultingtext[a:b]produces an invalid-UTF-8 string. Downstream calls (MatchMainon the slice,DiffMain, furthertext[c:d]) then either silently miss anchors or panic with `slice bounds out of range [:N] with length M`.The reference JS diff-match-patch hits the same code paths but slices by UTF-16 code unit, where surrogate-pair-middle splits are well-defined — so the original algorithm tolerates mid-codepoint cuts in JS but not in Go.
Fix
Introduce a small `alignRuneStart(s, idx)` helper and call it before every byte-level slice that uses an index derived from `MatchMain` / `DiffXIndex` / arithmetic on multi-byte text. The alignment costs at most 1–3 bytes of shrinkage per cut point but guarantees the result is always valid UTF-8, so subsequent `MatchMain` / `DiffMain` operations behave predictably.
Touched call sites:
Tests
Full existing test suite still passes; no behavior change on ASCII-only input (all alignment calls are no-ops when the index already sits at a rune boundary or at `len(s)`).
Context
This fix has been running on production service (Chinese-heavy HTML document patch apply, ~40KB documents, 200+ patch hunks per request). The original panic was reproducible at ~15% of requests; after this fix, panic rate is 0 and the suite has been stable for ~24h.
A follow-up PR will add an opt-in rune-based parallel API (`PatchApplyRunes` etc.) that goes further and resolves cross-implementation drift with the JS reference's UTF-16 unit semantics. This PR focuses purely on the panic fix and stays byte-mode.