fix: rune-safe byte slicing in Patch{Apply,SplitMax,AddContext} (fixes #132) by SnowWarri0r · Pull Request #155 · sergi/go-diff

SnowWarri0r · 2026-05-21T11:56:24Z

Fixes #132 (PatchApply slice bounds out of range on multi-byte text).

Root cause

PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte index throughout. When MatchMain's Bitap returns a byte position that lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes, emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8 string. Downstream calls (MatchMain on the slice, DiffMain, further text[c:d]) then either silently miss anchors or panic with `slice bounds out of range [:N] with length M`.

The reference JS diff-match-patch hits the same code paths but slices by UTF-16 code unit, where surrogate-pair-middle splits are well-defined — so the original algorithm tolerates mid-codepoint cuts in JS but not in Go.

Fix

Introduce a small `alignRuneStart(s, idx)` helper and call it before every byte-level slice that uses an index derived from `MatchMain` / `DiffXIndex` / arithmetic on multi-byte text. The alignment costs at most 1–3 bytes of shrinkage per cut point but guarantees the result is always valid UTF-8, so subsequent `MatchMain` / `DiffMain` operations behave predictably.

Touched call sites:

`PatchAddContext`: prefix start, suffix end
`PatchApply`: `startLoc` after `MatchMain`, `text2` end bound, replace end (perfect-match path), insert-at index, delete start/end (imperfect match path)
`PatchSplitMax`: `diffText` cut point (+ loop-progress guard so a `cutAt` that aligns to 0 always advances by one full rune), remainder slice on `bigpatch.Text`, precontext start, postcontext end

Tests

`TestAlignRuneStart`: byte-index → rune-start mapping (ASCII + multi-byte cases).
`TestPatchApply_Issue132_NoPanic`: the minimal repro from PatchApply panics with slice bounds out of range #132 no longer panics.
`TestPatchSplitMax_NoInvalidUTF8`: pure-CJK `PatchMake` + `PatchSplitMax` produces only valid-UTF-8 diff text.

Full existing test suite still passes; no behavior change on ASCII-only input (all alignment calls are no-ops when the index already sits at a rune boundary or at `len(s)`).

Context

This fix has been running on production service (Chinese-heavy HTML document patch apply, ~40KB documents, 200+ patch hunks per request). The original panic was reproducible at ~15% of requests; after this fix, panic rate is 0 and the suite has been stable for ~24h.

A follow-up PR will add an opt-in rune-based parallel API (`PatchApplyRunes` etc.) that goes further and resolves cross-implementation drift with the JS reference's UTF-16 unit semantics. This PR focuses purely on the panic fix and stays byte-mode.

Fixes sergi#132 (PatchApply slice bounds out of range on multi-byte text). PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte index throughout. When MatchMain's Bitap returns a byte position that lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes, emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8 string. Downstream calls (MatchMain on the slice, DiffMain, further text[c:d]) then either silently miss anchors or panic with "slice bounds out of range [:N] with length M". The reference JS diff-match-patch hits the same code paths but slices by UTF-16 code unit, where surrogate-pair-middle splits are well-defined — so the original algorithm tolerates mid-codepoint cuts in JS but not in Go. Fix: introduce a small `alignRuneStart(s, idx)` helper and call it before every byte-level slice that uses an index derived from MatchMain / DiffXIndex / arithmetic on multi-byte text. The alignment costs at most 1-3 bytes of shrinkage per cut point but guarantees the result is always valid UTF-8, so subsequent MatchMain / DiffMain operations behave predictably. Touched call sites: - PatchAddContext: prefix start, suffix end - PatchApply: startLoc after MatchMain, text2 end bound, replace end (perfect-match path), insert-at index, delete start/end (imperfect match path) - PatchSplitMax: diffText cut point (+ loop-progress guard so cutAt that aligns to 0 always advances by one full rune), remainder slice on bigpatch.Text, precontext start, postcontext end Tests added: - TestAlignRuneStart: byte-index → rune-start mapping (ASCII + multi-byte cases). - TestPatchApply_Issue132_NoPanic: the minimal repro from the upstream issue no longer panics. - TestPatchSplitMax_NoInvalidUTF8: pure-CJK PatchMake + PatchSplitMax produces only valid-UTF-8 diff text. Full existing test suite still passes; no behavior change on ASCII-only input (all alignment calls are no-ops when the index already sits at a rune boundary or at len(s)).

SnowWarri0r mentioned this pull request May 21, 2026

feat: add rune-based Patch / Match parallel API for JS interop #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rune-safe byte slicing in Patch{Apply,SplitMax,AddContext} (fixes #132)#155

fix: rune-safe byte slicing in Patch{Apply,SplitMax,AddContext} (fixes #132)#155
SnowWarri0r wants to merge 1 commit into
sergi:masterfrom
SnowWarri0r:pr1/fix-issue-132-rune-safe-slicing

SnowWarri0r commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SnowWarri0r commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

Tests

Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SnowWarri0r commented May 21, 2026 •

edited

Loading