Skip to content

fix: rune-safe byte slicing in Patch{Apply,SplitMax,AddContext} (fixes #132)#155

Open
SnowWarri0r wants to merge 1 commit into
sergi:masterfrom
SnowWarri0r:pr1/fix-issue-132-rune-safe-slicing
Open

fix: rune-safe byte slicing in Patch{Apply,SplitMax,AddContext} (fixes #132)#155
SnowWarri0r wants to merge 1 commit into
sergi:masterfrom
SnowWarri0r:pr1/fix-issue-132-rune-safe-slicing

Conversation

@SnowWarri0r
Copy link
Copy Markdown

@SnowWarri0r SnowWarri0r commented May 21, 2026

Fixes #132 (PatchApply slice bounds out of range on multi-byte text).

Root cause

PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte index throughout. When MatchMain's Bitap returns a byte position that lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes, emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8 string. Downstream calls (MatchMain on the slice, DiffMain, further text[c:d]) then either silently miss anchors or panic with `slice bounds out of range [:N] with length M`.

The reference JS diff-match-patch hits the same code paths but slices by UTF-16 code unit, where surrogate-pair-middle splits are well-defined — so the original algorithm tolerates mid-codepoint cuts in JS but not in Go.

Fix

Introduce a small `alignRuneStart(s, idx)` helper and call it before every byte-level slice that uses an index derived from `MatchMain` / `DiffXIndex` / arithmetic on multi-byte text. The alignment costs at most 1–3 bytes of shrinkage per cut point but guarantees the result is always valid UTF-8, so subsequent `MatchMain` / `DiffMain` operations behave predictably.

Touched call sites:

  • `PatchAddContext`: prefix start, suffix end
  • `PatchApply`: `startLoc` after `MatchMain`, `text2` end bound, replace end (perfect-match path), insert-at index, delete start/end (imperfect match path)
  • `PatchSplitMax`: `diffText` cut point (+ loop-progress guard so a `cutAt` that aligns to 0 always advances by one full rune), remainder slice on `bigpatch.Text`, precontext start, postcontext end

Tests

  • `TestAlignRuneStart`: byte-index → rune-start mapping (ASCII + multi-byte cases).
  • `TestPatchApply_Issue132_NoPanic`: the minimal repro from PatchApply panics with slice bounds out of range #132 no longer panics.
  • `TestPatchSplitMax_NoInvalidUTF8`: pure-CJK `PatchMake` + `PatchSplitMax` produces only valid-UTF-8 diff text.

Full existing test suite still passes; no behavior change on ASCII-only input (all alignment calls are no-ops when the index already sits at a rune boundary or at `len(s)`).

Context

This fix has been running on production service (Chinese-heavy HTML document patch apply, ~40KB documents, 200+ patch hunks per request). The original panic was reproducible at ~15% of requests; after this fix, panic rate is 0 and the suite has been stable for ~24h.

A follow-up PR will add an opt-in rune-based parallel API (`PatchApplyRunes` etc.) that goes further and resolves cross-implementation drift with the JS reference's UTF-16 unit semantics. This PR focuses purely on the panic fix and stays byte-mode.

Fixes sergi#132 (PatchApply slice bounds out of range on multi-byte text).

PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte
index throughout. When MatchMain's Bitap returns a byte position that
lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes,
emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8
string. Downstream calls (MatchMain on the slice, DiffMain, further
text[c:d]) then either silently miss anchors or panic with
"slice bounds out of range [:N] with length M".

The reference JS diff-match-patch hits the same code paths but slices
by UTF-16 code unit, where surrogate-pair-middle splits are
well-defined — so the original algorithm tolerates mid-codepoint cuts
in JS but not in Go.

Fix: introduce a small `alignRuneStart(s, idx)` helper and call it
before every byte-level slice that uses an index derived from
MatchMain / DiffXIndex / arithmetic on multi-byte text. The alignment
costs at most 1-3 bytes of shrinkage per cut point but guarantees the
result is always valid UTF-8, so subsequent MatchMain / DiffMain
operations behave predictably.

Touched call sites:
  - PatchAddContext: prefix start, suffix end
  - PatchApply: startLoc after MatchMain, text2 end bound, replace end
    (perfect-match path), insert-at index, delete start/end (imperfect
    match path)
  - PatchSplitMax: diffText cut point (+ loop-progress guard so cutAt
    that aligns to 0 always advances by one full rune), remainder
    slice on bigpatch.Text, precontext start, postcontext end

Tests added:
  - TestAlignRuneStart: byte-index → rune-start mapping (ASCII +
    multi-byte cases).
  - TestPatchApply_Issue132_NoPanic: the minimal repro from the
    upstream issue no longer panics.
  - TestPatchSplitMax_NoInvalidUTF8: pure-CJK PatchMake + PatchSplitMax
    produces only valid-UTF-8 diff text.

Full existing test suite still passes; no behavior change on
ASCII-only input (all alignment calls are no-ops when the index already
sits at a rune boundary or at len(s)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PatchApply panics with slice bounds out of range

1 participant