feat: add rune-based Patch / Match parallel API for JS interop by SnowWarri0r · Pull Request #156 · sergi/go-diff

SnowWarri0r · 2026-05-21T11:57:02Z

Note: this PR is stacked on top of #155 (rune-safe byte slicing). Suggest reviewing #155 first.

Motivation

The reference Google JS diff-match-patch implementation slices and counts length by UTF-16 code unit. This Go port slices and counts by UTF-8 byte. For ASCII the two are identical, but for CJK / em dash / emoji-heavy text the divergence breaks interoperability when a Go service consumes patch text produced by a JS counterpart:

`patch_splitMax` compares `Length1` to `MatchMaxBits` (32). A 30-CJK segment is 60 UTF-16 units (within limit in JS, kept whole) but 90 bytes here (split into ~9-char fragments). Short fragments produce short Bitap patterns and weak fuzzy matches.
`expected_loc = Start2 + delta` is treated as a byte offset here, but the wire numbers were emitted as UTF-16 units. N multi-byte characters preceding the patch shift the real anchor by N bytes, often beyond `MatchDistance`'s default 1000-byte window.
`matchAlphabet` maps `map[byte]int`. Each CJK rune contributes its three continuation bytes as separate alphabet entries, scrambling the Bitap masks on dense CJK text.

The result is that a patch JS produces and applies byte-perfect against some base text fails ~15% of segments when the same `(base, patch)` is fed to `PatchApply` here. Tested on a 44 KB Chinese HTML document with 262 hunks: byte-mode produces `failed=39/274`, rune-mode produces `failed=0/271` byte-perfect.

Change

Add a parallel rune-based path. Algorithm structure is preserved 1:1; only length and indexing model differs:

`match_runes.go`

`MatchMainRunes` / `MatchBitapRunes` / `MatchAlphabetRunes`
Alphabet keyed by rune (one entry per Unicode code point)
`runesLastIndexBefore` helper completing the rune analogues of `runesIndex` / `runesIndexOf` already present in `stringutil.go`

`patch_runes.go`

`DiffText1Runes` / `DiffText2Runes` / `DiffXIndexRunes`
`PatchAddPaddingRunes`
`PatchSplitMaxRunes`
`PatchApplyRunes(patches []Patch, text []rune) ([]rune, []bool)` — all length comparisons against `MatchMaxBits` / `PatchMargin` are rune-count; all text and `diffText` slices are rune-index

Callers pass `[]rune(s)` and `string(out)` the result.

Coverage

1 rune == 1 UTF-16 unit for all BMP characters (U+0000..U+FFFF), which covers CJK ideographs, em dash, most Latin extended sets, and the vast majority of Unicode in practice. Supplementary-plane characters (emoji, U+10000+) are 1 rune vs 2 UTF-16 units, so the rune path still diverges from JS for emoji-only slices. A future improvement could add a true UTF-16 code-unit path, but rune-based already covers the common JS-interop cases.

Compatibility

Existing byte-based `PatchApply` is untouched — this is opt-in.
Existing test suite passes unchanged.
New API surface is additive (`*Runes` suffix).

Tests

`TestPatchApplyRunes_RoundTripCJK`: `PatchMake` → `PatchToText` → `PatchFromText` → `PatchApplyRunes` round trip on CJK base, byte-perfect target reproduction.
`TestPatchApplyRunes_ASCII_EquivToByteApply`: ASCII input produces identical output via either path.
`TestMatchMainRunes_BasicCJK`: rune-index Bitap hit verification.
`TestDiffXIndexRunes_CountsRunes`: index returned in rune units.

Real-world impact

Running on production service (a Go BFF accepting patches from a JS-side editor). After switching from `PatchApply` to `PatchApplyRunes`, the per-segment failure rate dropped from ~15% to 0 on a representative corpus of Chinese HTML documents. The 422 error rate on the corresponding HTTP endpoint dropped to noise.

Fixes sergi#132 (PatchApply slice bounds out of range on multi-byte text). PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte index throughout. When MatchMain's Bitap returns a byte position that lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes, emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8 string. Downstream calls (MatchMain on the slice, DiffMain, further text[c:d]) then either silently miss anchors or panic with "slice bounds out of range [:N] with length M". The reference JS diff-match-patch hits the same code paths but slices by UTF-16 code unit, where surrogate-pair-middle splits are well-defined — so the original algorithm tolerates mid-codepoint cuts in JS but not in Go. Fix: introduce a small `alignRuneStart(s, idx)` helper and call it before every byte-level slice that uses an index derived from MatchMain / DiffXIndex / arithmetic on multi-byte text. The alignment costs at most 1-3 bytes of shrinkage per cut point but guarantees the result is always valid UTF-8, so subsequent MatchMain / DiffMain operations behave predictably. Touched call sites: - PatchAddContext: prefix start, suffix end - PatchApply: startLoc after MatchMain, text2 end bound, replace end (perfect-match path), insert-at index, delete start/end (imperfect match path) - PatchSplitMax: diffText cut point (+ loop-progress guard so cutAt that aligns to 0 always advances by one full rune), remainder slice on bigpatch.Text, precontext start, postcontext end Tests added: - TestAlignRuneStart: byte-index → rune-start mapping (ASCII + multi-byte cases). - TestPatchApply_Issue132_NoPanic: the minimal repro from the upstream issue no longer panics. - TestPatchSplitMax_NoInvalidUTF8: pure-CJK PatchMake + PatchSplitMax produces only valid-UTF-8 diff text. Full existing test suite still passes; no behavior change on ASCII-only input (all alignment calls are no-ops when the index already sits at a rune boundary or at len(s)).

The reference Google JS diff-match-patch implementation slices and counts length by UTF-16 code unit. This Go port slices and counts by UTF-8 byte. For ASCII the two are identical, but for CJK / em dash / emoji-heavy text the divergence breaks interoperability when a Go service consumes patch text produced by a JS counterpart: 1. patch_splitMax compares Length1 to MatchMaxBits (32). A 30-CJK segment is 60 UTF-16 units (within limit in JS, kept whole) but 90 bytes here (split into ~9-char fragments). Short fragments produce short Bitap patterns and weak fuzzy matches. 2. expected_loc = Start2 + delta is treated as a byte offset here, but the wire numbers were emitted as UTF-16 units. N multi-byte characters preceding the patch shift the real anchor by N bytes, often beyond MatchDistance's default 1000-byte window. 3. matchAlphabet maps map[byte]int. Each CJK rune contributes its three continuation bytes as separate alphabet entries, scrambling the Bitap masks on dense CJK text. The result is that a patch JS produces and applies byte-perfect against some base text fails ~15% of segments when the same (base, patch) is fed to PatchApply here. This change adds a parallel rune-based path. Algorithm structure is preserved 1:1; only length and indexing model differs: match_runes.go: - MatchMainRunes / MatchBitapRunes / MatchAlphabetRunes Alphabet keyed by rune (one entry per Unicode code point) and a runesLastIndexBefore helper completing the rune analogues of runesIndex / runesIndexOf already present in stringutil.go. patch_runes.go: - DiffText1Runes / DiffText2Runes / DiffXIndexRunes - PatchAddPaddingRunes - PatchSplitMaxRunes - PatchApplyRunes(patches []Patch, text []rune) ([]rune, []bool) All length comparisons against MatchMaxBits / PatchMargin are rune-count; all text and diffText slices are rune-index. Callers pass []rune(s) and string(out) the result. Note on coverage: 1 rune == 1 UTF-16 unit for all BMP characters (U+0000..U+FFFF), which covers CJK ideographs, em dash, most Latin extended sets, and the vast majority of Unicode in practice. Supplementary-plane characters (emoji, U+10000+) are 1 rune vs 2 UTF-16 units, so the rune path still diverges from JS for emoji-only slices. A future improvement could add a true UTF-16 code-unit path, but rune-based already covers the common JS-interop cases. Existing byte-based PatchApply is untouched (this is opt-in). Tests added: - TestPatchApplyRunes_RoundTripCJK: PatchMake → PatchToText → PatchFromText → PatchApplyRunes round trip on CJK base, byte-perfect target reproduction. - TestPatchApplyRunes_ASCII_EquivToByteApply: ASCII input produces identical output via either path. - TestMatchMainRunes_BasicCJK: rune-index Bitap hit verification. - TestDiffXIndexRunes_CountsRunes: index returned in rune units. Full existing test suite still passes.

SnowWarri0r added 2 commits May 21, 2026 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add rune-based Patch / Match parallel API for JS interop#156

feat: add rune-based Patch / Match parallel API for JS interop#156
SnowWarri0r wants to merge 2 commits into
sergi:masterfrom
SnowWarri0r:pr2/add-rune-based-patch-apply

SnowWarri0r commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SnowWarri0r commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Change

Coverage

Compatibility

Tests

Real-world impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SnowWarri0r commented May 21, 2026 •

edited

Loading