feat: add rune-based Patch / Match parallel API for JS interop#156
Open
SnowWarri0r wants to merge 2 commits into
Open
feat: add rune-based Patch / Match parallel API for JS interop#156SnowWarri0r wants to merge 2 commits into
SnowWarri0r wants to merge 2 commits into
Conversation
Fixes sergi#132 (PatchApply slice bounds out of range on multi-byte text). PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte index throughout. When MatchMain's Bitap returns a byte position that lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes, emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8 string. Downstream calls (MatchMain on the slice, DiffMain, further text[c:d]) then either silently miss anchors or panic with "slice bounds out of range [:N] with length M". The reference JS diff-match-patch hits the same code paths but slices by UTF-16 code unit, where surrogate-pair-middle splits are well-defined — so the original algorithm tolerates mid-codepoint cuts in JS but not in Go. Fix: introduce a small `alignRuneStart(s, idx)` helper and call it before every byte-level slice that uses an index derived from MatchMain / DiffXIndex / arithmetic on multi-byte text. The alignment costs at most 1-3 bytes of shrinkage per cut point but guarantees the result is always valid UTF-8, so subsequent MatchMain / DiffMain operations behave predictably. Touched call sites: - PatchAddContext: prefix start, suffix end - PatchApply: startLoc after MatchMain, text2 end bound, replace end (perfect-match path), insert-at index, delete start/end (imperfect match path) - PatchSplitMax: diffText cut point (+ loop-progress guard so cutAt that aligns to 0 always advances by one full rune), remainder slice on bigpatch.Text, precontext start, postcontext end Tests added: - TestAlignRuneStart: byte-index → rune-start mapping (ASCII + multi-byte cases). - TestPatchApply_Issue132_NoPanic: the minimal repro from the upstream issue no longer panics. - TestPatchSplitMax_NoInvalidUTF8: pure-CJK PatchMake + PatchSplitMax produces only valid-UTF-8 diff text. Full existing test suite still passes; no behavior change on ASCII-only input (all alignment calls are no-ops when the index already sits at a rune boundary or at len(s)).
The reference Google JS diff-match-patch implementation slices and counts
length by UTF-16 code unit. This Go port slices and counts by UTF-8 byte.
For ASCII the two are identical, but for CJK / em dash / emoji-heavy
text the divergence breaks interoperability when a Go service consumes
patch text produced by a JS counterpart:
1. patch_splitMax compares Length1 to MatchMaxBits (32). A 30-CJK
segment is 60 UTF-16 units (within limit in JS, kept whole) but
90 bytes here (split into ~9-char fragments). Short fragments
produce short Bitap patterns and weak fuzzy matches.
2. expected_loc = Start2 + delta is treated as a byte offset here,
but the wire numbers were emitted as UTF-16 units. N multi-byte
characters preceding the patch shift the real anchor by N bytes,
often beyond MatchDistance's default 1000-byte window.
3. matchAlphabet maps map[byte]int. Each CJK rune contributes its
three continuation bytes as separate alphabet entries, scrambling
the Bitap masks on dense CJK text.
The result is that a patch JS produces and applies byte-perfect against
some base text fails ~15% of segments when the same (base, patch) is
fed to PatchApply here.
This change adds a parallel rune-based path. Algorithm structure is
preserved 1:1; only length and indexing model differs:
match_runes.go:
- MatchMainRunes / MatchBitapRunes / MatchAlphabetRunes
Alphabet keyed by rune (one entry per Unicode code point) and a
runesLastIndexBefore helper completing the rune analogues of
runesIndex / runesIndexOf already present in stringutil.go.
patch_runes.go:
- DiffText1Runes / DiffText2Runes / DiffXIndexRunes
- PatchAddPaddingRunes
- PatchSplitMaxRunes
- PatchApplyRunes(patches []Patch, text []rune) ([]rune, []bool)
All length comparisons against MatchMaxBits / PatchMargin are
rune-count; all text and diffText slices are rune-index. Callers
pass []rune(s) and string(out) the result.
Note on coverage: 1 rune == 1 UTF-16 unit for all BMP characters
(U+0000..U+FFFF), which covers CJK ideographs, em dash, most Latin
extended sets, and the vast majority of Unicode in practice.
Supplementary-plane characters (emoji, U+10000+) are 1 rune vs 2
UTF-16 units, so the rune path still diverges from JS for emoji-only
slices. A future improvement could add a true UTF-16 code-unit path,
but rune-based already covers the common JS-interop cases.
Existing byte-based PatchApply is untouched (this is opt-in).
Tests added:
- TestPatchApplyRunes_RoundTripCJK: PatchMake → PatchToText →
PatchFromText → PatchApplyRunes round trip on CJK base, byte-perfect
target reproduction.
- TestPatchApplyRunes_ASCII_EquivToByteApply: ASCII input produces
identical output via either path.
- TestMatchMainRunes_BasicCJK: rune-index Bitap hit verification.
- TestDiffXIndexRunes_CountsRunes: index returned in rune units.
Full existing test suite still passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The reference Google JS diff-match-patch implementation slices and counts length by UTF-16 code unit. This Go port slices and counts by UTF-8 byte. For ASCII the two are identical, but for CJK / em dash / emoji-heavy text the divergence breaks interoperability when a Go service consumes patch text produced by a JS counterpart:
The result is that a patch JS produces and applies byte-perfect against some base text fails ~15% of segments when the same `(base, patch)` is fed to `PatchApply` here. Tested on a 44 KB Chinese HTML document with 262 hunks: byte-mode produces `failed=39/274`, rune-mode produces `failed=0/271` byte-perfect.
Change
Add a parallel rune-based path. Algorithm structure is preserved 1:1; only length and indexing model differs:
`match_runes.go`
`patch_runes.go`
Callers pass `[]rune(s)` and `string(out)` the result.
Coverage
1 rune == 1 UTF-16 unit for all BMP characters (U+0000..U+FFFF), which covers CJK ideographs, em dash, most Latin extended sets, and the vast majority of Unicode in practice. Supplementary-plane characters (emoji, U+10000+) are 1 rune vs 2 UTF-16 units, so the rune path still diverges from JS for emoji-only slices. A future improvement could add a true UTF-16 code-unit path, but rune-based already covers the common JS-interop cases.
Compatibility
Tests
Real-world impact
Running on production service (a Go BFF accepting patches from a JS-side editor). After switching from `PatchApply` to `PatchApplyRunes`, the per-segment failure rate dropped from ~15% to 0 on a representative corpus of Chinese HTML documents. The 422 error rate on the corresponding HTTP endpoint dropped to noise.