Skip to content

feat: add rune-based Patch / Match parallel API for JS interop#156

Open
SnowWarri0r wants to merge 2 commits into
sergi:masterfrom
SnowWarri0r:pr2/add-rune-based-patch-apply
Open

feat: add rune-based Patch / Match parallel API for JS interop#156
SnowWarri0r wants to merge 2 commits into
sergi:masterfrom
SnowWarri0r:pr2/add-rune-based-patch-apply

Conversation

@SnowWarri0r
Copy link
Copy Markdown

@SnowWarri0r SnowWarri0r commented May 21, 2026

Note: this PR is stacked on top of #155 (rune-safe byte slicing). Suggest reviewing #155 first.

Motivation

The reference Google JS diff-match-patch implementation slices and counts length by UTF-16 code unit. This Go port slices and counts by UTF-8 byte. For ASCII the two are identical, but for CJK / em dash / emoji-heavy text the divergence breaks interoperability when a Go service consumes patch text produced by a JS counterpart:

  1. `patch_splitMax` compares `Length1` to `MatchMaxBits` (32). A 30-CJK segment is 60 UTF-16 units (within limit in JS, kept whole) but 90 bytes here (split into ~9-char fragments). Short fragments produce short Bitap patterns and weak fuzzy matches.
  2. `expected_loc = Start2 + delta` is treated as a byte offset here, but the wire numbers were emitted as UTF-16 units. N multi-byte characters preceding the patch shift the real anchor by N bytes, often beyond `MatchDistance`'s default 1000-byte window.
  3. `matchAlphabet` maps `map[byte]int`. Each CJK rune contributes its three continuation bytes as separate alphabet entries, scrambling the Bitap masks on dense CJK text.

The result is that a patch JS produces and applies byte-perfect against some base text fails ~15% of segments when the same `(base, patch)` is fed to `PatchApply` here. Tested on a 44 KB Chinese HTML document with 262 hunks: byte-mode produces `failed=39/274`, rune-mode produces `failed=0/271` byte-perfect.

Change

Add a parallel rune-based path. Algorithm structure is preserved 1:1; only length and indexing model differs:

`match_runes.go`

  • `MatchMainRunes` / `MatchBitapRunes` / `MatchAlphabetRunes`
  • Alphabet keyed by rune (one entry per Unicode code point)
  • `runesLastIndexBefore` helper completing the rune analogues of `runesIndex` / `runesIndexOf` already present in `stringutil.go`

`patch_runes.go`

  • `DiffText1Runes` / `DiffText2Runes` / `DiffXIndexRunes`
  • `PatchAddPaddingRunes`
  • `PatchSplitMaxRunes`
  • `PatchApplyRunes(patches []Patch, text []rune) ([]rune, []bool)` — all length comparisons against `MatchMaxBits` / `PatchMargin` are rune-count; all text and `diffText` slices are rune-index

Callers pass `[]rune(s)` and `string(out)` the result.

Coverage

1 rune == 1 UTF-16 unit for all BMP characters (U+0000..U+FFFF), which covers CJK ideographs, em dash, most Latin extended sets, and the vast majority of Unicode in practice. Supplementary-plane characters (emoji, U+10000+) are 1 rune vs 2 UTF-16 units, so the rune path still diverges from JS for emoji-only slices. A future improvement could add a true UTF-16 code-unit path, but rune-based already covers the common JS-interop cases.

Compatibility

  • Existing byte-based `PatchApply` is untouched — this is opt-in.
  • Existing test suite passes unchanged.
  • New API surface is additive (`*Runes` suffix).

Tests

  • `TestPatchApplyRunes_RoundTripCJK`: `PatchMake` → `PatchToText` → `PatchFromText` → `PatchApplyRunes` round trip on CJK base, byte-perfect target reproduction.
  • `TestPatchApplyRunes_ASCII_EquivToByteApply`: ASCII input produces identical output via either path.
  • `TestMatchMainRunes_BasicCJK`: rune-index Bitap hit verification.
  • `TestDiffXIndexRunes_CountsRunes`: index returned in rune units.

Real-world impact

Running on production service (a Go BFF accepting patches from a JS-side editor). After switching from `PatchApply` to `PatchApplyRunes`, the per-segment failure rate dropped from ~15% to 0 on a representative corpus of Chinese HTML documents. The 422 error rate on the corresponding HTTP endpoint dropped to noise.

Fixes sergi#132 (PatchApply slice bounds out of range on multi-byte text).

PatchApply / PatchSplitMax / PatchAddContext slice text by UTF-8 byte
index throughout. When MatchMain's Bitap returns a byte position that
lands in the middle of a multi-byte rune (CJK 3 bytes, em dash 3 bytes,
emoji 4 bytes), the resulting text[a:b] produces an invalid-UTF-8
string. Downstream calls (MatchMain on the slice, DiffMain, further
text[c:d]) then either silently miss anchors or panic with
"slice bounds out of range [:N] with length M".

The reference JS diff-match-patch hits the same code paths but slices
by UTF-16 code unit, where surrogate-pair-middle splits are
well-defined — so the original algorithm tolerates mid-codepoint cuts
in JS but not in Go.

Fix: introduce a small `alignRuneStart(s, idx)` helper and call it
before every byte-level slice that uses an index derived from
MatchMain / DiffXIndex / arithmetic on multi-byte text. The alignment
costs at most 1-3 bytes of shrinkage per cut point but guarantees the
result is always valid UTF-8, so subsequent MatchMain / DiffMain
operations behave predictably.

Touched call sites:
  - PatchAddContext: prefix start, suffix end
  - PatchApply: startLoc after MatchMain, text2 end bound, replace end
    (perfect-match path), insert-at index, delete start/end (imperfect
    match path)
  - PatchSplitMax: diffText cut point (+ loop-progress guard so cutAt
    that aligns to 0 always advances by one full rune), remainder
    slice on bigpatch.Text, precontext start, postcontext end

Tests added:
  - TestAlignRuneStart: byte-index → rune-start mapping (ASCII +
    multi-byte cases).
  - TestPatchApply_Issue132_NoPanic: the minimal repro from the
    upstream issue no longer panics.
  - TestPatchSplitMax_NoInvalidUTF8: pure-CJK PatchMake + PatchSplitMax
    produces only valid-UTF-8 diff text.

Full existing test suite still passes; no behavior change on
ASCII-only input (all alignment calls are no-ops when the index already
sits at a rune boundary or at len(s)).
The reference Google JS diff-match-patch implementation slices and counts
length by UTF-16 code unit. This Go port slices and counts by UTF-8 byte.
For ASCII the two are identical, but for CJK / em dash / emoji-heavy
text the divergence breaks interoperability when a Go service consumes
patch text produced by a JS counterpart:

  1. patch_splitMax compares Length1 to MatchMaxBits (32). A 30-CJK
     segment is 60 UTF-16 units (within limit in JS, kept whole) but
     90 bytes here (split into ~9-char fragments). Short fragments
     produce short Bitap patterns and weak fuzzy matches.
  2. expected_loc = Start2 + delta is treated as a byte offset here,
     but the wire numbers were emitted as UTF-16 units. N multi-byte
     characters preceding the patch shift the real anchor by N bytes,
     often beyond MatchDistance's default 1000-byte window.
  3. matchAlphabet maps map[byte]int. Each CJK rune contributes its
     three continuation bytes as separate alphabet entries, scrambling
     the Bitap masks on dense CJK text.

The result is that a patch JS produces and applies byte-perfect against
some base text fails ~15% of segments when the same (base, patch) is
fed to PatchApply here.

This change adds a parallel rune-based path. Algorithm structure is
preserved 1:1; only length and indexing model differs:

  match_runes.go:
    - MatchMainRunes / MatchBitapRunes / MatchAlphabetRunes
      Alphabet keyed by rune (one entry per Unicode code point) and a
      runesLastIndexBefore helper completing the rune analogues of
      runesIndex / runesIndexOf already present in stringutil.go.

  patch_runes.go:
    - DiffText1Runes / DiffText2Runes / DiffXIndexRunes
    - PatchAddPaddingRunes
    - PatchSplitMaxRunes
    - PatchApplyRunes(patches []Patch, text []rune) ([]rune, []bool)
      All length comparisons against MatchMaxBits / PatchMargin are
      rune-count; all text and diffText slices are rune-index. Callers
      pass []rune(s) and string(out) the result.

Note on coverage: 1 rune == 1 UTF-16 unit for all BMP characters
(U+0000..U+FFFF), which covers CJK ideographs, em dash, most Latin
extended sets, and the vast majority of Unicode in practice.
Supplementary-plane characters (emoji, U+10000+) are 1 rune vs 2
UTF-16 units, so the rune path still diverges from JS for emoji-only
slices. A future improvement could add a true UTF-16 code-unit path,
but rune-based already covers the common JS-interop cases.

Existing byte-based PatchApply is untouched (this is opt-in).

Tests added:
  - TestPatchApplyRunes_RoundTripCJK: PatchMake → PatchToText →
    PatchFromText → PatchApplyRunes round trip on CJK base, byte-perfect
    target reproduction.
  - TestPatchApplyRunes_ASCII_EquivToByteApply: ASCII input produces
    identical output via either path.
  - TestMatchMainRunes_BasicCJK: rune-index Bitap hit verification.
  - TestDiffXIndexRunes_CountsRunes: index returned in rune units.

Full existing test suite still passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant