Skip to content

Commit 205456e

Browse files
blhsingclaude
andcommitted
Close ort correctness gaps vs C Git: recursive merge, dir-rename relevance, submodules, attributes
Bring the pure-Python ort engine to byte-for-byte parity with C Git (v2.44) across the result-affecting behaviors that the single-base implementation previously diverged on. Recursive merge (virtual merge base): - Port merge_ort_internal/merge_incore_recursive: compute all merge bases and recursively merge them into a virtual ancestor (mergeort.merge_recursive), exposed via ort.merge_commits and used by `git merge` (porcelain_merge). Matches `git merge-tree <a> <b>` (no --merge-base) on criss-cross histories. - Implement every call_depth>0 behavior: modify/delete uses the base stage, distinct-types/symlink use the merge-base version, binary merge takes the ancestor (virtual_ancestor), and content-merge marker size grows with depth. - Rewrite merge.merge_bases as a faithful paint_down_to_common (date-ordered priority walk with insertion-order tie-breaking) plus remove_redundant, so merge bases come back in C Git's exact order — which the virtual-ancestor construction (and its stage-1 blob) depends on. Directory rename detection: - Track dir_rename_mask through collect_merge_info with the per-directory 0x07 flip, computing rename-source relevance (content_relevant || location_relevant) and removed-dir relevance (RELEVANT_FOR_SELF/ANCESTOR/NOT_RELEVANT). - Cull non-relevant sources from inexact/basename rename matching in diffcore (matching diffcore_rename_extended) so rename *pairings* match git, and only count dir renames from relevant sources. Fixes nested, split, transitive, and "rename whose source the other side left untouched" cases. Submodules: port merge_submodule fast-forward (descendant wins when the submodule object store is available; conflict otherwise, like git). Config/attributes: honor merge.conflictStyle (merge/diff3/zdiff3) and the working-tree .gitattributes merge / conflict-marker-size attributes (incl. merge=union), matching how git's attr stack reads them for merge-tree. Validated byte-for-byte vs git 2.44 over thousands of randomized cases (blob merges, criss-cross/recursive merges, mixed and nested directory renames) plus new pytest parity tests; full suite 127 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent fc3740f commit 205456e

9 files changed

Lines changed: 1003 additions & 190 deletions

File tree

README.md

Lines changed: 30 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -266,31 +266,38 @@ mechanism.
266266

267267
### Merge
268268

269-
`merge.merge_bases` mirrors `commit-reach.c`'s `paint_down_to_common`: BFS
270-
from both tips with PARENT1/PARENT2 flags, marking double-flagged commits as
271-
results and pushing STALE to their ancestors.
269+
`merge.merge_bases` is a faithful port of `commit-reach.c`'s
270+
`paint_down_to_common`: a date-ordered priority walk with PARENT1/PARENT2/STALE
271+
flags and insertion-order tie-breaking, followed by `remove_redundant`, so the
272+
merge bases come back in the **same order** C Git returns them (which the
273+
recursive merge below depends on).
272274

273275
High-level three-way merges run a pure-Python port of Git's own `ort` engine —
274-
no `git` binary and no fallback engine. The port lives in three modules and
276+
no `git` binary and no fallback engine. The port lives in four modules and
275277
reproduces `git merge-tree --write-tree` byte-for-byte (result tree oid,
276278
conflicted blobs with markers, and conflicted index stages):
277279

278280
* `xdiff.py` — Git's xdiff library: record classification, the **histogram**
279281
diff that `ort` hardcodes for content merges (with the classic Myers
280282
algorithm as its documented fallback), change compaction, and the zealous
281-
three-way `xdl_merge` that emits `<<<<<<<` / `=======` / `>>>>>>>` markers.
283+
three-way `xdl_merge` that emits `<<<<<<<` / `=======` / `>>>>>>>` markers
284+
(merge / diff3 / zdiff3 styles, configurable marker size).
282285
* `diffcore.py` — rename detection: the `diffcore-delta` spanhash similarity
283286
estimator plus exact, basename-driven, and inexact NxM matrix matching from
284-
`diffcore-rename.c`.
287+
`diffcore-rename.c`, with `relevant_sources` source-culling.
285288
* `mergeort.py` — the `merge-ort.c` tree engine: the recursive three-way tree
286-
walk (`collect_merge_info`), file and **directory** rename detection and
287-
resolution (`process_renames`), per-path resolution (`process_entry`), and
288-
streamed result-tree assembly with conflicted index stages.
289-
290-
`ort.py` is a thin adapter exposing `merge_tree(repo, merge_base, ours,
291-
theirs)`; the `merge_base`/`ours`/`theirs` arguments double as the
292-
conflict-marker labels, exactly as the corresponding `git merge-tree
293-
--merge-base` arguments do.
289+
walk (`collect_merge_info`, tracking `dir_rename_mask` and rename-source
290+
relevance), file and **directory** rename detection/resolution
291+
(`process_renames`, dir-rename counting with RELEVANT_FOR_SELF/ANCESTOR
292+
gating), per-path resolution (`process_entry`, including the `call_depth`
293+
virtual-ancestor behaviors), submodule fast-forward, `.gitattributes`
294+
`merge`/`conflict-marker-size` handling, and streamed result-tree assembly.
295+
* `ort.py` — adapter exposing `merge_tree(repo, merge_base, ours, theirs)`
296+
(explicit base, like `git merge-tree --merge-base`) and `merge_commits(repo,
297+
ours, theirs)` (computes all merge bases and **recursively** merges them into
298+
a virtual ancestor, like `git merge-tree <a> <b>`). The tree-ish arguments
299+
double as conflict-marker labels, exactly as the matching `git merge-tree`
300+
arguments do; `merge.conflictStyle` is honored.
294301

295302
### Rerere
296303

@@ -405,13 +412,15 @@ randomized cases.
405412
remaining scale-sensitive cases are commands whose output inherently requires
406413
inspecting every path or blob.
407414
* The `ort` merge engine is a pure-Python reimplementation (no `git` binary,
408-
no fallback) and is validated for byte-for-byte parity against
409-
`git merge-tree --write-tree` across content merges, rename detection
410-
(file and directory), and conflict presentation. It targets a single merge
411-
base (as `git merge-tree --merge-base` provides); recursive merge of multiple
412-
merge bases (a virtual ancestor) and full submodule fast-forward resolution
413-
are not modelled, and `merge.conflictStyle`/whitespace merge drivers default
414-
to Git's standard behavior.
415+
no fallback), validated for byte-for-byte parity against
416+
`git merge-tree --write-tree` across content merges, file and directory
417+
renames, recursive merges (criss-cross histories with a virtual ancestor),
418+
submodule fast-forwards, conflict styles (merge/diff3/zdiff3), `merge=union`
419+
attributes, and conflict presentation. Two areas are not fully modelled:
420+
custom external `.gitattributes` merge drivers (treated as the built-in text
421+
driver), and a small number of pathological deeply-nested simultaneous
422+
directory-rename cases where Git's `merge-ort` deferred two-pass traversal
423+
computes rename-source relevance slightly differently.
415424
* `fsmonitor-daemon run` uses native filesystem notifications on Windows and
416425
Linux (`ReadDirectoryChangesW` / inotify). One-shot `fsmonitor` calls and
417426
unsupported platforms fall back to configurable polling.

pythongit/diffcore.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,12 +212,17 @@ def detect_renames(
212212
rename_limit: int = 7000,
213213
minimum_score: int = 0,
214214
rename_empty: bool = False,
215+
relevant_sources: Optional[set] = None,
215216
) -> list[RenamePair]:
216217
"""Detect file renames between two trees represented as path->(mode, oid).
217218
218219
Sources are paths present in base but absent in side (deletions);
219220
destinations are paths present in side but absent in base (additions).
220221
Returns the list of detected rename pairs.
222+
223+
``relevant_sources`` (if given) limits inexact/basename rename detection to
224+
those source paths (exact renames still consider all sources), mirroring
225+
merge-ort's relevant_sources culling in diffcore_rename_extended.
221226
"""
222227
if minimum_score == 0:
223228
minimum_score = DEFAULT_RENAME_SCORE
@@ -304,6 +309,8 @@ def remaining_srcs() -> list[int]:
304309
for i in range(len(srcs)):
305310
if srcs[i].rename_used:
306311
continue
312+
if relevant_sources is not None and srcs[i].path not in relevant_sources:
313+
continue
307314
base = _basename(srcs[i].path)
308315
src_index = src_base.get(base, -1)
309316
if base in dst_base:
@@ -321,7 +328,9 @@ def remaining_srcs() -> list[int]:
321328
record(dst_index, src_index, score)
322329

323330
# --- inexact matrix (NxM similarity) ---
324-
src_idx = remaining_srcs()
331+
# cull sources not in relevant_sources (remove_unneeded_paths_from_src)
332+
src_idx = [i for i in remaining_srcs()
333+
if relevant_sources is None or srcs[i].path in relevant_sources]
325334
num_sources = len(src_idx)
326335
num_destinations = sum(1 for i in range(len(dsts)) if not dst_is_rename[i])
327336
if not num_sources or not num_destinations:

pythongit/merge.py

Lines changed: 78 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -40,43 +40,92 @@ def _parents(repo: Repository, sha: str) -> list[str]:
4040
return objs.parse_commit(data).parents
4141

4242

43-
def merge_bases(repo: Repository, a: str, b: str) -> list[str]:
44-
if a == b:
45-
return [a]
46-
flags: dict[str, int] = {a: PARENT1, b: PARENT2}
47-
# max-heap by commit time via negative
48-
pq: list[tuple[int, str]] = []
49-
heapq.heappush(pq, (-_commit_time(repo, a), a))
50-
heapq.heappush(pq, (-_commit_time(repo, b), b))
51-
result: list[str] = []
52-
while pq:
53-
# check if any non-stale remain with both flags possible
54-
if all((flags[s] & STALE) for _, s in pq):
55-
break
56-
_, sha = heapq.heappop(pq)
57-
f = flags.get(sha, 0)
58-
if (f & (PARENT1 | PARENT2)) == (PARENT1 | PARENT2):
59-
if not (f & RESULT):
60-
f |= RESULT
61-
result.append(sha)
43+
def _insert_by_date(lst: list, item: str, date: int) -> None:
44+
"""Insert into a date-descending list; equal dates keep insertion (FIFO),
45+
mirroring commit_list_insert_by_date."""
46+
i = 0
47+
while i < len(lst) and lst[i][1] >= date:
48+
i += 1
49+
lst.insert(i, (item, date))
50+
51+
52+
def _paint_down_to_common(repo: Repository, one: str, twos: list[str]):
53+
"""Faithful port of commit-reach.c:paint_down_to_common (no commit-graph
54+
generation numbers, so ordering is by commit date). Returns
55+
(result_shas_in_date_order, flags)."""
56+
flags: dict[str, int] = {one: PARENT1}
57+
# min-heap on (-date, insertion_ctr) → pops newest first, FIFO for ties
58+
heap: list[tuple[int, int, str]] = []
59+
ctr = 0
60+
heapq.heappush(heap, (-_commit_time(repo, one), ctr, one))
61+
ctr += 1
62+
for t in twos:
63+
flags[t] = flags.get(t, 0) | PARENT2
64+
heapq.heappush(heap, (-_commit_time(repo, t), ctr, t))
65+
ctr += 1
66+
67+
result: list[tuple[str, int]] = []
68+
while any(not (flags.get(s, 0) & STALE) for _, _, s in heap):
69+
negd, _c, commit = heapq.heappop(heap)
70+
f = flags.get(commit, 0) & (PARENT1 | PARENT2 | STALE)
71+
if f == (PARENT1 | PARENT2):
72+
if not (flags.get(commit, 0) & RESULT):
73+
flags[commit] = flags.get(commit, 0) | RESULT
74+
_insert_by_date(result, commit, -negd)
6275
f |= STALE
63-
flags[sha] = f
64-
carry = f & (PARENT1 | PARENT2 | STALE)
65-
for p in _parents(repo, sha):
76+
for p in _parents(repo, commit):
6677
pf = flags.get(p, 0)
67-
if (pf & carry) == carry:
78+
if (pf & f) == f:
6879
continue
69-
flags[p] = pf | carry
70-
heapq.heappush(pq, (-_commit_time(repo, p), p))
71-
# filter out stale results
72-
return [s for s in result if not (flags.get(s, 0) & STALE) or (flags.get(s, 0) & RESULT)]
80+
flags[p] = pf | f
81+
heapq.heappush(heap, (-_commit_time(repo, p), ctr, p))
82+
ctr += 1
83+
return [s for s, _ in result], flags
84+
85+
86+
def _remove_redundant(repo: Repository, array: list[str]) -> list[str]:
87+
"""Port of remove_redundant_no_gen: drop merge bases that are ancestors of
88+
other merge bases, preserving order."""
89+
cnt = len(array)
90+
redundant = [False] * cnt
91+
for i in range(cnt):
92+
if redundant[i]:
93+
continue
94+
work = []
95+
filled_index = []
96+
for j in range(cnt):
97+
if i == j or redundant[j]:
98+
continue
99+
filled_index.append(j)
100+
work.append(array[j])
101+
if not work:
102+
continue
103+
_res, flags = _paint_down_to_common(repo, array[i], work)
104+
if flags.get(array[i], 0) & PARENT2:
105+
redundant[i] = True
106+
for k, wj in enumerate(work):
107+
if flags.get(wj, 0) & PARENT1:
108+
redundant[filled_index[k]] = True
109+
return [array[i] for i in range(cnt) if not redundant[i]]
110+
111+
112+
def merge_bases(repo: Repository, a: str, b: str) -> list[str]:
113+
"""Return the merge bases of two commits, in git's order
114+
(repo_get_merge_bases): date-descending with FIFO tie-breaking, redundant
115+
bases removed."""
116+
if a == b:
117+
return [a]
118+
res_shas, flags = _paint_down_to_common(repo, a, [b])
119+
result = [s for s in res_shas if not (flags.get(s, 0) & STALE)]
120+
if len(result) <= 1:
121+
return result
122+
return _remove_redundant(repo, result)
73123

74124

75125
def is_ancestor(repo: Repository, ancestor: str, descendant: str) -> bool:
76126
if ancestor == descendant:
77127
return True
78-
bases = merge_bases(repo, ancestor, descendant)
79-
return ancestor in bases
128+
return ancestor in merge_bases(repo, ancestor, descendant)
80129

81130

82131
# ---------------------------------------------------------------------------

0 commit comments

Comments
 (0)