Skip to content

Commit ea4f7d9

Browse files
committed
Scale large repository pack and tree operations
1 parent 80c5945 commit ea4f7d9

11 files changed

Lines changed: 1476 additions & 419 deletions

File tree

README.md

Lines changed: 45 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -213,14 +213,20 @@ The reverse also holds: pythongit reads packs and indexes produced by real
213213
### Object storage
214214

215215
Loose objects under `.git/objects/<oid[:2]>/<oid[2:]>`, zlib-compressed. SHA-1
216-
and SHA-256 repositories are selected by `extensions.objectformat`. Pack objects
217-
live in `.git/objects/pack/pack-*.{pack,idx}`. The pack reader mmaps pack files,
218-
binary-searches `.idx` tables, and handles both `REF_DELTA` (delta against a
219-
hex object-id base) and `OFS_DELTA` (delta against an earlier offset in the same
220-
pack). `pack.build_pack` also writes deltas: candidate bases come from a
221-
windowed search over recent same-type objects, accepted when the delta is at
222-
most half the raw size. Large CLI repacks stream non-delta pack records straight
223-
to disk instead of materializing the entire pack in Python memory.
216+
and SHA-256 repositories are selected by `extensions.objectformat`. Loose-object
217+
enumeration uses a persistent `.git/objects/info/pygit-loose-cache-v1` cache
218+
validated by fanout directory mtimes/sizes, so repeated `count-objects`,
219+
abbreviated-OID resolution, and pruning commands do not rewalk every loose
220+
object directory when nothing changed.
221+
222+
Pack objects live in `.git/objects/pack/pack-*.{pack,idx}`. The pack reader
223+
mmaps pack files, binary-searches `.idx` tables, and handles both `REF_DELTA`
224+
(delta against a hex object-id base) and `OFS_DELTA` (delta against an earlier
225+
offset in the same pack). Pack creation has two paths: `pack.build_pack` is the
226+
small in-memory builder used by tests and helper code, while CLI repacks,
227+
bundles, `pack-objects --stdout`, push requests, and upload-pack responses use
228+
a bounded-memory streaming writer that still emits OFS deltas against recent
229+
same-type bases.
224230

225231
`pack-objects --all` and `repack` write pack `.bitmap` indexes for full
226232
reachable packs. `multi-pack-index write --bitmap` writes `RIDX`/`BTMP` chunks
@@ -274,16 +280,24 @@ post-image automatically.
274280
### Bisect
275281

276282
`bisect_step` follows git's `best_bisection`: for each candidate commit,
277-
compute `min(reachable_from_it, n - reachable_from_it)` and pick the maximum
278-
— i.e. the commit that splits the candidate DAG as evenly as possible.
283+
compute `min(reachable_from_it, n - reachable_from_it)` and pick the maximum;
284+
that is, the commit that splits the candidate DAG as evenly as possible. Parent
285+
lookups use the commit-graph when present. For very large candidate sets,
286+
`pygit` switches from exact transitive bitsets to a bounded-memory
287+
generation/topological median so bisect remains usable on large histories.
279288

280289
### Pack writer (delta compression)
281290

282291
`pack._compute_delta` builds a hash table of every 16-byte block in the base,
283292
then sweeps the target looking for matches >= 4 bytes long. Matches become
284293
`COPY` ops; misses are accumulated into `INSERT` ops capped at 127 bytes each.
285294
The encoder is conservative: it accepts a delta only when it's at most 50% of
286-
raw size, keeping the chain length sensible.
295+
raw size, keeping the chain length sensible. The streaming writer processes
296+
bounded batches sorted by type/size and keeps only a small recent-base window,
297+
so large pack creation no longer requires all object contents or the final pack
298+
bytes in memory. Incoming fetch/receive packs are streamed to a temporary file,
299+
mmap-indexed from disk, and installed as pack/idx pairs; thin packs are fixed by
300+
appending missing bases before the final index is written.
287301

288302
### Binary commit-graph
289303

@@ -312,14 +326,18 @@ work for commits that definitely did not touch the requested path.
312326
### Smart HTTPS
313327

314328
`protocol.discover_refs` calls `GET /info/refs?service=git-upload-pack`,
315-
strips the pkt-line framing, and returns the ref map. `protocol.fetch_pack`
316-
posts `want <sha>` lines + capability list and parses the side-band-encoded
317-
pack response. `protocol.push` does the receive-pack flow including building
318-
a non-thin pack of only-new objects and parsing `ok/ng` lines.
329+
strips the pkt-line framing, and returns the ref map. Fetch/clone stream the
330+
side-band-encoded pack response directly into the pack indexer instead of
331+
building one large response buffer. `protocol.push` does the receive-pack flow
332+
including streaming a non-thin pack of only-new objects from a temporary pack
333+
file and parsing `ok/ng` lines.
319334

320335
The `daemon` command serves the same flow over a raw TCP socket (git:// at
321-
port 9418), implemented with `socketserver.ThreadingTCPServer`. `http-backend`
322-
is an in-process variant used by `instaweb`.
336+
port 9418), implemented with `socketserver.ThreadingTCPServer`. Upload-pack
337+
responses stream side-band pack chunks instead of assembling the full response
338+
body. `http-backend` is an in-process variant used by `instaweb`; the web server
339+
uses the streaming backend for upload-pack responses and receive-pack request
340+
bodies.
323341

324342
## Testing
325343

@@ -328,16 +346,16 @@ pip install pythongit[test]
328346
pytest
329347
```
330348

331-
93 tests pass:
349+
99 tests pass:
332350

333351
| File | Coverage |
334352
|-------------------------|----------|
335353
| `unit_objects.py` | hash, encode/decode, signatures, gitlinks |
336354
| `unit_refs.py` | symbolic refs, reflog, packed-refs, abbrev SHA |
337355
| `unit_index.py` | DIRC v2 roundtrip, conflict stages, long paths |
338-
| `unit_pack.py` | delta apply, idx v2, build_pack, pack/MIDX bitmaps, binary MIDX, SHA-256 interop |
356+
| `unit_pack.py` | delta apply, idx v2, build_pack, inbound pack indexing, pack/MIDX bitmaps, binary MIDX, SHA-256 interop |
339357
| `unit_modules.py` | diff/merge/patch/ignore/rerere/SMTP unit-level |
340-
| `unit_integration.py` | end-to-end CLI flows incl. conflicts, rerere replay, SHA-256 translation |
358+
| `unit_integration.py` | end-to-end CLI flows incl. conflicts, rerere replay, SHA-256 translation, loose cache, streaming upload-pack, recursive tree diff |
341359
| `unit_phase_scripts.py` | wraps the script-style phase tests |
342360

343361
Tests that require the real `git` binary are silently skipped when it's not on
@@ -354,13 +372,13 @@ PATH, so the suite runs cleanly in containers without one.
354372

355373
* Big repos: packed repositories now use mmap-backed pack reads, binary MIDX
356374
lookup, pack/MIDX bitmaps, commit-graph parent/tree lookup, changed-path
357-
Bloom filters, and streaming on-disk repacks. The remaining scale-sensitive
358-
cases are loose-object-heavy repositories without maintenance data, commands
359-
that inherently inspect every path or blob, and smart-protocol paths that
360-
still assemble response packs in memory.
361-
* The `bisect` heuristic computes exact candidate weights with iterative
362-
bitsets. It avoids Python recursion, but very large candidate DAGs can still
363-
spend noticeable CPU and memory on transitive reachability.
375+
Bloom filters, cached loose-object enumeration, and bounded-memory streaming
376+
pack generation/indexing. Tree-diff commands skip identical subtrees. The
377+
remaining scale-sensitive cases are commands whose output inherently requires
378+
inspecting every path or blob.
379+
* `bisect` computes exact candidate weights for ordinary ranges. Above a large
380+
threshold it uses a bounded-memory generation/topological median, so the pick
381+
may differ from C Git's exact best bisection on unusually large DAGs.
364382
* `fsmonitor` uses polling, not OS-level inotify/fsevent. Configurable
365383
interval; not free.
366384
* `send-email` uses `smtplib` with plain SMTP, STARTTLS/TLS, and SMTP-over-SSL.

0 commit comments

Comments
 (0)