@@ -213,14 +213,20 @@ The reverse also holds: pythongit reads packs and indexes produced by real
213213### Object storage
214214
215215Loose objects under ` .git/objects/<oid[:2]>/<oid[2:]> ` , zlib-compressed. SHA-1
216- and SHA-256 repositories are selected by ` extensions.objectformat ` . Pack objects
217- live in ` .git/objects/pack/pack-*.{pack,idx} ` . The pack reader mmaps pack files,
218- binary-searches ` .idx ` tables, and handles both ` REF_DELTA ` (delta against a
219- hex object-id base) and ` OFS_DELTA ` (delta against an earlier offset in the same
220- pack). ` pack.build_pack ` also writes deltas: candidate bases come from a
221- windowed search over recent same-type objects, accepted when the delta is at
222- most half the raw size. Large CLI repacks stream non-delta pack records straight
223- to disk instead of materializing the entire pack in Python memory.
216+ and SHA-256 repositories are selected by ` extensions.objectformat ` . Loose-object
217+ enumeration uses a persistent ` .git/objects/info/pygit-loose-cache-v1 ` cache
218+ validated by fanout directory mtimes/sizes, so repeated ` count-objects ` ,
219+ abbreviated-OID resolution, and pruning commands do not rewalk every loose
220+ object directory when nothing changed.
221+
222+ Pack objects live in ` .git/objects/pack/pack-*.{pack,idx} ` . The pack reader
223+ mmaps pack files, binary-searches ` .idx ` tables, and handles both ` REF_DELTA `
224+ (delta against a hex object-id base) and ` OFS_DELTA ` (delta against an earlier
225+ offset in the same pack). Pack creation has two paths: ` pack.build_pack ` is the
226+ small in-memory builder used by tests and helper code, while CLI repacks,
227+ bundles, ` pack-objects --stdout ` , push requests, and upload-pack responses use
228+ a bounded-memory streaming writer that still emits OFS deltas against recent
229+ same-type bases.
224230
225231` pack-objects --all ` and ` repack ` write pack ` .bitmap ` indexes for full
226232reachable packs. ` multi-pack-index write --bitmap ` writes ` RIDX ` /` BTMP ` chunks
@@ -274,16 +280,24 @@ post-image automatically.
274280### Bisect
275281
276282` bisect_step ` follows git's ` best_bisection ` : for each candidate commit,
277- compute ` min(reachable_from_it, n - reachable_from_it) ` and pick the maximum
278- — i.e. the commit that splits the candidate DAG as evenly as possible.
283+ compute ` min(reachable_from_it, n - reachable_from_it) ` and pick the maximum;
284+ that is, the commit that splits the candidate DAG as evenly as possible. Parent
285+ lookups use the commit-graph when present. For very large candidate sets,
286+ ` pygit ` switches from exact transitive bitsets to a bounded-memory
287+ generation/topological median so bisect remains usable on large histories.
279288
280289### Pack writer (delta compression)
281290
282291` pack._compute_delta ` builds a hash table of every 16-byte block in the base,
283292then sweeps the target looking for matches >= 4 bytes long. Matches become
284293` COPY ` ops; misses are accumulated into ` INSERT ` ops capped at 127 bytes each.
285294The encoder is conservative: it accepts a delta only when it's at most 50% of
286- raw size, keeping the chain length sensible.
295+ raw size, keeping the chain length sensible. The streaming writer processes
296+ bounded batches sorted by type/size and keeps only a small recent-base window,
297+ so large pack creation no longer requires all object contents or the final pack
298+ bytes in memory. Incoming fetch/receive packs are streamed to a temporary file,
299+ mmap-indexed from disk, and installed as pack/idx pairs; thin packs are fixed by
300+ appending missing bases before the final index is written.
287301
288302### Binary commit-graph
289303
@@ -312,14 +326,18 @@ work for commits that definitely did not touch the requested path.
312326### Smart HTTPS
313327
314328` protocol.discover_refs ` calls ` GET /info/refs?service=git-upload-pack ` ,
315- strips the pkt-line framing, and returns the ref map. ` protocol.fetch_pack `
316- posts ` want <sha> ` lines + capability list and parses the side-band-encoded
317- pack response. ` protocol.push ` does the receive-pack flow including building
318- a non-thin pack of only-new objects and parsing ` ok/ng ` lines.
329+ strips the pkt-line framing, and returns the ref map. Fetch/clone stream the
330+ side-band-encoded pack response directly into the pack indexer instead of
331+ building one large response buffer. ` protocol.push ` does the receive-pack flow
332+ including streaming a non-thin pack of only-new objects from a temporary pack
333+ file and parsing ` ok/ng ` lines.
319334
320335The ` daemon ` command serves the same flow over a raw TCP socket (git:// at
321- port 9418), implemented with ` socketserver.ThreadingTCPServer ` . ` http-backend `
322- is an in-process variant used by ` instaweb ` .
336+ port 9418), implemented with ` socketserver.ThreadingTCPServer ` . Upload-pack
337+ responses stream side-band pack chunks instead of assembling the full response
338+ body. ` http-backend ` is an in-process variant used by ` instaweb ` ; the web server
339+ uses the streaming backend for upload-pack responses and receive-pack request
340+ bodies.
323341
324342## Testing
325343
@@ -328,16 +346,16 @@ pip install pythongit[test]
328346pytest
329347```
330348
331- 93 tests pass:
349+ 99 tests pass:
332350
333351| File | Coverage |
334352| -------------------------| ----------|
335353| ` unit_objects.py ` | hash, encode/decode, signatures, gitlinks |
336354| ` unit_refs.py ` | symbolic refs, reflog, packed-refs, abbrev SHA |
337355| ` unit_index.py ` | DIRC v2 roundtrip, conflict stages, long paths |
338- | ` unit_pack.py ` | delta apply, idx v2, build_pack, pack/MIDX bitmaps, binary MIDX, SHA-256 interop |
356+ | ` unit_pack.py ` | delta apply, idx v2, build_pack, inbound pack indexing, pack/MIDX bitmaps, binary MIDX, SHA-256 interop |
339357| ` unit_modules.py ` | diff/merge/patch/ignore/rerere/SMTP unit-level |
340- | ` unit_integration.py ` | end-to-end CLI flows incl. conflicts, rerere replay, SHA-256 translation |
358+ | ` unit_integration.py ` | end-to-end CLI flows incl. conflicts, rerere replay, SHA-256 translation, loose cache, streaming upload-pack, recursive tree diff |
341359| ` unit_phase_scripts.py ` | wraps the script-style phase tests |
342360
343361Tests that require the real ` git ` binary are silently skipped when it's not on
@@ -354,13 +372,13 @@ PATH, so the suite runs cleanly in containers without one.
354372
355373* Big repos: packed repositories now use mmap-backed pack reads, binary MIDX
356374 lookup, pack/MIDX bitmaps, commit-graph parent/tree lookup, changed-path
357- Bloom filters, and streaming on-disk repacks. The remaining scale-sensitive
358- cases are loose-object-heavy repositories without maintenance data, commands
359- that inherently inspect every path or blob, and smart-protocol paths that
360- still assemble response packs in memory .
361- * The ` bisect ` heuristic computes exact candidate weights with iterative
362- bitsets. It avoids Python recursion, but very large candidate DAGs can still
363- spend noticeable CPU and memory on transitive reachability .
375+ Bloom filters, cached loose-object enumeration, and bounded-memory streaming
376+ pack generation/indexing. Tree-diff commands skip identical subtrees. The
377+ remaining scale-sensitive cases are commands whose output inherently requires
378+ inspecting every path or blob .
379+ * ` bisect ` computes exact candidate weights for ordinary ranges. Above a large
380+ threshold it uses a bounded-memory generation/topological median, so the pick
381+ may differ from C Git's exact best bisection on unusually large DAGs .
364382* ` fsmonitor ` uses polling, not OS-level inotify/fsevent. Configurable
365383 interval; not free.
366384* ` send-email ` uses ` smtplib ` with plain SMTP, STARTTLS/TLS, and SMTP-over-SSL.
0 commit comments