|
| 1 | +# pythongit |
| 2 | + |
| 3 | +A pure-Python reimplementation of `git`. No external runtime dependencies — just |
| 4 | +the Python standard library. All 141 of git's built-in subcommands are |
| 5 | +implemented, the on-disk format is byte-for-byte compatible with real `git`, |
| 6 | +and the package installs both `pygit` and a drop-in `git` console script. |
| 7 | + |
| 8 | +```text |
| 9 | +pythongit/ (repo root) |
| 10 | +├── pyproject.toml |
| 11 | +├── README.md this file |
| 12 | +├── pythongit/ importable package — at repo root |
| 13 | +│ ├── __init__.py |
| 14 | +│ ├── __main__.py `python -m pythongit ...` |
| 15 | +│ ├── cli.py command dispatch (158 commands) |
| 16 | +│ ├── repo.py Repository discovery + config |
| 17 | +│ ├── objects.py blob / tree / commit / tag encode/decode |
| 18 | +│ ├── refs.py ref resolution, update, reflog hook |
| 19 | +│ ├── reflog.py append-only ref log |
| 20 | +│ ├── index.py DIRC v2 with conflict stages |
| 21 | +│ ├── workdir.py add/rm/status/checkout, tree↔workdir |
| 22 | +│ ├── diff.py Myers diff + unified-diff renderer |
| 23 | +│ ├── merge.py merge-base + three-way blob merge |
| 24 | +│ ├── sequencer.py cherry-pick / revert / rebase |
| 25 | +│ ├── porcelain_merge.py ff + 3-way merge entry point |
| 26 | +│ ├── patch.py unified-diff parser + applier |
| 27 | +│ ├── pack.py pack v2 + idx v2, REF_DELTA + OFS_DELTA, encoder |
| 28 | +│ ├── protocol.py smart HTTPS clone / fetch / push |
| 29 | +│ ├── stash.py refs/stash + reflog-backed stash |
| 30 | +│ ├── ignore.py .gitignore engine |
| 31 | +│ ├── rerere.py reuse recorded resolution |
| 32 | +│ └── bridges.py daemon / http-backend / SMTP / Tk / shell-out |
| 33 | +└── tests/ pytest + script-style integration tests |
| 34 | +``` |
| 35 | + |
| 36 | +## Why does this exist? |
| 37 | + |
| 38 | +Sometimes you need `git` on a machine where you can't install a real `git` |
| 39 | +binary — locked-down CI workers, restricted containers, environments where the |
| 40 | +only thing you can `pip install` is wheels. `pythongit` ships as a single |
| 41 | +pure-Python wheel and exposes a `git` command. Most everyday workflows just |
| 42 | +work. |
| 43 | + |
| 44 | +This is also a reasonable reference implementation if you want to understand |
| 45 | +git's on-disk formats and protocols. The code in this repo cross-references |
| 46 | +git's own `Documentation/gitformat-*.adoc` specs for the wire formats it |
| 47 | +implements. |
| 48 | + |
| 49 | +## Install |
| 50 | + |
| 51 | +```bash |
| 52 | +pip install pythongit |
| 53 | +``` |
| 54 | + |
| 55 | +This installs **two console scripts**: |
| 56 | + |
| 57 | +| Script | Purpose | |
| 58 | +|---------|-----------------------------------------------------------| |
| 59 | +| `pygit` | Unambiguous name; always invokes pythongit | |
| 60 | +| `git` | Drop-in name; shadows real `git` only if it comes earlier on PATH | |
| 61 | + |
| 62 | +If a real `git` binary is already on PATH and earlier than the venv's `Scripts/` |
| 63 | +or `bin/` directory, your shell will resolve `git` to the real one. To force |
| 64 | +the pythongit version, either use `pygit`, put the venv earlier on PATH, or run |
| 65 | +`python -m pythongit ...`. |
| 66 | + |
| 67 | +You can also run from a checkout without installing: |
| 68 | + |
| 69 | +```bash |
| 70 | +python -m pythongit <command> [args...] |
| 71 | +``` |
| 72 | + |
| 73 | +## Tutorial |
| 74 | + |
| 75 | +```bash |
| 76 | +mkdir demo && cd demo |
| 77 | +pygit init . |
| 78 | +pygit config user.name "You" |
| 79 | +pygit config user.email "you@example.com" |
| 80 | + |
| 81 | +echo "hello" > a.txt |
| 82 | +pygit add a.txt |
| 83 | +pygit commit -m "first commit" |
| 84 | + |
| 85 | +echo "world" >> a.txt |
| 86 | +pygit diff |
| 87 | +pygit add a.txt |
| 88 | +pygit commit -m "append world" |
| 89 | + |
| 90 | +pygit log --oneline |
| 91 | +pygit tag v1 |
| 92 | +pygit branch feature |
| 93 | +pygit checkout feature |
| 94 | +echo "feature work" > f.txt |
| 95 | +pygit add f.txt |
| 96 | +pygit commit -m "feature commit" |
| 97 | + |
| 98 | +pygit checkout main |
| 99 | +pygit merge feature |
| 100 | +``` |
| 101 | + |
| 102 | +Cloning over HTTPS: |
| 103 | + |
| 104 | +```bash |
| 105 | +pygit clone https://github.com/some/repo.git |
| 106 | +``` |
| 107 | + |
| 108 | +## Supported commands |
| 109 | + |
| 110 | +All 141 git built-in subcommands plus aliases (158 entries in total). Selected |
| 111 | +highlights: |
| 112 | + |
| 113 | +**Plumbing.** `hash-object`, `cat-file`, `ls-tree`, `write-tree`, `read-tree`, |
| 114 | +`commit-tree`, `mktree`, `mktag`, `update-ref`, `symbolic-ref`, `rev-parse`, |
| 115 | +`rev-list`, `ls-files`, `diff-tree`, `diff-index`, `diff-files`, `diff-pairs`, |
| 116 | +`pack-objects`, `unpack-objects`, `index-pack`, `verify-pack`, `show-index`, |
| 117 | +`unpack-file`, `merge-index`, `merge-file`, `update-index`, `update-server-info`, |
| 118 | +`check-ref-format`, `check-attr`, `check-mailmap`, `check-ignore`, `for-each-ref`, |
| 119 | +`show-ref`, `pack-refs`, `prune-packed`, `pack-redundant`, `multi-pack-index`, |
| 120 | +`fetch-pack`, `send-pack`, `upload-pack`, `receive-pack`, `upload-archive`, |
| 121 | +`http-fetch`, `http-backend`, `fmt-merge-msg`, `mailinfo`, `mailsplit`, |
| 122 | +`patch-id`, `commit-graph`, `var`, `stripspace`. |
| 123 | + |
| 124 | +**Porcelain.** `init`, `clone`, `add`, `rm`, `mv`, `status`, `commit`, `log`, |
| 125 | +`show`, `diff`, `branch`, `tag`, `checkout`, `switch`, `restore`, `reset`, |
| 126 | +`merge`, `merge-tree`, `cherry-pick`, `revert`, `rebase`, `replay`, `cherry`, |
| 127 | +`range-diff`, `stash`, `reflog`, `notes`, `bisect`, `blame`, `annotate`, |
| 128 | +`describe`, `name-rev`, `shortlog`, `whatchanged`, `clean`, `archive`, |
| 129 | +`bundle`, `format-patch`, `am`, `apply`, `grep`, `show-branch`, `worktree`, |
| 130 | +`submodule`, `sparse-checkout`, `request-pull`, `interpret-trailers`, |
| 131 | +`verify-commit`, `verify-tag`, `rerere`, `replace`, `gc`, `repack`, `prune`, |
| 132 | +`count-objects`, `fsck`, `pull`, `fetch`, `push`, `remote`, `ls-remote`, |
| 133 | +`config`, `refs`, `repo`, `diagnose`, `bugreport`, `last-modified`, `history`, |
| 134 | +`url-parse`, `maintenance`. |
| 135 | + |
| 136 | +**Bridges (orchestrate other binaries / protocols).** `send-email` (via |
| 137 | +`smtplib`), `daemon` (TCP git:// server), `instaweb`/`gitweb` (`http.server`-based |
| 138 | +browser), `gitk`/`gui` (Tk log viewer), `cvsimport`/`cvsexportcommit`/`cvsserver` |
| 139 | +(shell out to `cvs`), `svn` (shell out to `svn`), `difftool`/`mergetool` |
| 140 | +(invoke configured external tool), `credential`/`credential-store`/ |
| 141 | +`credential-cache`/`credential-cache-daemon`, `remote-helper`/`remote-ext`/ |
| 142 | +`remote-fd`, `fsmonitor`/`fsmonitor-daemon`, `shell` (restricted ssh |
| 143 | +dispatcher), `init-db`, `submodule-helper`, `checkout-worker`, `backfill`. |
| 144 | + |
| 145 | +To see the full list: |
| 146 | + |
| 147 | +```bash |
| 148 | +pygit help |
| 149 | +``` |
| 150 | + |
| 151 | +## Interop with real git |
| 152 | + |
| 153 | +The on-disk format is byte-for-byte compatible with the git C implementation. |
| 154 | +The test suite verifies this against the real `git` binary: |
| 155 | + |
| 156 | +| pythongit writes... | ...real `git` validates | |
| 157 | +|---|---| |
| 158 | +| loose objects | `git fsck` | |
| 159 | +| tree / commit objects | `git cat-file -p` | |
| 160 | +| index v2 with stages | `git ls-files --stage` | |
| 161 | +| pack v2 + idx v2 (with deltas) | `git verify-pack -v` | |
| 162 | +| binary commit-graph file | `git commit-graph verify` | |
| 163 | +| refs / packed-refs / reflog | `git log --all` | |
| 164 | +| smart HTTPS push payload | `git receive-pack` | |
| 165 | + |
| 166 | +The reverse also holds: pythongit reads packs and indexes produced by real |
| 167 | +`git` clones. |
| 168 | + |
| 169 | +## Architecture |
| 170 | + |
| 171 | +### Object storage |
| 172 | + |
| 173 | +Loose objects under `.git/objects/<sha[:2]>/<sha[2:]>`, zlib-compressed. Pack |
| 174 | +objects in `.git/objects/pack/pack-*.{pack,idx}`. The pack reader handles both |
| 175 | +`REF_DELTA` (delta against a hex sha base) and `OFS_DELTA` (delta against an |
| 176 | +earlier offset in the same pack). `pack.build_pack` also writes deltas: |
| 177 | +candidate bases come from a windowed search over recent same-type objects, |
| 178 | +accepted when the delta is at most half the raw size. |
| 179 | + |
| 180 | +### Index |
| 181 | + |
| 182 | +DIRC v2 with full stage support (bits 14-13 of the flags field). When a merge |
| 183 | +or cherry-pick conflicts, stages 1 (base), 2 (ours), 3 (theirs) are written to |
| 184 | +the index alongside a stage-0 entry pointing at the merged-with-markers blob. |
| 185 | +`pygit commit` refuses to commit while any stage > 0 exists; `pygit add` |
| 186 | +clears the conflict stages on resolution. `pygit merge-index -o <tool>` walks |
| 187 | +conflicted entries and invokes the driver with `(path, base-tmp, ours-tmp, |
| 188 | +theirs-tmp)`. |
| 189 | + |
| 190 | +### Refs & reflog |
| 191 | + |
| 192 | +`refs.update_ref` is the single chokepoint for all ref updates; it |
| 193 | +automatically appends to `.git/logs/<ref>` and (when the updated ref is what |
| 194 | +HEAD points at symbolically) to `.git/logs/HEAD`. This means `reflog`, `stash` |
| 195 | +(via `refs/stash`), and `notes` (via `refs/notes/commits`) all share one |
| 196 | +mechanism. |
| 197 | + |
| 198 | +### Merge |
| 199 | + |
| 200 | +`merge.merge_bases` mirrors `commit-reach.c`'s `paint_down_to_common`: BFS |
| 201 | +from both tips with PARENT1/PARENT2 flags, marking double-flagged commits as |
| 202 | +results and pushing STALE to their ancestors. `merge.merge_blob` is a |
| 203 | +line-based three-way merge that consults the rerere cache before falling back |
| 204 | +to emitting conflict markers. |
| 205 | + |
| 206 | +### Rerere |
| 207 | + |
| 208 | +When a conflict is produced, the file (with markers) is hashed after |
| 209 | +normalization (branch labels stripped) and stored under |
| 210 | +`.git/rr-cache/<hash>/preimage` plus a line in `_pending.txt`. When the user |
| 211 | +resolves the conflict and runs `commit`, the post-image is recorded next to |
| 212 | +it. The next time the *same* logical conflict appears, the merge replays the |
| 213 | +post-image automatically. |
| 214 | + |
| 215 | +### Bisect |
| 216 | + |
| 217 | +`bisect_step` follows git's `best_bisection`: for each candidate commit, |
| 218 | +compute `min(reachable_from_it, n - reachable_from_it)` and pick the maximum |
| 219 | +— i.e. the commit that splits the candidate DAG as evenly as possible. |
| 220 | + |
| 221 | +### Pack writer (delta compression) |
| 222 | + |
| 223 | +`pack._compute_delta` builds a hash table of every 16-byte block in the base, |
| 224 | +then sweeps the target looking for matches >= 4 bytes long. Matches become |
| 225 | +`COPY` ops; misses are accumulated into `INSERT` ops capped at 127 bytes each. |
| 226 | +The encoder is conservative: it accepts a delta only when it's at most 50% of |
| 227 | +raw size, keeping the chain length sensible. |
| 228 | + |
| 229 | +### Binary commit-graph |
| 230 | + |
| 231 | +Implements the format from `gitformat-commit-graph.adoc`: |
| 232 | + |
| 233 | +```text |
| 234 | +HEADER (8 bytes) CGPH + ver(1) + hashver(1) + chunk_count + base_count |
| 235 | +TOC ((C+1)*12) per-chunk (id, offset_uint64) + terminator |
| 236 | +OIDF (256*4) fanout: cumulative counts indexed by first byte of OID |
| 237 | +OIDL (N*20) sorted SHA-1s |
| 238 | +CDAT (N*36) tree(20) + parent1_pos(4) + parent2_pos(4) + gen+time(8) |
| 239 | +EDGE (optional) octopus extra parents |
| 240 | +TRAILER (20) SHA-1 of all preceding bytes |
| 241 | +``` |
| 242 | + |
| 243 | +Generation numbers count topological level (1 for roots). The on-disk file is |
| 244 | +verifiable by real `git commit-graph verify`. |
| 245 | + |
| 246 | +### Smart HTTPS |
| 247 | + |
| 248 | +`protocol.discover_refs` calls `GET /info/refs?service=git-upload-pack`, |
| 249 | +strips the pkt-line framing, and returns the ref map. `protocol.fetch_pack` |
| 250 | +posts `want <sha>` lines + capability list and parses the side-band-encoded |
| 251 | +pack response. `protocol.push` does the receive-pack flow including building |
| 252 | +a non-thin pack of only-new objects and parsing `ok/ng` lines. |
| 253 | + |
| 254 | +The `daemon` command serves the same flow over a raw TCP socket (git:// at |
| 255 | +port 9418), implemented with `socketserver.ThreadingTCPServer`. `http-backend` |
| 256 | +is an in-process variant used by `instaweb`. |
| 257 | + |
| 258 | +## Testing |
| 259 | + |
| 260 | +```bash |
| 261 | +pip install pythongit[test] |
| 262 | +pytest |
| 263 | +``` |
| 264 | + |
| 265 | +74 tests pass: |
| 266 | + |
| 267 | +| File | Coverage | |
| 268 | +|-------------------------|----------| |
| 269 | +| `unit_objects.py` | hash, encode/decode, signatures, gitlinks | |
| 270 | +| `unit_refs.py` | symbolic refs, reflog, packed-refs, abbrev SHA | |
| 271 | +| `unit_index.py` | DIRC v2 roundtrip, conflict stages, long paths | |
| 272 | +| `unit_pack.py` | delta apply, idx v2, build_pack, real-git interop | |
| 273 | +| `unit_modules.py` | diff/merge/patch/ignore/rerere unit-level | |
| 274 | +| `unit_integration.py` | end-to-end CLI flows incl. conflicts + rerere replay | |
| 275 | +| `unit_phase_scripts.py` | wraps the script-style phase tests | |
| 276 | + |
| 277 | +Tests that require the real `git` binary are silently skipped when it's not on |
| 278 | +PATH, so the suite runs cleanly in containers without one. |
| 279 | + |
| 280 | +## What's intentionally NOT implemented |
| 281 | + |
| 282 | +* SHA-256 object IDs. The format module is wired for SHA-1; SHA-256 would |
| 283 | + need a few format changes (hash length = H byte, idx v3, longer OIDs). |
| 284 | +* Bitmap indexes, multi-pack-index in binary form, and bloom filters on the |
| 285 | + commit-graph. The hot paths use linear scans instead — fine up to a few |
| 286 | + thousand commits / a few hundred MB of packs. |
| 287 | +* `git filter-repo` (it's a separate Python tool anyway, not a git built-in). |
| 288 | +* The fancier merge strategies (`recursive`'s rename detection, `ort`'s |
| 289 | + three-way for trees). `pygit merge-recursive` aliases to the default |
| 290 | + three-way merge. |
| 291 | + |
| 292 | +## Limitations to know about |
| 293 | + |
| 294 | +* Big repos: scans walk every loose object on disk and every pack |
| 295 | + sequentially. Fine for typical project sizes; not designed for the |
| 296 | + linux-kernel-or-larger end of the spectrum. |
| 297 | +* The `bisect` heuristic computes weights with a Python recursion — for |
| 298 | + multi-thousand-commit candidate sets this is slow. |
| 299 | +* `fsmonitor` uses polling, not OS-level inotify/fsevent. Configurable |
| 300 | + interval; not free. |
| 301 | +* `send-email` only supports vanilla SMTP via `smtplib`. No SSL/TLS-only |
| 302 | + authentication helpers (it does use `starttls()` when given a `--smtp-user`). |
| 303 | +* `gitk` / `gui` need a working Tk install (`tkinter`). |
| 304 | + |
| 305 | +## Contributing |
| 306 | + |
| 307 | +The project tries to follow git's published wire and on-disk format specs |
| 308 | +(`Documentation/gitformat-*.adoc`, `Documentation/technical/*.adoc`). When |
| 309 | +adding a feature: |
| 310 | + |
| 311 | +1. Find the matching `builtin/<name>.c` and read its argument parser to figure |
| 312 | + out the flag set people actually use. |
| 313 | +2. Implement the behavior, but only the common flags first. Less-common flags |
| 314 | + should `argparse.error` rather than silently misbehave. |
| 315 | +3. Add a unit test in `tests/unit_*.py`. If real `git` can verify the output, |
| 316 | + also add an interop check. |
| 317 | +4. Run `pytest` — must remain green. |
| 318 | + |
| 319 | +## License |
| 320 | + |
| 321 | +MIT. |
0 commit comments