Skip to content

Commit faf3dbc

Browse files
blhsingclaude
andcommitted
Initial commit: pure-Python git reimplementation
All 141 git built-in subcommands implemented (158 commands with aliases). On-disk format is byte-compatible with real git: real git can read pythongit's loose objects, packs (with deltas), index, refs, reflog, and commit-graph. Highlights: - DIRC v2 index with conflict stages (1/2/3) - Pack v2 + idx v2 with OFS_DELTA / REF_DELTA encoder and decoder - Binary commit-graph file passing git commit-graph verify - Smart HTTPS clone/fetch/push, plus git:// daemon and http-backend - Three-way merge with rerere replay - 74 pytest tests covering objects, refs, index, pack, diff, merge, patch, ignore, rerere, and full integration flows Installs both pygit and a drop-in git console script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 parents  commit faf3dbc

40 files changed

Lines changed: 11385 additions & 0 deletions

.gitignore

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Build artifacts
2+
build/
3+
dist/
4+
*.egg-info/
5+
*.egg
6+
7+
# Python
8+
__pycache__/
9+
*.py[cod]
10+
.pytest_cache/
11+
.mypy_cache/
12+
.ruff_cache/
13+
14+
# Local venvs
15+
venv/
16+
.venv/
17+
env/
18+
19+
# Editor / OS
20+
.idea/
21+
.vscode/
22+
.DS_Store
23+
Thumbs.db

README.md

Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
# pythongit
2+
3+
A pure-Python reimplementation of `git`. No external runtime dependencies — just
4+
the Python standard library. All 141 of git's built-in subcommands are
5+
implemented, the on-disk format is byte-for-byte compatible with real `git`,
6+
and the package installs both `pygit` and a drop-in `git` console script.
7+
8+
```text
9+
pythongit/ (repo root)
10+
├── pyproject.toml
11+
├── README.md this file
12+
├── pythongit/ importable package — at repo root
13+
│ ├── __init__.py
14+
│ ├── __main__.py `python -m pythongit ...`
15+
│ ├── cli.py command dispatch (158 commands)
16+
│ ├── repo.py Repository discovery + config
17+
│ ├── objects.py blob / tree / commit / tag encode/decode
18+
│ ├── refs.py ref resolution, update, reflog hook
19+
│ ├── reflog.py append-only ref log
20+
│ ├── index.py DIRC v2 with conflict stages
21+
│ ├── workdir.py add/rm/status/checkout, tree↔workdir
22+
│ ├── diff.py Myers diff + unified-diff renderer
23+
│ ├── merge.py merge-base + three-way blob merge
24+
│ ├── sequencer.py cherry-pick / revert / rebase
25+
│ ├── porcelain_merge.py ff + 3-way merge entry point
26+
│ ├── patch.py unified-diff parser + applier
27+
│ ├── pack.py pack v2 + idx v2, REF_DELTA + OFS_DELTA, encoder
28+
│ ├── protocol.py smart HTTPS clone / fetch / push
29+
│ ├── stash.py refs/stash + reflog-backed stash
30+
│ ├── ignore.py .gitignore engine
31+
│ ├── rerere.py reuse recorded resolution
32+
│ └── bridges.py daemon / http-backend / SMTP / Tk / shell-out
33+
└── tests/ pytest + script-style integration tests
34+
```
35+
36+
## Why does this exist?
37+
38+
Sometimes you need `git` on a machine where you can't install a real `git`
39+
binary — locked-down CI workers, restricted containers, environments where the
40+
only thing you can `pip install` is wheels. `pythongit` ships as a single
41+
pure-Python wheel and exposes a `git` command. Most everyday workflows just
42+
work.
43+
44+
This is also a reasonable reference implementation if you want to understand
45+
git's on-disk formats and protocols. The code in this repo cross-references
46+
git's own `Documentation/gitformat-*.adoc` specs for the wire formats it
47+
implements.
48+
49+
## Install
50+
51+
```bash
52+
pip install pythongit
53+
```
54+
55+
This installs **two console scripts**:
56+
57+
| Script | Purpose |
58+
|---------|-----------------------------------------------------------|
59+
| `pygit` | Unambiguous name; always invokes pythongit |
60+
| `git` | Drop-in name; shadows real `git` only if it comes earlier on PATH |
61+
62+
If a real `git` binary is already on PATH and earlier than the venv's `Scripts/`
63+
or `bin/` directory, your shell will resolve `git` to the real one. To force
64+
the pythongit version, either use `pygit`, put the venv earlier on PATH, or run
65+
`python -m pythongit ...`.
66+
67+
You can also run from a checkout without installing:
68+
69+
```bash
70+
python -m pythongit <command> [args...]
71+
```
72+
73+
## Tutorial
74+
75+
```bash
76+
mkdir demo && cd demo
77+
pygit init .
78+
pygit config user.name "You"
79+
pygit config user.email "you@example.com"
80+
81+
echo "hello" > a.txt
82+
pygit add a.txt
83+
pygit commit -m "first commit"
84+
85+
echo "world" >> a.txt
86+
pygit diff
87+
pygit add a.txt
88+
pygit commit -m "append world"
89+
90+
pygit log --oneline
91+
pygit tag v1
92+
pygit branch feature
93+
pygit checkout feature
94+
echo "feature work" > f.txt
95+
pygit add f.txt
96+
pygit commit -m "feature commit"
97+
98+
pygit checkout main
99+
pygit merge feature
100+
```
101+
102+
Cloning over HTTPS:
103+
104+
```bash
105+
pygit clone https://github.com/some/repo.git
106+
```
107+
108+
## Supported commands
109+
110+
All 141 git built-in subcommands plus aliases (158 entries in total). Selected
111+
highlights:
112+
113+
**Plumbing.** `hash-object`, `cat-file`, `ls-tree`, `write-tree`, `read-tree`,
114+
`commit-tree`, `mktree`, `mktag`, `update-ref`, `symbolic-ref`, `rev-parse`,
115+
`rev-list`, `ls-files`, `diff-tree`, `diff-index`, `diff-files`, `diff-pairs`,
116+
`pack-objects`, `unpack-objects`, `index-pack`, `verify-pack`, `show-index`,
117+
`unpack-file`, `merge-index`, `merge-file`, `update-index`, `update-server-info`,
118+
`check-ref-format`, `check-attr`, `check-mailmap`, `check-ignore`, `for-each-ref`,
119+
`show-ref`, `pack-refs`, `prune-packed`, `pack-redundant`, `multi-pack-index`,
120+
`fetch-pack`, `send-pack`, `upload-pack`, `receive-pack`, `upload-archive`,
121+
`http-fetch`, `http-backend`, `fmt-merge-msg`, `mailinfo`, `mailsplit`,
122+
`patch-id`, `commit-graph`, `var`, `stripspace`.
123+
124+
**Porcelain.** `init`, `clone`, `add`, `rm`, `mv`, `status`, `commit`, `log`,
125+
`show`, `diff`, `branch`, `tag`, `checkout`, `switch`, `restore`, `reset`,
126+
`merge`, `merge-tree`, `cherry-pick`, `revert`, `rebase`, `replay`, `cherry`,
127+
`range-diff`, `stash`, `reflog`, `notes`, `bisect`, `blame`, `annotate`,
128+
`describe`, `name-rev`, `shortlog`, `whatchanged`, `clean`, `archive`,
129+
`bundle`, `format-patch`, `am`, `apply`, `grep`, `show-branch`, `worktree`,
130+
`submodule`, `sparse-checkout`, `request-pull`, `interpret-trailers`,
131+
`verify-commit`, `verify-tag`, `rerere`, `replace`, `gc`, `repack`, `prune`,
132+
`count-objects`, `fsck`, `pull`, `fetch`, `push`, `remote`, `ls-remote`,
133+
`config`, `refs`, `repo`, `diagnose`, `bugreport`, `last-modified`, `history`,
134+
`url-parse`, `maintenance`.
135+
136+
**Bridges (orchestrate other binaries / protocols).** `send-email` (via
137+
`smtplib`), `daemon` (TCP git:// server), `instaweb`/`gitweb` (`http.server`-based
138+
browser), `gitk`/`gui` (Tk log viewer), `cvsimport`/`cvsexportcommit`/`cvsserver`
139+
(shell out to `cvs`), `svn` (shell out to `svn`), `difftool`/`mergetool`
140+
(invoke configured external tool), `credential`/`credential-store`/
141+
`credential-cache`/`credential-cache-daemon`, `remote-helper`/`remote-ext`/
142+
`remote-fd`, `fsmonitor`/`fsmonitor-daemon`, `shell` (restricted ssh
143+
dispatcher), `init-db`, `submodule-helper`, `checkout-worker`, `backfill`.
144+
145+
To see the full list:
146+
147+
```bash
148+
pygit help
149+
```
150+
151+
## Interop with real git
152+
153+
The on-disk format is byte-for-byte compatible with the git C implementation.
154+
The test suite verifies this against the real `git` binary:
155+
156+
| pythongit writes... | ...real `git` validates |
157+
|---|---|
158+
| loose objects | `git fsck` |
159+
| tree / commit objects | `git cat-file -p` |
160+
| index v2 with stages | `git ls-files --stage` |
161+
| pack v2 + idx v2 (with deltas) | `git verify-pack -v` |
162+
| binary commit-graph file | `git commit-graph verify` |
163+
| refs / packed-refs / reflog | `git log --all` |
164+
| smart HTTPS push payload | `git receive-pack` |
165+
166+
The reverse also holds: pythongit reads packs and indexes produced by real
167+
`git` clones.
168+
169+
## Architecture
170+
171+
### Object storage
172+
173+
Loose objects under `.git/objects/<sha[:2]>/<sha[2:]>`, zlib-compressed. Pack
174+
objects in `.git/objects/pack/pack-*.{pack,idx}`. The pack reader handles both
175+
`REF_DELTA` (delta against a hex sha base) and `OFS_DELTA` (delta against an
176+
earlier offset in the same pack). `pack.build_pack` also writes deltas:
177+
candidate bases come from a windowed search over recent same-type objects,
178+
accepted when the delta is at most half the raw size.
179+
180+
### Index
181+
182+
DIRC v2 with full stage support (bits 14-13 of the flags field). When a merge
183+
or cherry-pick conflicts, stages 1 (base), 2 (ours), 3 (theirs) are written to
184+
the index alongside a stage-0 entry pointing at the merged-with-markers blob.
185+
`pygit commit` refuses to commit while any stage > 0 exists; `pygit add`
186+
clears the conflict stages on resolution. `pygit merge-index -o <tool>` walks
187+
conflicted entries and invokes the driver with `(path, base-tmp, ours-tmp,
188+
theirs-tmp)`.
189+
190+
### Refs & reflog
191+
192+
`refs.update_ref` is the single chokepoint for all ref updates; it
193+
automatically appends to `.git/logs/<ref>` and (when the updated ref is what
194+
HEAD points at symbolically) to `.git/logs/HEAD`. This means `reflog`, `stash`
195+
(via `refs/stash`), and `notes` (via `refs/notes/commits`) all share one
196+
mechanism.
197+
198+
### Merge
199+
200+
`merge.merge_bases` mirrors `commit-reach.c`'s `paint_down_to_common`: BFS
201+
from both tips with PARENT1/PARENT2 flags, marking double-flagged commits as
202+
results and pushing STALE to their ancestors. `merge.merge_blob` is a
203+
line-based three-way merge that consults the rerere cache before falling back
204+
to emitting conflict markers.
205+
206+
### Rerere
207+
208+
When a conflict is produced, the file (with markers) is hashed after
209+
normalization (branch labels stripped) and stored under
210+
`.git/rr-cache/<hash>/preimage` plus a line in `_pending.txt`. When the user
211+
resolves the conflict and runs `commit`, the post-image is recorded next to
212+
it. The next time the *same* logical conflict appears, the merge replays the
213+
post-image automatically.
214+
215+
### Bisect
216+
217+
`bisect_step` follows git's `best_bisection`: for each candidate commit,
218+
compute `min(reachable_from_it, n - reachable_from_it)` and pick the maximum
219+
— i.e. the commit that splits the candidate DAG as evenly as possible.
220+
221+
### Pack writer (delta compression)
222+
223+
`pack._compute_delta` builds a hash table of every 16-byte block in the base,
224+
then sweeps the target looking for matches >= 4 bytes long. Matches become
225+
`COPY` ops; misses are accumulated into `INSERT` ops capped at 127 bytes each.
226+
The encoder is conservative: it accepts a delta only when it's at most 50% of
227+
raw size, keeping the chain length sensible.
228+
229+
### Binary commit-graph
230+
231+
Implements the format from `gitformat-commit-graph.adoc`:
232+
233+
```text
234+
HEADER (8 bytes) CGPH + ver(1) + hashver(1) + chunk_count + base_count
235+
TOC ((C+1)*12) per-chunk (id, offset_uint64) + terminator
236+
OIDF (256*4) fanout: cumulative counts indexed by first byte of OID
237+
OIDL (N*20) sorted SHA-1s
238+
CDAT (N*36) tree(20) + parent1_pos(4) + parent2_pos(4) + gen+time(8)
239+
EDGE (optional) octopus extra parents
240+
TRAILER (20) SHA-1 of all preceding bytes
241+
```
242+
243+
Generation numbers count topological level (1 for roots). The on-disk file is
244+
verifiable by real `git commit-graph verify`.
245+
246+
### Smart HTTPS
247+
248+
`protocol.discover_refs` calls `GET /info/refs?service=git-upload-pack`,
249+
strips the pkt-line framing, and returns the ref map. `protocol.fetch_pack`
250+
posts `want <sha>` lines + capability list and parses the side-band-encoded
251+
pack response. `protocol.push` does the receive-pack flow including building
252+
a non-thin pack of only-new objects and parsing `ok/ng` lines.
253+
254+
The `daemon` command serves the same flow over a raw TCP socket (git:// at
255+
port 9418), implemented with `socketserver.ThreadingTCPServer`. `http-backend`
256+
is an in-process variant used by `instaweb`.
257+
258+
## Testing
259+
260+
```bash
261+
pip install pythongit[test]
262+
pytest
263+
```
264+
265+
74 tests pass:
266+
267+
| File | Coverage |
268+
|-------------------------|----------|
269+
| `unit_objects.py` | hash, encode/decode, signatures, gitlinks |
270+
| `unit_refs.py` | symbolic refs, reflog, packed-refs, abbrev SHA |
271+
| `unit_index.py` | DIRC v2 roundtrip, conflict stages, long paths |
272+
| `unit_pack.py` | delta apply, idx v2, build_pack, real-git interop |
273+
| `unit_modules.py` | diff/merge/patch/ignore/rerere unit-level |
274+
| `unit_integration.py` | end-to-end CLI flows incl. conflicts + rerere replay |
275+
| `unit_phase_scripts.py` | wraps the script-style phase tests |
276+
277+
Tests that require the real `git` binary are silently skipped when it's not on
278+
PATH, so the suite runs cleanly in containers without one.
279+
280+
## What's intentionally NOT implemented
281+
282+
* SHA-256 object IDs. The format module is wired for SHA-1; SHA-256 would
283+
need a few format changes (hash length = H byte, idx v3, longer OIDs).
284+
* Bitmap indexes, multi-pack-index in binary form, and bloom filters on the
285+
commit-graph. The hot paths use linear scans instead — fine up to a few
286+
thousand commits / a few hundred MB of packs.
287+
* `git filter-repo` (it's a separate Python tool anyway, not a git built-in).
288+
* The fancier merge strategies (`recursive`'s rename detection, `ort`'s
289+
three-way for trees). `pygit merge-recursive` aliases to the default
290+
three-way merge.
291+
292+
## Limitations to know about
293+
294+
* Big repos: scans walk every loose object on disk and every pack
295+
sequentially. Fine for typical project sizes; not designed for the
296+
linux-kernel-or-larger end of the spectrum.
297+
* The `bisect` heuristic computes weights with a Python recursion — for
298+
multi-thousand-commit candidate sets this is slow.
299+
* `fsmonitor` uses polling, not OS-level inotify/fsevent. Configurable
300+
interval; not free.
301+
* `send-email` only supports vanilla SMTP via `smtplib`. No SSL/TLS-only
302+
authentication helpers (it does use `starttls()` when given a `--smtp-user`).
303+
* `gitk` / `gui` need a working Tk install (`tkinter`).
304+
305+
## Contributing
306+
307+
The project tries to follow git's published wire and on-disk format specs
308+
(`Documentation/gitformat-*.adoc`, `Documentation/technical/*.adoc`). When
309+
adding a feature:
310+
311+
1. Find the matching `builtin/<name>.c` and read its argument parser to figure
312+
out the flag set people actually use.
313+
2. Implement the behavior, but only the common flags first. Less-common flags
314+
should `argparse.error` rather than silently misbehave.
315+
3. Add a unit test in `tests/unit_*.py`. If real `git` can verify the output,
316+
also add an interop check.
317+
4. Run `pytest` — must remain green.
318+
319+
## License
320+
321+
MIT.

pyproject.toml

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
[build-system]
2+
requires = ["setuptools>=61"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "pythongit"
7+
version = "0.1.0"
8+
description = "Pure-Python reimplementation of git. Drop-in replacement that exposes both `pygit` and `git` console scripts."
9+
readme = "README.md"
10+
requires-python = ">=3.9"
11+
authors = [{ name = "pythongit" }]
12+
license = { text = "MIT" }
13+
keywords = ["git", "vcs", "version-control", "scm"]
14+
classifiers = [
15+
"Development Status :: 4 - Beta",
16+
"Environment :: Console",
17+
"Intended Audience :: Developers",
18+
"License :: OSI Approved :: MIT License",
19+
"Operating System :: OS Independent",
20+
"Programming Language :: Python :: 3",
21+
"Programming Language :: Python :: 3 :: Only",
22+
"Programming Language :: Python :: 3.9",
23+
"Programming Language :: Python :: 3.10",
24+
"Programming Language :: Python :: 3.11",
25+
"Programming Language :: Python :: 3.12",
26+
"Programming Language :: Python :: 3.13",
27+
"Programming Language :: Python :: 3.14",
28+
"Topic :: Software Development :: Version Control :: Git",
29+
]
30+
31+
[project.urls]
32+
Homepage = "https://github.com/example/pythongit"
33+
Repository = "https://github.com/example/pythongit"
34+
35+
[project.scripts]
36+
# Primary CLI name.
37+
pygit = "pythongit.cli:main"
38+
# Drop-in name. When this package is installed in an environment without a
39+
# real `git` binary on PATH, calling `git ...` invokes pythongit instead.
40+
git = "pythongit.cli:main"
41+
42+
[project.optional-dependencies]
43+
test = ["pytest>=7"]
44+
45+
[tool.setuptools.packages.find]
46+
include = ["pythongit*"]
47+
48+
[tool.setuptools.package-data]
49+
pythongit = ["py.typed"]
50+
51+
[tool.pytest.ini_options]
52+
testpaths = ["tests"]
53+
python_files = ["test_*.py", "unit_*.py"]

0 commit comments

Comments
 (0)