Skip to content

Latest commit

 

History

History
330 lines (229 loc) · 11.9 KB

File metadata and controls

330 lines (229 loc) · 11.9 KB

devstats

A zero-dependency Python CLI that scans your local git repositories and produces daily engineering activity statistics as CSV or JSON.

Point it at a folder full of repos, give it an author email, and get a clean breakdown of what happened each day — lines added, deleted, files touched, directories changed, and a complexity score. No GitHub API, no tokens, no browser automation. Just git log under the hood.


Why

You have 20+ repos cloned locally. You want to know:

  • How many lines did a developer ship per day this quarter?
  • Which days were inactive (excluding weekends)?
  • How spread out were the changes across the codebase?

Existing tools either require GitHub API tokens, scrape browser UIs, or need a database. This one reads your local .git history directly and writes a CSV.


Quick start

git clone https://github.com/thestuntcoder/git-developer-activity.git
cd git-developer-activity

# Scan all repos under ~/Sites for one author, last 30 days
python3 -m devstats scan ~/Sites \
    --author-email you@company.com \
    --last 30d \
    --format csv \
    --output stats.csv

Requirements: Python 3.9+ and git on your PATH. No pip dependencies. Nothing to install.


What you get

$ python3 -m devstats scan ~/Sites --author-email dev@example.com --last 30d

date,number_of_commits,lines_added,lines_deleted,net_lines,total_churn,files_changed,directories_touched,complexity_score
2026-02-08,3,411,28,383,439,41,5,21.58
2026-02-09,1,4,2,2,6,2,2,1.98
2026-02-14,5,166,82,84,248,8,5,6.53
2026-02-24,8,260,42,218,302,10,3,7.1
2026-03-05,18,1787,1822,-35,3609,52,7,27.67
2026-03-06,17,3655,604,3051,4259,84,4,41.52
...

Summary: 117 commits across 18 active day(s) (2026-02-08 → 2026-03-07)
  Lines added:       14,232
  Lines deleted:      3,213
  Total churn:       17,445
  Net lines:         11,019

The CSV goes to stdout (or a file with -o). The summary goes to stderr. Pipe-friendly.

Days with no activity are emitted as normal rows with zeros (for non---by-repo mode), so the output is a continuous date series.


Usage

python3 -m devstats {scan,repos} [options]

Two modes

Command What it does
scan <directory> Auto-discovers all git repos in the immediate subdirectories of <directory>
repos <path> [path ...] Uses one or more explicit repo paths

Examples

# All repos under ~/code, last year, as CSV
python3 -m devstats scan ~/code --author-email dev@company.com --last 1y -o year.csv

# Specific repos, JSON output
python3 -m devstats repos ~/code/api ~/code/frontend \
    --author-email dev@company.com \
    --since 2026-01-01 --until 2026-04-01 \
    --format json

# Filter by author name instead of email
python3 -m devstats scan ~/code --author-name "Jane Doe" --last 6m

# Multiple email identities (same person, different emails)
python3 -m devstats scan ~/code \
    --author-email dev@company.com \
    --author-email dev@personal.com \
    --last 6m

# Combine name and email (OR logic — matches either)
python3 -m devstats scan ~/code \
    --author-name "Jane Doe" \
    --author-email jane@other-company.com \
    --last 6m

# Regex match on author name or email
python3 -m devstats scan ~/code --author-regex "Jane|jane@" --last 3m

# Per-repo breakdown
python3 -m devstats scan ~/code --author-email dev@company.com --last 30d --by-repo

# Only Python and JS files
python3 -m devstats scan ~/code --author-email dev@company.com --last 30d --extensions py,js

# Exclude test fixtures
python3 -m devstats scan ~/code --author-email dev@company.com --exclude-path "fixtures/*"

# Skip merge commits
python3 -m devstats scan ~/code --author-email dev@company.com --last 30d --exclude-merges

# Different timezone
python3 -m devstats scan ~/code --author-email dev@company.com --timezone America/New_York

# Include lock files in stats
python3 -m devstats scan ~/code --author-email dev@company.com --include-generated

# Verbose mode — see which repos are found
python3 -m devstats scan ~/code --author-email dev@company.com --last 7d -v

All options

Author filters (at least one required)

Flag Description
--author-email EMAIL Exact email match. Repeat for multiple identities.
--author-name NAME Exact author name match. Repeat for multiple identities.
--author-regex PATTERN Regex matched against Author Name <email>

When multiple --author-email and/or --author-name flags are given, git treats them as OR — a commit matches if any one of them hits.

Date range

Flag Description
--since DATE Start date (inclusive), e.g. 2026-01-01
--until DATE End date (exclusive), e.g. 2026-04-01
--last SPAN Shorthand: 7d, 4w, 6m, 1y. Overrides --since.

Output

Flag Description
--format {csv,json} Output format. Default: csv
-o FILE / --output FILE Write to file instead of stdout
--by-repo Add repo_name column, one row per day per repo

Behaviour

Flag Description
--timezone TZ Timezone for day boundaries. Default: UTC. Accepts IANA names (America/New_York), offsets (+05:30, UTC+9).
--exclude-merges Skip merge commits
--include-generated Include lock files and minified assets
--extensions py,js,ts Only count files with these extensions
--exclude-path GLOB Exclude matching paths (repeatable)
-v / --verbose Show which repos are processed (to stderr)

Output columns

Column Description
date YYYY-MM-DD in the configured timezone
number_of_commits Commits on that day
lines_added Lines added (excluding binary and filtered files)
lines_deleted Lines deleted
net_lines lines_added − lines_deleted
total_churn lines_added + lines_deleted
files_changed Unique file paths changed that day
directories_touched Unique top-level directories changed (see below)
complexity_score Heuristic score (see below)

With --by-repo, a repo_name column is inserted after date.


How it works

  1. Discoveryscan walks the given directory one level deep, running git rev-parse --is-inside-work-tree on each subdirectory. Warns if a repo is a shallow clone.

  2. Enumeration — Runs git log --all --author=<filter> --since=<date> --until=<date> to get matching commit SHAs.

  3. Detail extraction — For each commit, runs git show --numstat --format=... -M <sha> to get per-file added/deleted counts. The -M flag detects renames.

  4. Filtering — Removes binary files, files in skipped directories, generated/lock files, and anything excluded by --exclude-path or --extensions.

  5. Aggregation — Groups by calendar date using the author date (not committer date) converted to the configured timezone.

  6. Export — Writes CSV or JSON with stable column ordering.

All git interaction happens via subprocess calling the git CLI. No GitPython, no pygit2, no API calls.


Directories touched

Counts unique top-level directories — the first path segment of each changed file. Files in the repo root count as ".".

Example: changes to src/a.py, src/b.py, lib/c.py, and README.md → 3 directories: src, lib, .


Complexity score

complexity_score = 0.35 × log(1 + total_churn)
                 + 0.45 × files_changed
                 + 0.20 × directories_touched

Rounded to 2 decimal places. The idea:

  • log(churn) — rewards volume of work but dampens massive refactors or generated code
  • files_changed — breadth of changes, correlates with review difficulty
  • directories_touched — cross-cutting scope, context-switching cost

The weights live in devstats/complexity.py:WEIGHTS — change them if you want.


Continuous daily output

For non---by-repo exports, the tool emits a dense day-by-day timeline. If a date has no commits, it is still included with zero values (number_of_commits=0, lines_added=0, etc.).

Range used for filling missing days:

  • --since / --until if provided (--until is exclusive)
  • if only --since is provided, fill through today
  • if no date flags are provided, fill between first and last active day

What gets filtered out

Skipped directories (always)

node_modules, vendor, dist, build, coverage, tmp, .next, __pycache__, .cache, .tox, .mypy_cache, .pytest_cache, venv, .venv, env

Generated / lock files (by default, override with --include-generated)

  • Exact filenames: package-lock.json, yarn.lock, pnpm-lock.yaml, Gemfile.lock, Pipfile.lock, poetry.lock, composer.lock, Cargo.lock, go.sum, flake.lock
  • Glob patterns: *.min.js, *.min.css, *.bundle.js, *.chunk.js, *.map, *.compiled.*, *.generated.*, *.pb.go, *_pb2.py, *.swagger.json

Binary files (always)

Git reports binary files as -\t- in numstat. They are excluded from all line counts, file counts, and directory counts.


Author date vs committer date

Commits are grouped by author date — when the code was actually written — not committer date (when it was applied/merged). This means:

  • A commit authored at 23:30 UTC will land on the next calendar day if you use --timezone UTC+2
  • Rebased or cherry-picked commits retain the original author date
  • This is usually what you want for measuring "when did the person write this code"

Error handling

  • If a repo fails, processing continues. Failures are listed at the end.
  • Shallow clones trigger a warning (history may be incomplete).
  • The exit code is 0 if at least one repo was processed successfully.

Project structure

devstats/
├── cli.py           # Argument parsing, orchestration
├── discovery.py     # Find git repos, validate paths, detect shallow clones
├── commits.py       # git log / git show wrappers
├── numstat.py       # Parse --numstat output (additions, deletions, renames)
├── filters.py       # Skip dirs, generated files, extension whitelist
├── aggregation.py   # Group by day, compute stats
├── complexity.py    # Complexity score formula + weights
├── export.py        # CSV / JSON output, summary with inactive days
├── constants.py     # All default values in one place
tests/
├── test_numstat.py      # Numstat parsing, renames, binary detection
├── test_filters.py      # Skip dirs, generated files, classification
├── test_complexity.py   # Score formula, determinism, custom weights
├── test_aggregation.py  # Timezone handling, day grouping, merging
├── test_export.py       # CSV/JSON format, dense timeline, summaries
├── test_cli.py          # --last shorthand parsing
├── test_discovery.py    # Repo detection
├── test_integration.py  # End-to-end with a real temp git repo

Running tests

pip install pytest
python3 -m pytest -v

126 tests covering numstat parsing, filtering, complexity scoring, timezone-aware aggregation, inactive weekday calculation, CSV/JSON output, and full end-to-end integration with a real temporary git repo.


Caveats

  • Local repos only. No GitHub API, no tokens, no network calls.
  • git must be installed. The tool shells out to git log and git show.
  • Shallow clones may report incomplete history. The tool warns but continues.
  • Timezone handling uses Python's zoneinfo for IANA names (Python 3.9+). For fixed offsets like +05:30, no extra modules are needed. Historical DST transitions are not modelled when using fixed-offset notation.
  • Large repos are processed sequentially. For millions of commits, narrow the date range with --since/--until/--last.
  • --since with date-only values — git interprets --since=2026-03-01 using the current time-of-day as the cutoff (a git quirk). The --last flag avoids this by emitting full ISO timestamps internally.

License

MIT