Skip to content

fix: stagger estate push to stop CI thundering-herd#147

Merged
hyperpolymath merged 1 commit into
mainfrom
fix/throttle-estate-push-herd
May 16, 2026
Merged

fix: stagger estate push to stop CI thundering-herd#147
hyperpolymath merged 1 commit into
mainfrom
fix/throttle-estate-push-herd

Conversation

@hyperpolymath
Copy link
Copy Markdown
Owner

Root cause

scripts/sync-all-parallel.exs Phase 1 (phase1_parallel) fetches/pulls/pushes
every owned repo through a single Task.async_stream at --concurrency (default 32),
with no pacing. When the whole estate (~355 repos) is synced in one tight window,
each git push immediately triggers that repo's GitHub Actions workflows. The
resulting near-simultaneous burst of thousands of workflow runs saturates the
account-wide hosted-runner concurrency cap, so a large fraction of estate CI
(root-caused at ~34%) sits transiently queued — a classic CI thundering herd.

Nothing about what is pushed is wrong; the problem is purely the dispatch
timing (all pushes effectively at once).

Change — staggered batch pacing

Least-invasive fix that fits the existing BEAM design: keep Task.async_stream
and its intra-batch --concurrency parallelism unchanged, but process repos in
batches with a paced pause between batches, so CI trigger waves are
spread over time instead of arriving all at once.

  • phase1_parallel now chunk_everys repos into batches and runs each batch via
    the unchanged run_batch/4 (same concurrency, same per-repo error handling).
  • A pause is inserted after every batch except the last: batch_pause_sec
    plus a small random jitter (0–5s) to de-correlate repeated runs.
  • Correctness preserved: every repo is still processed exactly as before;
    a crash/timeout in one repo still maps to an error result and never aborts
    the batch or the run; idempotent (re-running pushes nothing new).

Defaults & tuning

Setting Default Flag Env
Batch size 25 --batch-size N SYNC_BATCH_SIZE
Inter-batch pause 45s (+0–5s jitter) --batch-pause SEC SYNC_BATCH_PAUSE_SEC
Disable pacing on --no-throttle

25 repos / ~45s caps the workflow-trigger rate well under the account runner
cap while still syncing the full estate in a bounded time (~355 repos ≈ 14
batches ≈ ~10–11 min of added pacing). Tune up --batch-pause or down
--batch-size if the queue still backs up; raise the batch size once the
runner cap is increased. CLI flags override env vars.

Throttling is automatically skipped under --dry-run (no pushes occur, so
there is nothing to pace) and via explicit --no-throttle (legacy single-stream
behaviour).

Pairing

This addresses the producer side (spreading the push/trigger burst). It pairs
with the template concurrency: PR, which addresses the consumer side
(per-repo workflow auto-cancellation / serialization). Both are needed to fully
eliminate the queued-CI saturation.

🤖 Generated with Claude Code

Phase 1 of sync-all-parallel.exs pushed every owned repo via a single
Task.async_stream with no pacing, so a full-estate sync fired thousands
of GitHub Actions runs near-simultaneously and saturated the account
hosted-runner concurrency cap (~34% of estate CI left transiently queued).

Process repos in batches (default 25, --batch-size / SYNC_BATCH_SIZE)
with a paced pause between batches (default 45s + 0-5s jitter,
--batch-pause / SYNC_BATCH_PAUSE_SEC). Intra-batch concurrency and
per-repo error handling are unchanged; every repo is still processed.
Pacing is skipped on --dry-run and --no-throttle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hyperpolymath hyperpolymath merged commit 4ea09a7 into main May 16, 2026
21 of 26 checks passed
@hyperpolymath hyperpolymath deleted the fix/throttle-estate-push-herd branch May 16, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant