fix: stagger estate push to stop CI thundering-herd#147
Merged
Conversation
Phase 1 of sync-all-parallel.exs pushed every owned repo via a single Task.async_stream with no pacing, so a full-estate sync fired thousands of GitHub Actions runs near-simultaneously and saturated the account hosted-runner concurrency cap (~34% of estate CI left transiently queued). Process repos in batches (default 25, --batch-size / SYNC_BATCH_SIZE) with a paced pause between batches (default 45s + 0-5s jitter, --batch-pause / SYNC_BATCH_PAUSE_SEC). Intra-batch concurrency and per-repo error handling are unchanged; every repo is still processed. Pacing is skipped on --dry-run and --no-throttle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
scripts/sync-all-parallel.exsPhase 1 (phase1_parallel) fetches/pulls/pushesevery owned repo through a single
Task.async_streamat--concurrency(default 32),with no pacing. When the whole estate (~355 repos) is synced in one tight window,
each
git pushimmediately triggers that repo's GitHub Actions workflows. Theresulting near-simultaneous burst of thousands of workflow runs saturates the
account-wide hosted-runner concurrency cap, so a large fraction of estate CI
(root-caused at ~34%) sits transiently
queued— a classic CI thundering herd.Nothing about what is pushed is wrong; the problem is purely the dispatch
timing (all pushes effectively at once).
Change — staggered batch pacing
Least-invasive fix that fits the existing BEAM design: keep
Task.async_streamand its intra-batch
--concurrencyparallelism unchanged, but process repos inbatches with a paced pause between batches, so CI trigger waves are
spread over time instead of arriving all at once.
phase1_parallelnowchunk_everys repos into batches and runs each batch viathe unchanged
run_batch/4(same concurrency, same per-repo error handling).batch_pause_secplus a small random jitter (0–5s) to de-correlate repeated runs.
a crash/timeout in one repo still maps to an error result and never aborts
the batch or the run; idempotent (re-running pushes nothing new).
Defaults & tuning
--batch-size NSYNC_BATCH_SIZE--batch-pause SECSYNC_BATCH_PAUSE_SEC--no-throttle25 repos / ~45s caps the workflow-trigger rate well under the account runner
cap while still syncing the full estate in a bounded time (~355 repos ≈ 14
batches ≈ ~10–11 min of added pacing). Tune up
--batch-pauseor down--batch-sizeif the queue still backs up; raise the batch size once therunner cap is increased. CLI flags override env vars.
Throttling is automatically skipped under
--dry-run(no pushes occur, sothere is nothing to pace) and via explicit
--no-throttle(legacy single-streambehaviour).
Pairing
This addresses the producer side (spreading the push/trigger burst). It pairs
with the template
concurrency:PR, which addresses the consumer side(per-repo workflow auto-cancellation / serialization). Both are needed to fully
eliminate the queued-CI saturation.
🤖 Generated with Claude Code