Skip to content

fix: enable Octokit throttle retry and SSM adaptive retry under burst#5136

Draft
vegardx wants to merge 1 commit into
github-aws-runners:mainfrom
vegardx:fix/throttle-retry-and-ssm-adaptive
Draft

fix: enable Octokit throttle retry and SSM adaptive retry under burst#5136
vegardx wants to merge 1 commit into
github-aws-runners:mainfrom
vegardx:fix/throttle-retry-and-ssm-adaptive

Conversation

@vegardx
Copy link
Copy Markdown

@vegardx vegardx commented May 27, 2026

Problem

Two resilience gaps that compound under burst load:

  1. Octokit throttle plugin never retriesonRateLimit and onSecondaryRateLimit callbacks log a warning but don't return true, so @octokit/plugin-throttling throws immediately instead of retrying after the retryAfter delay.

  2. SSM client uses default retry (standard, 3 attempts, ~3s budget) — under burst with multiple concurrent Lambdas writing JIT configs via PutParameter, SSM's ~40 TPS limit is easily exceeded and ThrottlingException propagates up, failing the entire batch.

Fix

Octokit throttle callbacks return true with caps

onRateLimit: (retryAfter, options) => {
  logger.warn(`...retrying after ${retryAfter}s`);
  return options.request.retryCount < 2; // retry up to 2x on primary rate limit
},
onSecondaryRateLimit: (retryAfter, options) => {
  logger.warn(`...retrying after ${retryAfter}s`);
  return options.request.retryCount < 1; // retry once on secondary/abuse limit
},

SSM adaptive retry with maxAttempts=10

const SSM_CLIENT_CONFIG = {
  region: process.env.AWS_REGION,
  maxAttempts: 10,
  retryMode: 'adaptive' as const,
};

adaptive mode adds client-side rate-sensing — when the SDK sees ThrottlingException it slows further calls to match the observed budget. 10 attempts gives ~30s of retry budget, which is safe because runners take ~30-50s to boot before reading their JIT config.

Changes

  • lambdas/functions/control-plane/src/github/auth.ts — throttle callbacks return true with retry caps
  • lambdas/libs/aws-ssm-util/src/index.ts — shared SSM client config with adaptive retry

Impact

Scenario Before After
GitHub rate limit during scale-up Request fails immediately Retries 1-2x with backoff
SSM throttle during JIT config write ThrottlingException after ~3s Adaptive backoff for ~30s
Burst of 100 jobs, batch_size=10 SSM throttle → orphaned instances Retry absorbs throttle

Fixes #5135
Refs: #5024, #5037

Two resilience improvements for burst load:

1. Octokit plugin-throttling callbacks now return true (with caps) so
   rate-limited GitHub API requests are retried instead of immediately
   failing. Primary: up to 2 retries. Secondary: 1 retry.

2. SSM client switched to adaptive retry mode with maxAttempts=10,
   giving ~30s of retry budget for PutParameter under throttling.
   Standard mode (3 attempts, ~3s) is insufficient when multiple
   concurrent Lambdas write JIT configs during burst.

Fixes github-aws-runners#5135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: Octokit throttle callbacks should retry, and SSM client should use adaptive retry

1 participant