Skip to content

feat(lambda): cross-Lambda installation token cache via DynamoDB#5132

Open
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:feat/installation-token-cache-dynamodb
Open

feat(lambda): cross-Lambda installation token cache via DynamoDB#5132
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:feat/installation-token-cache-dynamodb

Conversation

@vegardx
Copy link
Copy Markdown

@vegardx vegardx commented May 26, 2026

Problem

Every Lambda invocation (scale-up, pool) mints a fresh GitHub App installation access token via POST /app/installations/{id}/access_tokens. Tokens are valid for 60 minutes, but the module discards them after each invocation — there is no cross-invocation caching.

Under burst load this produces thousands of redundant token-mint calls per minute. Users report hitting rate limits as low as 10-50 concurrent runners (#3199), with the problem becoming severe at scale (#5037). The token-mint endpoint is subject to both primary rate limits and secondary (abuse) rate limits, which manifest as 403s or delayed 404s.

At 10 runner configs × batch_size 10, a burst of 100 workflow jobs produces ~100 token mints in seconds — all for the same token.

Relevant GitHub API rate limits

Endpoint Primary limit Secondary limit Notes
POST /app/installations/{id}/access_tokens 5,000 req/hour (shared JWT budget) 900 points/min, 100 concurrent This is what the cache targets
POST /orgs/{org}/actions/runners/registration-token 10,000 req/hour (actions_runner_registration bucket) 900 points/min, 100 concurrent JIT config calls; unaffected by this PR

The installation access token endpoint shares the App's 5,000 req/hour JWT-authenticated budget with all other App-level calls (listing installations, getting repo info, etc.). Under burst load, 100+ concurrent token mints can also trigger the secondary rate limit (100 concurrent requests max), resulting in 403s or 502s before the hourly budget is even exhausted.

With the cache: ~1 mint/hour regardless of concurrency. The entire hourly budget is preserved for actual API work.

Solution

A DynamoDB table that caches installation tokens across all Lambda invocations. One token mint per ~50 minutes (refresh-ahead), shared by all concurrent Lambdas.

Why this should be default-on (no feature flag)

  1. Zero risk of behavioral change — the cached token has identical scope to a freshly-minted one (full installation scope, no repositoryIds narrowing)
  2. Graceful degradation — if DDB is unreachable, the code falls through to direct mint (same as today)
  3. Effectively free — PAY_PER_REQUEST DynamoDB at ~1 write/hour + a few reads/minute costs < $0.01/month
  4. The alternative (multiple GitHub Apps, Support for multiple GitHub Apps to overcome API rate limits at scale #5037) is operationally complex — requires splitting orgs, managing multiple app installations, and routing logic
  5. Every deployment benefits — even small deployments avoid unnecessary API calls; large deployments avoid rate limit failures

How it works

sequenceDiagram
    participant A as Lambda A (scale-up)
    participant DDB as DynamoDB
    participant GH as GitHub API

    Note over A,GH: Case A: Fresh cache hit
    A->>DDB: GetItem(installation_id)
    DDB-->>A: token (expires in 30min)
    Note right of A: Return cached token

    Note over A,GH: Case B: Refresh-ahead (token expiring soon)
    participant B as Lambda B (scale-up)
    participant C as Lambda C (concurrent)

    B->>DDB: GetItem(installation_id)
    DDB-->>B: token (expires in 5min)
    B->>DDB: UpdateItem (acquire lock)
    DDB-->>B: lock acquired ✓
    B->>GH: POST /access_tokens
    GH-->>B: new token + expiresAt
    B->>DDB: PutItem (store token, clear lock)

    C->>DDB: GetItem(installation_id)
    DDB-->>C: token (still valid, 5min left)
    Note right of C: Return cached token (no mint needed)

    Note over A,GH: Case C: Cold miss
    A->>DDB: GetItem(installation_id)
    DDB-->>A: ∅ (no item)
    A->>DDB: UpdateItem (acquire lock)
    DDB-->>A: lock acquired ✓
    A->>GH: POST /access_tokens
    GH-->>A: token + expiresAt
    A->>DDB: PutItem (store token)
Loading

Three cases:

  • A. Fresh hit (>10min to expiry): return cached, no GitHub call
  • B. Refresh-ahead (<10min to expiry): one Lambda wins lock, mints, others return still-valid cached token
  • C. Cold miss: one Lambda wins lock, mints; others wait briefly then read from cache

On mint failure the lock expires naturally after 60s — caps retry storms.

Changes

Lambda (TypeScript)

  • lambdas/functions/control-plane/src/github/token-cache.ts — cache module with locking
  • lambdas/functions/control-plane/src/github/token-cache.test.ts — 8 tests covering all paths
  • lambdas/functions/control-plane/src/github/auth.ts — integration: route through cache when INSTALLATION_TOKEN_TABLE_NAME is set
  • lambdas/functions/control-plane/package.json — add @aws-sdk/client-dynamodb

Terraform

  • token-cache.tf (root module) — DynamoDB table for single-runner deployments
  • modules/multi-runner/token-cache.tf — shared table for multi-runner deployments
  • modules/runners/variables.tf — new installation_token_table_name / _arn variables
  • modules/runners/scale-up.tf — env var + IAM policy for scale-up Lambda
  • modules/runners/pool.tf + modules/runners/pool/main.tf — same for pool Lambda

DynamoDB schema

Attribute Type Purpose
installation_id N (hash key) GitHub App installation ID
token S Cached access token
expires_at_ms N Token expiry (epoch ms)
lock_until_ms N Mint-in-progress lock expiry
ttl N DynamoDB TTL (epoch seconds)

Impact

Metric Before After
Token mints per hour N (= total Lambda invocations) ~1 per installation
GitHub API calls during burst 100s-1000s of redundant mints 1 mint + reads from DDB
Cost of cache infrastructure N/A ~$0/month (PAY_PER_REQUEST)
Failure mode if DDB is down N/A Falls through to direct mint

Refs: #5037, #3199, #4710

Add a DynamoDB-backed cache for GitHub App installation access tokens.
Previously every Lambda invocation minted a fresh token via POST
/app/installations/{id}/access_tokens — under burst load this produces
thousands of redundant token-mint calls per minute, triggering rate
limits and secondary rate limit responses from GitHub.

The cache provides:
- Shared token across all concurrent Lambda invocations (scale-up, pool)
- Refresh-ahead at T-10min with conditional-write locking (single-flight)
- Graceful degradation: DDB read failures fall through to direct mint
- Lock TTL backoff: on mint failure the lock expires naturally (60s),
  capping retry storms against a struggling upstream

DynamoDB table:
- PAY_PER_REQUEST billing (~$0 at this access pattern)
- TTL-enabled for automatic cleanup
- One table shared across all runner configs (multi-runner)

The table is always created (no feature flag). The env var
INSTALLATION_TOKEN_TABLE_NAME is always set. The cache is transparent:
same token scope, same semantics, just fewer API calls.

Refs: github-aws-runners#5037, github-aws-runners#3199, github-aws-runners#4710
Copilot AI review requested due to automatic review settings May 26, 2026 20:43
@vegardx vegardx requested review from a team as code owners May 26, 2026 20:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a DynamoDB-backed cache for GitHub App installation tokens to reduce repeated token minting across Lambda invocations, including Terraform resources/IAM wiring and Node dependencies/tests.

Changes:

  • Introduces a DynamoDB table for caching installation tokens (with TTL + SSE).
  • Wires table name/ARN into runner Lambdas via env vars and adds IAM permissions.
  • Adds a control-plane token cache implementation + Vitest coverage and DynamoDB SDK dependency.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
token-cache.tf Creates a DynamoDB table for installation token caching in the root stack.
main.tf Passes installation token table outputs into the runners module.
modules/multi-runner/token-cache.tf Creates a DynamoDB table for token caching inside the multi-runner module.
modules/multi-runner/runners.tf Passes token table name/ARN into the runners submodule.
modules/runners/variables.tf Adds required inputs for the token cache table name/ARN.
modules/runners/scale-up.tf Exposes table name to the scale-up Lambda and grants DynamoDB access.
modules/runners/pool.tf Propagates token cache table name/ARN into the pool submodule config.
modules/runners/pool/main.tf Exposes table name to the pool Lambda and grants DynamoDB access.
lambdas/functions/control-plane/package.json Adds @aws-sdk/client-dynamodb dependency.
lambdas/yarn.lock Locks new DynamoDB client and transitive AWS SDK dependencies.
lambdas/functions/control-plane/src/github/token-cache.ts Implements DynamoDB-backed cache + locking/single-flight for token minting.
lambdas/functions/control-plane/src/github/token-cache.test.ts Adds unit tests for cache hit/refresh-ahead/cold-miss flows.
lambdas/functions/control-plane/src/github/auth.ts Integrates token cache into installation token auth creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +817 to +823
description = "Name of the DynamoDB table used to cache GitHub App installation tokens across Lambda invocations."
type = string
}

variable "installation_token_table_arn" {
description = "ARN of the DynamoDB table used to cache GitHub App installation tokens."
type = string
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally required. The table is always created by the parent module (root token-cache.tf or modules/multi-runner/token-cache.tf) and always passed down — there is no opt-out path by design.

The argument for always-on:

  • Zero behavioral risk (same token scope, graceful fallback to direct mint on DDB failure)
  • Effectively free (~$0/month PAY_PER_REQUEST)
  • Reduces GitHub API calls from N-per-invocation to ~1/hour for every deployment

Making it optional adds configuration surface for users to accidentally leave off, gaining nothing since the cost is zero and the behavior is transparent.

Comment on lines +141 to +157
resource "aws_iam_role_policy" "scale_up_token_cache" {
name = "token-cache-policy"
role = aws_iam_role.scale_up.name
policy = data.aws_iam_policy_document.scale_up_token_cache.json
}

data "aws_iam_policy_document" "scale_up_token_cache" {
statement {
effect = "Allow"
actions = [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
]
resources = [var.installation_token_table_arn]
}
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as above — intentionally always-on. The table is always created by the parent module; this policy is always needed.

Comment on lines +110 to +126
resource "aws_iam_role_policy" "pool_token_cache" {
name = "token-cache-policy"
role = aws_iam_role.pool.name
policy = data.aws_iam_policy_document.pool_token_cache.json
}

data "aws_iam_policy_document" "pool_token_cache" {
statement {
effect = "Allow"
actions = [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
]
resources = [var.config.installation_token_table_arn]
}
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as scale-up — always-on by design. See the discussion on the variables.tf comment.

Comment on lines +150 to +166
new UpdateItemCommand({
TableName: tableName,
Key: { installation_id: { N: String(installationId) } },
UpdateExpression: 'SET lock_until_ms = :lockUntil',
// Acquire if:
// 1. No item exists, OR no lock, OR current lock expired
// AND
// 2. No valid token exists (or token is within the refresh-ahead window)
ConditionExpression:
'(attribute_not_exists(installation_id) OR attribute_not_exists(lock_until_ms) OR lock_until_ms < :now)' +
' AND ' +
'(attribute_not_exists(expires_at_ms) OR expires_at_ms < :refreshAt)',
ExpressionAttributeValues: {
':lockUntil': { N: String(lockUntil) },
':now': { N: String(nowMs) },
':refreshAt': { N: String(nowMs + REFRESH_AHEAD_MS) },
},
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 0fc3ddf — the UpdateItem now also sets ttl (epoch seconds, lock expiry + 10min buffer) so lock-only records are automatically cleaned up by DynamoDB TTL.

Comment on lines +119 to +137
let getCalls = 0;
mockSend.mockImplementation(async (cmd: unknown) => {
if (cmd instanceof GetItemCommand) {
getCalls++;
if (getCalls === 1) return { Item: undefined };
return freshTokenItem(Date.now() + 60 * 60 * 1000);
}
if (cmd instanceof UpdateItemCommand) {
throw new ConditionalCheckFailedException({ $metadata: {}, message: 'lock taken' });
}
throw new Error('unexpected: ' + (cmd as { constructor: { name: string } }).constructor.name);
});
const mint = vi.fn();

const result = await getCachedInstallationToken(installationId, mint);

expect(result.token).toBe('cached-token-abc');
expect(mint).not.toHaveBeenCalled();
expect(getCalls).toBe(2);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The max jitter is 1s and it only fires in the cold-miss-lost-lock test path. Fake timers in vitest interact poorly with async DynamoDB mock timing and would make the test significantly harder to read for marginal speed gain. Keeping as-is.

Comment thread token-cache.tf
Comment on lines +1 to +27
resource "aws_dynamodb_table" "installation_tokens" {
name = "${var.prefix}-installation-tokens"
billing_mode = "PAY_PER_REQUEST"
hash_key = "installation_id"

attribute {
name = "installation_id"
type = "N"
}

ttl {
attribute_name = "ttl"
enabled = true
}

point_in_time_recovery {
enabled = false
}

server_side_encryption {
enabled = true
}

tags = merge(local.tags, {
Name = "${var.prefix}-installation-tokens"
})
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two files serve different entry points: the root module (single-runner) and modules/multi-runner/. They can't reference each other. Extracting a shared sub-module for 8 lines of HCL adds a directory, a variables.tf, an outputs.tf, and a source reference — more ceremony than the duplication it eliminates. If the table schema changes, both files are in this repo and covered by the same PR.

Comment on lines +119 to +123
new GetItemCommand({
TableName: tableName,
Key: { installation_id: { N: String(installationId) } },
ConsistentRead: true,
}),
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConsistentRead is intentional and required for correctness. The lock protocol depends on reading the current lock state — an eventually-consistent read could return a stale "no lock" and grant two concurrent mints, defeating the single-flight guarantee.

The cost difference at this access pattern (single-digit reads/minute) is fractions of a cent per month. Not worth the correctness risk.

When UpdateItem creates a lock-only record (no token stored yet), it now
also sets the ttl attribute so DynamoDB auto-deletes it if the holder
crashes and never writes a full token entry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants