feat(lambda): cross-Lambda installation token cache via DynamoDB by vegardx · Pull Request #5132 · github-aws-runners/terraform-aws-github-runner

vegardx · 2026-05-26T20:43:06Z

Problem

Every Lambda invocation (scale-up, pool) mints a fresh GitHub App installation access token via POST /app/installations/{id}/access_tokens. Tokens are valid for 60 minutes, but the module discards them after each invocation — there is no cross-invocation caching.

Under burst load this produces thousands of redundant token-mint calls per minute. Users report hitting rate limits as low as 10-50 concurrent runners (#3199), with the problem becoming severe at scale (#5037). The token-mint endpoint is subject to both primary rate limits and secondary (abuse) rate limits, which manifest as 403s or delayed 404s.

At 10 runner configs × batch_size 10, a burst of 100 workflow jobs produces ~100 token mints in seconds — all for the same token.

Relevant GitHub API rate limits

Endpoint	Primary limit	Secondary limit	Notes
`POST /app/installations/{id}/access_tokens`	5,000 req/hour (shared JWT budget)	900 points/min, 100 concurrent	This is what the cache targets
`POST /orgs/{org}/actions/runners/registration-token`	10,000 req/hour (`actions_runner_registration` bucket)	900 points/min, 100 concurrent	JIT config calls; unaffected by this PR

The installation access token endpoint shares the App's 5,000 req/hour JWT-authenticated budget with all other App-level calls (listing installations, getting repo info, etc.). Under burst load, 100+ concurrent token mints can also trigger the secondary rate limit (100 concurrent requests max), resulting in 403s or 502s before the hourly budget is even exhausted.

With the cache: ~1 mint/hour regardless of concurrency. The entire hourly budget is preserved for actual API work.

Solution

A DynamoDB table that caches installation tokens across all Lambda invocations. One token mint per ~50 minutes (refresh-ahead), shared by all concurrent Lambdas.

Why this should be default-on (no feature flag)

Zero risk of behavioral change — the cached token has identical scope to a freshly-minted one (full installation scope, no repositoryIds narrowing)
Graceful degradation — if DDB is unreachable, the code falls through to direct mint (same as today)
Effectively free — PAY_PER_REQUEST DynamoDB at ~1 write/hour + a few reads/minute costs < $0.01/month
The alternative (multiple GitHub Apps, Support for multiple GitHub Apps to overcome API rate limits at scale #5037) is operationally complex — requires splitting orgs, managing multiple app installations, and routing logic
Every deployment benefits — even small deployments avoid unnecessary API calls; large deployments avoid rate limit failures

How it works

sequenceDiagram
    participant A as Lambda A (scale-up)
    participant DDB as DynamoDB
    participant GH as GitHub API

    Note over A,GH: Case A: Fresh cache hit
    A->>DDB: GetItem(installation_id)
    DDB-->>A: token (expires in 30min)
    Note right of A: Return cached token

    Note over A,GH: Case B: Refresh-ahead (token expiring soon)
    participant B as Lambda B (scale-up)
    participant C as Lambda C (concurrent)

    B->>DDB: GetItem(installation_id)
    DDB-->>B: token (expires in 5min)
    B->>DDB: UpdateItem (acquire lock)
    DDB-->>B: lock acquired ✓
    B->>GH: POST /access_tokens
    GH-->>B: new token + expiresAt
    B->>DDB: PutItem (store token, clear lock)

    C->>DDB: GetItem(installation_id)
    DDB-->>C: token (still valid, 5min left)
    Note right of C: Return cached token (no mint needed)

    Note over A,GH: Case C: Cold miss
    A->>DDB: GetItem(installation_id)
    DDB-->>A: ∅ (no item)
    A->>DDB: UpdateItem (acquire lock)
    DDB-->>A: lock acquired ✓
    A->>GH: POST /access_tokens
    GH-->>A: token + expiresAt
    A->>DDB: PutItem (store token)

Three cases:

A. Fresh hit (>10min to expiry): return cached, no GitHub call
B. Refresh-ahead (<10min to expiry): one Lambda wins lock, mints, others return still-valid cached token
C. Cold miss: one Lambda wins lock, mints; others wait briefly then read from cache

On mint failure the lock expires naturally after 60s — caps retry storms.

Changes

Lambda (TypeScript)

lambdas/functions/control-plane/src/github/token-cache.ts — cache module with locking
lambdas/functions/control-plane/src/github/token-cache.test.ts — 8 tests covering all paths
lambdas/functions/control-plane/src/github/auth.ts — integration: route through cache when INSTALLATION_TOKEN_TABLE_NAME is set
lambdas/functions/control-plane/package.json — add @aws-sdk/client-dynamodb

Terraform

token-cache.tf (root module) — DynamoDB table for single-runner deployments
modules/multi-runner/token-cache.tf — shared table for multi-runner deployments
modules/runners/variables.tf — new installation_token_table_name / _arn variables
modules/runners/scale-up.tf — env var + IAM policy for scale-up Lambda
modules/runners/pool.tf + modules/runners/pool/main.tf — same for pool Lambda

DynamoDB schema

Attribute	Type	Purpose
`installation_id`	N (hash key)	GitHub App installation ID
`token`	S	Cached access token
`expires_at_ms`	N	Token expiry (epoch ms)
`lock_until_ms`	N	Mint-in-progress lock expiry
`ttl`	N	DynamoDB TTL (epoch seconds)

Impact

Metric	Before	After
Token mints per hour	N (= total Lambda invocations)	~1 per installation
GitHub API calls during burst	100s-1000s of redundant mints	1 mint + reads from DDB
Cost of cache infrastructure	N/A	~$0/month (PAY_PER_REQUEST)
Failure mode if DDB is down	N/A	Falls through to direct mint

Refs: #5037, #3199, #4710

Add a DynamoDB-backed cache for GitHub App installation access tokens. Previously every Lambda invocation minted a fresh token via POST /app/installations/{id}/access_tokens — under burst load this produces thousands of redundant token-mint calls per minute, triggering rate limits and secondary rate limit responses from GitHub. The cache provides: - Shared token across all concurrent Lambda invocations (scale-up, pool) - Refresh-ahead at T-10min with conditional-write locking (single-flight) - Graceful degradation: DDB read failures fall through to direct mint - Lock TTL backoff: on mint failure the lock expires naturally (60s), capping retry storms against a struggling upstream DynamoDB table: - PAY_PER_REQUEST billing (~$0 at this access pattern) - TTL-enabled for automatic cleanup - One table shared across all runner configs (multi-runner) The table is always created (no feature flag). The env var INSTALLATION_TOKEN_TABLE_NAME is always set. The cache is transparent: same token scope, same semantics, just fewer API calls. Refs: github-aws-runners#5037, github-aws-runners#3199, github-aws-runners#4710

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a DynamoDB-backed cache for GitHub App installation tokens to reduce repeated token minting across Lambda invocations, including Terraform resources/IAM wiring and Node dependencies/tests.

Changes:

Introduces a DynamoDB table for caching installation tokens (with TTL + SSE).
Wires table name/ARN into runner Lambdas via env vars and adds IAM permissions.
Adds a control-plane token cache implementation + Vitest coverage and DynamoDB SDK dependency.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
token-cache.tf	Creates a DynamoDB table for installation token caching in the root stack.
main.tf	Passes installation token table outputs into the runners module.
modules/multi-runner/token-cache.tf	Creates a DynamoDB table for token caching inside the multi-runner module.
modules/multi-runner/runners.tf	Passes token table name/ARN into the runners submodule.
modules/runners/variables.tf	Adds required inputs for the token cache table name/ARN.
modules/runners/scale-up.tf	Exposes table name to the scale-up Lambda and grants DynamoDB access.
modules/runners/pool.tf	Propagates token cache table name/ARN into the pool submodule config.
modules/runners/pool/main.tf	Exposes table name to the pool Lambda and grants DynamoDB access.
lambdas/functions/control-plane/package.json	Adds `@aws-sdk/client-dynamodb` dependency.
lambdas/yarn.lock	Locks new DynamoDB client and transitive AWS SDK dependencies.
lambdas/functions/control-plane/src/github/token-cache.ts	Implements DynamoDB-backed cache + locking/single-flight for token minting.
lambdas/functions/control-plane/src/github/token-cache.test.ts	Adds unit tests for cache hit/refresh-ahead/cold-miss flows.
lambdas/functions/control-plane/src/github/auth.ts	Integrates token cache into installation token auth creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vegardx · 2026-05-26T21:01:10Z

+  description = "Name of the DynamoDB table used to cache GitHub App installation tokens across Lambda invocations."
+  type        = string
+}
+
+variable "installation_token_table_arn" {
+  description = "ARN of the DynamoDB table used to cache GitHub App installation tokens."
+  type        = string


Intentionally required. The table is always created by the parent module (root token-cache.tf or modules/multi-runner/token-cache.tf) and always passed down — there is no opt-out path by design.

The argument for always-on:

Zero behavioral risk (same token scope, graceful fallback to direct mint on DDB failure)

Effectively free (~$0/month PAY_PER_REQUEST)

Reduces GitHub API calls from N-per-invocation to ~1/hour for every deployment

Making it optional adds configuration surface for users to accidentally leave off, gaining nothing since the cost is zero and the behavior is transparent.

vegardx · 2026-05-26T21:01:16Z

+resource "aws_iam_role_policy" "scale_up_token_cache" {
+  name   = "token-cache-policy"
+  role   = aws_iam_role.scale_up.name
+  policy = data.aws_iam_policy_document.scale_up_token_cache.json
+}
+
+data "aws_iam_policy_document" "scale_up_token_cache" {
+  statement {
+    effect = "Allow"
+    actions = [
+      "dynamodb:GetItem",
+      "dynamodb:PutItem",
+      "dynamodb:UpdateItem",
+    ]
+    resources = [var.installation_token_table_arn]
+  }
+}


Same reasoning as above — intentionally always-on. The table is always created by the parent module; this policy is always needed.

vegardx · 2026-05-26T21:01:21Z

+resource "aws_iam_role_policy" "pool_token_cache" {
+  name   = "token-cache-policy"
+  role   = aws_iam_role.pool.name
+  policy = data.aws_iam_policy_document.pool_token_cache.json
+}
+
+data "aws_iam_policy_document" "pool_token_cache" {
+  statement {
+    effect = "Allow"
+    actions = [
+      "dynamodb:GetItem",
+      "dynamodb:PutItem",
+      "dynamodb:UpdateItem",
+    ]
+    resources = [var.config.installation_token_table_arn]
+  }
+}


Same as scale-up — always-on by design. See the discussion on the variables.tf comment.

vegardx · 2026-05-26T21:01:00Z

+      new UpdateItemCommand({
+        TableName: tableName,
+        Key: { installation_id: { N: String(installationId) } },
+        UpdateExpression: 'SET lock_until_ms = :lockUntil',
+        // Acquire if:
+        //   1. No item exists, OR no lock, OR current lock expired
+        //   AND
+        //   2. No valid token exists (or token is within the refresh-ahead window)
+        ConditionExpression:
+          '(attribute_not_exists(installation_id) OR attribute_not_exists(lock_until_ms) OR lock_until_ms < :now)' +
+          ' AND ' +
+          '(attribute_not_exists(expires_at_ms) OR expires_at_ms < :refreshAt)',
+        ExpressionAttributeValues: {
+          ':lockUntil': { N: String(lockUntil) },
+          ':now': { N: String(nowMs) },
+          ':refreshAt': { N: String(nowMs + REFRESH_AHEAD_MS) },
+        },


Good catch. Fixed in 0fc3ddf — the UpdateItem now also sets ttl (epoch seconds, lock expiry + 10min buffer) so lock-only records are automatically cleaned up by DynamoDB TTL.

vegardx · 2026-05-26T21:01:28Z

+    let getCalls = 0;
+    mockSend.mockImplementation(async (cmd: unknown) => {
+      if (cmd instanceof GetItemCommand) {
+        getCalls++;
+        if (getCalls === 1) return { Item: undefined };
+        return freshTokenItem(Date.now() + 60 * 60 * 1000);
+      }
+      if (cmd instanceof UpdateItemCommand) {
+        throw new ConditionalCheckFailedException({ $metadata: {}, message: 'lock taken' });
+      }
+      throw new Error('unexpected: ' + (cmd as { constructor: { name: string } }).constructor.name);
+    });
+    const mint = vi.fn();
+
+    const result = await getCachedInstallationToken(installationId, mint);
+
+    expect(result.token).toBe('cached-token-abc');
+    expect(mint).not.toHaveBeenCalled();
+    expect(getCalls).toBe(2);


Acknowledged. The max jitter is 1s and it only fires in the cold-miss-lost-lock test path. Fake timers in vitest interact poorly with async DynamoDB mock timing and would make the test significantly harder to read for marginal speed gain. Keeping as-is.

vegardx · 2026-05-26T21:01:35Z

+resource "aws_dynamodb_table" "installation_tokens" {
+  name         = "${var.prefix}-installation-tokens"
+  billing_mode = "PAY_PER_REQUEST"
+  hash_key     = "installation_id"
+
+  attribute {
+    name = "installation_id"
+    type = "N"
+  }
+
+  ttl {
+    attribute_name = "ttl"
+    enabled        = true
+  }
+
+  point_in_time_recovery {
+    enabled = false
+  }
+
+  server_side_encryption {
+    enabled = true
+  }
+
+  tags = merge(local.tags, {
+    Name = "${var.prefix}-installation-tokens"
+  })
+}


The two files serve different entry points: the root module (single-runner) and modules/multi-runner/. They can't reference each other. Extracting a shared sub-module for 8 lines of HCL adds a directory, a variables.tf, an outputs.tf, and a source reference — more ceremony than the duplication it eliminates. If the table schema changes, both files are in this repo and covered by the same PR.

vegardx · 2026-05-26T21:01:43Z

+      new GetItemCommand({
+        TableName: tableName,
+        Key: { installation_id: { N: String(installationId) } },
+        ConsistentRead: true,
+      }),


ConsistentRead is intentional and required for correctness. The lock protocol depends on reading the current lock state — an eventually-consistent read could return a stale "no lock" and grant two concurrent mints, defeating the single-flight guarantee.

The cost difference at this access pattern (single-digit reads/minute) is fractions of a cent per month. Not worth the correctness risk.

When UpdateItem creates a lock-only record (no token stored yet), it now also sets the ttl attribute so DynamoDB auto-deletes it if the holder crashes and never writes a full token entry.

Copilot AI review requested due to automatic review settings May 26, 2026 20:43

vegardx requested review from a team as code owners May 26, 2026 20:43

Copilot AI reviewed May 26, 2026

View reviewed changes

fix: set TTL on lock-only DynamoDB items to prevent accumulation

0fc3ddf

When UpdateItem creates a lock-only record (no token stored yet), it now also sets the ttl attribute so DynamoDB auto-deletes it if the holder crashes and never writes a full token entry.

vegardx mentioned this pull request May 26, 2026

feat: KMS-backed JWT signing for GitHub App authentication (private key never leaves HSM) #5134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lambda): cross-Lambda installation token cache via DynamoDB#5132

feat(lambda): cross-Lambda installation token cache via DynamoDB#5132
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:feat/installation-token-cache-dynamodb

vegardx commented May 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vegardx commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Relevant GitHub API rate limits

Solution

Why this should be default-on (no feature flag)

How it works

Changes

Lambda (TypeScript)

Terraform

DynamoDB schema

Impact

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vegardx commented May 26, 2026 •

edited

Loading