feat(lambda): cross-Lambda installation token cache via DynamoDB#5132
feat(lambda): cross-Lambda installation token cache via DynamoDB#5132vegardx wants to merge 2 commits into
Conversation
Add a DynamoDB-backed cache for GitHub App installation access tokens.
Previously every Lambda invocation minted a fresh token via POST
/app/installations/{id}/access_tokens — under burst load this produces
thousands of redundant token-mint calls per minute, triggering rate
limits and secondary rate limit responses from GitHub.
The cache provides:
- Shared token across all concurrent Lambda invocations (scale-up, pool)
- Refresh-ahead at T-10min with conditional-write locking (single-flight)
- Graceful degradation: DDB read failures fall through to direct mint
- Lock TTL backoff: on mint failure the lock expires naturally (60s),
capping retry storms against a struggling upstream
DynamoDB table:
- PAY_PER_REQUEST billing (~$0 at this access pattern)
- TTL-enabled for automatic cleanup
- One table shared across all runner configs (multi-runner)
The table is always created (no feature flag). The env var
INSTALLATION_TOKEN_TABLE_NAME is always set. The cache is transparent:
same token scope, same semantics, just fewer API calls.
Refs: github-aws-runners#5037, github-aws-runners#3199, github-aws-runners#4710
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a DynamoDB-backed cache for GitHub App installation tokens to reduce repeated token minting across Lambda invocations, including Terraform resources/IAM wiring and Node dependencies/tests.
Changes:
- Introduces a DynamoDB table for caching installation tokens (with TTL + SSE).
- Wires table name/ARN into runner Lambdas via env vars and adds IAM permissions.
- Adds a control-plane token cache implementation + Vitest coverage and DynamoDB SDK dependency.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| token-cache.tf | Creates a DynamoDB table for installation token caching in the root stack. |
| main.tf | Passes installation token table outputs into the runners module. |
| modules/multi-runner/token-cache.tf | Creates a DynamoDB table for token caching inside the multi-runner module. |
| modules/multi-runner/runners.tf | Passes token table name/ARN into the runners submodule. |
| modules/runners/variables.tf | Adds required inputs for the token cache table name/ARN. |
| modules/runners/scale-up.tf | Exposes table name to the scale-up Lambda and grants DynamoDB access. |
| modules/runners/pool.tf | Propagates token cache table name/ARN into the pool submodule config. |
| modules/runners/pool/main.tf | Exposes table name to the pool Lambda and grants DynamoDB access. |
| lambdas/functions/control-plane/package.json | Adds @aws-sdk/client-dynamodb dependency. |
| lambdas/yarn.lock | Locks new DynamoDB client and transitive AWS SDK dependencies. |
| lambdas/functions/control-plane/src/github/token-cache.ts | Implements DynamoDB-backed cache + locking/single-flight for token minting. |
| lambdas/functions/control-plane/src/github/token-cache.test.ts | Adds unit tests for cache hit/refresh-ahead/cold-miss flows. |
| lambdas/functions/control-plane/src/github/auth.ts | Integrates token cache into installation token auth creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| description = "Name of the DynamoDB table used to cache GitHub App installation tokens across Lambda invocations." | ||
| type = string | ||
| } | ||
|
|
||
| variable "installation_token_table_arn" { | ||
| description = "ARN of the DynamoDB table used to cache GitHub App installation tokens." | ||
| type = string |
There was a problem hiding this comment.
Intentionally required. The table is always created by the parent module (root token-cache.tf or modules/multi-runner/token-cache.tf) and always passed down — there is no opt-out path by design.
The argument for always-on:
- Zero behavioral risk (same token scope, graceful fallback to direct mint on DDB failure)
- Effectively free (~$0/month PAY_PER_REQUEST)
- Reduces GitHub API calls from N-per-invocation to ~1/hour for every deployment
Making it optional adds configuration surface for users to accidentally leave off, gaining nothing since the cost is zero and the behavior is transparent.
| resource "aws_iam_role_policy" "scale_up_token_cache" { | ||
| name = "token-cache-policy" | ||
| role = aws_iam_role.scale_up.name | ||
| policy = data.aws_iam_policy_document.scale_up_token_cache.json | ||
| } | ||
|
|
||
| data "aws_iam_policy_document" "scale_up_token_cache" { | ||
| statement { | ||
| effect = "Allow" | ||
| actions = [ | ||
| "dynamodb:GetItem", | ||
| "dynamodb:PutItem", | ||
| "dynamodb:UpdateItem", | ||
| ] | ||
| resources = [var.installation_token_table_arn] | ||
| } | ||
| } |
There was a problem hiding this comment.
Same reasoning as above — intentionally always-on. The table is always created by the parent module; this policy is always needed.
| resource "aws_iam_role_policy" "pool_token_cache" { | ||
| name = "token-cache-policy" | ||
| role = aws_iam_role.pool.name | ||
| policy = data.aws_iam_policy_document.pool_token_cache.json | ||
| } | ||
|
|
||
| data "aws_iam_policy_document" "pool_token_cache" { | ||
| statement { | ||
| effect = "Allow" | ||
| actions = [ | ||
| "dynamodb:GetItem", | ||
| "dynamodb:PutItem", | ||
| "dynamodb:UpdateItem", | ||
| ] | ||
| resources = [var.config.installation_token_table_arn] | ||
| } | ||
| } |
There was a problem hiding this comment.
Same as scale-up — always-on by design. See the discussion on the variables.tf comment.
| new UpdateItemCommand({ | ||
| TableName: tableName, | ||
| Key: { installation_id: { N: String(installationId) } }, | ||
| UpdateExpression: 'SET lock_until_ms = :lockUntil', | ||
| // Acquire if: | ||
| // 1. No item exists, OR no lock, OR current lock expired | ||
| // AND | ||
| // 2. No valid token exists (or token is within the refresh-ahead window) | ||
| ConditionExpression: | ||
| '(attribute_not_exists(installation_id) OR attribute_not_exists(lock_until_ms) OR lock_until_ms < :now)' + | ||
| ' AND ' + | ||
| '(attribute_not_exists(expires_at_ms) OR expires_at_ms < :refreshAt)', | ||
| ExpressionAttributeValues: { | ||
| ':lockUntil': { N: String(lockUntil) }, | ||
| ':now': { N: String(nowMs) }, | ||
| ':refreshAt': { N: String(nowMs + REFRESH_AHEAD_MS) }, | ||
| }, |
There was a problem hiding this comment.
Good catch. Fixed in 0fc3ddf — the UpdateItem now also sets ttl (epoch seconds, lock expiry + 10min buffer) so lock-only records are automatically cleaned up by DynamoDB TTL.
| let getCalls = 0; | ||
| mockSend.mockImplementation(async (cmd: unknown) => { | ||
| if (cmd instanceof GetItemCommand) { | ||
| getCalls++; | ||
| if (getCalls === 1) return { Item: undefined }; | ||
| return freshTokenItem(Date.now() + 60 * 60 * 1000); | ||
| } | ||
| if (cmd instanceof UpdateItemCommand) { | ||
| throw new ConditionalCheckFailedException({ $metadata: {}, message: 'lock taken' }); | ||
| } | ||
| throw new Error('unexpected: ' + (cmd as { constructor: { name: string } }).constructor.name); | ||
| }); | ||
| const mint = vi.fn(); | ||
|
|
||
| const result = await getCachedInstallationToken(installationId, mint); | ||
|
|
||
| expect(result.token).toBe('cached-token-abc'); | ||
| expect(mint).not.toHaveBeenCalled(); | ||
| expect(getCalls).toBe(2); |
There was a problem hiding this comment.
Acknowledged. The max jitter is 1s and it only fires in the cold-miss-lost-lock test path. Fake timers in vitest interact poorly with async DynamoDB mock timing and would make the test significantly harder to read for marginal speed gain. Keeping as-is.
| resource "aws_dynamodb_table" "installation_tokens" { | ||
| name = "${var.prefix}-installation-tokens" | ||
| billing_mode = "PAY_PER_REQUEST" | ||
| hash_key = "installation_id" | ||
|
|
||
| attribute { | ||
| name = "installation_id" | ||
| type = "N" | ||
| } | ||
|
|
||
| ttl { | ||
| attribute_name = "ttl" | ||
| enabled = true | ||
| } | ||
|
|
||
| point_in_time_recovery { | ||
| enabled = false | ||
| } | ||
|
|
||
| server_side_encryption { | ||
| enabled = true | ||
| } | ||
|
|
||
| tags = merge(local.tags, { | ||
| Name = "${var.prefix}-installation-tokens" | ||
| }) | ||
| } |
There was a problem hiding this comment.
The two files serve different entry points: the root module (single-runner) and modules/multi-runner/. They can't reference each other. Extracting a shared sub-module for 8 lines of HCL adds a directory, a variables.tf, an outputs.tf, and a source reference — more ceremony than the duplication it eliminates. If the table schema changes, both files are in this repo and covered by the same PR.
| new GetItemCommand({ | ||
| TableName: tableName, | ||
| Key: { installation_id: { N: String(installationId) } }, | ||
| ConsistentRead: true, | ||
| }), |
There was a problem hiding this comment.
ConsistentRead is intentional and required for correctness. The lock protocol depends on reading the current lock state — an eventually-consistent read could return a stale "no lock" and grant two concurrent mints, defeating the single-flight guarantee.
The cost difference at this access pattern (single-digit reads/minute) is fractions of a cent per month. Not worth the correctness risk.
When UpdateItem creates a lock-only record (no token stored yet), it now also sets the ttl attribute so DynamoDB auto-deletes it if the holder crashes and never writes a full token entry.
Problem
Every Lambda invocation (scale-up, pool) mints a fresh GitHub App installation access token via
POST /app/installations/{id}/access_tokens. Tokens are valid for 60 minutes, but the module discards them after each invocation — there is no cross-invocation caching.Under burst load this produces thousands of redundant token-mint calls per minute. Users report hitting rate limits as low as 10-50 concurrent runners (#3199), with the problem becoming severe at scale (#5037). The token-mint endpoint is subject to both primary rate limits and secondary (abuse) rate limits, which manifest as 403s or delayed 404s.
At 10 runner configs × batch_size 10, a burst of 100 workflow jobs produces ~100 token mints in seconds — all for the same token.
Relevant GitHub API rate limits
POST /app/installations/{id}/access_tokensPOST /orgs/{org}/actions/runners/registration-tokenactions_runner_registrationbucket)The installation access token endpoint shares the App's 5,000 req/hour JWT-authenticated budget with all other App-level calls (listing installations, getting repo info, etc.). Under burst load, 100+ concurrent token mints can also trigger the secondary rate limit (100 concurrent requests max), resulting in 403s or 502s before the hourly budget is even exhausted.
With the cache: ~1 mint/hour regardless of concurrency. The entire hourly budget is preserved for actual API work.
Solution
A DynamoDB table that caches installation tokens across all Lambda invocations. One token mint per ~50 minutes (refresh-ahead), shared by all concurrent Lambdas.
Why this should be default-on (no feature flag)
repositoryIdsnarrowing)How it works
sequenceDiagram participant A as Lambda A (scale-up) participant DDB as DynamoDB participant GH as GitHub API Note over A,GH: Case A: Fresh cache hit A->>DDB: GetItem(installation_id) DDB-->>A: token (expires in 30min) Note right of A: Return cached token Note over A,GH: Case B: Refresh-ahead (token expiring soon) participant B as Lambda B (scale-up) participant C as Lambda C (concurrent) B->>DDB: GetItem(installation_id) DDB-->>B: token (expires in 5min) B->>DDB: UpdateItem (acquire lock) DDB-->>B: lock acquired ✓ B->>GH: POST /access_tokens GH-->>B: new token + expiresAt B->>DDB: PutItem (store token, clear lock) C->>DDB: GetItem(installation_id) DDB-->>C: token (still valid, 5min left) Note right of C: Return cached token (no mint needed) Note over A,GH: Case C: Cold miss A->>DDB: GetItem(installation_id) DDB-->>A: ∅ (no item) A->>DDB: UpdateItem (acquire lock) DDB-->>A: lock acquired ✓ A->>GH: POST /access_tokens GH-->>A: token + expiresAt A->>DDB: PutItem (store token)Three cases:
On mint failure the lock expires naturally after 60s — caps retry storms.
Changes
Lambda (TypeScript)
lambdas/functions/control-plane/src/github/token-cache.ts— cache module with lockinglambdas/functions/control-plane/src/github/token-cache.test.ts— 8 tests covering all pathslambdas/functions/control-plane/src/github/auth.ts— integration: route through cache whenINSTALLATION_TOKEN_TABLE_NAMEis setlambdas/functions/control-plane/package.json— add@aws-sdk/client-dynamodbTerraform
token-cache.tf(root module) — DynamoDB table for single-runner deploymentsmodules/multi-runner/token-cache.tf— shared table for multi-runner deploymentsmodules/runners/variables.tf— newinstallation_token_table_name/_arnvariablesmodules/runners/scale-up.tf— env var + IAM policy for scale-up Lambdamodules/runners/pool.tf+modules/runners/pool/main.tf— same for pool LambdaDynamoDB schema
installation_idtokenexpires_at_mslock_until_msttlImpact
Refs: #5037, #3199, #4710