Skip to content

Concurrent Lambda invocations generate byte-identical JWTs causing transient 404 on installation token endpoint #5025

@vegardx

Description

@vegardx

Description

When multiple scale-up Lambda invocations run concurrently (common during burst workloads with 100+ workflow_job events), the GitHub App JWT generation produces byte-identical tokens. GitHub rejects these duplicates (likely replay protection), causing POST /app/installations/{id}/access_tokens to return HTTP 404. This triggers silent batch dropping (see related issue: silent batch drop) and permanently loses the affected jobs.

Root Cause

The @octokit/auth-app library (via universal-github-app-jwt) generates JWTs with only { iat, exp, iss } claims — no jti (JWT ID). The iat uses seconds precision (Math.floor(Date.now() / 1000)). When multiple Lambda invocations generate JWTs within the same second using the same App ID and private key, they produce byte-identical tokens.

GitHub rejects duplicate JWTs, causing the POST /app/installations/{id}/access_tokens request to be treated as unauthenticated. An unauthenticated request to this endpoint returns HTTP 404 (GitHub won't confirm resource existence to unauthorized callers).

The 404 is transient — the same installation ID succeeds on subsequent requests seconds later.

Observed Error

{
  "level": "ERROR",
  "message": "Error processing batch (size: 4): Not Found, ignoring batch",
  "error": {
    "name": "HttpError",
    "status": 404,
    "request": {
      "method": "POST",
      "url": "https://api.<redacted>.ghe.com/app/installations/<redacted>/access_tokens"
    },
    "response": { "status": 404, "data": { "message": "Not Found" } }
  }
}

This repeats for multiple concurrent invocations in rapid succession while other invocations at the same time succeed with the same installation ID.

Impact

  • Combined with the silent batch drop issue, this permanently loses SQS messages and their corresponding jobs.
  • The failure rate correlates with burst size and number of runner type configurations. More configurations = more concurrent Lambdas = higher probability of same-second JWT generation.
  • Small workloads work fine; large matrix workflows trigger this consistently.

Environment

  • Module version: ~> 7.3
  • GitHub: Enterprise Cloud with Data Residency (ghe.com)
  • Multi-runner module with 12+ runner type configurations

Suggested Fixes

Fix 1: Add jti claim to JWT generation

Add a unique jti (JWT ID) claim to the JWT payload to prevent byte-identical tokens. This can be done via @octokit/auth-app's createJwt callback:

import { randomUUID } from 'node:crypto';

const auth = createAppAuth({
  appId,
  createJwt: async ({ appId, privateKey }) => {
    const now = Math.floor(Date.now() / 1000);
    const payload = { iat: now - 60, exp: now + 600, iss: appId, jti: randomUUID() };
    // ... sign with privateKey
  },
});

This eliminates the root cause of the transient 404s.

Alternatively, this could be fixed upstream in universal-github-app-jwt by always including a jti claim.

Fix 2: Retry installation token API call with backoff

As a defense-in-depth measure, treat HTTP 404 on POST /app/installations/{id}/access_tokens as a transient error and retry within createGithubInstallationAuth():

const installationId = await getInstallationId(githubAppClient, enableOrgLevel, payload);
let ghAuth, retries = 0;
while (retries < 3) {
  try {
    ghAuth = await createGithubInstallationAuth(installationId, ghesApiUrl);
    break;
  } catch (e) {
    if (e.status === 404 && retries < 2) {
      retries++;
      await new Promise(r => setTimeout(r, 1000 * retries));
      continue;
    }
    throw e;
  }
}

Our Workaround

We have applied both fixes locally:

  1. Custom createJwt callback with jti claim via crypto.randomUUID()
  2. Retry with backoff on 404 in createGithubInstallationAuth()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions