App Hosting: auto-rollout silently not created for successful build (two silent skips in one day)

### Summary

Firebase App Hosting sometimes does not create a rollout resource for a successful build. The build reaches \`state: READY\`, a Cloud Run revision is created and becomes \`ContainerReady\`, but no \`rollout\` resource appears under \`/v1/.../backends/<id>/rollouts\`. Traffic stays on the previous revision indefinitely until a manual \`POST .../rollouts\` call is made. We saw this happen twice in one day on the same backend — incidence rate ~18% — and the affected builds are field-for-field identical to the builds that auto-rolled out fine.

This looks similar to some older reports (for example #8866) but the symptom here is that **the rollout resource itself is never created**, not that a rollout is created and fails.

### Environment

- **Project**: \`magic-bracket-simulator\`
- **Backend**: \`api\` (Next.js 15.5.7, \`output: "standalone"\`)
- **Region / location**: \`us-central1\`
- **Billing**: Blaze
- **Backend UID**: \`53b19782-1be2-4edf-9d66-8466a6d089b0\`
- **rolloutPolicy**: exactly \`{ "codebaseBranch": "main" }\` — no \`disabled\`, no \`cooldownDuration\`, no custom traffic config
- **firebase-tools**: whatever \`npx firebase-tools@latest\` resolved to on 2026-04-10

### Observed timeline (both silent skips on 2026-04-10 UTC)

Between 17:50 and 18:46 UTC I merged six PRs to \`main\` in rapid succession. All six auto-rolled out normally — each \`rollout-*\` resource was created within ~60 ms of its corresponding \`build-*\` resource. I verified this by diffing \`createTime\` fields across \`/v1/.../builds\` and \`/v1/.../rollouts\`.

Then:

**First silent skip — build \`build-2026-04-10-009\`** (commit \`a09a9e0017d796675f33f53f6540e17a71ed73df\`, PR #152)
- Reached \`state: READY\` normally
- No rollout resource was created
- Ambiguous because build \`build-2026-04-10-010\` was created ~3 minutes later and may have superseded it

**Second silent skip — build \`build-2026-04-10-011\`**, build UID \`c6128a89-baae-4b59-96bb-4e8e7414b584\` (commit \`f6467654a5236cf30f62a39d08afaea7bfcc075d\`, PR #154)
- \`createTime: 2026-04-10T21:30:05.888598146Z\`
- Reached \`state: READY\` successfully
- Cloud Run revision \`api-build-2026-04-10-011\` created at 21:34:29Z, became \`Ready/Active/ContainerHealthy/ContainerReady\` by 21:37:02Z
- Cloud Run log: \`Starting new instance. Reason: DEPLOYMENT_ROLLOUT - Instance started due to traffic shifting between revisions due to deployment, traffic split adjustment, or deployment health check.\`
- \`GET /v1/.../backends/api/rollouts?pageSize=1000\` (with \`nextPageToken\` null) returned no rollout referencing this build. Latest rollout was still \`rollout-2026-04-10-009\` from 18:46:46Z
- **No other build was created after this one**, so supersession does not explain it
- Traffic sat at 100% on \`api-build-2026-04-10-010\` for ~10+ minutes

**Manual workaround** that fixed it:

\`\`\`bash
curl -sS -X POST \\
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \\
  -H "Content-Type: application/json" \\
  -d '{"build":"projects/magic-bracket-simulator/locations/us-central1/backends/api/builds/build-2026-04-10-011"}' \\
  "https://firebaseapphosting.googleapis.com/v1/projects/magic-bracket-simulator/locations/us-central1/backends/api/rollouts?rolloutId=rollout-manual-cloudtasks"
\`\`\`

Traffic shifted to \`-011\` within ~30 seconds of that call. No config changes, no code changes — the runtime path worked fine, only the rollout resource was missing.

### What I ruled out

I did a read-only forensics pass before filing:

- **Rollout policy paused / disabled / rate-limited** — no such fields exist on the backend, only \`codebaseBranch: main\`
- **Cooldown after N rapid rollouts** — rollouts are created within 60 ms of build creation, not queued, and there was a 3-hour idle gap before \`build-011\` anyway
- **Build failure** — \`state: READY\`, Cloud Run revision healthy, Cloud Build tags identical to healthy builds (\`fah\`, \`p-fah\`, \`r-nodejs\`, \`b-nodejs_20260405_RC00\`, \`bt-LIFECYCLE\`)
- **GitHub connection / repo link drift** — both silent-skip builds have \`source.codebase\` populated correctly with \`branch: main\`, \`hash: <sha>\`, \`uri: https://github.com/.../commit/<sha>\`, and the right author
- **Quota denied** — none in logs, billing is Blaze, Cloud Run revision was created successfully
- **maxInstances / traffic deadlock** — traffic shifted in ~30 s once the manual \`POST\` was made
- **Firebase App Hosting API pagination issue** — fetched with \`pageSize=1000\` and \`nextPageToken\` was null
- **Hanging long-running operation** — \`GET /v1/.../operations\` showed no queued App Hosting ops for this backend beyond the one I created manually

The one thing I couldn't observe is the App Hosting control plane's internal \`CreateRollout\` decision — \`AppHosting.CreateRollout\` audit entries are only present in Cloud Audit Logs for the manual rollout I made, not for any of the six successful auto-rollouts earlier that day. So I can't distinguish "webhook never fired" from "webhook fired but \`CreateRollout\` returned an error" from "write dropped" without Google-side logs.

### Fingerprint (for Google SRE)

Backend UID: \`53b19782-1be2-4edf-9d66-8466a6d089b0\`
Affected build UIDs:
- \`build-2026-04-10-011\` → UID \`c6128a89-baae-4b59-96bb-4e8e7414b584\` (high-confidence glitch)
- \`build-2026-04-10-009\` → commit \`a09a9e0017d796675f33f53f6540e17a71ed73df\` (plausible glitch, can't rule out supersession)

Healthy auto-rollouts from the same backend on the same day, for reference: \`rollout-2026-04-10-004\` through \`rollout-2026-04-10-009\`, all created within ~60 ms of their paired builds.

### Expected vs actual

**Expected:** a rollout resource is created automatically within ~60 ms of a successful build, same as every other deploy that day.

**Actual:** no rollout resource was created at all. The build sits \`READY\` and the revision sits healthy, but traffic does not shift. No user-visible error — \`firebase-tools\` is not involved in this path at all (it's purely the App Hosting control plane reacting to a GitHub push), so there's nowhere for an error to surface to the user except by staring at the rollouts list.

### Workaround

Poll the \`rollouts\` endpoint after every push to \`main\` and create the rollout manually if it's missing. I implemented this as a GitHub Actions workflow with WIF-based auth and a dedicated \`roles/firebaseapphosting.admin\` service account. Happy to share the workflow if it's useful.

### What would help

- Any visibility the App Hosting control plane has into why \`CreateRollout\` did or did not fire for \`build-2026-04-10-011\` (backend UID above, build UID above)
- Knowing whether this correlates with a recent App Hosting control-plane deploy
- A way to surface "build is READY but no rollout exists" as a user-visible warning in the Firebase console, since today there's no indication anything is wrong
- Confirmation this is a known class of bug so I know whether to invest more in the workaround or remove it

Thanks — happy to provide more data or try reproductions if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

App Hosting: auto-rollout silently not created for successful build (two silent skips in one day) #10320

Summary

Environment

Observed timeline (both silent skips on 2026-04-10 UTC)

What I ruled out

Fingerprint (for Google SRE)

Expected vs actual

Workaround

What would help

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

App Hosting: auto-rollout silently not created for successful build (two silent skips in one day) #10320

Description

Summary

Environment

Observed timeline (both silent skips on 2026-04-10 UTC)

What I ruled out

Fingerprint (for Google SRE)

Expected vs actual

Workaround

What would help

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions