Skip to content

fix(studio): save retries, mutation queue circuit breaker, save_failure diagnostics#1366

Merged
miguel-heygen merged 2 commits into
mainfrom
fix/studio-save-reliability
Jun 13, 2026
Merged

fix(studio): save retries, mutation queue circuit breaker, save_failure diagnostics#1366
miguel-heygen merged 2 commits into
mainfrom
fix/studio-save-reliability

Conversation

@miguel-heygen

Copy link
Copy Markdown
Collaborator

Problem

Studio save failures can silently drop user work:

  • Code-editor saves fire a single PUT with no retry — any transient network error or oversized payload loses the edit.
  • The DOM-edit mutation queue has no circuit breaker, so a failing server burns through every queued mutation, producing bursts of failed saves.
  • Several failure paths (style/attribute DOM edits) only log to the console, so failures are invisible and undiagnosable.
  • Many commitMutation call sites (GSAP drag, property scrubbing, undo/redo, text fields) are unawaited, surfacing as unhandled promise rejections instead of reported errors.

Fix

  • Exponential-backoff retries on code-editor saves (useFileManager).
  • Circuit breaker on the DOM-edit save queue (utils/domEditSaveQueue.ts): repeated failures pause persistence and show a dismissible alert instead of draining the queue.
  • save_failure telemetry now carries error_message, status_code, and source on every emission path (utils/studioSaveDiagnostics.ts); style/attribute failures emit telemetry instead of only console-warning.
  • Safe wrapper (useSafeGsapCommitMutation) routes unawaited commit failures through telemetry + a user-facing toast rather than unhandledrejection.

Follow-ups (deferred to keep this reviewable)

  • Version/ETag conflict guard on file PUTs (concurrent-tab protection).
  • Offline save queue.

Testing

  • New unit tests for the save queue circuit breaker and save diagnostics.
  • Studio suite fully green (803 tests), typecheck clean, bun run build green, oxlint/oxfmt clean.

Comment thread packages/studio/src/hooks/useFileManager.ts Fixed

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyperframes#1366fix(studio): save retries, mutation queue circuit breaker, save_failure diagnostics

Solid load-bearing PR for the studio save-reliability stack. Retry helper is well-designed (allowlist on transient classes, jitter, abort signal, attempt-decorated errors), the circuit breaker is small and testable, telemetry is consolidated through a single builder. Studio test suite + typecheck + lint green; only Fallow audit failing (pre-existing CRAP/duplication scores on hot files like useAppHotkeys/useFileManager — not introduced here). One real bug, a handful of design concerns that affect UX + the sibling-PR contract, and some smaller items.

Blockers
packages/studio/src/utils/domEditSaveQueue.ts:55-63 — the breaker auto-closes on the first in-flight success after it opens, which silently negates the user-dismiss model. Sequence: 5 consecutive failures trip the breaker → onOpen fires → toast + banner show "Auto-save is paused. Check your connection." → a save that was already queued (sitting in tail.then(...)) resolves → run() calls reset()onReset clears the banner without any user action. The test at domEditSaveQueue.test.ts:54-86 actually asserts this auto-recovery behavior is intentional, but combined with the "reject NEW work while open" branch at line 68 you've got two inconsistent rules: new enqueues see a hard wall, in-flight enqueues silently revive the queue. If auto-recovery is the goal, the "Dismiss" button is misleading; if hard pause is the goal, reset() from run() should be gated on !breakerOpen. Pick one — current shape is the worst of both.

Concerns
packages/studio/src/hooks/usePreviewPersistence.ts:67-84 — queue is constructed during render, not in lazy useState init. Pattern if (!ref.current) ref.current = createDomEditSaveQueue({ onOpen: setX, ... }) captures setDomEditSaveQueuePaused from first render's closure. That's a stable identity (React guarantees setState identity), so it works in practice, but showToastRef is what's actually defending against the stale-closure case for showToast itself. The pattern is fragile under StrictMode double-invoke + Fast Refresh in dev (you can construct two queues if the ref-bailout happens to dance with the mount cycle, then useMountEffect cleanup only destroys the second). Prefer useState(() => createDomEditSaveQueue(...)) with the callbacks reading refs.
packages/studio/src/hooks/useDomEditPositionPatchCommit.ts:32-44 — when the breaker is OPEN, every subsequent queueDomEditSave rejects synchronously with DomEditSaveQueueOpenError. That rejection flows through the same .catch that fires showToast(error.message) ("Auto-save is paused. Dismiss the warning to retry DOM edits.") AND trackStudioSaveFailure({ source: "dom_edit", ... }). If the user keeps editing while paused, you get a toast storm + a save_failure event per attempt that isn't a real save failure — it's a breaker rejection. Suggested: short-circuit on DomEditSaveQueueOpenError in the catch (and in useDomEditTextCommits.ts:75 trackDomEditSaveFailure paths) so it's neither toasted again nor logged as save_failure. Same applies anywhere queueDomEditSave(...) rejection hits trackStudioSaveFailure.
packages/studio/src/utils/studioSaveDiagnostics.ts:92-97 — retry allowlist treats statusCode == null as retryable. That covers network/CORS/abort, which is right for transient cases, but also catches your own thrown JS errors (e.g. Error("Mutation failed: ...") from useGsapScriptCommits.ts:108, or Error("Missing file contents for ...") from useFileManager.ts:113 — both rethrown via attachStudioSaveAttempt so they hit this path). Those aren't transient — they shouldn't burn the retry budget. Tighten to "retry only if statusCode != null && transient OR error is TypeError (fetch-network) OR name === "AbortError" (no — that's already filtered)". Otherwise typo-shaped failures retry 3× and look like flakes in save_failure telemetry.
packages/studio/src/hooks/useGsapScriptCommits.ts:106-110mutateGsapScript no longer returns null; it throws. The if (!result) branch in commitMutation is dead code post-change. Either delete it or convert to try/catch that re-throws / handles skipReload semantics explicitly. The behavioral change is good (errors no longer silently swallowed), but leaving dead branches obscures the new contract.
• Sibling PR coherence with #1388 (journal source writebacks): the backup journal on the server snapshots before every write and is capped at 10 per file. A 4-attempt retry storm against a flaky 503 burns 4 journal slots per save, accelerating churn on the recovery window. Not strictly a blocker on this PR, but the two PRs ship coupled — worth a heads-up to Miguel that the journal pruning policy may need re-tuning once retries are live, or have the server treat repeated identical-content writes within N seconds as a single snapshot.
• Sibling PR coherence with #1389 (keyframe value validation): the 400 from the new server-side NaN/Infinity guard correctly does NOT retry (per isRetryableStudioSaveError 408/425/429/5xx allowlist). Good. But the 400 message will now surface as a save_failure event with status_code=400 — make sure the PostHog dashboards distinguish "user-rejected (4xx)" from "infrastructure (5xx)" cohorts, otherwise the new validation will look like a regression in save reliability.
packages/studio/src/utils/studioSaveDiagnostics.ts:132-135 — error detail is response.text().slice(0, 300) and ends up in error_message on PostHog. Server error bodies can contain file paths and (depending on linkedom failure shape from #1388/#1389) snippets of the patched HTML. Confirm with the backend team that DOM-patch error responses don't echo back attribute values that could include customer content (style strings, text content from inputs). If they do, scrub on the client before telemetry.
packages/studio/src/utils/domEditSaveQueue.ts:22DEFAULT_FAILURE_THRESHOLD = 5 is hardcoded. Five consecutive failures during a long render export or laptop sleep + wake could mean 30s+ of dropped work before the user sees the banner. Worth either making it configurable via a Studio flag (so we can tune in prod via PostHog feature flag) or shortening for the first rollout. Pair this with a save_failure_recovered event so we can dashboard "how many threshold crossings auto-recovered vs needed dismiss" — important data for tuning.
packages/studio/src/utils/studioSaveDiagnostics.ts:139-191retryStudioSave has no maximum total-time budget. With retries=3, baseDelayMs=500, maxDelayMs=8000 and a server that 503s every attempt: ~500+1000+2000+~8s jitter = ~12s of requestAnimationFrame-scheduled save blocking. During that window the AbortController in useFileManager.ts:120 lets unmount cancel, but if the user navigates between compositions without unmounting the file manager, the next save sits behind a queued retry. Acceptable for now; flag for a future deadline-budget pass.

Nits
packages/studio/src/utils/studioSaveDiagnostics.ts:191throw new Error("Save retry loop exited unexpectedly") is unreachable (the loop ends at attempt >= maxAttempts with a throw). Either annotate // istanbul ignore next or convert to an assertion. (nit)
packages/studio/src/components/SaveQueuePausedBanner.tsx:10z-[92] is a magic number relative to other studio overlays. Confirm it's above the LintModal but below the export progress overlay; consider a named token. (nit)
packages/studio/src/utils/domEditSaveQueue.ts:24-29DomEditSaveQueueOpenError is exported (Fallow flags it as unused-export). If you're keeping it for instanceof checks (recommended for the breaker-rejection short-circuit suggested above), add a brief docstring; otherwise drop the export. (nit)
packages/studio/src/utils/studioSaveDiagnostics.ts:78-86 — same: getStudioSaveAttempt exported but not consumed externally (Fallow flag). Decide on the public surface. (nit)
packages/studio/src/hooks/useSafeGsapCommitMutation.ts:58void commitMutation(...).catch(...) swallows the rejection chain. Callers that previously did void commitMutation(...) lose the ability to await; that's the design, but the wrapper name implies safety not erasure. Consider safeCommitGsapMutation returning Promise<void> so call sites that do want to await error state can. (nit)

Questions
packages/studio/src/hooks/usePreviewPersistence.ts:81-83 — when onReset clears the banner state, should we also emit a save_queue_resumed telemetry counterpart to save_queue_paused? Without it we can't measure "trip → recovery" funnel in PostHog. Probably out of scope but worth confirming with Miguel.
• Is there a follow-up planned for save_failure rate-limiting? The PR's trackStudioSaveFailure emits one event per failure; under a sustained outage (the exact case the breaker exists for) that's ~5 events per breaker-open cycle plus any post-open enqueue rejections (see concern above). Easy DoS surface on the PostHog ingest.
useDomEditSession.ts makeFetchFallback is now created per-render and passed as a callback dep — confirmed useGsapAnimationFetchFallback memoizes? (I'm flagging this to Vai for the runtime-interop review, but calling it out so it doesn't slip.)

What I didn't verify
• Runtime / SDK-shape impacts of the new throw-on-error semantics in mutateGsapScript — that's Vai's lane.
• PostHog dashboard wiring for save_queue_paused vs save_failure — assumed event names are already provisioned; confirm with whoever owns studio telemetry.
• Whether useDomEditPositionPatchCommit extraction is byte-for-byte equivalent to the inline version it replaces. The shape matches; I didn't diff every closure capture.

CI: studio suite + typecheck + lint + build all green. Fallow audit is failing on pre-existing CRAP/duplication scores in unrelated hot files (useAppHotkeys, useFileManager, usePreviewPersistence arrow at line 178 — not from this PR). CodeQL flagged a "client-side request forgery" on the new writeProjectFile PUT — same pattern as the existing GETs, encodeURIComponent-wrapped, low priority for a local-project-scoped studio.

Review by Rames D Jusso

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-classifying per author's request to evaluate long-term-vs-bandaid framing.

Retry + backoff + circuit-breaker is the right long-term architecture — that part stays. Two implementation decisions are bandaid-shaped and need resolving before merge:

  1. Breaker design contradicts itself. domEditSaveQueue.ts:55-63 auto-closes on first in-flight success after opening, while :68 rejects new work while open. That's two inconsistent rules — the "Dismiss to retry" button becomes UX theater because the breaker may have already auto-reset behind the user's back. Pick one model:

    • Hard-pause: reset() from run() gated on !breakerOpen, dismiss is the only path back to running state
    • Auto-recovery: drop the banner + dismiss button, surface a transient toast instead
  2. Retry allowlist treats statusCode == null as retryable (studioSaveDiagnostics.ts:92-97). That's a shotgun — it catches your own thrown JS errors (Error("Mutation failed: ...") from useGsapScriptCommits.ts, Error("Missing file contents for ...") from useFileManager.ts) and burns 3 retries on them. Long-term shape: typed error classes (e.g. RetryableNetworkError, NonRetryableServerError, LocalAssertionError) with instanceof checks. Current implementation makes thrown JS bugs look like flakes in save_failure telemetry, masking real issues.

Smaller foot-gun (fix now or it'll bite): usePreviewPersistence.ts:67-84 constructs the queue during render (if (!ref.current) ref.current = createDomEditSaveQueue(...)) instead of lazy useState(() => createDomEditSaveQueue(...)). Fragile under StrictMode double-invoke + Fast Refresh.

See my earlier comment review for the full file:line breakdown of other concerns.

Review by Rames D Jusso

@miguel-heygen miguel-heygen force-pushed the fix/studio-save-reliability branch from 5292c7c to 39da716 Compare June 12, 2026 20:16
Comment thread packages/studio/src/hooks/useFileManager.ts Fixed

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyperframes#1366 — R2 verification — fix(studio): save retries, mutation queue circuit breaker, save_failure diagnostics

Round-2 verification against HEAD f4768cf5 (merge from main on top of R1-reviewed 5292c7ca). Verbatim finding-by-finding pass on the R1 request-changes review. All three R1 must-fix items are resolved. Recommending stamp.

R1 finding-by-finding

R1-Blocker-1 (breaker design contradiction) — ✅ Resolved. domEditSaveQueue.ts:55-63 now reads if (!breakerOpen) consecutiveFailures = 0; on success — the old reset() call from run() is gone, so in-flight successes no longer silently re-close the breaker. Hard-pause model. The dismiss button is now the only path out of paused state. New test domEditSaveQueue.test.ts:54-90 ("keeps an open breaker paused even when already queued work succeeds") explicitly asserts the chosen semantics: first throws → breaker opens → second-already-queued resolves → onReset NOT called → next enqueue still rejects with DomEditSaveQueueOpenError. Design + test now agree. Note: R1 read the original reset()-on-success as an auto-recovery call; what we got is a clean rewrite that addresses the underlying ambiguity either way.

R1-Blocker-2 (retry allowlist shotgun) — ✅ Resolved. studioSaveDiagnostics.ts:99-105 now reads:

if (isStudioSaveAbortError(error)) return false;
if (error instanceof StudioSaveNetworkError) return true;
const statusCode = getStudioSaveStatusCode(error);
if (statusCode == null) return false;
return statusCode === 408 || statusCode === 425 || statusCode === 429 || statusCode >= 500;

Two new typed error classes (StudioSaveHttpError, StudioSaveNetworkError) — exactly the long-term refactor R1 asked for. Plain Error("Mutation failed: ...") / Error("Missing file contents for ...") thrown from app code now fall through to statusCode == null → false → NOT retried. Real JS bugs no longer burn the retry budget or contaminate save_failure telemetry as flakes.

R1-Smaller-Foot-Gun (constructed-during-render queue) — ✅ Resolved. usePreviewPersistence.ts:66-83 now uses lazy useState(() => createDomEditSaveQueue({...})) — the React idiom for one-time-per-component-instance construction. The fragile useRef + if (!ref.current) ref.current = ... pattern is gone. StrictMode double-invoke + Fast Refresh are no longer footguns here.

R1 comment-review items

Toast storm + false save_failure per attempt while breaker open — ✅ Resolved. useDomEditPositionPatchCommit.ts:35-36 short-circuits via if (isDomEditSaveQueueOpenError(error)) return; — neither toasts nor emits save_failure for breaker-rejection errors. (Note: this is the only queueDomEditSave call site in the diff; text/style/attribute commits in useDomEditTextCommits.ts go through persistDomEditOperations directly, not the queue, so they never see DomEditSaveQueueOpenError. Routing text commits through the breaker is a separate design question, not an R1 followup.)

mutateGsapScript dead if (!result) branch — ✅ Resolved. useGsapScriptCommits.ts commitMutation now checks if (result.changed === false) — the dead !result branch is gone and the new throw-on-error contract is explicit.

Items deferred (acceptable — not R1 blockers)

DEFAULT_FAILURE_THRESHOLD = 5 still hardcoded — R1 question, deferred. Follow-up shaped.
retryStudioSave no max total-time budget — R1 future-pass flag.
save_failure no rate-limiting — R1 question, deferred.
save_queue_resumed telemetry counterpart — R1 question, deferred.
• Sibling-PR coherence with #1388 (journal churn under retries) and #1389 (4xx vs 5xx dashboards) — R1 cross-PR heads-up, not blockers on this PR.
• Nits on SaveQueuePausedBanner.tsx z-[92], unreachable retryStudioSave throw, exported DomEditSaveQueueOpenError/getStudioSaveAttempt — all leftover nits, none load-bearing.

CI

All required checks green at f4768cf5: Test, Typecheck, Lint, Build, CLI smoke, Studio load smoke, SDK contract/smoke, runtime contract, Render/Tests on windows-latest, Preview parity, Format, CodeQL, Semantic PR title. Only Fallow audit failing — pre-existing CRAP/duplication scores on hot files (useAppHotkeys, useFileManager, useDomEditTextCommits, usePreviewPersistence's legacy arrow) — not introduced by this PR.

Verdict

Every R1 must-fix is resolved with a substantive implementation change (not a docstring/intent fix), and the deferred items were explicitly classified at R1 as non-blocking. Recommending approve + stamp.

Review by Rames D Jusso

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jrusso1020 jrusso1020 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@miguel-heygen

Copy link
Copy Markdown
Collaborator Author

Merge prep update: conflicts are resolved and the branch is git-mergeable, but GitHub still reports reviewDecision=CHANGES_REQUESTED from the earlier james-russo-rames-d-jusso review. The later approval from jrusso1020 does not clear that requester in branch policy. Please approve/dismiss from the original requesting account so this can merge once checks finish green.

Comment on lines +108 to +112
response = await fetch(`/api/projects/${pid}/files/${encodeURIComponent(path)}`, {
method: "PUT",
headers: { "Content-Type": "text/plain" },
body: content,
});
…re diagnostics

Save failures could silently drop user work: code-editor saves fired a single
PUT with no retry, DOM-edit failures drained the whole queue against a failing
server, and several failure paths only logged to the console.

- Retry code-editor saves with exponential backoff instead of dropping the
  edit on the first failed PUT.
- Circuit breaker on the DOM-edit save queue: a failing server pauses the
  queue with a user-visible error state instead of burning every queued
  mutation against it.
- save_failure events now carry error_message, status_code, and source on
  every emission path; style/attribute DOM-edit failures that previously only
  logged to the console now emit telemetry too.
- Route unawaited commitMutation call sites (GSAP drag, property scrubbing,
  undo/redo, text fields) through a safe wrapper that reports failures via
  telemetry instead of unhandledrejection.

Follow-ups (deferred): version/ETag conflict guard on file PUTs, offline
save queue.
@miguel-heygen miguel-heygen force-pushed the fix/studio-save-reliability branch from d83bdec to d5dd61b Compare June 13, 2026 00:57
@miguel-heygen miguel-heygen force-pushed the fix/studio-save-reliability branch from d5dd61b to 1dc55cd Compare June 13, 2026 01:03
@miguel-heygen miguel-heygen merged commit 5f12e69 into main Jun 13, 2026
47 checks passed
@miguel-heygen miguel-heygen deleted the fix/studio-save-reliability branch June 13, 2026 02:02
jrusso1020 added a commit that referenced this pull request Jun 13, 2026
main went red again at e2cc134: my #1399 fix branched off the pre-#1366
state (where `.hyperframes` was wholesale-hidden via IGNORE_DIRS) and, when
it merged on top of #1366, overwrote #1366's corrected test with an
assertion that `.hyperframes/examples` is hidden.

#1366 is the authoritative behavior: walkDir now hides only
`.hyperframes/backup` (shouldIgnoreDir), so `.hyperframes/examples` — like
any other vendored dot-dir — stays visible in the file tree and is gated out
of composition discovery by isInHiddenOrVendorDir. That is the original #1384
intent.

Correct the file-tree test to match:
- `.cache/examples/preset.html` and `.hyperframes/examples/preset.html` are
  both visible in `files` (kept the `.cache` case from #1399 — it exercises
  isInHiddenOrVendorDir gating for a non-special dot-dir).
- `.hyperframes/backup/snapshot.html` is the only thing hidden from the tree.
- Compositions still exclude every dot-dir example.

Full non-producer suite green; walkDir "hides backups" test untouched.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants