fix(studio): save retries, mutation queue circuit breaker, save_failure diagnostics#1366
Conversation
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
hyperframes#1366 — fix(studio): save retries, mutation queue circuit breaker, save_failure diagnostics
Solid load-bearing PR for the studio save-reliability stack. Retry helper is well-designed (allowlist on transient classes, jitter, abort signal, attempt-decorated errors), the circuit breaker is small and testable, telemetry is consolidated through a single builder. Studio test suite + typecheck + lint green; only Fallow audit failing (pre-existing CRAP/duplication scores on hot files like useAppHotkeys/useFileManager — not introduced here). One real bug, a handful of design concerns that affect UX + the sibling-PR contract, and some smaller items.
Blockers
• packages/studio/src/utils/domEditSaveQueue.ts:55-63 — the breaker auto-closes on the first in-flight success after it opens, which silently negates the user-dismiss model. Sequence: 5 consecutive failures trip the breaker → onOpen fires → toast + banner show "Auto-save is paused. Check your connection." → a save that was already queued (sitting in tail.then(...)) resolves → run() calls reset() → onReset clears the banner without any user action. The test at domEditSaveQueue.test.ts:54-86 actually asserts this auto-recovery behavior is intentional, but combined with the "reject NEW work while open" branch at line 68 you've got two inconsistent rules: new enqueues see a hard wall, in-flight enqueues silently revive the queue. If auto-recovery is the goal, the "Dismiss" button is misleading; if hard pause is the goal, reset() from run() should be gated on !breakerOpen. Pick one — current shape is the worst of both.
Concerns
• packages/studio/src/hooks/usePreviewPersistence.ts:67-84 — queue is constructed during render, not in lazy useState init. Pattern if (!ref.current) ref.current = createDomEditSaveQueue({ onOpen: setX, ... }) captures setDomEditSaveQueuePaused from first render's closure. That's a stable identity (React guarantees setState identity), so it works in practice, but showToastRef is what's actually defending against the stale-closure case for showToast itself. The pattern is fragile under StrictMode double-invoke + Fast Refresh in dev (you can construct two queues if the ref-bailout happens to dance with the mount cycle, then useMountEffect cleanup only destroys the second). Prefer useState(() => createDomEditSaveQueue(...)) with the callbacks reading refs.
• packages/studio/src/hooks/useDomEditPositionPatchCommit.ts:32-44 — when the breaker is OPEN, every subsequent queueDomEditSave rejects synchronously with DomEditSaveQueueOpenError. That rejection flows through the same .catch that fires showToast(error.message) ("Auto-save is paused. Dismiss the warning to retry DOM edits.") AND trackStudioSaveFailure({ source: "dom_edit", ... }). If the user keeps editing while paused, you get a toast storm + a save_failure event per attempt that isn't a real save failure — it's a breaker rejection. Suggested: short-circuit on DomEditSaveQueueOpenError in the catch (and in useDomEditTextCommits.ts:75 trackDomEditSaveFailure paths) so it's neither toasted again nor logged as save_failure. Same applies anywhere queueDomEditSave(...) rejection hits trackStudioSaveFailure.
• packages/studio/src/utils/studioSaveDiagnostics.ts:92-97 — retry allowlist treats statusCode == null as retryable. That covers network/CORS/abort, which is right for transient cases, but also catches your own thrown JS errors (e.g. Error("Mutation failed: ...") from useGsapScriptCommits.ts:108, or Error("Missing file contents for ...") from useFileManager.ts:113 — both rethrown via attachStudioSaveAttempt so they hit this path). Those aren't transient — they shouldn't burn the retry budget. Tighten to "retry only if statusCode != null && transient OR error is TypeError (fetch-network) OR name === "AbortError" (no — that's already filtered)". Otherwise typo-shaped failures retry 3× and look like flakes in save_failure telemetry.
• packages/studio/src/hooks/useGsapScriptCommits.ts:106-110 — mutateGsapScript no longer returns null; it throws. The if (!result) branch in commitMutation is dead code post-change. Either delete it or convert to try/catch that re-throws / handles skipReload semantics explicitly. The behavioral change is good (errors no longer silently swallowed), but leaving dead branches obscures the new contract.
• Sibling PR coherence with #1388 (journal source writebacks): the backup journal on the server snapshots before every write and is capped at 10 per file. A 4-attempt retry storm against a flaky 503 burns 4 journal slots per save, accelerating churn on the recovery window. Not strictly a blocker on this PR, but the two PRs ship coupled — worth a heads-up to Miguel that the journal pruning policy may need re-tuning once retries are live, or have the server treat repeated identical-content writes within N seconds as a single snapshot.
• Sibling PR coherence with #1389 (keyframe value validation): the 400 from the new server-side NaN/Infinity guard correctly does NOT retry (per isRetryableStudioSaveError 408/425/429/5xx allowlist). Good. But the 400 message will now surface as a save_failure event with status_code=400 — make sure the PostHog dashboards distinguish "user-rejected (4xx)" from "infrastructure (5xx)" cohorts, otherwise the new validation will look like a regression in save reliability.
• packages/studio/src/utils/studioSaveDiagnostics.ts:132-135 — error detail is response.text().slice(0, 300) and ends up in error_message on PostHog. Server error bodies can contain file paths and (depending on linkedom failure shape from #1388/#1389) snippets of the patched HTML. Confirm with the backend team that DOM-patch error responses don't echo back attribute values that could include customer content (style strings, text content from inputs). If they do, scrub on the client before telemetry.
• packages/studio/src/utils/domEditSaveQueue.ts:22 — DEFAULT_FAILURE_THRESHOLD = 5 is hardcoded. Five consecutive failures during a long render export or laptop sleep + wake could mean 30s+ of dropped work before the user sees the banner. Worth either making it configurable via a Studio flag (so we can tune in prod via PostHog feature flag) or shortening for the first rollout. Pair this with a save_failure_recovered event so we can dashboard "how many threshold crossings auto-recovered vs needed dismiss" — important data for tuning.
• packages/studio/src/utils/studioSaveDiagnostics.ts:139-191 — retryStudioSave has no maximum total-time budget. With retries=3, baseDelayMs=500, maxDelayMs=8000 and a server that 503s every attempt: ~500+1000+2000+~8s jitter = ~12s of requestAnimationFrame-scheduled save blocking. During that window the AbortController in useFileManager.ts:120 lets unmount cancel, but if the user navigates between compositions without unmounting the file manager, the next save sits behind a queued retry. Acceptable for now; flag for a future deadline-budget pass.
Nits
• packages/studio/src/utils/studioSaveDiagnostics.ts:191 — throw new Error("Save retry loop exited unexpectedly") is unreachable (the loop ends at attempt >= maxAttempts with a throw). Either annotate // istanbul ignore next or convert to an assertion. (nit)
• packages/studio/src/components/SaveQueuePausedBanner.tsx:10 — z-[92] is a magic number relative to other studio overlays. Confirm it's above the LintModal but below the export progress overlay; consider a named token. (nit)
• packages/studio/src/utils/domEditSaveQueue.ts:24-29 — DomEditSaveQueueOpenError is exported (Fallow flags it as unused-export). If you're keeping it for instanceof checks (recommended for the breaker-rejection short-circuit suggested above), add a brief docstring; otherwise drop the export. (nit)
• packages/studio/src/utils/studioSaveDiagnostics.ts:78-86 — same: getStudioSaveAttempt exported but not consumed externally (Fallow flag). Decide on the public surface. (nit)
• packages/studio/src/hooks/useSafeGsapCommitMutation.ts:58 — void commitMutation(...).catch(...) swallows the rejection chain. Callers that previously did void commitMutation(...) lose the ability to await; that's the design, but the wrapper name implies safety not erasure. Consider safeCommitGsapMutation returning Promise<void> so call sites that do want to await error state can. (nit)
Questions
• packages/studio/src/hooks/usePreviewPersistence.ts:81-83 — when onReset clears the banner state, should we also emit a save_queue_resumed telemetry counterpart to save_queue_paused? Without it we can't measure "trip → recovery" funnel in PostHog. Probably out of scope but worth confirming with Miguel.
• Is there a follow-up planned for save_failure rate-limiting? The PR's trackStudioSaveFailure emits one event per failure; under a sustained outage (the exact case the breaker exists for) that's ~5 events per breaker-open cycle plus any post-open enqueue rejections (see concern above). Easy DoS surface on the PostHog ingest.
• useDomEditSession.ts makeFetchFallback is now created per-render and passed as a callback dep — confirmed useGsapAnimationFetchFallback memoizes? (I'm flagging this to Vai for the runtime-interop review, but calling it out so it doesn't slip.)
What I didn't verify
• Runtime / SDK-shape impacts of the new throw-on-error semantics in mutateGsapScript — that's Vai's lane.
• PostHog dashboard wiring for save_queue_paused vs save_failure — assumed event names are already provisioned; confirm with whoever owns studio telemetry.
• Whether useDomEditPositionPatchCommit extraction is byte-for-byte equivalent to the inline version it replaces. The shape matches; I didn't diff every closure capture.
CI: studio suite + typecheck + lint + build all green. Fallow audit is failing on pre-existing CRAP/duplication scores in unrelated hot files (useAppHotkeys, useFileManager, usePreviewPersistence arrow at line 178 — not from this PR). CodeQL flagged a "client-side request forgery" on the new writeProjectFile PUT — same pattern as the existing GETs, encodeURIComponent-wrapped, low priority for a local-project-scoped studio.
Review by Rames D Jusso
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Re-classifying per author's request to evaluate long-term-vs-bandaid framing.
Retry + backoff + circuit-breaker is the right long-term architecture — that part stays. Two implementation decisions are bandaid-shaped and need resolving before merge:
-
Breaker design contradicts itself.
domEditSaveQueue.ts:55-63auto-closes on first in-flight success after opening, while:68rejects new work while open. That's two inconsistent rules — the "Dismiss to retry" button becomes UX theater because the breaker may have already auto-reset behind the user's back. Pick one model:- Hard-pause:
reset()fromrun()gated on!breakerOpen, dismiss is the only path back to running state - Auto-recovery: drop the banner + dismiss button, surface a transient toast instead
- Hard-pause:
-
Retry allowlist treats
statusCode == nullas retryable (studioSaveDiagnostics.ts:92-97). That's a shotgun — it catches your own thrown JS errors (Error("Mutation failed: ...")fromuseGsapScriptCommits.ts,Error("Missing file contents for ...")fromuseFileManager.ts) and burns 3 retries on them. Long-term shape: typed error classes (e.g.RetryableNetworkError,NonRetryableServerError,LocalAssertionError) withinstanceofchecks. Current implementation makes thrown JS bugs look like flakes insave_failuretelemetry, masking real issues.
Smaller foot-gun (fix now or it'll bite): usePreviewPersistence.ts:67-84 constructs the queue during render (if (!ref.current) ref.current = createDomEditSaveQueue(...)) instead of lazy useState(() => createDomEditSaveQueue(...)). Fragile under StrictMode double-invoke + Fast Refresh.
See my earlier comment review for the full file:line breakdown of other concerns.
Review by Rames D Jusso
5292c7c to
39da716
Compare
39da716 to
f4768cf
Compare
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
hyperframes#1366 — R2 verification — fix(studio): save retries, mutation queue circuit breaker, save_failure diagnostics
Round-2 verification against HEAD f4768cf5 (merge from main on top of R1-reviewed 5292c7ca). Verbatim finding-by-finding pass on the R1 request-changes review. All three R1 must-fix items are resolved. Recommending stamp.
R1 finding-by-finding
• R1-Blocker-1 (breaker design contradiction) — ✅ Resolved. domEditSaveQueue.ts:55-63 now reads if (!breakerOpen) consecutiveFailures = 0; on success — the old reset() call from run() is gone, so in-flight successes no longer silently re-close the breaker. Hard-pause model. The dismiss button is now the only path out of paused state. New test domEditSaveQueue.test.ts:54-90 ("keeps an open breaker paused even when already queued work succeeds") explicitly asserts the chosen semantics: first throws → breaker opens → second-already-queued resolves → onReset NOT called → next enqueue still rejects with DomEditSaveQueueOpenError. Design + test now agree. Note: R1 read the original reset()-on-success as an auto-recovery call; what we got is a clean rewrite that addresses the underlying ambiguity either way.
• R1-Blocker-2 (retry allowlist shotgun) — ✅ Resolved. studioSaveDiagnostics.ts:99-105 now reads:
if (isStudioSaveAbortError(error)) return false;
if (error instanceof StudioSaveNetworkError) return true;
const statusCode = getStudioSaveStatusCode(error);
if (statusCode == null) return false;
return statusCode === 408 || statusCode === 425 || statusCode === 429 || statusCode >= 500;
Two new typed error classes (StudioSaveHttpError, StudioSaveNetworkError) — exactly the long-term refactor R1 asked for. Plain Error("Mutation failed: ...") / Error("Missing file contents for ...") thrown from app code now fall through to statusCode == null → false → NOT retried. Real JS bugs no longer burn the retry budget or contaminate save_failure telemetry as flakes.
• R1-Smaller-Foot-Gun (constructed-during-render queue) — ✅ Resolved. usePreviewPersistence.ts:66-83 now uses lazy useState(() => createDomEditSaveQueue({...})) — the React idiom for one-time-per-component-instance construction. The fragile useRef + if (!ref.current) ref.current = ... pattern is gone. StrictMode double-invoke + Fast Refresh are no longer footguns here.
R1 comment-review items
• Toast storm + false save_failure per attempt while breaker open — ✅ Resolved. useDomEditPositionPatchCommit.ts:35-36 short-circuits via if (isDomEditSaveQueueOpenError(error)) return; — neither toasts nor emits save_failure for breaker-rejection errors. (Note: this is the only queueDomEditSave call site in the diff; text/style/attribute commits in useDomEditTextCommits.ts go through persistDomEditOperations directly, not the queue, so they never see DomEditSaveQueueOpenError. Routing text commits through the breaker is a separate design question, not an R1 followup.)
• mutateGsapScript dead if (!result) branch — ✅ Resolved. useGsapScriptCommits.ts commitMutation now checks if (result.changed === false) — the dead !result branch is gone and the new throw-on-error contract is explicit.
Items deferred (acceptable — not R1 blockers)
• DEFAULT_FAILURE_THRESHOLD = 5 still hardcoded — R1 question, deferred. Follow-up shaped.
• retryStudioSave no max total-time budget — R1 future-pass flag.
• save_failure no rate-limiting — R1 question, deferred.
• save_queue_resumed telemetry counterpart — R1 question, deferred.
• Sibling-PR coherence with #1388 (journal churn under retries) and #1389 (4xx vs 5xx dashboards) — R1 cross-PR heads-up, not blockers on this PR.
• Nits on SaveQueuePausedBanner.tsx z-[92], unreachable retryStudioSave throw, exported DomEditSaveQueueOpenError/getStudioSaveAttempt — all leftover nits, none load-bearing.
CI
All required checks green at f4768cf5: Test, Typecheck, Lint, Build, CLI smoke, Studio load smoke, SDK contract/smoke, runtime contract, Render/Tests on windows-latest, Preview parity, Format, CodeQL, Semantic PR title. Only Fallow audit failing — pre-existing CRAP/duplication scores on hot files (useAppHotkeys, useFileManager, useDomEditTextCommits, usePreviewPersistence's legacy arrow) — not introduced by this PR.
Verdict
Every R1 must-fix is resolved with a substantive implementation change (not a docstring/intent fix), and the deferred items were explicitly classified at R1 as non-blocking. Recommending approve + stamp.
Review by Rames D Jusso
|
Merge prep update: conflicts are resolved and the branch is git-mergeable, but GitHub still reports |
| response = await fetch(`/api/projects/${pid}/files/${encodeURIComponent(path)}`, { | ||
| method: "PUT", | ||
| headers: { "Content-Type": "text/plain" }, | ||
| body: content, | ||
| }); |
…re diagnostics Save failures could silently drop user work: code-editor saves fired a single PUT with no retry, DOM-edit failures drained the whole queue against a failing server, and several failure paths only logged to the console. - Retry code-editor saves with exponential backoff instead of dropping the edit on the first failed PUT. - Circuit breaker on the DOM-edit save queue: a failing server pauses the queue with a user-visible error state instead of burning every queued mutation against it. - save_failure events now carry error_message, status_code, and source on every emission path; style/attribute DOM-edit failures that previously only logged to the console now emit telemetry too. - Route unawaited commitMutation call sites (GSAP drag, property scrubbing, undo/redo, text fields) through a safe wrapper that reports failures via telemetry instead of unhandledrejection. Follow-ups (deferred): version/ETag conflict guard on file PUTs, offline save queue.
d83bdec to
d5dd61b
Compare
d5dd61b to
1dc55cd
Compare
main went red again at e2cc134: my #1399 fix branched off the pre-#1366 state (where `.hyperframes` was wholesale-hidden via IGNORE_DIRS) and, when it merged on top of #1366, overwrote #1366's corrected test with an assertion that `.hyperframes/examples` is hidden. #1366 is the authoritative behavior: walkDir now hides only `.hyperframes/backup` (shouldIgnoreDir), so `.hyperframes/examples` — like any other vendored dot-dir — stays visible in the file tree and is gated out of composition discovery by isInHiddenOrVendorDir. That is the original #1384 intent. Correct the file-tree test to match: - `.cache/examples/preset.html` and `.hyperframes/examples/preset.html` are both visible in `files` (kept the `.cache` case from #1399 — it exercises isInHiddenOrVendorDir gating for a non-special dot-dir). - `.hyperframes/backup/snapshot.html` is the only thing hidden from the tree. - Compositions still exclude every dot-dir example. Full non-producer suite green; walkDir "hides backups" test untouched. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Studio save failures can silently drop user work:
commitMutationcall sites (GSAP drag, property scrubbing, undo/redo, text fields) are unawaited, surfacing as unhandled promise rejections instead of reported errors.Fix
useFileManager).utils/domEditSaveQueue.ts): repeated failures pause persistence and show a dismissible alert instead of draining the queue.save_failuretelemetry now carrieserror_message,status_code, andsourceon every emission path (utils/studioSaveDiagnostics.ts); style/attribute failures emit telemetry instead of only console-warning.useSafeGsapCommitMutation) routes unawaited commit failures through telemetry + a user-facing toast rather thanunhandledrejection.Follow-ups (deferred to keep this reviewable)
Testing
bun run buildgreen, oxlint/oxfmt clean.