fix: handle silent schema sync failures in branch-scoped DB middleware#2999
fix: handle silent schema sync failures in branch-scoped DB middleware#2999amikofalvy wants to merge 2 commits intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
TL;DR — Fixes silent 500 errors caused by schema sync failures during Dolt branch checkout. The middleware now inspects the Key changes
Summary | 3 files | 2 commits | base: Schema sync retry with jittered backoff in branch-scoped DB middleware
The retry loop targets a single retryable failure mode: Proper serialization of database error causes
|
There was a problem hiding this comment.
Solid fix for a real production pain point — the root cause analysis on non-enumerable pg Error properties is spot-on. Two actionable issues to address before merging: the retry loop fires on unretryable failures (merge conflicts), and dbError is a misleading field name for a generic cause serializer.
Claude Opus | 𝕏
| while ( | ||
| checkoutResult.schemaSync.hadDifferences && | ||
| !checkoutResult.schemaSync.performed && | ||
| syncAttempts < MAX_SCHEMA_SYNC_RETRIES | ||
| ) { | ||
| syncAttempts++; | ||
| logger.warn( | ||
| { | ||
| branch: resolvedRef.name, | ||
| error: checkoutResult.schemaSync.error, | ||
| attempt: syncAttempts, | ||
| }, | ||
| 'Schema sync not performed, retrying' | ||
| ); | ||
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts)); | ||
| checkoutResult = await checkoutBranch(requestDb)({ | ||
| branchName: resolvedRef.name, | ||
| syncSchema: true, | ||
| autoCommitPending: true, | ||
| }); | ||
| } |
There was a problem hiding this comment.
The retry condition (hadDifferences && !performed) fires on all sync failure modes — lock contention, merge conflicts, uncommitted-changes errors, and generic failures. But the PR description says this targets lock contention specifically. Retrying merge conflicts or "uncommitted changes" errors will fail identically each time, burning 2 retries + backoff for no benefit.
Consider narrowing the retry condition to lock contention only. The SchemaSyncResult already has a skippedDueToLock field, but it's dropped during the mapping to CheckoutBranchResult. Either:
- Forward
skippedDueToLocktoCheckoutBranchResultand gate the retry on it, or - At minimum, only retry when
errorisundefined(which currently signals lock contention since the lock-skip path doesn't set an error string).
| logger.warn( | ||
| { | ||
| branch: resolvedRef.name, | ||
| error: checkoutResult.schemaSync.error, | ||
| attempt: syncAttempts, | ||
| }, | ||
| 'Schema sync not performed, retrying' | ||
| ); |
There was a problem hiding this comment.
checkoutResult.schemaSync.error is undefined in the lock contention path — SchemaSyncResult.skippedDueToLock is set but not forwarded to CheckoutBranchResult.schemaSync.error. This warning log will show error: undefined, which makes it hard for operators to understand why the retry happened. Consider logging a meaningful string like 'skipped due to lock contention' when the error field is absent, or better yet, forward skippedDueToLock from the result.
| error: err, | ||
| message: errorMessage, | ||
| stack: errorStack, | ||
| ...(cause && { dbError: cause }), |
There was a problem hiding this comment.
dbError is a misleading field name — serializeCause handles any error's cause chain (HTTP errors, validation wrappers, etc.), not just database errors. Since this is a new field with no downstream consumers, renaming to cause would be more accurate and future-proof. If you want to avoid collision with Pino's own cause handling, errorCause (already used in AgentSession.ts:1825) would be consistent with existing patterns.
| ...(cause && { dbError: cause }), | |
| ...(cause && { errorCause: cause }), |
| { | ||
| error: err, | ||
| message: errorMessage, | ||
| ...(cause && { dbError: cause }), |
There was a problem hiding this comment.
Same field naming concern as above.
| ...(cause && { dbError: cause }), | |
| ...(cause && { errorCause: cause }), |
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Medium
🟠⚠️ Major (1) 🟠⚠️
🟠 1) branchScopedDb.ts:119-128 Schema sync failure allows degraded operation without user feedback
Issue: After exhausting schema sync retries, the middleware logs an error but continues request execution without any indication to the API caller. The user receives a normal response even though subsequent queries may fail unpredictably due to outdated schema (e.g., missing columns like is_work_app).
Why: This creates a debugging nightmare where the symptom (column errors, 500s) is disconnected from the cause (schema sync failure). The log message helps operators but provides no feedback to API consumers. For write operations especially, proceeding with a stale schema is risky since data may be written to incomplete tables.
Fix: Consider one or more of the following:
- Set a response header for observability without breaking the request:
if (checkoutResult.schemaSync.hadDifferences && !checkoutResult.schemaSync.performed) {
logger.error(...);
c.header('X-Schema-Sync-Warning', 'Schema sync failed; queries may use outdated schema');
}- Add span attributes so downstream failures can be correlated via tracing:
import { trace } from '@opentelemetry/api';
// ...
const span = trace.getActiveSpan();
span?.setAttribute('schema_sync.failed', true);
span?.setAttribute('schema_sync.attempts', syncAttempts + 1);- Fail fast for write operations (stricter option):
const isWriteMethod = !['GET', 'HEAD', 'OPTIONS'].includes(method);
if (checkoutResult.schemaSync.hadDifferences && !checkoutResult.schemaSync.performed && isWriteMethod) {
throw new HTTPException(503, { message: 'Schema sync unavailable; please retry' });
}Refs:
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
branchScopedDb.ts:111Retry backoff lacks jitter — add randomization to prevent thundering herd
💭 Consider (1) 💭
💭 1) errorHandler.ts:94-104 Original error object no longer logged directly
Issue: The previous implementation logged error: err directly. The new implementation extracts specific fields (message, stack, dbError) but no longer includes the full error object, which could lose properties that aren't explicitly extracted (e.g., custom properties from third-party libraries like HTTP client errors with response, config fields).
Why: While pg errors are now better handled, other error types may lose context. Low impact since the critical pg properties are now captured.
Fix: Consider keeping the original error object for non-pg errors, or accept this tradeoff since the extracted fields cover the most common debugging needs.
Refs:
🧹 While You're Here (1) 🧹
🧹 1) branchScopedDb.ts:63-70 Silent fallback to manageDbClient masks infrastructure problems
Issue: When getPoolFromClient() returns null (line 65), the code logs an error but falls back to using manageDbClient directly and continues processing. This masks a fundamental infrastructure problem: the database client doesn't have the expected structure. Branch scoping will not work correctly.
Why: If pool extraction fails in production, every subsequent request will silently bypass branch scoping, potentially causing data isolation violations. This is pre-existing behavior, not introduced by this PR, but worth noting since you're already improving error handling in this file.
Fix: Consider throwing a 503 instead of silently proceeding:
if (!pool) {
logger.error({}, 'Could not get connection pool from dbClient - branch scoping unavailable');
throw new HTTPException(503, { message: 'Database connection pool unavailable' });
}Refs:
💡 APPROVE WITH SUGGESTIONS
Summary: This PR effectively addresses a real operational pain point — silent schema sync failures and invisible pg errors were causing 500s that were difficult to diagnose. The retry logic and serializeCause() helper are solid improvements. The main suggestion is to consider adding some form of feedback (header, span attribute, or controlled failure) when schema sync fails so that downstream errors can be correlated with the root cause. The jitter suggestion is a quick win for production reliability.
Discarded (1)
| Location | Issue | Reason Discarded |
|---|---|---|
errorHandler.ts:54-78 |
serializeCause has bounded recursion depth (positive observation) | Not an issue — this was a validation that the implementation correctly bounds recursion at depth 3 to prevent stack overflow |
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-errors |
3 | 1 | 1 | 1 | 0 | 0 | 0 |
pr-review-sre |
4 | 0 | 0 | 0 | 1 | 0 | 1 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 7 | 1 | 1 | 1 | 1 | 0 | 1 |
Note: The schema sync failure finding from pr-review-errors and pr-review-sre were merged as they identified the same issue from different angles.
| }, | ||
| 'Schema sync not performed, retrying' | ||
| ); | ||
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts)); |
There was a problem hiding this comment.
🟡 Minor: Retry backoff lacks jitter
Issue: The retry delay uses linear backoff (50 * syncAttempts) without jitter. When multiple concurrent requests hit lock contention simultaneously, they'll all retry at the same intervals (50ms, 100ms), creating synchronized retry waves that can perpetuate contention.
Why: The codebase already uses jitter in similar patterns (withRetry.ts uses Math.random() * expDelay). Lock contention scenarios are particularly susceptible to thundering herd problems when retries are synchronized.
Fix:
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts)); | |
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts * (1 + Math.random() * 0.5))); |
Refs:
Preview URLsUse these stable preview aliases for testing this PR:
These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find. Raw Vercel deployment URLs
|
Ito Test Report ✅16 test cases ran. 4 additional findings, 12 passed. Overall, 16 end-to-end/local API tests ran with 12 passing and 4 failing: baseline routed reads (Agents/Triggers/Settings), branch list/create and duplicate/conflict handling, malformed-input rejection, deletion safety, stale deep-link handling, auth-boundary checks, forced-503 sanitization, and post-stress recovery all behaved as expected without observed data leakage or broad stability regressions. The key defects were a high-severity Project Settings save blocker caused by UI edit-form ID regex validation on a disabled field (also invalidating two-tab overlap expectations), a high-severity concurrent branch-scoped PATCH burst path returning 500s when schema-sync lock retries are exhausted and requests proceed unsynced, and a medium-severity mobile scheduled-trigger creation failure where invalid browser timezone defaults are not validated/caught and escalate to 500. ✅ Passed (12)ℹ️ Additional Findings (4)
|
Schema sync failures during Dolt branch checkout were silently swallowed,
causing queries to run against branches with outdated schemas (missing
columns like is_work_app), producing 500 errors with unhelpful cause: {}
in logs.
- Capture and inspect checkoutBranch result for sync failures
- Retry schema sync up to 2 times on lock contention
- Log explicit errors when sync fails after retries
- Serialize pg Error cause chain properly (non-enumerable properties)
so actual DB errors (code, message, severity) appear in logs
Made-with: Cursor
6b92c29 to
178db63
Compare
…d jitter, rename dbError - Forward skippedDueToLock from SchemaSyncResult to CheckoutBranchResult - Gate retry loop on skippedDueToLock instead of generic hadDifferences+!performed - Add jitter to retry backoff to prevent thundering herd - Log meaningful reason string instead of undefined error field - Rename dbError to errorCause to match existing codebase patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
PR Review Summary
(6) Total Issues | Risk: Medium
This re-review found no new Critical/Major/Minor issues. The prior feedback from claude and pullfrog (2026-04-03) remains unresolved and is summarized below in Pending Recommendations.
💭 Consider (1) 💭
Inline Comments:
- 💭 Consider:
errorHandler.ts:68-75Plain object branch could capturestackandnameproperties for consistency with Error instance handling
🕐 Pending Recommendations (6)
Prior issues from claude and pullfrog reviews (2026-04-03) that remain unaddressed:
- 🟠
branchScopedDb.ts:119-128Schema sync failure allows degraded operation without user feedback — consider response headers, span attributes, or failing fast for write operations - 🟠
branchScopedDb.ts:128Retry condition fires on ALL sync failure modes (lock contention, merge conflicts, uncommitted changes) — narrow to lock-contention-only usingskippedDueToLock - 🟡
branchScopedDb.ts:122Retry backoff lacks jitter — add randomization to prevent thundering herd - 🟡
branchScopedDb.ts:121checkoutResult.schemaSync.erroris undefined in lock contention path — log a meaningful string like 'skipped due to lock contention' - 🟡
errorHandler.ts:100dbErrorfield name is misleading for generic cause serializer — rename toerrorCausefor accuracy - 🧹
branchScopedDb.ts:63-70Silent fallback to manageDbClient when pool is null masks infrastructure problems — consider throwing 503 instead
💡 APPROVE WITH SUGGESTIONS
Summary: The core implementation is solid — retry logic and serializeCause() address the root cause of silent 500s. However, the Major concerns from prior reviews remain unaddressed: (1) the retry condition is too broad (retries unretryable failures like merge conflicts), and (2) schema sync failure still allows degraded operation without user feedback. Recommend narrowing the retry condition using skippedDueToLock and adding observability (response header or span attributes) when sync fails.
Discarded (0)
No findings discarded.
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-errors |
4 | 0 | 0 | 0 | 1 | 3 | 0 |
pr-review-sre |
3 | 0 | 0 | 0 | 0 | 3 | 0 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 7 | 0 | 0 | 0 | 1 | 6 | 0 |
Note: All Major/Minor findings from reviewers overlapped with prior feedback from claude and pullfrog reviews and were routed to Pending Recommendations.
| if (typeof cause === 'object') { | ||
| const obj = cause as Record<string, unknown>; | ||
| return { | ||
| ...(obj.message != null ? { message: obj.message } : {}), | ||
| ...(obj.code != null ? { code: obj.code } : {}), | ||
| ...(obj.severity != null ? { severity: obj.severity } : {}), | ||
| ...(obj.detail != null ? { detail: obj.detail } : {}), | ||
| }; |
There was a problem hiding this comment.
💭 Consider: Plain object branch could capture stack and name properties
Issue: When the cause is a plain object (not instanceof Error) with stack or name properties (which some error-like objects have), these properties are not captured — creating asymmetry with the Error instance branch.
Why: Some libraries use plain objects as error-like values with stack traces. The current implementation captures these for Error instances but silently drops them for plain objects. Low impact since most real-world causes will be Error instances.
Fix: Consider adding stack and name to the plain object branch if present:
if (typeof cause === 'object') {
const obj = cause as Record<string, unknown>;
return {
...(obj.message != null ? { message: obj.message } : {}),
...(obj.name != null ? { name: obj.name } : {}),
...(obj.stack != null ? { stack: obj.stack } : {}),
...(obj.code != null ? { code: obj.code } : {}),
...(obj.severity != null ? { severity: obj.severity } : {}),
...(obj.detail != null ? { detail: obj.detail } : {}),
};
}Refs:
There was a problem hiding this comment.
PR Review Summary
(0) New Issues | Risk: Low
This re-review scoped to the delta since 178db6321 found no new issues. The latest commit 047027dc0 successfully addresses the actionable feedback from prior reviews.
✅ Prior Feedback Resolved
The following issues from pullfrog and claude reviews (2026-04-03) have been addressed:
| Issue | Resolution |
|---|---|
| 🟠 Retry condition fires on ALL sync failure modes | ✅ Narrowed to skippedDueToLock only — non-retryable failures (merge conflicts, uncommitted changes) no longer trigger retries |
| 🟡 Retry backoff lacks jitter | ✅ Added * (1 + Math.random() * 0.5) to prevent thundering herd |
🟡 dbError field name is misleading |
✅ Renamed to errorCause |
🟡 checkoutResult.schemaSync.error is undefined in lock path |
✅ Log now shows reason: 'skipped due to lock contention' instead of undefined error |
➕ skippedDueToLock not propagated |
✅ Added to CheckoutBranchResult type and propagation in branches-api.ts |
🕐 Pending Recommendations (2)
These items from prior reviews remain unaddressed but are architectural suggestions, not blockers:
- 🟠
branchScopedDb.ts:129-138Schema sync failure allows degraded operation without user feedback — consider response headers, span attributes, or failing fast for write operations - 🧹
branchScopedDb.ts:66-71Silent fallback when pool is null masks infrastructure problems — consider throwing 503 instead (pre-existing)
These are reasonable tradeoffs for graceful degradation vs. strict failure. The current behavior logs errors for operators while not failing requests that might still succeed. Deferring to author judgment on whether stricter behavior is desired.
✅ APPROVE
Summary: The delta cleanly addresses the actionable feedback. The retry logic is now correctly scoped to lock contention only, jitter prevents synchronized retry waves, and error logging uses accurate field names. The remaining suggestions are architectural improvements that can be addressed in follow-up work if desired. Ship it! 🚀
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-sre |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-errors |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Delta review — all prior issues were either resolved in this commit or are pending architectural suggestions.
Ito Test Report ✅16 test cases ran. 2 additional findings, 14 passed. Across 16 end-to-end and API checks, 14 passed and 2 failed (both minor), with successful baseline coverage for tenant routes (projects/work-apps/MCP), project+agent CRUD/delete integrity, rapid interaction and concurrency resilience, mobile usability, tenant-isolation/authorization controls, safe rendering of malicious text, and proper stale-session write denial. The key defect was a repeatable API contract inconsistency on POST /manage/tenants/default/projects where wrong-shape JSON returns a legacy 400 application/json ZodError payload instead of the expected RFC 7807 application/problem+json format, while malformed JSON and unknown-ref handling behaved correctly and follow-up health reads stayed stable. ✅ Passed (14)ℹ️ Additional Findings (2)
🟡 Ref injection payloads
Relevant code:
function createAgentsHono(config: AppConfig) {
const { serverConfig, credentialStores, auth, sandboxConfig } = config;
const app = new OpenAPIHono<{ Variables: AppVariables }>();
request: {
params: TenantParamsSchema,
body: {
content: {
'application/json': {
schema: ProjectApiInsertSchema,
},
},
},
},
const zodIssues = extractZodIssues(err);
if (status === 400 && zodIssues) {
return formatZodValidationError(c, zodIssues);
}🟡 Invalid JSON and content-type abuse
Relevant code:
function createAgentsHono(config: AppConfig) {
const { serverConfig, credentialStores, auth, sandboxConfig } = config;
const app = new OpenAPIHono<{ Variables: AppVariables }>();
request: {
params: TenantParamsSchema,
body: {
content: {
'application/json': {
schema: ProjectApiInsertSchema,
},
},
},
},
const zodIssues = extractZodIssues(err);
if (status === 400 && zodIssues) {
return formatZodValidationError(c, zodIssues);
}Commit: Tell us how we did: Give Ito Feedback |

































Summary
branchScopedDbMiddlewarenever inspected thecheckoutBranchresult. When sync failed (lock contention, merge conflicts from recent DROP TABLE migration), queries ran against branches with outdated schemas (missing columns likeis_work_app), producing 500 errors.message,code,severity) are non-enumerable, soJSON.stringify()producedcause: {}in logs, hiding the actual database error.serializeCause()to the error handler that traverses the error cause chain and extracts non-enumerable pg Error properties (code,message,severity,detail,hint)Operational note
Existing branches that are already out of sync need a one-time manual fix:
Test plan
tsc --noEmit)dbErrorfield in error logs now contains actual pg error details instead of{}pnpm db:manage:sync-all-branchesagainst production DoltgreSQLMade with Cursor