Skip to content

fix: handle silent schema sync failures in branch-scoped DB middleware#2999

Open
amikofalvy wants to merge 2 commits intomainfrom
fix/schema-sync-silent-failure
Open

fix: handle silent schema sync failures in branch-scoped DB middleware#2999
amikofalvy wants to merge 2 commits intomainfrom
fix/schema-sync-silent-failure

Conversation

@amikofalvy
Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: Schema sync failures during Dolt branch checkout were silently swallowed — branchScopedDbMiddleware never inspected the checkoutBranch result. When sync failed (lock contention, merge conflicts from recent DROP TABLE migration), queries ran against branches with outdated schemas (missing columns like is_work_app), producing 500 errors.
  • Invisible errors: The pg driver's Error properties (message, code, severity) are non-enumerable, so JSON.stringify() produced cause: {} in logs, hiding the actual database error.
  • Adds retry logic (up to 2 attempts with backoff) for schema sync failures caused by advisory lock contention
  • Logs explicit errors with full sync failure details when retries are exhausted
  • Adds serializeCause() to the error handler that traverses the error cause chain and extracts non-enumerable pg Error properties (code, message, severity, detail, hint)

Operational note

Existing branches that are already out of sync need a one-time manual fix:

pnpm db:manage:sync-all-branches

Test plan

  • TypeScript compiles cleanly (tsc --noEmit)
  • Existing middleware tests pass
  • Existing error utility tests pass
  • Verify on staging: schema sync retries appear in logs when lock is contended
  • Verify on staging: dbError field in error logs now contains actual pg error details instead of {}
  • Run pnpm db:manage:sync-all-branches against production DoltgreSQL

Made with Cursor

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Apr 8, 2026 8:34am
agents-docs Ready Ready Preview, Comment Apr 8, 2026 8:34am
agents-manage-ui Ready Ready Preview, Comment Apr 8, 2026 8:34am

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 3, 2026

⚠️ No Changeset found

Latest commit: 047027d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Apr 3, 2026

TL;DR — Fixes silent 500 errors caused by schema sync failures during Dolt branch checkout. The middleware now inspects the checkoutBranch result, retries on lock contention with jittered backoff, and the error handler properly serializes non-enumerable pg Error properties that were previously logged as {}.

Key changes

  • Retry schema sync on lock contention in branchScopedDbMiddleware — inspects the new skippedDueToLock flag on CheckoutBranchResult and retries up to 2 times with jittered linear backoff, logging an explicit error if retries are exhausted.
  • Add serializeCause() to error handler — traverses the error cause chain extracting non-enumerable pg Error properties (code, message, severity, detail, hint) so database errors appear in logs instead of empty {}.
  • Surface skippedDueToLock in CheckoutBranchResult — propagates the advisory lock skip flag from the schema sync result so callers can distinguish lock contention from other failure modes.

Summary | 3 files | 2 commits | base: mainfix/schema-sync-silent-failure


Schema sync retry with jittered backoff in branch-scoped DB middleware

Before: branchScopedDbMiddleware fire-and-forgot checkoutBranch — if schema sync failed (lock contention, merge conflict), queries silently ran against a stale branch missing columns like is_work_app, producing 500s.
After: The middleware inspects CheckoutBranchResult.schemaSync, retries up to MAX_SCHEMA_SYNC_RETRIES (2) with jittered linear backoff (50ms × attempt × (1 + random * 0.5)), and logs an explicit error with the sync failure details when retries are exhausted.

The retry loop targets a single retryable failure mode: skippedDueToLock: true, meaning another request held the advisory lock. Other failures (merge conflicts, uncommitted changes) are not retried. When retries are exhausted and hadDifferences is true but performed is false, an error is logged so the failure is visible rather than silent.

branchScopedDb.ts


Proper serialization of database error causes

Before: logServerError passed the raw err object to the logger. Because pg driver Error properties (message, code, severity) are non-enumerable, JSON.stringify produced cause: {} — hiding the actual database error.
After: A new serializeCause() function walks the cause chain (up to depth 3), explicitly extracts non-enumerable properties, and attaches the result as errorCause in the log payload. Both the expected and unexpected error branches now include structured message, errorCause, and stack fields.

Why not just fix the logger serializer? The pg driver sets properties as non-enumerable by design. Rather than patching a global serializer (which could have unintended side effects), `serializeCause` is a targeted extraction at the point of logging, keeping the fix scoped to error handling.

errorHandler.ts


skippedDueToLock flag in CheckoutBranchResult

Before: Callers had no way to distinguish "sync skipped due to advisory lock contention" from other sync failures.
After: CheckoutBranchResult.schemaSync.skippedDueToLock is now surfaced from the underlying sync result, enabling targeted retry logic in the middleware.

branches-api.ts

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix for a real production pain point — the root cause analysis on non-enumerable pg Error properties is spot-on. Two actionable issues to address before merging: the retry loop fires on unretryable failures (merge conflicts), and dbError is a misleading field name for a generic cause serializer.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

Comment on lines +97 to +117
while (
checkoutResult.schemaSync.hadDifferences &&
!checkoutResult.schemaSync.performed &&
syncAttempts < MAX_SCHEMA_SYNC_RETRIES
) {
syncAttempts++;
logger.warn(
{
branch: resolvedRef.name,
error: checkoutResult.schemaSync.error,
attempt: syncAttempts,
},
'Schema sync not performed, retrying'
);
await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts));
checkoutResult = await checkoutBranch(requestDb)({
branchName: resolvedRef.name,
syncSchema: true,
autoCommitPending: true,
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry condition (hadDifferences && !performed) fires on all sync failure modes — lock contention, merge conflicts, uncommitted-changes errors, and generic failures. But the PR description says this targets lock contention specifically. Retrying merge conflicts or "uncommitted changes" errors will fail identically each time, burning 2 retries + backoff for no benefit.

Consider narrowing the retry condition to lock contention only. The SchemaSyncResult already has a skippedDueToLock field, but it's dropped during the mapping to CheckoutBranchResult. Either:

  1. Forward skippedDueToLock to CheckoutBranchResult and gate the retry on it, or
  2. At minimum, only retry when error is undefined (which currently signals lock contention since the lock-skip path doesn't set an error string).

Comment on lines +103 to +110
logger.warn(
{
branch: resolvedRef.name,
error: checkoutResult.schemaSync.error,
attempt: syncAttempts,
},
'Schema sync not performed, retrying'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkoutResult.schemaSync.error is undefined in the lock contention path — SchemaSyncResult.skippedDueToLock is set but not forwarded to CheckoutBranchResult.schemaSync.error. This warning log will show error: undefined, which makes it hard for operators to understand why the retry happened. Consider logging a meaningful string like 'skipped due to lock contention' when the error field is absent, or better yet, forward skippedDueToLock from the result.

error: err,
message: errorMessage,
stack: errorStack,
...(cause && { dbError: cause }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbError is a misleading field name — serializeCause handles any error's cause chain (HTTP errors, validation wrappers, etc.), not just database errors. Since this is a new field with no downstream consumers, renaming to cause would be more accurate and future-proof. If you want to avoid collision with Pino's own cause handling, errorCause (already used in AgentSession.ts:1825) would be consistent with existing patterns.

Suggested change
...(cause && { dbError: cause }),
...(cause && { errorCause: cause }),

{
error: err,
message: errorMessage,
...(cause && { dbError: cause }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same field naming concern as above.

Suggested change
...(cause && { dbError: cause }),
...(cause && { errorCause: cause }),

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(3) Total Issues | Risk: Medium

🟠⚠️ Major (1) 🟠⚠️

🟠 1) branchScopedDb.ts:119-128 Schema sync failure allows degraded operation without user feedback

Issue: After exhausting schema sync retries, the middleware logs an error but continues request execution without any indication to the API caller. The user receives a normal response even though subsequent queries may fail unpredictably due to outdated schema (e.g., missing columns like is_work_app).

Why: This creates a debugging nightmare where the symptom (column errors, 500s) is disconnected from the cause (schema sync failure). The log message helps operators but provides no feedback to API consumers. For write operations especially, proceeding with a stale schema is risky since data may be written to incomplete tables.

Fix: Consider one or more of the following:

  1. Set a response header for observability without breaking the request:
if (checkoutResult.schemaSync.hadDifferences && !checkoutResult.schemaSync.performed) {
  logger.error(...);
  c.header('X-Schema-Sync-Warning', 'Schema sync failed; queries may use outdated schema');
}
  1. Add span attributes so downstream failures can be correlated via tracing:
import { trace } from '@opentelemetry/api';
// ...
const span = trace.getActiveSpan();
span?.setAttribute('schema_sync.failed', true);
span?.setAttribute('schema_sync.attempts', syncAttempts + 1);
  1. Fail fast for write operations (stricter option):
const isWriteMethod = !['GET', 'HEAD', 'OPTIONS'].includes(method);
if (checkoutResult.schemaSync.hadDifferences && !checkoutResult.schemaSync.performed && isWriteMethod) {
  throw new HTTPException(503, { message: 'Schema sync unavailable; please retry' });
}

Refs:

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: branchScopedDb.ts:111 Retry backoff lacks jitter — add randomization to prevent thundering herd

💭 Consider (1) 💭

💭 1) errorHandler.ts:94-104 Original error object no longer logged directly

Issue: The previous implementation logged error: err directly. The new implementation extracts specific fields (message, stack, dbError) but no longer includes the full error object, which could lose properties that aren't explicitly extracted (e.g., custom properties from third-party libraries like HTTP client errors with response, config fields).

Why: While pg errors are now better handled, other error types may lose context. Low impact since the critical pg properties are now captured.

Fix: Consider keeping the original error object for non-pg errors, or accept this tradeoff since the extracted fields cover the most common debugging needs.

Refs:

🧹 While You're Here (1) 🧹

🧹 1) branchScopedDb.ts:63-70 Silent fallback to manageDbClient masks infrastructure problems

Issue: When getPoolFromClient() returns null (line 65), the code logs an error but falls back to using manageDbClient directly and continues processing. This masks a fundamental infrastructure problem: the database client doesn't have the expected structure. Branch scoping will not work correctly.

Why: If pool extraction fails in production, every subsequent request will silently bypass branch scoping, potentially causing data isolation violations. This is pre-existing behavior, not introduced by this PR, but worth noting since you're already improving error handling in this file.

Fix: Consider throwing a 503 instead of silently proceeding:

if (!pool) {
  logger.error({}, 'Could not get connection pool from dbClient - branch scoping unavailable');
  throw new HTTPException(503, { message: 'Database connection pool unavailable' });
}

Refs:


💡 APPROVE WITH SUGGESTIONS

Summary: This PR effectively addresses a real operational pain point — silent schema sync failures and invisible pg errors were causing 500s that were difficult to diagnose. The retry logic and serializeCause() helper are solid improvements. The main suggestion is to consider adding some form of feedback (header, span attribute, or controlled failure) when schema sync fails so that downstream errors can be correlated with the root cause. The jitter suggestion is a quick win for production reliability.

Discarded (1)
Location Issue Reason Discarded
errorHandler.ts:54-78 serializeCause has bounded recursion depth (positive observation) Not an issue — this was a validation that the implementation correctly bounds recursion at depth 3 to prevent stack overflow
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-errors 3 1 1 1 0 0 0
pr-review-sre 4 0 0 0 1 0 1
pr-review-standards 0 0 0 0 0 0 0
Total 7 1 1 1 1 0 1

Note: The schema sync failure finding from pr-review-errors and pr-review-sre were merged as they identified the same issue from different angles.

},
'Schema sync not performed, retrying'
);
await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Retry backoff lacks jitter

Issue: The retry delay uses linear backoff (50 * syncAttempts) without jitter. When multiple concurrent requests hit lock contention simultaneously, they'll all retry at the same intervals (50ms, 100ms), creating synchronized retry waves that can perpetuate contention.

Why: The codebase already uses jitter in similar patterns (withRetry.ts uses Math.random() * expDelay). Lock contention scenarios are particularly susceptible to thundering herd problems when retries are synchronized.

Fix:

Suggested change
await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts));
await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts * (1 + Math.random() * 0.5)));

Refs:

@github-actions github-actions bot deleted a comment from claude bot Apr 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Preview URLs

Use these stable preview aliases for testing this PR:

These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find.

Raw Vercel deployment URLs

@itoqa
Copy link
Copy Markdown

itoqa bot commented Apr 3, 2026

Ito Test Report ✅

16 test cases ran. 4 additional findings, 12 passed.

Overall, 16 end-to-end/local API tests ran with 12 passing and 4 failing: baseline routed reads (Agents/Triggers/Settings), branch list/create and duplicate/conflict handling, malformed-input rejection, deletion safety, stale deep-link handling, auth-boundary checks, forced-503 sanitization, and post-stress recovery all behaved as expected without observed data leakage or broad stability regressions. The key defects were a high-severity Project Settings save blocker caused by UI edit-form ID regex validation on a disabled field (also invalidating two-tab overlap expectations), a high-severity concurrent branch-scoped PATCH burst path returning 500s when schema-sync lock retries are exhausted and requests proceed unsynced, and a medium-severity mobile scheduled-trigger creation failure where invalid browser timezone defaults are not validated/caught and escalate to 500.

✅ Passed (12)
Category Summary Screenshot
Adversarial Concurrent triple submit produced one successful create and conflict responses for duplicates, and branch relist confirmed exactly one qa-dupe-test branch. ADV-1
Adversarial Malformed ref inputs were rejected with controlled 4xx problem responses and no stack/SQL leakage; valid ref=main still returned 200 afterward. ADV-2
Adversarial Script-injection and SQL-like branch names were rejected with 400 validation responses, and branch relist confirmed these payloads were not created or reflected as branches. ADV-3
Adversarial Unauthorized requests to the real dev-session endpoint are denied (401), valid bypass secret creates a session (200), and authenticated projects access succeeds. ADV-4
Adversarial Direct stale deep-link after deletion consistently produced controlled not-found handling, and repeated back/forward navigation remained safe without data leakage. ADV-5
Adversarial Forced 503 handling stayed sanitized (no sensitive internals exposed) and the app recovered after service restoration. ADV-6
Edge Created two disposable projects, deleted project A from main context (204), and confirmed it disappeared from API/UI listings while project B remained. EDGE-2
Edge Both non-main delete attempts were safely blocked with explicit 404 problem responses and did not remove the project. EDGE-3
Edge During reload and back-navigation around save attempts, the Settings page recovered to a consistent state (non-saved baseline), and controls remained functional with no deadlock spinner/toast. EDGE-4
Logic Post-stress recovery sequence succeeded: projects read, settings update, and branch read all completed with stable UI behavior. LOGIC-1
Happy-path Baseline project routes (Agents, Triggers, Settings) loaded without generic fallback and without 5xx responses. ROUTE-1
Happy-path Branch list returned 200, creating qa-retry-check-1775193521 returned 201, and relist returned 200 with the new branch present exactly once. ROUTE-2
ℹ️ Additional Findings (4)

These findings are unrelated to the current changes but were observed during testing.

Category Summary Screenshot
Edge ⚠️ Concurrent PATCH burst produced repeated 500 responses instead of deterministic success/conflict handling. EDGE-1
Edge 🟠 In two-tab overlapping edits, neither save persisted and both tabs converged to baseline because the same edit-form validation defect prevents updates before conflict policy can apply. EDGE-5
Edge 🟠 Mobile scheduled-trigger creation can fail with HTTP 500 when an invalid browser timezone is used as default cron timezone. EDGE-6
Happy-path ⚠️ Settings updates are blocked for existing IDs like proj_qa because edit-mode form validation still enforces a stricter ID regex on the disabled ID field, so changes do not persist. ROUTE-3
⚠️ Concurrent write burst does not produce user-facing schema-sync failure cascade
  • What failed: Most concurrent PATCH requests returned HTTP 500 instead of deterministic success or explicit conflict handling, even though a follow-up read remained coherent.
  • Impact: Concurrent edits can fail unpredictably with server errors in normal multi-user or automation bursts. This reduces reliability of project updates and can force manual retries.
  • Steps to reproduce:
    1. Authenticate to the local Manage API with a valid dev session.
    2. Send 5-10 concurrent PATCH requests to /manage/tenants/default/projects/edge-project on the same branch-scoped context.
    3. Observe burst responses and then run a follow-up GET projects request to confirm the list still loads.
  • Stub / mock context: A development bypass token was used to authenticate API requests and the test targeted a disposable project (edge-project) for safety. No response stubs, route interception, or mock payload injection were applied to the PATCH burst itself.
  • Code analysis: I inspected the branch-scoped middleware and schema-sync lock behavior in production code. The middleware retries schema sync briefly, but after retries are exhausted it only logs and still executes the route handler, while the schema-sync routine explicitly reports lock-skipped sync attempts as unsynced; this combination makes write queries run against stale schema under contention.
  • Why this is likely a bug: The code explicitly allows request execution to continue after confirmed unsynced schema state, which is a direct production-code path to write-query 500s under concurrent branch traffic.

Relevant code:

agents-api/src/middleware/branchScopedDb.ts (lines 97-127)

while (
  checkoutResult.schemaSync.hadDifferences &&
  !checkoutResult.schemaSync.performed &&
  syncAttempts < MAX_SCHEMA_SYNC_RETRIES
) {
  syncAttempts++;
  await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts));
  checkoutResult = await checkoutBranch(requestDb)({
    branchName: resolvedRef.name,
    syncSchema: true,
    autoCommitPending: true,
  });
}

if (checkoutResult.schemaSync.hadDifferences && !checkoutResult.schemaSync.performed) {
  logger.error(
    { branch: resolvedRef.name, syncError: checkoutResult.schemaSync.error, attempts: syncAttempts + 1 },
    'Schema sync failed after retries — queries may fail due to outdated schema'
  );
}

packages/agents-core/src/dolt/schema-sync.ts (lines 217-224)

if (!lockAcquired) {
  // Another request is currently syncing this branch
  // Return early - that request will handle the sync
  return {
    synced: false,
    hadDifferences: true,
    skippedDueToLock: true,
  };
}
🟠 Two-tab overlapping edits remain deterministic
  • What failed: Both tabs converge to baseline with neither edit persisted; expected behavior is deterministic last-write-wins or explicit conflict handling after at least one successful write.
  • Impact: Concurrent-edit behavior cannot be validated because writes are blocked at the form layer, so users can lose intended updates and receive no meaningful conflict outcome. This undermines reliability of collaborative editing workflows.
  • Steps to reproduce:
    1. Open /default/projects/proj_qa/settings in two browser tabs.
    2. In tab A, edit Description and click Update project.
    3. Without refreshing tab B, edit Description to a different value and click Update project.
    4. Reload both tabs and verify both revert to baseline with no persisted overlap outcome.
  • Stub / mock context: Real sign-in was bypassed with a local dev-session shortcut so the run could establish an admin session deterministically, and local service host wiring was adjusted to match the containerized environment. No route-level API response mocks were used for the project settings save flow.
  • Code analysis: I reviewed the same settings update pipeline used by this test and found no separate two-tab conflict branch in the failing path; both tabs submit through the same projectSchema gate. Because edit mode still validates the disabled ID against a hyphen-only regex, saves can be blocked in both tabs regardless of overlap sequencing.
  • Why this is likely a bug: Both-tab non-persistence follows directly from the same client validation mismatch, so overlap logic never gets a valid persisted write to resolve.

Relevant code:

agents-manage-ui/src/components/projects/form/project-form.tsx (lines 196-203)

<GenericInput
  control={form.control}
  name="id"
  label="Id"
  placeholder="my-project"
  description="Choose a unique identifier for this project. This cannot be changed later."
  disabled={!!projectId || readOnly}
  isRequired
/>

agents-manage-ui/src/components/projects/form/validation.ts (lines 35-38)

.regex(/^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/, {
  message:
    'ID must start and end with lowercase alphanumeric characters, and may contain hyphens',
}),

packages/agents-core/src/validation/schemas/shared.ts (lines 5-14)

const URL_SAFE_ID_PATTERN = /^[a-zA-Z0-9\-_.]+$/;

const ResourceIdSchema = z
  .string()
  .trim()
  .nonempty('Id is required')
  .max(MAX_ID_LENGTH)
  .regex(URL_SAFE_ID_PATTERN, {
    message: 'ID must contain only letters, numbers, hyphens, underscores, and dots',
  })
🟠 Mobile viewport branch-dependent workflow remains functional
  • What failed: Submit returns a server error (500) and no trigger is created; expected behavior is successful creation or a handled validation error for invalid timezone input.
  • Impact: Users on environments that resolve to a non-IANA browser timezone can be blocked from creating scheduled triggers. This breaks core automation setup in the UI and surfaces as a server failure instead of actionable validation.
  • Steps to reproduce:
    1. Open the Manage UI on iPhone 13 viewport and go to a project's Triggers area.
    2. Click New scheduled trigger, select an agent, and continue to the scheduled-trigger form.
    3. Keep cron schedule mode with default timezone populated from the browser and submit Create Scheduled Trigger.
    4. Observe a server error response and confirm the trigger is not created.
  • Stub / mock context: Authentication used a local dev-session bypass for test access, but no route-level response mocking or scheduled-trigger API stubbing was applied to this flow; the observed 500 came from the real create path under local services.
  • Code analysis: I inspected the mobile form defaults, scheduled trigger create route, and cron next-run helper. The UI defaults cronTimezone directly from browser timezone, and the create API path computes next run with cron parser without validating timezone format or catching parser errors on create.
  • Why this is likely a bug: Production create flow accepts timezone strings from client defaults but does not guard cron parsing errors at create time, so invalid timezones can escalate to 500 instead of a controlled 4xx validation path.

Relevant code:

agents-manage-ui/src/components/scheduled-triggers/scheduled-trigger-form.tsx (lines 130-143)

const getDefaultValues = (): ScheduledTriggerFormData => {
  // Get browser's timezone for new triggers
  const browserTimezone = Intl.DateTimeFormat().resolvedOptions().timeZone || 'UTC';

  if (!trigger) {
    const p = defaultsFromParams;
    return {
      enabled: true,
      name: '',
      description: '',
      scheduleType: (p?.scheduleType as 'cron' | 'one-time') || 'cron',
      cronExpression: p?.cronExpression || '',
      cronTimezone: p?.cronTimezone || browserTimezone,

packages/agents-core/src/utils/compute-next-run-at.ts (lines 15-22)

if (cronExpression) {
  const baseDate = lastScheduledFor ? new Date(lastScheduledFor) : new Date();
  const interval = CronExpressionParser.parse(cronExpression, {
    currentDate: baseDate,
    tz: cronTimezone || 'UTC',
  });
  return interval.next().toISOString();
}

agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 370-377)

const enabled = body.enabled ?? true;
const nextRunAt = enabled
  ? computeNextRunAt({
      cronExpression: body.cronExpression,
      cronTimezone: body.cronTimezone,
      runAt: body.runAt,
    })
  : null;
⚠️ Project update mutation remains stable through connection-scoped checkout
  • What failed: The save does not persist and the form surfaces an ID regex error even though the ID field is disabled during edit; expected behavior is that editable fields save successfully for existing valid project IDs.
  • Impact: Project settings cannot be reliably updated for existing projects whose IDs are valid in backend schemas but rejected by the edit form regex. This blocks routine configuration changes and creates false "successful save" expectations.
  • Steps to reproduce:
    1. Open /default/projects/proj_qa/settings for an existing project whose ID includes an underscore.
    2. Edit Name or Description and click Update project.
    3. Reload the settings page.
    4. Observe values revert and an ID validation error blocks a successful update path.
  • Stub / mock context: Real sign-in was bypassed with a local dev-session shortcut so the run could establish an admin session deterministically, and local service host wiring was adjusted to match the containerized environment. No route-level API response mocks were used for the project settings save flow.
  • Code analysis: I inspected the settings form submission path in agents-manage-ui and the API update schema in packages/agents-core. The form always resolves against projectSchema (which validates id with a strict hyphen-only regex) while edit mode disables the ID input; backend update schema explicitly omits id, so client-side validation is stricter than the production update contract and can block valid updates.
  • Why this is likely a bug: The UI edit flow validates and blocks on an immutable id field that the backend update contract does not even accept, so legitimate edits are rejected by client logic before persistence.

Relevant code:

agents-manage-ui/src/components/projects/form/validation.ts (lines 31-38)

export const projectSchema = z.object({
  id: z
    .string()
    .min(1, 'Project ID is required')
    .regex(/^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/, {
      message:
        'ID must start and end with lowercase alphanumeric characters, and may contain hyphens',
    }),

agents-manage-ui/src/components/projects/form/project-form.tsx (lines 122-124)

const form = useForm<ProjectFormData>({
  resolver: zodResolver(projectSchema),
  defaultValues: initialData ? createDefaultValues(initialData) : { ...defaultValues },
});

packages/agents-core/src/validation/schemas.ts (lines 2511-2514)

export const ProjectUpdateSchema = ProjectInsertSchema.partial().omit({
  id: true,
  tenantId: true,
});

Commit: 6b92c29

View Full Run


Tell us how we did: Give Ito Feedback

Schema sync failures during Dolt branch checkout were silently swallowed,
causing queries to run against branches with outdated schemas (missing
columns like is_work_app), producing 500 errors with unhelpful cause: {}
in logs.

- Capture and inspect checkoutBranch result for sync failures
- Retry schema sync up to 2 times on lock contention
- Log explicit errors when sync fails after retries
- Serialize pg Error cause chain properly (non-enumerable properties)
  so actual DB errors (code, message, severity) appear in logs

Made-with: Cursor
…d jitter, rename dbError

- Forward skippedDueToLock from SchemaSyncResult to CheckoutBranchResult
- Gate retry loop on skippedDueToLock instead of generic hadDifferences+!performed
- Add jitter to retry backoff to prevent thundering herd
- Log meaningful reason string instead of undefined error field
- Rename dbError to errorCause to match existing codebase patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(6) Total Issues | Risk: Medium

This re-review found no new Critical/Major/Minor issues. The prior feedback from claude and pullfrog (2026-04-03) remains unresolved and is summarized below in Pending Recommendations.

💭 Consider (1) 💭

Inline Comments:

  • 💭 Consider: errorHandler.ts:68-75 Plain object branch could capture stack and name properties for consistency with Error instance handling

🕐 Pending Recommendations (6)

Prior issues from claude and pullfrog reviews (2026-04-03) that remain unaddressed:

  • 🟠 branchScopedDb.ts:119-128 Schema sync failure allows degraded operation without user feedback — consider response headers, span attributes, or failing fast for write operations
  • 🟠 branchScopedDb.ts:128 Retry condition fires on ALL sync failure modes (lock contention, merge conflicts, uncommitted changes) — narrow to lock-contention-only using skippedDueToLock
  • 🟡 branchScopedDb.ts:122 Retry backoff lacks jitter — add randomization to prevent thundering herd
  • 🟡 branchScopedDb.ts:121 checkoutResult.schemaSync.error is undefined in lock contention path — log a meaningful string like 'skipped due to lock contention'
  • 🟡 errorHandler.ts:100 dbError field name is misleading for generic cause serializer — rename to errorCause for accuracy
  • 🧹 branchScopedDb.ts:63-70 Silent fallback to manageDbClient when pool is null masks infrastructure problems — consider throwing 503 instead

💡 APPROVE WITH SUGGESTIONS

Summary: The core implementation is solid — retry logic and serializeCause() address the root cause of silent 500s. However, the Major concerns from prior reviews remain unaddressed: (1) the retry condition is too broad (retries unretryable failures like merge conflicts), and (2) schema sync failure still allows degraded operation without user feedback. Recommend narrowing the retry condition using skippedDueToLock and adding observability (response header or span attributes) when sync fails.

Discarded (0)

No findings discarded.

Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-errors 4 0 0 0 1 3 0
pr-review-sre 3 0 0 0 0 3 0
pr-review-standards 0 0 0 0 0 0 0
Total 7 0 0 0 1 6 0

Note: All Major/Minor findings from reviewers overlapped with prior feedback from claude and pullfrog reviews and were routed to Pending Recommendations.

Comment on lines +68 to +75
if (typeof cause === 'object') {
const obj = cause as Record<string, unknown>;
return {
...(obj.message != null ? { message: obj.message } : {}),
...(obj.code != null ? { code: obj.code } : {}),
...(obj.severity != null ? { severity: obj.severity } : {}),
...(obj.detail != null ? { detail: obj.detail } : {}),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Plain object branch could capture stack and name properties

Issue: When the cause is a plain object (not instanceof Error) with stack or name properties (which some error-like objects have), these properties are not captured — creating asymmetry with the Error instance branch.

Why: Some libraries use plain objects as error-like values with stack traces. The current implementation captures these for Error instances but silently drops them for plain objects. Low impact since most real-world causes will be Error instances.

Fix: Consider adding stack and name to the plain object branch if present:

if (typeof cause === 'object') {
  const obj = cause as Record<string, unknown>;
  return {
    ...(obj.message != null ? { message: obj.message } : {}),
    ...(obj.name != null ? { name: obj.name } : {}),
    ...(obj.stack != null ? { stack: obj.stack } : {}),
    ...(obj.code != null ? { code: obj.code } : {}),
    ...(obj.severity != null ? { severity: obj.severity } : {}),
    ...(obj.detail != null ? { detail: obj.detail } : {}),
  };
}

Refs:

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) New Issues | Risk: Low

This re-review scoped to the delta since 178db6321 found no new issues. The latest commit 047027dc0 successfully addresses the actionable feedback from prior reviews.

✅ Prior Feedback Resolved

The following issues from pullfrog and claude reviews (2026-04-03) have been addressed:

Issue Resolution
🟠 Retry condition fires on ALL sync failure modes ✅ Narrowed to skippedDueToLock only — non-retryable failures (merge conflicts, uncommitted changes) no longer trigger retries
🟡 Retry backoff lacks jitter ✅ Added * (1 + Math.random() * 0.5) to prevent thundering herd
🟡 dbError field name is misleading ✅ Renamed to errorCause
🟡 checkoutResult.schemaSync.error is undefined in lock path ✅ Log now shows reason: 'skipped due to lock contention' instead of undefined error
skippedDueToLock not propagated ✅ Added to CheckoutBranchResult type and propagation in branches-api.ts

🕐 Pending Recommendations (2)

These items from prior reviews remain unaddressed but are architectural suggestions, not blockers:

  • 🟠 branchScopedDb.ts:129-138 Schema sync failure allows degraded operation without user feedback — consider response headers, span attributes, or failing fast for write operations
  • 🧹 branchScopedDb.ts:66-71 Silent fallback when pool is null masks infrastructure problems — consider throwing 503 instead (pre-existing)

These are reasonable tradeoffs for graceful degradation vs. strict failure. The current behavior logs errors for operators while not failing requests that might still succeed. Deferring to author judgment on whether stricter behavior is desired.


✅ APPROVE

Summary: The delta cleanly addresses the actionable feedback. The retry logic is now correctly scoped to lock contention only, jitter prevents synchronized retry waves, and error logging uses accurate field names. The remaining suggestions are architectural improvements that can be addressed in follow-up work if desired. Ship it! 🚀

Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-sre 0 0 0 0 0 0 0
pr-review-errors 0 0 0 0 0 0 0
pr-review-standards 0 0 0 0 0 0 0
Total 0 0 0 0 0 0 0

Note: Delta review — all prior issues were either resolved in this commit or are pending architectural suggestions.

@github-actions github-actions bot deleted a comment from claude bot Apr 8, 2026
@itoqa
Copy link
Copy Markdown

itoqa bot commented Apr 8, 2026

Ito Test Report ✅

16 test cases ran. 2 additional findings, 14 passed.

Across 16 end-to-end and API checks, 14 passed and 2 failed (both minor), with successful baseline coverage for tenant routes (projects/work-apps/MCP), project+agent CRUD/delete integrity, rapid interaction and concurrency resilience, mobile usability, tenant-isolation/authorization controls, safe rendering of malicious text, and proper stale-session write denial. The key defect was a repeatable API contract inconsistency on POST /manage/tenants/default/projects where wrong-shape JSON returns a legacy 400 application/json ZodError payload instead of the expected RFC 7807 application/problem+json format, while malformed JSON and unknown-ref handling behaved correctly and follow-up health reads stayed stable.

✅ Passed (14)
Category Summary Screenshot
Adversarial Cross-tenant list and nested route tampering were denied with structured 403 forbidden responses and no foreign tenant payload exposure. ADV-2
Adversarial A malicious script-tag name with oversized description was accepted for creation, and subsequent rendering showed literal escaped text with no script execution in the UI. ADV-3
Adversarial After logout in tab A, tab B could still show stale UI but mutation submission failed with an error and no unauthorized write was committed. ADV-4
Adversarial Alternating submit/cancel modal stress behaved correctly; only submitted writes persisted. ADV-6
Edge Rapid double-submit created one project with no 500/internal error behavior. EDGE-1
Edge Refresh interruption and rapid interactions converged cleanly with responsive UI and controlled persistence. EDGE-2
Edge Back/forward navigation after deleting an agent returned a controlled 404 state without instability. EDGE-3
Edge Unknown ref returns controlled 404 problem+json, and a follow-up request without ref succeeds (200) after creating a valid project. EDGE-4
Edge Re-execution showed 10 parallel project-create requests returned controlled 201 responses, and the immediate follow-up projects GET remained healthy. EDGE-5
Edge Mobile viewport flow remained usable at 390x844, including navigation, create/delete actions, and readable inline validation errors. EDGE-6
Happy-path Projects route loaded successfully and the projects API check returned 200. ROUTE-1
Happy-path Not a real application bug; re-execution completed project+agent create, persistence check, and cleanup successfully. ROUTE-2
Happy-path Delete workflow integrity passed: disposable project and agent were created, persisted, and deleted successfully. ROUTE-4
Happy-path Work-app and MCP routes loaded successfully with stable repeated reads after local setup prerequisites were applied. ROUTE-5
ℹ️ Additional Findings (2)

These findings are unrelated to the current changes but were observed during testing.

Category Summary Screenshot
Adversarial 🟡 Wrong-shape JSON on project create returns legacy application/json ZodError instead of application/problem+json; ref injection handling itself is safe. ADV-1
Adversarial 🟡 Malformed JSON is handled correctly, but wrong-shape JSON still returns legacy application/json ZodError instead of application/problem+json. ADV-5
🟡 Ref injection payloads
  • What failed: The wrong-shape JSON payload returns 400 application/json with legacy success/error: ZodError structure, but the expected contract is application/problem+json with problem details.
  • Impact: API clients expecting a consistent RFC 7807 error contract can mis-handle validation failures on this endpoint. This creates inconsistent error parsing behavior across failure modes.
  • Steps to reproduce:
    1. Send malformed JSON and wrong-shape JSON payloads to POST /manage/tenants/default/projects with authenticated API access.
    2. Compare the response headers and body schema between malformed JSON and wrong-shape JSON cases.
    3. Verify baseline endpoint health with a follow-up GET /manage/tenants/default/projects request.
  • Stub / mock context: Requests were executed with a local bypass bearer token for deterministic API authentication, while adversarial ref payloads were exercised directly against the real manage endpoint behavior.
  • Code analysis: I traced project creation validation through manage routes and global app setup. OpenAPIHono routes validate request bodies via c.req.valid('json'), but app/route construction does not set a validation default hook to normalize those schema-validation failures into problem+json, while the global error handler only formats errors that flow through its Zod issue extraction path.
  • Why this is likely a bug: The endpoint declares structured validation and the app has a problem+json error strategy, yet this invalid-input path emits a different legacy format, which is an inconsistent production behavior rather than test setup noise.

Relevant code:

agents-api/src/createApp.ts (lines 49-53)

function createAgentsHono(config: AppConfig) {
  const { serverConfig, credentialStores, auth, sandboxConfig } = config;

  const app = new OpenAPIHono<{ Variables: AppVariables }>();

agents-api/src/domains/manage/routes/projects.ts (lines 174-183)

request: {
      params: TenantParamsSchema,
      body: {
        content: {
          'application/json': {
            schema: ProjectApiInsertSchema,
          },
        },
      },
    },

agents-api/src/middleware/errorHandler.ts (lines 132-135)

const zodIssues = extractZodIssues(err);
  if (status === 400 && zodIssues) {
    return formatZodValidationError(c, zodIssues);
  }
🟡 Invalid JSON and content-type abuse
  • What failed: Malformed JSON correctly returns application/problem+json, but wrong-shape JSON still returns a legacy application/json ZodError response, violating the expected error-content contract.
  • Impact: Consumers cannot rely on one deterministic error format for invalid project-create requests. This increases integration fragility for clients that parse RFC 7807 responses.
  • Steps to reproduce:
    1. Send malformed JSON and wrong-shape JSON payloads to POST /manage/tenants/default/projects with authenticated API access.
    2. Confirm malformed JSON returns problem+json while wrong-shape JSON returns legacy application/json ZodError.
    3. Run a follow-up baseline GET /manage/tenants/default/projects to confirm service health after adversarial calls.
  • Stub / mock context: API auth used a deterministic local bypass bearer token, and missing seeded project data was handled by targeting an existing disposable project for unknown-ref and injection checks.
  • Code analysis: The same production code path is exercised here: schema validation is attached at route definition, but no default validation hook is configured at app/route construction to enforce problem+json formatting for request-schema failures. The global error handler can emit problem+json, but this wrong-shape validation path continues to surface the legacy format.
  • Why this is likely a bug: The observed mismatch is repeatable across runs and directly aligns with route/error-handling code structure that allows legacy validation responses in production.

Relevant code:

agents-api/src/createApp.ts (lines 49-53)

function createAgentsHono(config: AppConfig) {
  const { serverConfig, credentialStores, auth, sandboxConfig } = config;

  const app = new OpenAPIHono<{ Variables: AppVariables }>();

agents-api/src/domains/manage/routes/projects.ts (lines 174-183)

request: {
      params: TenantParamsSchema,
      body: {
        content: {
          'application/json': {
            schema: ProjectApiInsertSchema,
          },
        },
      },
    },

agents-api/src/middleware/errorHandler.ts (lines 132-135)

const zodIssues = extractZodIssues(err);
  if (status === 400 && zodIssues) {
    return formatZodValidationError(c, zodIssues);
  }

Commit: 047027d

View Full Run


Tell us how we did: Give Ito Feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant