Skip to content

scheduler: add force flag to resumeJob for failed-job revival (#128)#144

Open
truffle-dev wants to merge 1 commit into
ghostwright:mainfrom
truffle-dev:fix/scheduler-resume-failed
Open

scheduler: add force flag to resumeJob for failed-job revival (#128)#144
truffle-dev wants to merge 1 commit into
ghostwright:mainfrom
truffle-dev:fix/scheduler-resume-failed

Conversation

@truffle-dev
Copy link
Copy Markdown
Contributor

Closes #128.

When a scheduled job hits MAX_CONSECUTIVE_ERRORS (10) the executor at src/scheduler/executor.ts flips it to status='failed', sets next_run_at=NULL, and stops touching it. The public API has no path back: resumeJob at src/scheduler/service.ts:160 refuses anything that isn't paused, and runJobNow at src/scheduler/service.ts:226 refuses anything that isn't active. Until now the only recovery was a raw SQLite UPDATE.

I picked shape (1) from the issue body: resumeJob gains an optional { force?: boolean }. With force: true it accepts failed in addition to paused, recomputes next_run_at from computeNextRunAt(job.schedule), and clears consecutive_errors. Default behaviour is unchanged (paused-only). completed stays rejected even with force because at-kind one-shots may have already deleted themselves via the executor's deleteAfterRun path, and the issue body's reasoning for completed still holds.

UI surface

POST /ui/api/scheduler/:id/resume accepts an optional { "force": true } JSON body. Empty body keeps the old semantics. The audit log records resume:force when the flag is used so operators can tell forced revivals apart from normal ones.

Tests

Service-level (src/scheduler/__tests__/service.test.ts):

  • existing paused→active path unchanged
  • existing no-force no-op on failed/completed unchanged (renamed for clarity)
  • new: force=true revives a failed job, clears consecutive_errors, recomputes next_run_at
  • new: force=true still refuses completed
  • new: force=true still resumes paused (default path regression check)

UI API (src/ui/api/__tests__/scheduler.test.ts):

  • new: POST without body is a no-op on a failed job
  • new: POST with { force: true } revives a failed job and writes resume:force to the audit log

All 32 service tests and 35 UI scheduler tests pass. tsc --noEmit and biome check src/ are clean.

What this doesn't do

armTimer already fires from inside resumeJob, so the in-memory wake-up problem the issue body alludes to is handled the same way a normal resume handles it. Shape (2) (a dedicated recoverFailedJob action) and shape (3) (docs-only) are still on the table if you'd rather not extend the existing API; happy to close this in favour of either.

…right#128)

When a scheduled job hits MAX_CONSECUTIVE_ERRORS the executor flips it
to status='failed' and stops touching it. Until now the only path back
was a raw SQLite UPDATE; resumeJob refused anything that wasn't paused.

Shape (1) from the issue: resumeJob gains an optional `{ force?: boolean }`
that lets paused-or-failed jobs flip back to active, recomputes
next_run_at from the stored schedule, and clears consecutive_errors.
Default behaviour is unchanged (paused-only). `completed` stays
rejected even with force, because at-kind one-shots may have already
deleted themselves via the deleteAfterRun path.

UI surface: POST /ui/api/scheduler/:id/resume accepts an optional
`{ force: true }` JSON body. Empty body keeps the old semantics. Audit
log records `resume:force` when the flag is used so operators can tell
forced revivals apart from normal ones.

Tests cover paused-default, failed-without-force (no-op), failed-with-force
(revive + clear errors + recompute next_run_at), completed-with-force
(still refused), and the UI body-plumbing both ways.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: no public API recovers failed jobs; resumeJob handles only paused

1 participant