fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409] by matthv · Pull Request #1609 · ForestAdmin/agent-nodejs

matthv · 2026-05-28T14:52:22Z

Summary

When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner.

This PR adds a dedicated timeout on each AI invocation (default 30s, configurable via AI_INVOKE_TIMEOUT_MS) by passing LangChain's native timeout call option to model.invoke. LangChain converts it into an AbortSignal.timeout(ms) and forwards it to the underlying HTTP request, so a hanging provider is actually cancelled — not merely raced.

On timeout, the abort surfaces as a TimeoutError/AbortError, which invokeWithTools maps to the new AiInvokeTimeoutError. BaseStepExecutor.execute() then converts it to an error outcome with a user-friendly message — the orchestrator sets context.error on the step and the frontend exits its isLoading state immediately.

Why delegate to LangChain instead of a manual AbortController

An earlier version wired up AbortController + setTimeout by hand. LangChain already does exactly this internally when given a timeout call option (verified in @langchain/core ensureConfig → AbortSignal.timeout → forwarded as signal to the request). Delegating removes the manual timer plumbing and lowers invokeWithTools complexity, while still producing a real request cancellation. The timeout call option is in milliseconds.

Why not just lower STEP_TIMEOUT_MS globally

STEP_TIMEOUT_MS covers more than the AI call (it also covers slow agent fetches, DB lookups, etc.). Lowering it globally would kill legitimately slow non-AI work. A dedicated AI timeout is more surgical.

Changes

defaults.ts: new DEFAULT_AI_INVOKE_TIMEOUT_MS = 30_000
errors.ts: new AiInvokeTimeoutError extends WorkflowExecutorError with provider-specific user message
base-step-executor.ts: invokeWithTools passes { timeout: aiInvokeTimeoutMs } to model.invoke, and maps the resulting TimeoutError/AbortError to AiInvokeTimeoutError
Config plumbing through RunnerConfig → StepContextConfig → ExecutionContext
cli-core.ts: parse AI_INVOKE_TIMEOUT_MS env var
6 unit tests: TimeoutError/AbortError mapped to AiInvokeTimeoutError, { timeout } passed as the 2nd arg, disabled when unset/<=0 (abort not mapped), non-abort errors rethrown as-is

fixes PRD-409

Test plan

workflow-executor test suite passes (base-step-executor.test.ts: 45 tests, incl. the 6 above)
Lint clean on changed files; tsc --noEmit clean
Live test: with SIMULATE_AI_HANG=1 AI_INVOKE_TIMEOUT_MS=10000, the frontend shows the new user message after 10s instead of spinning for 5min
Default set to 30s

🤖 Generated with Claude Code

Note

Add per-invocation AI timeout to surface hanging provider errors in workflow executor

Adds aiInvokeTimeoutMs (default 60,000ms) to the workflow executor's RunnerConfig, ExecutionContext, and ExecutorOptions, configurable via the AI_INVOKE_TIMEOUT_MS environment variable.
In BaseStepExecutor.invokeWithTools, wraps AI provider calls with an AbortController timer; if the provider hangs past the timeout, the invocation is aborted and throws AiInvokeTimeoutError.
Introduces AiInvokeTimeoutError with a user-facing retry message to distinguish timeout failures from other AI errors.
Setting aiInvokeTimeoutMs to 0 or leaving it unset disables the timeout, preserving existing behavior.
Risk: AI invocations that previously hung indefinitely will now fail after 60s by default, which may surface as new errors in workflows that relied on slow providers.

^{Macroscope summarized 1718cb4.}

…ing provider errors [PRD-409] When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner. Add a dedicated timeout on each AI invocation (default 60s, configurable via AI_INVOKE_TIMEOUT_MS) using AbortController + signal so the underlying HTTP request is actually cancelled. On timeout, throws the new AiInvokeTimeoutError, which BaseStepExecutor.execute() converts to an error outcome with a user-friendly message — the orchestrator then sets context.error on the step and the frontend exits its isLoading state immediately. fixes PRD-409 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

linear · 2026-05-28T14:52:27Z

PRD-409

qltysh · 2026-05-28T14:53:38Z

1 new issue

Tool	Category	Rule	Count
qlty	Structure	Function with high complexity (count = 11): invokeWithTools	1

qltysh · 2026-05-28T14:58:23Z

Coverage Impact

⬆️ Merging this pull request will increase total coverage on feat/prd-214-server-step-mapper by 0.02%.

Modified Files with Diff Coverage (3)

Rating	File	% Diff	Uncovered Line #s
	packages/workflow-executor/src/executors/base-step-executor.ts	100.0%
	packages/workflow-executor/src/errors.ts	100.0%
	packages/workflow-executor/src/defaults.ts	100.0%
	Total	100.0%

🚦 See full report on Qlty Cloud »

🛟 Help

Diff Coverage: Coverage for added or modified lines of code (excludes deleted files). Learn more.
Total Coverage: Coverage for the whole repository, calculated as the sum of all File Coverage. Learn more.
File Coverage: Covered Lines divided by Covered Lines plus Missed Lines. (Excludes non-executable lines including blank lines and comments.)
- Indirect Changes: Changes to File Coverage for files that were not modified in this PR. Learn more.

Replace the manual AbortController + setTimeout in invokeWithTools with LangChain's native `timeout` call option, which it converts to an AbortSignal.timeout(ms) and forwards to the underlying HTTP request (real cancellation, not just a race). Lowers invokeWithTools complexity. Map the resulting TimeoutError/AbortError to AiInvokeTimeoutError to keep the user-facing message. Lower the default to 30s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609
matthv wants to merge 2 commits into
feat/prd-214-server-step-mapperfrom
fix/prd-409-ai-invoke-timeout

matthv commented May 28, 2026 •

edited

Loading

Uh oh!

linear Bot commented May 28, 2026

Uh oh!

qltysh Bot commented May 28, 2026 •

edited

Loading

Uh oh!

qltysh Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matthv commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why delegate to LangChain instead of a manual AbortController

Why not just lower STEP_TIMEOUT_MS globally

Changes

Test plan

Add per-invocation AI timeout to surface hanging provider errors in workflow executor

Uh oh!

linear Bot commented May 28, 2026

Uh oh!

qltysh Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 new issue

Uh oh!

qltysh Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matthv commented May 28, 2026 •

edited

Loading

qltysh Bot commented May 28, 2026 •

edited

Loading

qltysh Bot commented May 28, 2026 •

edited

Loading