Skip to content

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609

Open
matthv wants to merge 2 commits into
feat/prd-214-server-step-mapperfrom
fix/prd-409-ai-invoke-timeout
Open

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609
matthv wants to merge 2 commits into
feat/prd-214-server-step-mapperfrom
fix/prd-409-ai-invoke-timeout

Conversation

@matthv
Copy link
Copy Markdown
Member

@matthv matthv commented May 28, 2026

Summary

When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner.

This PR adds a dedicated timeout on each AI invocation (default 30s, configurable via AI_INVOKE_TIMEOUT_MS) by passing LangChain's native timeout call option to model.invoke. LangChain converts it into an AbortSignal.timeout(ms) and forwards it to the underlying HTTP request, so a hanging provider is actually cancelled — not merely raced.

On timeout, the abort surfaces as a TimeoutError/AbortError, which invokeWithTools maps to the new AiInvokeTimeoutError. BaseStepExecutor.execute() then converts it to an error outcome with a user-friendly message — the orchestrator sets context.error on the step and the frontend exits its isLoading state immediately.

Why delegate to LangChain instead of a manual AbortController

An earlier version wired up AbortController + setTimeout by hand. LangChain already does exactly this internally when given a timeout call option (verified in @langchain/core ensureConfigAbortSignal.timeout → forwarded as signal to the request). Delegating removes the manual timer plumbing and lowers invokeWithTools complexity, while still producing a real request cancellation. The timeout call option is in milliseconds.

Why not just lower STEP_TIMEOUT_MS globally

STEP_TIMEOUT_MS covers more than the AI call (it also covers slow agent fetches, DB lookups, etc.). Lowering it globally would kill legitimately slow non-AI work. A dedicated AI timeout is more surgical.

Changes

  • defaults.ts: new DEFAULT_AI_INVOKE_TIMEOUT_MS = 30_000
  • errors.ts: new AiInvokeTimeoutError extends WorkflowExecutorError with provider-specific user message
  • base-step-executor.ts: invokeWithTools passes { timeout: aiInvokeTimeoutMs } to model.invoke, and maps the resulting TimeoutError/AbortError to AiInvokeTimeoutError
  • Config plumbing through RunnerConfigStepContextConfigExecutionContext
  • cli-core.ts: parse AI_INVOKE_TIMEOUT_MS env var
  • 6 unit tests: TimeoutError/AbortError mapped to AiInvokeTimeoutError, { timeout } passed as the 2nd arg, disabled when unset/<=0 (abort not mapped), non-abort errors rethrown as-is

fixes PRD-409

Test plan

  • workflow-executor test suite passes (base-step-executor.test.ts: 45 tests, incl. the 6 above)
  • Lint clean on changed files; tsc --noEmit clean
  • Live test: with SIMULATE_AI_HANG=1 AI_INVOKE_TIMEOUT_MS=10000, the frontend shows the new user message after 10s instead of spinning for 5min
  • Default set to 30s

🤖 Generated with Claude Code

Note

Add per-invocation AI timeout to surface hanging provider errors in workflow executor

  • Adds aiInvokeTimeoutMs (default 60,000ms) to the workflow executor's RunnerConfig, ExecutionContext, and ExecutorOptions, configurable via the AI_INVOKE_TIMEOUT_MS environment variable.
  • In BaseStepExecutor.invokeWithTools, wraps AI provider calls with an AbortController timer; if the provider hangs past the timeout, the invocation is aborted and throws AiInvokeTimeoutError.
  • Introduces AiInvokeTimeoutError with a user-facing retry message to distinguish timeout failures from other AI errors.
  • Setting aiInvokeTimeoutMs to 0 or leaving it unset disables the timeout, preserving existing behavior.
  • Risk: AI invocations that previously hung indefinitely will now fail after 60s by default, which may surface as new errors in workflows that relied on slow providers.

Macroscope summarized 1718cb4.

…ing provider errors [PRD-409]

When the AI provider hangs (no response, internal retries, or holds the
connection open), the previous code relied on the global STEP_TIMEOUT_MS
(default 5 min) to fail the step. From the user's perspective this looks
like an infinite spinner.

Add a dedicated timeout on each AI invocation (default 60s, configurable
via AI_INVOKE_TIMEOUT_MS) using AbortController + signal so the underlying
HTTP request is actually cancelled. On timeout, throws the new
AiInvokeTimeoutError, which BaseStepExecutor.execute() converts to an
error outcome with a user-friendly message — the orchestrator then sets
context.error on the step and the frontend exits its isLoading state
immediately.

fixes PRD-409

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear Bot commented May 28, 2026

PRD-409

@qltysh
Copy link
Copy Markdown

qltysh Bot commented May 28, 2026

1 new issue

Tool Category Rule Count
qlty Structure Function with high complexity (count = 11): invokeWithTools 1

@qltysh
Copy link
Copy Markdown

qltysh Bot commented May 28, 2026

Qlty


Coverage Impact

⬆️ Merging this pull request will increase total coverage on feat/prd-214-server-step-mapper by 0.02%.

Modified Files with Diff Coverage (3)

RatingFile% DiffUncovered Line #s
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/executors/base-step-executor.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/errors.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/defaults.ts100.0%
Total100.0%
🚦 See full report on Qlty Cloud »

🛟 Help
  • Diff Coverage: Coverage for added or modified lines of code (excludes deleted files). Learn more.

  • Total Coverage: Coverage for the whole repository, calculated as the sum of all File Coverage. Learn more.

  • File Coverage: Covered Lines divided by Covered Lines plus Missed Lines. (Excludes non-executable lines including blank lines and comments.)

    • Indirect Changes: Changes to File Coverage for files that were not modified in this PR. Learn more.

Replace the manual AbortController + setTimeout in invokeWithTools with
LangChain's native `timeout` call option, which it converts to an
AbortSignal.timeout(ms) and forwards to the underlying HTTP request (real
cancellation, not just a race). Lowers invokeWithTools complexity.

Map the resulting TimeoutError/AbortError to AiInvokeTimeoutError to keep the
user-facing message. Lower the default to 30s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant