-
Notifications
You must be signed in to change notification settings - Fork 506
Description
Summary
When using the web_search_20250209 tool with the Messages API, the model frequently enters an internal loop of code_execution calls during dynamic filtering. The server-side sampling loop hits its 10-iteration limit and returns stop_reason: "pause_turn" with ~110 content blocks — without producing any usable output. This happens on 50-60% of requests in production batch runs, making the tool unreliable for automated workflows.
Environment
- SDK:
anthropicPython SDK (latest) - Model:
claude-sonnet-4-6 - Tool:
web_search_20250209withmax_uses: 5 - Streaming: Yes (required due to long response times)
- Prompt caching: Yes (
cache_control: ephemeralon system prompt)
Reproduction
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8192,
tools=[{
"type": "web_search_20250209",
"name": "web_search",
"max_uses": 5,
}],
system=[{"type": "text", "text": system_prompt}],
messages=[{"role": "user", "content": "Research {TICKER} and return a structured JSON profile"}],
)
# response.stop_reason == "pause_turn" ~55% of the timeThe system prompt asks the model to research a topic via web search and return a structured JSON object. This is a straightforward single-turn request — no multi-turn conversation, no user-provided tools.
Observed behavior
- Model starts web search, makes 1-3 search requests
- Internal
code_executiontool fires repeatedly for dynamic filtering - Content blocks accumulate: alternating
server_tool_use+code_execution_tool_result(orbash_code_execution_tool_result) - At ~110 content blocks,
stop_reason: "pause_turn"is returned - No usable text output — the model never reached the point of writing its response
- Retrying (sending the response back as assistant message per docs) usually results in another
pause_turn
Typical content block pattern at failure:
['text', 'server_tool_use', 'web_search_tool_result', 'text',
'server_tool_use', 'code_execution_tool_result', 'server_tool_use',
'code_execution_tool_result', 'server_tool_use', 'code_execution_tool_result',
... (repeating ~40-50 times) ...]
Impact — batch processing statistics
Over multiple production batch runs (50+ requests):
| Metric | Value |
|---|---|
| Clean first-attempt success | ~40-45% |
pause_turn on first attempt |
~55-60% |
| Success after 1 retry | ~30% |
| Total failure (all 3 attempts) | ~5-10% |
| Average time per request (including retries) | ~32 minutes |
| Average time per clean request | ~11 minutes |
| Cost per failed attempt | ~$0.15-0.25 (tokens consumed but no output) |
A batch of 35 requests takes 9+ hours and typically completes only 50-60% of items. Failed attempts still incur full token costs for all internal tool calls.
Additional issue: connection drops
During long-running streaming requests (10+ minutes), we also see intermittent connection drops:
peer closed connection without sending complete message body (incomplete chunked read)
This appears to be a separate issue but compounds the reliability problem.
What we've tried
- Reducing
max_uses(2 vs 5): No effect onpause_turnrate. Time difference is negligible (~10 seconds) because each internal turn reprocesses the full accumulated context. - Retry with delay (60s between attempts, up to 3 retries): Helps ~30% of the time, but the same request often gets stuck repeatedly.
- Different tickers/topics: The issue is not content-specific. It occurs across diverse topics with no pattern — the same request may succeed on retry without any changes.
Workaround: allowed_callers: ["direct"]
Setting allowed_callers: ["direct"] on the tool definition disables dynamic filtering (internal code_execution). We tested this with identical prompts, model, and workload:
tools=[{
"type": "web_search_20250209",
"name": "web_search",
"max_uses": 5,
"allowed_callers": ["direct"], # disables dynamic filtering
}]Before vs. after (same 25 unique tickers, same prompts):
| Metric | Default (dynamic filtering) | allowed_callers: ["direct"] |
|---|---|---|
| Success rate | ~45% first attempt | 100% (50/50) |
pause_turn rate |
55-60% | 0% |
| Avg time/request | ~32 min (incl. retries) | ~46 seconds |
| Avg cost/request | ~$0.32 | ~$0.20 |
| Total batch time | 9+ hours (17/35 completed) | 38 min (50/50 completed) |
| Retries needed | Constant | Zero |
| Connection drops | Intermittent | Zero |
The 3 tickers that failed all retry attempts with dynamic filtering (SNDK, HUN, GEV) all succeeded on first attempt with allowed_callers: ["direct"] in under 50 seconds.
Output quality appears equivalent — the model still performs 4-5 web searches and produces structured JSON profiles with the same level of detail. The only difference is the absence of the internal code_execution filtering step.
Questions for the team
-
Is
allowed_callers: ["direct"]a supported, stable parameter? We found it in the docs but it's not prominently documented. We're now dependent on it for production use. Is it expected to remain available? -
What is the intended quality benefit of dynamic filtering? In our testing (50 requests, structured JSON output from web search), we see no quality difference. Is there a use case where dynamic filtering meaningfully improves results?
-
Is this a known regression? The
pause_turnrate increased over time (from ~33% in early March 2026 to ~60% by March 9-10). Was there a change to the dynamic filtering behavior? -
Can the server-side iteration limit be increased or made configurable? For users who want dynamic filtering, 10 iterations appears too low for search + structured output tasks.
Related issues in other frameworks
This problem is causing issues across the ecosystem:
- Pydantic AI #2600:
pause_turnnot handled correctly with built-in tools - LiteLLM #17737:
server_tool_useincorrectly converted, breaking multi-turn - LiteLLM #18839: Missing code_execution tool results
- LangChain #33920:
code_execution_tool_resultblocks missing from streaming - Vercel AI SDK #11855:
tool-resultin assistant message throws API error
All of these trace back to the same root cause: the web_search_20250209 dynamic filtering loop hitting the iteration limit.
Suggested improvements
- Document
allowed_callers: ["direct"]prominently as the recommended approach for batch/automated workflows wherepause_turnreliability matters - Increase the server-side iteration limit for dynamic filtering — 10 iterations is insufficient for search + structured output
- Add a configurable parameter (e.g.,
max_server_iterations) so callers can trade latency for reliability - Improve the dynamic filtering loop to detect when it's not making progress and break early with usable partial output
- Return partial text output on
pause_turn— currently the model's intermediate reasoning is lost, makingpause_turnequivalent to a total failure