Skip to content

web_search_20250209 dynamic filtering causes excessive pause_turn at ~110 content blocks — batch processing impractical #1237

@som3k

Description

@som3k

Summary

When using the web_search_20250209 tool with the Messages API, the model frequently enters an internal loop of code_execution calls during dynamic filtering. The server-side sampling loop hits its 10-iteration limit and returns stop_reason: "pause_turn" with ~110 content blocks — without producing any usable output. This happens on 50-60% of requests in production batch runs, making the tool unreliable for automated workflows.

Environment

  • SDK: anthropic Python SDK (latest)
  • Model: claude-sonnet-4-6
  • Tool: web_search_20250209 with max_uses: 5
  • Streaming: Yes (required due to long response times)
  • Prompt caching: Yes (cache_control: ephemeral on system prompt)

Reproduction

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8192,
    tools=[{
        "type": "web_search_20250209",
        "name": "web_search",
        "max_uses": 5,
    }],
    system=[{"type": "text", "text": system_prompt}],
    messages=[{"role": "user", "content": "Research {TICKER} and return a structured JSON profile"}],
)
# response.stop_reason == "pause_turn" ~55% of the time

The system prompt asks the model to research a topic via web search and return a structured JSON object. This is a straightforward single-turn request — no multi-turn conversation, no user-provided tools.

Observed behavior

  1. Model starts web search, makes 1-3 search requests
  2. Internal code_execution tool fires repeatedly for dynamic filtering
  3. Content blocks accumulate: alternating server_tool_use + code_execution_tool_result (or bash_code_execution_tool_result)
  4. At ~110 content blocks, stop_reason: "pause_turn" is returned
  5. No usable text output — the model never reached the point of writing its response
  6. Retrying (sending the response back as assistant message per docs) usually results in another pause_turn

Typical content block pattern at failure:

['text', 'server_tool_use', 'web_search_tool_result', 'text',
 'server_tool_use', 'code_execution_tool_result', 'server_tool_use',
 'code_execution_tool_result', 'server_tool_use', 'code_execution_tool_result',
 ... (repeating ~40-50 times) ...]

Impact — batch processing statistics

Over multiple production batch runs (50+ requests):

Metric Value
Clean first-attempt success ~40-45%
pause_turn on first attempt ~55-60%
Success after 1 retry ~30%
Total failure (all 3 attempts) ~5-10%
Average time per request (including retries) ~32 minutes
Average time per clean request ~11 minutes
Cost per failed attempt ~$0.15-0.25 (tokens consumed but no output)

A batch of 35 requests takes 9+ hours and typically completes only 50-60% of items. Failed attempts still incur full token costs for all internal tool calls.

Additional issue: connection drops

During long-running streaming requests (10+ minutes), we also see intermittent connection drops:

peer closed connection without sending complete message body (incomplete chunked read)

This appears to be a separate issue but compounds the reliability problem.

What we've tried

  • Reducing max_uses (2 vs 5): No effect on pause_turn rate. Time difference is negligible (~10 seconds) because each internal turn reprocesses the full accumulated context.
  • Retry with delay (60s between attempts, up to 3 retries): Helps ~30% of the time, but the same request often gets stuck repeatedly.
  • Different tickers/topics: The issue is not content-specific. It occurs across diverse topics with no pattern — the same request may succeed on retry without any changes.

Workaround: allowed_callers: ["direct"]

Setting allowed_callers: ["direct"] on the tool definition disables dynamic filtering (internal code_execution). We tested this with identical prompts, model, and workload:

tools=[{
    "type": "web_search_20250209",
    "name": "web_search",
    "max_uses": 5,
    "allowed_callers": ["direct"],  # disables dynamic filtering
}]

Before vs. after (same 25 unique tickers, same prompts):

Metric Default (dynamic filtering) allowed_callers: ["direct"]
Success rate ~45% first attempt 100% (50/50)
pause_turn rate 55-60% 0%
Avg time/request ~32 min (incl. retries) ~46 seconds
Avg cost/request ~$0.32 ~$0.20
Total batch time 9+ hours (17/35 completed) 38 min (50/50 completed)
Retries needed Constant Zero
Connection drops Intermittent Zero

The 3 tickers that failed all retry attempts with dynamic filtering (SNDK, HUN, GEV) all succeeded on first attempt with allowed_callers: ["direct"] in under 50 seconds.

Output quality appears equivalent — the model still performs 4-5 web searches and produces structured JSON profiles with the same level of detail. The only difference is the absence of the internal code_execution filtering step.

Questions for the team

  1. Is allowed_callers: ["direct"] a supported, stable parameter? We found it in the docs but it's not prominently documented. We're now dependent on it for production use. Is it expected to remain available?

  2. What is the intended quality benefit of dynamic filtering? In our testing (50 requests, structured JSON output from web search), we see no quality difference. Is there a use case where dynamic filtering meaningfully improves results?

  3. Is this a known regression? The pause_turn rate increased over time (from ~33% in early March 2026 to ~60% by March 9-10). Was there a change to the dynamic filtering behavior?

  4. Can the server-side iteration limit be increased or made configurable? For users who want dynamic filtering, 10 iterations appears too low for search + structured output tasks.

Related issues in other frameworks

This problem is causing issues across the ecosystem:

  • Pydantic AI #2600: pause_turn not handled correctly with built-in tools
  • LiteLLM #17737: server_tool_use incorrectly converted, breaking multi-turn
  • LiteLLM #18839: Missing code_execution tool results
  • LangChain #33920: code_execution_tool_result blocks missing from streaming
  • Vercel AI SDK #11855: tool-result in assistant message throws API error

All of these trace back to the same root cause: the web_search_20250209 dynamic filtering loop hitting the iteration limit.

Suggested improvements

  1. Document allowed_callers: ["direct"] prominently as the recommended approach for batch/automated workflows where pause_turn reliability matters
  2. Increase the server-side iteration limit for dynamic filtering — 10 iterations is insufficient for search + structured output
  3. Add a configurable parameter (e.g., max_server_iterations) so callers can trade latency for reliability
  4. Improve the dynamic filtering loop to detect when it's not making progress and break early with usable partial output
  5. Return partial text output on pause_turn — currently the model's intermediate reasoning is lost, making pause_turn equivalent to a total failure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions