`web_search_20250209` dynamic filtering causes excessive `pause_turn` at ~110 content blocks — batch processing impractical

### Summary

When using the `web_search_20250209` tool with the Messages API, the model frequently enters an internal loop of `code_execution` calls during dynamic filtering. The server-side sampling loop hits its 10-iteration limit and returns `stop_reason: "pause_turn"` with ~110 content blocks — without producing any usable output. This happens on **50-60% of requests** in production batch runs, making the tool unreliable for automated workflows.

### Environment

- **SDK:** `anthropic` Python SDK (latest)
- **Model:** `claude-sonnet-4-6`
- **Tool:** `web_search_20250209` with `max_uses: 5`
- **Streaming:** Yes (required due to long response times)
- **Prompt caching:** Yes (`cache_control: ephemeral` on system prompt)

### Reproduction

```python
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8192,
    tools=[{
        "type": "web_search_20250209",
        "name": "web_search",
        "max_uses": 5,
    }],
    system=[{"type": "text", "text": system_prompt}],
    messages=[{"role": "user", "content": "Research {TICKER} and return a structured JSON profile"}],
)
# response.stop_reason == "pause_turn" ~55% of the time
```

The system prompt asks the model to research a topic via web search and return a structured JSON object. This is a straightforward single-turn request — no multi-turn conversation, no user-provided tools.

### Observed behavior

1. Model starts web search, makes 1-3 search requests
2. Internal `code_execution` tool fires repeatedly for dynamic filtering
3. Content blocks accumulate: alternating `server_tool_use` + `code_execution_tool_result` (or `bash_code_execution_tool_result`)
4. At ~110 content blocks, `stop_reason: "pause_turn"` is returned
5. **No usable text output** — the model never reached the point of writing its response
6. Retrying (sending the response back as assistant message per docs) usually results in another `pause_turn`

Typical content block pattern at failure:
```
['text', 'server_tool_use', 'web_search_tool_result', 'text',
 'server_tool_use', 'code_execution_tool_result', 'server_tool_use',
 'code_execution_tool_result', 'server_tool_use', 'code_execution_tool_result',
 ... (repeating ~40-50 times) ...]
```

### Impact — batch processing statistics

Over multiple production batch runs (50+ requests):

| Metric | Value |
|--------|-------|
| Clean first-attempt success | ~40-45% |
| `pause_turn` on first attempt | ~55-60% |
| Success after 1 retry | ~30% |
| Total failure (all 3 attempts) | ~5-10% |
| Average time per request (including retries) | ~32 minutes |
| Average time per clean request | ~11 minutes |
| Cost per failed attempt | ~$0.15-0.25 (tokens consumed but no output) |

A batch of 35 requests takes **9+ hours** and typically completes only 50-60% of items. Failed attempts still incur full token costs for all internal tool calls.

### Additional issue: connection drops

During long-running streaming requests (10+ minutes), we also see intermittent connection drops:

```
peer closed connection without sending complete message body (incomplete chunked read)
```

This appears to be a separate issue but compounds the reliability problem.

### What we've tried

- **Reducing `max_uses`** (2 vs 5): No effect on `pause_turn` rate. Time difference is negligible (~10 seconds) because each internal turn reprocesses the full accumulated context.
- **Retry with delay** (60s between attempts, up to 3 retries): Helps ~30% of the time, but the same request often gets stuck repeatedly.
- **Different tickers/topics**: The issue is not content-specific. It occurs across diverse topics with no pattern — the same request may succeed on retry without any changes.

### Workaround: `allowed_callers: ["direct"]`

Setting `allowed_callers: ["direct"]` on the tool definition disables dynamic filtering (internal code_execution). We tested this with identical prompts, model, and workload:

```python
tools=[{
    "type": "web_search_20250209",
    "name": "web_search",
    "max_uses": 5,
    "allowed_callers": ["direct"],  # disables dynamic filtering
}]
```

**Before vs. after (same 25 unique tickers, same prompts):**

| Metric | Default (dynamic filtering) | `allowed_callers: ["direct"]` |
|--------|---------------------------|-------------------------------|
| **Success rate** | ~45% first attempt | **100%** (50/50) |
| **`pause_turn` rate** | 55-60% | **0%** |
| **Avg time/request** | ~32 min (incl. retries) | **~46 seconds** |
| **Avg cost/request** | ~$0.32 | **~$0.20** |
| **Total batch time** | 9+ hours (17/35 completed) | **38 min (50/50 completed)** |
| **Retries needed** | Constant | **Zero** |
| **Connection drops** | Intermittent | **Zero** |

The 3 tickers that **failed all retry attempts** with dynamic filtering (SNDK, HUN, GEV) all succeeded on first attempt with `allowed_callers: ["direct"]` in under 50 seconds.

Output quality appears equivalent — the model still performs 4-5 web searches and produces structured JSON profiles with the same level of detail. The only difference is the absence of the internal code_execution filtering step.

### Questions for the team

1. **Is `allowed_callers: ["direct"]` a supported, stable parameter?** We found it in the docs but it's not prominently documented. We're now dependent on it for production use. Is it expected to remain available?

2. **What is the intended quality benefit of dynamic filtering?** In our testing (50 requests, structured JSON output from web search), we see no quality difference. Is there a use case where dynamic filtering meaningfully improves results?

3. **Is this a known regression?** The `pause_turn` rate increased over time (from ~33% in early March 2026 to ~60% by March 9-10). Was there a change to the dynamic filtering behavior?

4. **Can the server-side iteration limit be increased or made configurable?** For users who want dynamic filtering, 10 iterations appears too low for search + structured output tasks.

### Related issues in other frameworks

This problem is causing issues across the ecosystem:

- **Pydantic AI #2600**: `pause_turn` not handled correctly with built-in tools
- **LiteLLM #17737**: `server_tool_use` incorrectly converted, breaking multi-turn
- **LiteLLM #18839**: Missing code_execution tool results
- **LangChain #33920**: `code_execution_tool_result` blocks missing from streaming
- **Vercel AI SDK #11855**: `tool-result` in assistant message throws API error

All of these trace back to the same root cause: the `web_search_20250209` dynamic filtering loop hitting the iteration limit.

### Suggested improvements

1. **Document `allowed_callers: ["direct"]` prominently** as the recommended approach for batch/automated workflows where `pause_turn` reliability matters
2. **Increase the server-side iteration limit** for dynamic filtering — 10 iterations is insufficient for search + structured output
3. **Add a configurable parameter** (e.g., `max_server_iterations`) so callers can trade latency for reliability
4. **Improve the dynamic filtering loop** to detect when it's not making progress and break early with usable partial output
5. **Return partial text output** on `pause_turn` — currently the model's intermediate reasoning is lost, making `pause_turn` equivalent to a total failure

Metric	Default (dynamic filtering)	`allowed_callers: ["direct"]`
Success rate	~45% first attempt	100% (50/50)
`pause_turn` rate	55-60%	0%
Avg time/request	~32 min (incl. retries)	~46 seconds
Avg cost/request	~$0.32	~$0.20
Total batch time	9+ hours (17/35 completed)	38 min (50/50 completed)
Retries needed	Constant	Zero
Connection drops	Intermittent	Zero

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`web_search_20250209` dynamic filtering causes excessive `pause_turn` at ~110 content blocks — batch processing impractical #1237

Summary

Environment

Reproduction

Observed behavior

Impact — batch processing statistics

Additional issue: connection drops

What we've tried

Workaround: `allowed_callers: ["direct"]`

Questions for the team

Related issues in other frameworks

Suggested improvements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Clean first-attempt success	~40-45%
`pause_turn` on first attempt	~55-60%
Success after 1 retry	~30%
Total failure (all 3 attempts)	~5-10%
Average time per request (including retries)	~32 minutes
Average time per clean request	~11 minutes
Cost per failed attempt	~$0.15-0.25 (tokens consumed but no output)

web_search_20250209 dynamic filtering causes excessive pause_turn at ~110 content blocks — batch processing impractical #1237

Description

Summary

Environment

Reproduction

Observed behavior

Impact — batch processing statistics

Additional issue: connection drops

What we've tried

Workaround: allowed_callers: ["direct"]

Questions for the team

Related issues in other frameworks

Suggested improvements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`web_search_20250209` dynamic filtering causes excessive `pause_turn` at ~110 content blocks — batch processing impractical #1237

Workaround: `allowed_callers: ["direct"]`