Skip to content

Commit 78adb72

Browse files
committed
backport: copy tests/interaction/ verbatim from main (phase 0)
Excludes tests/interaction from pyright until the backport restores type correctness phase by phase.
1 parent 1abcca2 commit 78adb72

55 files changed

Lines changed: 14234 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,8 @@ packages = ["src/mcp"]
9595
[tool.pyright]
9696
typeCheckingMode = "strict"
9797
include = ["src/mcp", "tests", "examples/servers", "examples/snippets"]
98+
# tests/interaction is mid-backport from main; type-checking is restored phase by phase.
99+
exclude = ["tests/interaction"]
98100
venvPath = "."
99101
venv = ".venv"
100102
# The FastAPI style of using decorators in tests gives a `reportUnusedFunction` error.

tests/interaction/README.md

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# Interaction-model test suite
2+
3+
This suite enumerates the MCP interaction model as end-to-end tests: one test per piece of
4+
functionality, asserting the full client↔server round trip through the public API. It exists to
5+
pin the SDK's observable behaviour — every request type, every notification direction, every
6+
error plane — so that internal rewrites of the send/receive path can be proven equivalent by
7+
running the suite before and after.
8+
9+
```bash
10+
uv run --frozen pytest tests/interaction/
11+
```
12+
13+
The whole suite is in-process and event-driven — including the streamable HTTP, SSE, and OAuth
14+
flows — with a single subprocess test for stdio.
15+
16+
## Ground rules
17+
18+
- **Public API only.** Tests drive a `Client` connected to a `Server` or `MCPServer`. Nothing
19+
reaches into session internals, so the suite keeps working when those internals change.
20+
`ClientSession` is used directly only for behaviours `Client` cannot express (skipping
21+
initialization, requesting a non-default protocol version).
22+
- **Pin current behaviour.** Every test passes against the current `main`, including behaviours
23+
that diverge from the specification. A failing or xfailed test proves nothing about whether a
24+
rewrite preserved behaviour; a passing test that pins the wrong output exactly does. Known
25+
divergences are recorded as data on the requirement (see below), not worked around in the test.
26+
- **Spec-mandated assertions, not implementation quirks.** Error *codes* are asserted against
27+
the constants in `mcp.types`; error *message strings* are pinned only where they are the
28+
SDK's own deliberate output.
29+
- **No sleeps, no real I/O.** Concurrency is coordinated with `anyio.Event`; every wait that
30+
could hang is bounded by `anyio.fail_after(5)`. The HTTP and OAuth tests drive the Starlette
31+
app in-process through the suite's streaming ASGI bridge (`transports/_bridge.py`), which
32+
delivers each response chunk as the server produces it — full duplex, but still no sockets,
33+
threads, or subprocesses anywhere outside the one stdio test.
34+
35+
## Layout
36+
37+
```text
38+
tests/interaction/
39+
_requirements.py the requirements manifest (see below)
40+
_helpers.py shared type aliases + the wire-recording transport
41+
_connect.py the transport-parametrized connection factories
42+
conftest.py the connect fixture (the transport matrix)
43+
test_coverage.py enforces the manifest ↔ test contract
44+
lowlevel/ one file per feature area, against the low-level Server
45+
mcpserver/ the same feature areas in MCPServer's natural idiom
46+
transports/ behaviour specific to one transport (sessions, resumability, framing)
47+
auth/ OAuth flows against an in-process authorization server
48+
```
49+
50+
The two server APIs produce genuinely different wire output for the same conceptual feature
51+
(`MCPServer` generates schemas, converts exceptions to `isError` results, attaches structured
52+
content), so they get parallel directories with mirrored file names rather than one parametrized
53+
test body — each directory pins its flavour's true output exactly.
54+
55+
### The transport matrix
56+
57+
Transport-agnostic tests take the `connect` fixture instead of constructing `Client(server)`
58+
directly, and therefore run once per transport: over the in-memory transport, over the server's
59+
real streamable HTTP app driven in-process through the streaming bridge, and over the legacy SSE
60+
transport the same way. A test connects with `async with connect(server, ...) as client:` and
61+
asserts the same output on every leg, because the transport is not supposed to change observable
62+
behaviour. Tests that are tied to one transport do not use the fixture: the wire-recording tests
63+
(their seam is the in-memory stream pair), the bare-`ClientSession` lifecycle tests, the
64+
real-clock timeout tests (the timeout machinery is transport-independent and must not race
65+
transport latency), and everything under `transports/`, which pins behaviour only observable on
66+
that transport.
67+
68+
A transport conformance test in `transports/` speaks raw `httpx` against the mounted ASGI app
69+
**only** when its assertion is about HTTP semantics that `Client` cannot observe — status codes,
70+
response headers, SSE event fields, which stream a message travels on. Any other behaviour is
71+
asserted through a `Client`, connected to the mounted app via `client_via_http(http)` so several
72+
clients can share one session manager.
73+
74+
## The requirements manifest
75+
76+
`_requirements.py` maps every behaviour the suite covers to the reason it must hold:
77+
78+
```python
79+
"tools:call:content:text": Requirement(
80+
source=f"{SPEC_BASE_URL}/server/tools#text-content",
81+
behavior="tools/call delivers arguments to the tool handler and returns its text content.",
82+
),
83+
```
84+
85+
- **`source`** is a deep link into the MCP specification for externally mandated behaviour,
86+
the literal string `"sdk"` for behaviour the SDK chose where the spec is silent, or
87+
`"issue:#n"` for a regression lock.
88+
- **`behavior`** describes the *required* behaviour — what the specification (or the SDK's own
89+
contract) says should happen. Tests always pin the SDK's current behaviour; where that falls
90+
short of `behavior`, the gap is recorded as data rather than hidden in the test.
91+
- **`divergence`** records that gap for entries whose tests pin the divergent current behaviour.
92+
- **`deferred`** marks a behaviour that is tracked but has no test in this suite, with a precise
93+
reason: the SDK does not implement it, the negative cannot be observed, the assertion is
94+
schema-level rather than interaction-level, the feature is experimental (tasks), or the test
95+
would require real-time waits the suite refuses.
96+
- **`transports`** names the transports a behaviour applies to; omitted means transport-independent.
97+
- **`issue`** carries the tracking link for a recorded gap once one is filed.
98+
99+
Tests link themselves to the manifest with a decorator:
100+
101+
```python
102+
@requirement("tools:call:content:text")
103+
async def test_call_tool_returns_text_content() -> None: ...
104+
```
105+
106+
`test_coverage.py` enforces the contract in both directions: every non-deferred requirement must
107+
be exercised by at least one test, every deferred requirement by none, and an unknown ID fails at
108+
import time. A behaviour without a manifest entry cannot be silently half-tested, and a manifest
109+
entry without a test cannot be silently aspirational.
110+
111+
### The divergence lifecycle
112+
113+
1. A test reveals that the SDK does not do what the spec says. The test pins what the SDK
114+
*actually does* and a `Divergence(note=..., issue=...)` goes on the requirement.
115+
2. When the behaviour is eventually fixed, the pinned test fails. Whoever makes the change finds
116+
the divergence note explaining that the old behaviour was a known gap, re-pins the test to the
117+
spec-correct output, and deletes the `Divergence`.
118+
3. An empty divergence list means the SDK is spec-conformant on every behaviour the suite covers.
119+
120+
A requirement may carry both `divergence` and `deferred`: the divergence records that the SDK falls
121+
short of the spec, and the deferral records why no test pins it (typically because the divergent
122+
behaviour cannot be driven through the public API). Divergence alone implies a test pins the
123+
divergent behaviour; divergence plus deferred means the gap is known but unpinned.
124+
125+
This is also the triage key for any rewrite: a test that fails on the new code path either has a
126+
divergence note (the rewrite accidentally fixed a known gap — decide whether to keep the fix) or
127+
it does not (the rewrite broke something that was correct — fix the rewrite).
128+
129+
### When a new spec revision is released
130+
131+
1. Update `SPEC_REVISION` and walk the new revision's changelog.
132+
2. For each changed interaction, find its requirements (the IDs use the wire method strings the
133+
changelog speaks in), re-audit the tests against the new text, and update `source` links and
134+
assertions where behaviour legitimately changed.
135+
3. New interactions get new requirements and new tests; removed interactions get their
136+
requirements deleted along with their tests.
137+
4. A behaviour that is correct under both revisions needs no change beyond the `source` link.
138+
139+
## Writing a test
140+
141+
The shortest complete example of the conventions:
142+
143+
```python
144+
@requirement("tools:call:content:text")
145+
async def test_call_tool_returns_text_content() -> None:
146+
"""Arguments reach the tool handler; its content comes back as the call result."""
147+
148+
async def call_tool(ctx: ServerRequestContext, params: types.CallToolRequestParams) -> CallToolResult:
149+
assert params.name == "add"
150+
assert params.arguments is not None
151+
return CallToolResult(content=[TextContent(text=str(params.arguments["a"] + params.arguments["b"]))])
152+
153+
server = Server("adder", on_call_tool=call_tool)
154+
155+
async with Client(server) as client:
156+
result = await client.call_tool("add", {"a": 2, "b": 3})
157+
158+
assert result == snapshot(CallToolResult(content=[TextContent(text="5")]))
159+
```
160+
161+
- **The server is defined inside the test** (or in a small fixture at the top of the file when
162+
several tests genuinely share it). The whole observable behaviour fits on one screen.
163+
- **Test names are behaviour sentences** — they state the observable outcome, not the feature
164+
being poked. Docstrings add the one or two sentences of context a reviewer needs, including
165+
whether the assertion is spec-mandated, SDK-defined, or a known divergence.
166+
- **Handlers assert their dispatch identity first** (`assert params.name == "add"`), proving the
167+
request that arrived is the request the test sent.
168+
- **The result proves the round trip.** Server-side observations travel back to the test through
169+
the protocol itself (a tool returns what it saw) or through a closure-captured list; the test
170+
asserts after the call returns.
171+
- **Order within a test**: server handlers → server construction → client callbacks → connect →
172+
act → assert. The test reads in the order the conversation happens.
173+
- A registered handler or tool that a test never invokes gets a `raise NotImplementedError` body
174+
so it cannot silently become load-bearing.
175+
- A test that needs a peer no real `Server` or `Client` can play (a server that answers initialize
176+
with an unsupported version, a client that sends malformed params) plays that side of the wire by
177+
hand over `create_client_server_memory_streams()`. This scripted-peer pattern is the suite's only
178+
way to drive behaviour the typed API cannot produce, and the docstring of every such test says so.
179+
180+
Stack a second `@requirement` decorator only when a test's natural assertions incidentally prove
181+
another behaviour — one capabilities snapshot proving four `*:capability:declared` entries, one
182+
input-schema identity check proving each preserved keyword. Do not build a test around covering
183+
many requirements at once; if the assertions would be separate, write separate tests.
184+
185+
### Choosing an assertion
186+
187+
| The property under test is… | Assert with |
188+
|---|---|
189+
| the result of a transformation (arguments → output, exception → error result) | `result == snapshot(...)` of the full object, so any field the implementation adds or drops fails the test |
190+
| pass-through of an opaque value (`_meta`, cursors) | identity against the same variable that was sent — a snapshot of a pass-through value only matches the input because a human checked two literals correspond |
191+
| an error | `pytest.raises(MCPError)` and a snapshot of `exc.value.error` when the message is the SDK's own; a plain `==` on `.code` against the `mcp.types` constant when it is not |
192+
| third-party output embedded in a result (validation messages) | the stable prefix only — never pin text that changes with a dependency upgrade |
193+
194+
### Notifications and concurrency
195+
196+
The client's receive loop dispatches each incoming message to completion before reading the next,
197+
and the in-memory transport delivers everything on one ordered stream. Together these guarantee
198+
that every notification a server handler emits before its response reaches the client callback
199+
before the originating request returns — so tests collect notifications into a plain list and
200+
assert after the call, with no synchronisation. The exceptions:
201+
202+
- a notification not triggered by a request the test is awaiting needs an `anyio.Event` set in
203+
the receiving handler and awaited under `anyio.fail_after(5)`;
204+
- the ordering guarantee does not survive transports that split messages across streams (the
205+
streamable HTTP standalone GET stream) — see `transports/test_streamable_http.py`.
206+
207+
### Coverage
208+
209+
CI requires 100% line and branch coverage, including `tests/`, and `strict-no-cover` fails the
210+
build if a line marked `# pragma: no cover` is ever executed. When a new test starts covering a
211+
pragma'd line in `src/`, delete the pragma in the same change. Do not add new `# type: ignore` or
212+
`# noqa` comments; restructure instead. Two pragmas are sanctioned in this suite's test code, both
213+
for known-upstream tracer bugs and only after restructuring has been tried: `# pragma: no branch`
214+
on a `with`/`async with` line whose only fault is coverage.py mis-tracing the exit arc of a nested
215+
async context (reserve it for shapes that cannot collapse — a sync `with` adjacent to an
216+
`async with`); and `# pragma: lax no cover` on a single statement that 3.11's tracer drops because
217+
the preceding `async with` unwinds via `coro.throw()` (python/cpython#106749, wontfix on 3.11) —
218+
this hits any test that must run statements after a `ClientSession`/`streamable_http_client` exits
219+
but still inside an outer `async with`, and no restructure can avoid it.
220+
221+
A handful of `# pragma: lax no cover` markers in `src/` cover teardown exception handlers whose
222+
execution is timing-dependent under the in-process HTTP bridge — the POST-stream and
223+
stateless-session `except Exception` handlers in `server/streamable_http*.py`, the `_terminated`
224+
check in `message_router`, and the response-stream double-close guard in
225+
`BaseSession._receive_loop`. `strict-no-cover` does not check `lax` lines; do not promote them to
226+
strict `no cover` without first making the teardown ordering deterministic. The suite also relies
227+
on a one-line `src/mcp/server/sse.py` fix (`sse_stream_reader.aclose()`) that closes a stream the
228+
SSE leg would otherwise leak.

tests/interaction/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)