|
| 1 | +# Interaction-model test suite |
| 2 | + |
| 3 | +This suite enumerates the MCP interaction model as end-to-end tests: one test per piece of |
| 4 | +functionality, asserting the full client↔server round trip through the public API. It exists to |
| 5 | +pin the SDK's observable behaviour — every request type, every notification direction, every |
| 6 | +error plane — so that internal rewrites of the send/receive path can be proven equivalent by |
| 7 | +running the suite before and after. |
| 8 | + |
| 9 | +```bash |
| 10 | +uv run --frozen pytest tests/interaction/ |
| 11 | +``` |
| 12 | + |
| 13 | +The whole suite is in-process and event-driven — including the streamable HTTP, SSE, and OAuth |
| 14 | +flows — with a single subprocess test for stdio. |
| 15 | + |
| 16 | +## Ground rules |
| 17 | + |
| 18 | +- **Public API only.** Tests drive a `Client` connected to a `Server` or `MCPServer`. Nothing |
| 19 | + reaches into session internals, so the suite keeps working when those internals change. |
| 20 | + `ClientSession` is used directly only for behaviours `Client` cannot express (skipping |
| 21 | + initialization, requesting a non-default protocol version). |
| 22 | +- **Pin current behaviour.** Every test passes against the current `main`, including behaviours |
| 23 | + that diverge from the specification. A failing or xfailed test proves nothing about whether a |
| 24 | + rewrite preserved behaviour; a passing test that pins the wrong output exactly does. Known |
| 25 | + divergences are recorded as data on the requirement (see below), not worked around in the test. |
| 26 | +- **Spec-mandated assertions, not implementation quirks.** Error *codes* are asserted against |
| 27 | + the constants in `mcp.types`; error *message strings* are pinned only where they are the |
| 28 | + SDK's own deliberate output. |
| 29 | +- **No sleeps, no real I/O.** Concurrency is coordinated with `anyio.Event`; every wait that |
| 30 | + could hang is bounded by `anyio.fail_after(5)`. The HTTP and OAuth tests drive the Starlette |
| 31 | + app in-process through the suite's streaming ASGI bridge (`transports/_bridge.py`), which |
| 32 | + delivers each response chunk as the server produces it — full duplex, but still no sockets, |
| 33 | + threads, or subprocesses anywhere outside the one stdio test. |
| 34 | + |
| 35 | +## Layout |
| 36 | + |
| 37 | +```text |
| 38 | +tests/interaction/ |
| 39 | + _requirements.py the requirements manifest (see below) |
| 40 | + _helpers.py shared type aliases + the wire-recording transport |
| 41 | + _connect.py the transport-parametrized connection factories |
| 42 | + conftest.py the connect fixture (the transport matrix) |
| 43 | + test_coverage.py enforces the manifest ↔ test contract |
| 44 | + lowlevel/ one file per feature area, against the low-level Server |
| 45 | + mcpserver/ the same feature areas in MCPServer's natural idiom |
| 46 | + transports/ behaviour specific to one transport (sessions, resumability, framing) |
| 47 | + auth/ OAuth flows against an in-process authorization server |
| 48 | +``` |
| 49 | + |
| 50 | +The two server APIs produce genuinely different wire output for the same conceptual feature |
| 51 | +(`MCPServer` generates schemas, converts exceptions to `isError` results, attaches structured |
| 52 | +content), so they get parallel directories with mirrored file names rather than one parametrized |
| 53 | +test body — each directory pins its flavour's true output exactly. |
| 54 | + |
| 55 | +### The transport matrix |
| 56 | + |
| 57 | +Transport-agnostic tests take the `connect` fixture instead of constructing `Client(server)` |
| 58 | +directly, and therefore run once per transport: over the in-memory transport, over the server's |
| 59 | +real streamable HTTP app driven in-process through the streaming bridge, and over the legacy SSE |
| 60 | +transport the same way. A test connects with `async with connect(server, ...) as client:` and |
| 61 | +asserts the same output on every leg, because the transport is not supposed to change observable |
| 62 | +behaviour. Tests that are tied to one transport do not use the fixture: the wire-recording tests |
| 63 | +(their seam is the in-memory stream pair), the bare-`ClientSession` lifecycle tests, the |
| 64 | +real-clock timeout tests (the timeout machinery is transport-independent and must not race |
| 65 | +transport latency), and everything under `transports/`, which pins behaviour only observable on |
| 66 | +that transport. |
| 67 | + |
| 68 | +A transport conformance test in `transports/` speaks raw `httpx` against the mounted ASGI app |
| 69 | +**only** when its assertion is about HTTP semantics that `Client` cannot observe — status codes, |
| 70 | +response headers, SSE event fields, which stream a message travels on. Any other behaviour is |
| 71 | +asserted through a `Client`, connected to the mounted app via `client_via_http(http)` so several |
| 72 | +clients can share one session manager. |
| 73 | + |
| 74 | +## The requirements manifest |
| 75 | + |
| 76 | +`_requirements.py` maps every behaviour the suite covers to the reason it must hold: |
| 77 | + |
| 78 | +```python |
| 79 | +"tools:call:content:text": Requirement( |
| 80 | + source=f"{SPEC_BASE_URL}/server/tools#text-content", |
| 81 | + behavior="tools/call delivers arguments to the tool handler and returns its text content.", |
| 82 | +), |
| 83 | +``` |
| 84 | + |
| 85 | +- **`source`** is a deep link into the MCP specification for externally mandated behaviour, |
| 86 | + the literal string `"sdk"` for behaviour the SDK chose where the spec is silent, or |
| 87 | + `"issue:#n"` for a regression lock. |
| 88 | +- **`behavior`** describes the *required* behaviour — what the specification (or the SDK's own |
| 89 | + contract) says should happen. Tests always pin the SDK's current behaviour; where that falls |
| 90 | + short of `behavior`, the gap is recorded as data rather than hidden in the test. |
| 91 | +- **`divergence`** records that gap for entries whose tests pin the divergent current behaviour. |
| 92 | +- **`deferred`** marks a behaviour that is tracked but has no test in this suite, with a precise |
| 93 | + reason: the SDK does not implement it, the negative cannot be observed, the assertion is |
| 94 | + schema-level rather than interaction-level, the feature is experimental (tasks), or the test |
| 95 | + would require real-time waits the suite refuses. |
| 96 | +- **`transports`** names the transports a behaviour applies to; omitted means transport-independent. |
| 97 | +- **`issue`** carries the tracking link for a recorded gap once one is filed. |
| 98 | + |
| 99 | +Tests link themselves to the manifest with a decorator: |
| 100 | + |
| 101 | +```python |
| 102 | +@requirement("tools:call:content:text") |
| 103 | +async def test_call_tool_returns_text_content() -> None: ... |
| 104 | +``` |
| 105 | + |
| 106 | +`test_coverage.py` enforces the contract in both directions: every non-deferred requirement must |
| 107 | +be exercised by at least one test, every deferred requirement by none, and an unknown ID fails at |
| 108 | +import time. A behaviour without a manifest entry cannot be silently half-tested, and a manifest |
| 109 | +entry without a test cannot be silently aspirational. |
| 110 | + |
| 111 | +### The divergence lifecycle |
| 112 | + |
| 113 | +1. A test reveals that the SDK does not do what the spec says. The test pins what the SDK |
| 114 | + *actually does* and a `Divergence(note=..., issue=...)` goes on the requirement. |
| 115 | +2. When the behaviour is eventually fixed, the pinned test fails. Whoever makes the change finds |
| 116 | + the divergence note explaining that the old behaviour was a known gap, re-pins the test to the |
| 117 | + spec-correct output, and deletes the `Divergence`. |
| 118 | +3. An empty divergence list means the SDK is spec-conformant on every behaviour the suite covers. |
| 119 | + |
| 120 | +A requirement may carry both `divergence` and `deferred`: the divergence records that the SDK falls |
| 121 | +short of the spec, and the deferral records why no test pins it (typically because the divergent |
| 122 | +behaviour cannot be driven through the public API). Divergence alone implies a test pins the |
| 123 | +divergent behaviour; divergence plus deferred means the gap is known but unpinned. |
| 124 | + |
| 125 | +This is also the triage key for any rewrite: a test that fails on the new code path either has a |
| 126 | +divergence note (the rewrite accidentally fixed a known gap — decide whether to keep the fix) or |
| 127 | +it does not (the rewrite broke something that was correct — fix the rewrite). |
| 128 | + |
| 129 | +### When a new spec revision is released |
| 130 | + |
| 131 | +1. Update `SPEC_REVISION` and walk the new revision's changelog. |
| 132 | +2. For each changed interaction, find its requirements (the IDs use the wire method strings the |
| 133 | + changelog speaks in), re-audit the tests against the new text, and update `source` links and |
| 134 | + assertions where behaviour legitimately changed. |
| 135 | +3. New interactions get new requirements and new tests; removed interactions get their |
| 136 | + requirements deleted along with their tests. |
| 137 | +4. A behaviour that is correct under both revisions needs no change beyond the `source` link. |
| 138 | + |
| 139 | +## Writing a test |
| 140 | + |
| 141 | +The shortest complete example of the conventions: |
| 142 | + |
| 143 | +```python |
| 144 | +@requirement("tools:call:content:text") |
| 145 | +async def test_call_tool_returns_text_content() -> None: |
| 146 | + """Arguments reach the tool handler; its content comes back as the call result.""" |
| 147 | + |
| 148 | + async def call_tool(ctx: ServerRequestContext, params: types.CallToolRequestParams) -> CallToolResult: |
| 149 | + assert params.name == "add" |
| 150 | + assert params.arguments is not None |
| 151 | + return CallToolResult(content=[TextContent(text=str(params.arguments["a"] + params.arguments["b"]))]) |
| 152 | + |
| 153 | + server = Server("adder", on_call_tool=call_tool) |
| 154 | + |
| 155 | + async with Client(server) as client: |
| 156 | + result = await client.call_tool("add", {"a": 2, "b": 3}) |
| 157 | + |
| 158 | + assert result == snapshot(CallToolResult(content=[TextContent(text="5")])) |
| 159 | +``` |
| 160 | + |
| 161 | +- **The server is defined inside the test** (or in a small fixture at the top of the file when |
| 162 | + several tests genuinely share it). The whole observable behaviour fits on one screen. |
| 163 | +- **Test names are behaviour sentences** — they state the observable outcome, not the feature |
| 164 | + being poked. Docstrings add the one or two sentences of context a reviewer needs, including |
| 165 | + whether the assertion is spec-mandated, SDK-defined, or a known divergence. |
| 166 | +- **Handlers assert their dispatch identity first** (`assert params.name == "add"`), proving the |
| 167 | + request that arrived is the request the test sent. |
| 168 | +- **The result proves the round trip.** Server-side observations travel back to the test through |
| 169 | + the protocol itself (a tool returns what it saw) or through a closure-captured list; the test |
| 170 | + asserts after the call returns. |
| 171 | +- **Order within a test**: server handlers → server construction → client callbacks → connect → |
| 172 | + act → assert. The test reads in the order the conversation happens. |
| 173 | +- A registered handler or tool that a test never invokes gets a `raise NotImplementedError` body |
| 174 | + so it cannot silently become load-bearing. |
| 175 | +- A test that needs a peer no real `Server` or `Client` can play (a server that answers initialize |
| 176 | + with an unsupported version, a client that sends malformed params) plays that side of the wire by |
| 177 | + hand over `create_client_server_memory_streams()`. This scripted-peer pattern is the suite's only |
| 178 | + way to drive behaviour the typed API cannot produce, and the docstring of every such test says so. |
| 179 | + |
| 180 | +Stack a second `@requirement` decorator only when a test's natural assertions incidentally prove |
| 181 | +another behaviour — one capabilities snapshot proving four `*:capability:declared` entries, one |
| 182 | +input-schema identity check proving each preserved keyword. Do not build a test around covering |
| 183 | +many requirements at once; if the assertions would be separate, write separate tests. |
| 184 | + |
| 185 | +### Choosing an assertion |
| 186 | + |
| 187 | +| The property under test is… | Assert with | |
| 188 | +|---|---| |
| 189 | +| the result of a transformation (arguments → output, exception → error result) | `result == snapshot(...)` of the full object, so any field the implementation adds or drops fails the test | |
| 190 | +| pass-through of an opaque value (`_meta`, cursors) | identity against the same variable that was sent — a snapshot of a pass-through value only matches the input because a human checked two literals correspond | |
| 191 | +| an error | `pytest.raises(MCPError)` and a snapshot of `exc.value.error` when the message is the SDK's own; a plain `==` on `.code` against the `mcp.types` constant when it is not | |
| 192 | +| third-party output embedded in a result (validation messages) | the stable prefix only — never pin text that changes with a dependency upgrade | |
| 193 | + |
| 194 | +### Notifications and concurrency |
| 195 | + |
| 196 | +The client's receive loop dispatches each incoming message to completion before reading the next, |
| 197 | +and the in-memory transport delivers everything on one ordered stream. Together these guarantee |
| 198 | +that every notification a server handler emits before its response reaches the client callback |
| 199 | +before the originating request returns — so tests collect notifications into a plain list and |
| 200 | +assert after the call, with no synchronisation. The exceptions: |
| 201 | + |
| 202 | +- a notification not triggered by a request the test is awaiting needs an `anyio.Event` set in |
| 203 | + the receiving handler and awaited under `anyio.fail_after(5)`; |
| 204 | +- the ordering guarantee does not survive transports that split messages across streams (the |
| 205 | + streamable HTTP standalone GET stream) — see `transports/test_streamable_http.py`. |
| 206 | + |
| 207 | +### Coverage |
| 208 | + |
| 209 | +CI requires 100% line and branch coverage, including `tests/`, and `strict-no-cover` fails the |
| 210 | +build if a line marked `# pragma: no cover` is ever executed. When a new test starts covering a |
| 211 | +pragma'd line in `src/`, delete the pragma in the same change. Do not add new `# type: ignore` or |
| 212 | +`# noqa` comments; restructure instead. Two pragmas are sanctioned in this suite's test code, both |
| 213 | +for known-upstream tracer bugs and only after restructuring has been tried: `# pragma: no branch` |
| 214 | +on a `with`/`async with` line whose only fault is coverage.py mis-tracing the exit arc of a nested |
| 215 | +async context (reserve it for shapes that cannot collapse — a sync `with` adjacent to an |
| 216 | +`async with`); and `# pragma: lax no cover` on a single statement that 3.11's tracer drops because |
| 217 | +the preceding `async with` unwinds via `coro.throw()` (python/cpython#106749, wontfix on 3.11) — |
| 218 | +this hits any test that must run statements after a `ClientSession`/`streamable_http_client` exits |
| 219 | +but still inside an outer `async with`, and no restructure can avoid it. |
| 220 | + |
| 221 | +A handful of `# pragma: lax no cover` markers in `src/` cover teardown exception handlers whose |
| 222 | +execution is timing-dependent under the in-process HTTP bridge — the POST-stream and |
| 223 | +stateless-session `except Exception` handlers in `server/streamable_http*.py`, the `_terminated` |
| 224 | +check in `message_router`, and the response-stream double-close guard in |
| 225 | +`BaseSession._receive_loop`. `strict-no-cover` does not check `lax` lines; do not promote them to |
| 226 | +strict `no cover` without first making the teardown ordering deterministic. The suite also relies |
| 227 | +on a one-line `src/mcp/server/sse.py` fix (`sse_stream_reader.aclose()`) that closes a stream the |
| 228 | +SSE leg would otherwise leak. |
0 commit comments