tests: variable-length parametrize for 2KB-budget regression test#42
Closed
tony wants to merge 4 commits intopane-discoverabilityfrom
Closed
tests: variable-length parametrize for 2KB-budget regression test#42tony wants to merge 4 commits intopane-discoverabilityfrom
tony wants to merge 4 commits intopane-discoverabilityfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## pane-discoverability #42 +/- ##
=====================================================
Coverage 84.91% 84.91%
=====================================================
Files 40 40
Lines 2294 2294
Branches 294 294
=====================================================
Hits 1948 1948
Misses 261 261
Partials 85 85 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…B transmitted why: _BASE_INSTRUCTIONS was already 2162 bytes; the dynamic safety-tier and $TMUX_PANE agent-context blocks added another ~1340 bytes, putting the transmitted instructions ~1500 bytes over Claude Code's documented 2KB truncation budget — silently severing the agent-context block that is the only server-side fix for "current window" anaphora. The new SCOPE segment documents activation triggers and anti-triggers (browser/editor/GUI WM/Jupyter) so the LLM has explicit boundaries when bare pane/window/session appears. Coupled with compressed safety and agent-context blocks, the full transmitted instructions now fit under 2048 across all three safety tiers and both TMUX_PANE configurations. what: - Trim 6 _INSTR_* segments; preserve every existing test substring at tests/test_server.py:132-247 - Add _INSTR_SCOPE with TRIGGERS / ANTI-TRIGGERS labels; place second in the join (after hierarchy as topic sentence) - _build_instructions: compress safety-tier block (~672 -> ~165 bytes) and agent-context block (~671 -> ~225 bytes); add readonly-tier discoverability hint inside the function (only emitted on TAG_READONLY) - Tests: parametrized 2KB budget assertion across (tier x tmux_pane); scope substrings; tier-conditional hint visibility
…erverInfo.name
why: The pre-activation discovery surface in Claude Code is BM25 over
tool name + description + parameter names + parameter descriptions
(per Anthropic ToolSearch docs; cross-verified vs. fastmcp
_extract_searchable_text at server/transforms/search/base.py:41-57).
Re-writing the leading paragraph of six discovery-anchor docstrings
to carry "tmux" plus a buried synonym (terminal, shell, scrollback,
multiplexer, workspace) widens the indexed lexicon. Per-tool
anthropic/alwaysLoad on three read-only anchors is a best-effort
hint — opaque pass-through in FastMCP, with honoring delegated to
Claude Code (documented at code.claude.com/docs/en/mcp v2.1.121+);
ship as forward-compatible metadata. FastMCP(name="tmux") aligns
serverInfo.name with the README registration slug; cosmetic but
removes a cross-client papercut.
what:
- Six discovery-anchor docstring rewrites (list_panes, list_windows,
list_sessions, snapshot_pane, search_panes, capture_pane); first
paragraph carries "tmux" plus a buried user-vocabulary synonym +
an inline anti-trigger ("not editor splits or browser panes").
Both sentences land in BM25's corpus via FastMCP's griffe-based
parse_docstring (utilities/docstring_parsing.py:35-65).
- DISCOVERY_META = {"anthropic/alwaysLoad": True} in _utils.py;
applied to list_panes, list_windows, snapshot_pane (Snapshot Pane
title unchanged — verb-of-art carve-out preserved)
- FastMCP(name="libtmux") -> name="tmux"; add website_url
- docs/conf.py: monkey-patch sphinx_autodoc_fastmcp.ToolCollector to
accept and ignore meta= kwarg. Upstream mock signature lacks
**kwargs, so per-tool meta= raises TypeError inside the docs-build
collector and silently drops the entire enclosing module's tools
from the docs catalog (caught by a generic except Exception). The
shim is the minimum-viable workaround; upstream fix is a **kwargs
on ToolCollector.tool().
- Tests: server-name, anchor-description coverage, alwaysLoad presence
…ragraph
why: the 2KB-budget compression in 4164758 dropped the load-bearing
rationale phrase "survive process death" from _INSTR_HOOKS_GAP. Without
it, agents read "Write-hooks belong in your tmux config file" as soft
preference rather than a correctness boundary tied to a concrete tmux
fact. Restoring the rationale costs 26 bytes; tightening the safety-
tier paragraph (preserving the read / read+send / read+send+kill
verb-pairings inline) banks 27 bytes back. Net -1 byte;
readonly+TMUX_PANE worst case 2045 -> 2044 (margin 4 of 2048).
what:
- _INSTR_HOOKS_GAP now reads "Write-hooks survive process death;
keep them in your tmux config file, not a transient MCP session."
Substring "tmux config file" preserved verbatim (asserted at
tests/test_server.py:173).
- _build_instructions safety-tier paragraph rewrites to:
"Safety level: <tier> (readonly: read; mutating: read+send;
destructive: read+send+kill). Set LIBTMUX_SAFETY; off-tier tools
are hidden." Substrings preserved: "Safety level:",
f"Safety level: {tier}", "LIBTMUX_SAFETY".
- New test_hooks_gap_keeps_process_death_rationale defensively pins
both "survive process death" and "tmux config file" to the gap
segment so a future refactor that moves the substring still
fails the pin (line-173's existing test passes either way).
…tress case why: the existing 2KB-budget regression test parametrized (3 tiers x 2 tmux_pane states) but every case used "%42" (3 chars) and "default" (7-char socket name). A user with a slightly longer custom socket name + a multi-digit pane id from a long session pushes the readonly worst case very close to 2048. The static cases never exercised this, so future text additions could silently put realistic runtime injections over the budget. what: - Collapse the two cross-product @pytest.mark.parametrize decorators into one explicit list so the variable-length stress case is a peer entry instead of expanding the cross-product space. - Add (TAG_READONLY, "%99", "/tmp/tmux-1000/dev-prod,12345,0") as the 7th case. Exercises BOTH axes: multi-digit pane id and a longer- than-default socket name. Margin ~2 bytes from the 2048 ceiling. - Inline comment names the fallback path (tighter compression form) if a future text addition trips this case. note: this branch builds on pane-discoverability — merge after that one lands. Standalone off main, the test would not pass since the 2KB compression in pane-discoverability is what makes the budget fit in the first place.
d1de08f to
52c849e
Compare
Member
Author
|
Folded into #37 per the multi-model weave-ask synthesis (2-of-3 votes for fold; the variable-length parametrize is intrinsic to #37's 2KB-budget contract, not an independent feature). The variable-length stress case ( Closing without merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The existing 2KB-budget regression test parametrized (3 tiers × 2
tmux_panestates), but every case used%42(3 chars) anddefault(7-char socket name). A user with a slightly longer custom socket name + a multi-digit pane id from a long session pushes the readonly worst case very close to 2048 bytes, and the static cases never exercised this — future text additions could silently put realistic runtime injections over the budget.This PR:
@pytest.mark.parametrizedecorators into one explicit list so the variable-length stress case is a peer entry instead of expanding the cross-product space(TAG_READONLY, "%99", "/tmp/tmux-1000/dev-prod,12345,0")as the 7th case. Exercises BOTH axes: multi-digit pane id and a longer-than-default socket name. Margin ~2 bytes from the 2048 ceiling.Note on base branch
This PR targets
pane-discoverability(PR #37), notmain. The variable-length stress test cannot pass offmainalone — the 2KB compression inpane-discoverabilityis what makes the budget fit in the first place. Merge after #37 lands.Test plan
uv run ruff check . && uv run mypy && uv run py.test -q(443 passed; +1 stress case)just build-docs