Skip to content

tests: variable-length parametrize for 2KB-budget regression test#42

Closed
tony wants to merge 4 commits intopane-discoverabilityfrom
feature/variable-length-budget-test
Closed

tests: variable-length parametrize for 2KB-budget regression test#42
tony wants to merge 4 commits intopane-discoverabilityfrom
feature/variable-length-budget-test

Conversation

@tony
Copy link
Copy Markdown
Member

@tony tony commented May 8, 2026

Summary

The existing 2KB-budget regression test parametrized (3 tiers × 2 tmux_pane states), but every case used %42 (3 chars) and default (7-char socket name). A user with a slightly longer custom socket name + a multi-digit pane id from a long session pushes the readonly worst case very close to 2048 bytes, and the static cases never exercised this — future text additions could silently put realistic runtime injections over the budget.

This PR:

  • Collapses the two cross-product @pytest.mark.parametrize decorators into one explicit list so the variable-length stress case is a peer entry instead of expanding the cross-product space
  • Adds (TAG_READONLY, "%99", "/tmp/tmux-1000/dev-prod,12345,0") as the 7th case. Exercises BOTH axes: multi-digit pane id and a longer-than-default socket name. Margin ~2 bytes from the 2048 ceiling.
  • Inline comment names the fallback path (tighter compression form) if a future text addition trips this case.

Note on base branch

This PR targets pane-discoverability (PR #37), not main. The variable-length stress test cannot pass off main alone — the 2KB compression in pane-discoverability is what makes the budget fit in the first place. Merge after #37 lands.

Test plan

  • uv run ruff check . && uv run mypy && uv run py.test -q (443 passed; +1 stress case)
  • just build-docs

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.91%. Comparing base (a3ff362) to head (52c849e).

Additional details and impacted files
@@                  Coverage Diff                  @@
##           pane-discoverability      #42   +/-   ##
=====================================================
  Coverage                 84.91%   84.91%           
=====================================================
  Files                        40       40           
  Lines                      2294     2294           
  Branches                    294      294           
=====================================================
  Hits                       1948     1948           
  Misses                      261      261           
  Partials                     85       85           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tony tony force-pushed the pane-discoverability branch from 1a4eda5 to 11f9b76 Compare May 9, 2026 09:53
tony added 3 commits May 9, 2026 04:58
…B transmitted

why: _BASE_INSTRUCTIONS was already 2162 bytes; the dynamic safety-tier
and $TMUX_PANE agent-context blocks added another ~1340 bytes, putting
the transmitted instructions ~1500 bytes over Claude Code's documented
2KB truncation budget — silently severing the agent-context block that
is the only server-side fix for "current window" anaphora. The new
SCOPE segment documents activation triggers and anti-triggers
(browser/editor/GUI WM/Jupyter) so the LLM has explicit boundaries
when bare pane/window/session appears. Coupled with compressed safety
and agent-context blocks, the full transmitted instructions now fit
under 2048 across all three safety tiers and both TMUX_PANE
configurations.

what:
- Trim 6 _INSTR_* segments; preserve every existing test substring
  at tests/test_server.py:132-247
- Add _INSTR_SCOPE with TRIGGERS / ANTI-TRIGGERS labels; place
  second in the join (after hierarchy as topic sentence)
- _build_instructions: compress safety-tier block (~672 -> ~165
  bytes) and agent-context block (~671 -> ~225 bytes); add
  readonly-tier discoverability hint inside the function (only
  emitted on TAG_READONLY)
- Tests: parametrized 2KB budget assertion across (tier x tmux_pane);
  scope substrings; tier-conditional hint visibility
…erverInfo.name

why: The pre-activation discovery surface in Claude Code is BM25 over
tool name + description + parameter names + parameter descriptions
(per Anthropic ToolSearch docs; cross-verified vs. fastmcp
_extract_searchable_text at server/transforms/search/base.py:41-57).
Re-writing the leading paragraph of six discovery-anchor docstrings
to carry "tmux" plus a buried synonym (terminal, shell, scrollback,
multiplexer, workspace) widens the indexed lexicon. Per-tool
anthropic/alwaysLoad on three read-only anchors is a best-effort
hint — opaque pass-through in FastMCP, with honoring delegated to
Claude Code (documented at code.claude.com/docs/en/mcp v2.1.121+);
ship as forward-compatible metadata. FastMCP(name="tmux") aligns
serverInfo.name with the README registration slug; cosmetic but
removes a cross-client papercut.

what:
- Six discovery-anchor docstring rewrites (list_panes, list_windows,
  list_sessions, snapshot_pane, search_panes, capture_pane); first
  paragraph carries "tmux" plus a buried user-vocabulary synonym +
  an inline anti-trigger ("not editor splits or browser panes").
  Both sentences land in BM25's corpus via FastMCP's griffe-based
  parse_docstring (utilities/docstring_parsing.py:35-65).
- DISCOVERY_META = {"anthropic/alwaysLoad": True} in _utils.py;
  applied to list_panes, list_windows, snapshot_pane (Snapshot Pane
  title unchanged — verb-of-art carve-out preserved)
- FastMCP(name="libtmux") -> name="tmux"; add website_url
- docs/conf.py: monkey-patch sphinx_autodoc_fastmcp.ToolCollector to
  accept and ignore meta= kwarg. Upstream mock signature lacks
  **kwargs, so per-tool meta= raises TypeError inside the docs-build
  collector and silently drops the entire enclosing module's tools
  from the docs catalog (caught by a generic except Exception). The
  shim is the minimum-viable workaround; upstream fix is a **kwargs
  on ToolCollector.tool().
- Tests: server-name, anchor-description coverage, alwaysLoad presence
…ragraph

why: the 2KB-budget compression in 4164758 dropped the load-bearing
rationale phrase "survive process death" from _INSTR_HOOKS_GAP. Without
it, agents read "Write-hooks belong in your tmux config file" as soft
preference rather than a correctness boundary tied to a concrete tmux
fact. Restoring the rationale costs 26 bytes; tightening the safety-
tier paragraph (preserving the read / read+send / read+send+kill
verb-pairings inline) banks 27 bytes back. Net -1 byte;
readonly+TMUX_PANE worst case 2045 -> 2044 (margin 4 of 2048).

what:
- _INSTR_HOOKS_GAP now reads "Write-hooks survive process death;
  keep them in your tmux config file, not a transient MCP session."
  Substring "tmux config file" preserved verbatim (asserted at
  tests/test_server.py:173).
- _build_instructions safety-tier paragraph rewrites to:
  "Safety level: <tier> (readonly: read; mutating: read+send;
  destructive: read+send+kill). Set LIBTMUX_SAFETY; off-tier tools
  are hidden." Substrings preserved: "Safety level:",
  f"Safety level: {tier}", "LIBTMUX_SAFETY".
- New test_hooks_gap_keeps_process_death_rationale defensively pins
  both "survive process death" and "tmux config file" to the gap
  segment so a future refactor that moves the substring still
  fails the pin (line-173's existing test passes either way).
@tony tony force-pushed the pane-discoverability branch from 11f9b76 to a3ff362 Compare May 9, 2026 09:58
…tress case

why: the existing 2KB-budget regression test parametrized (3 tiers x
2 tmux_pane states) but every case used "%42" (3 chars) and "default"
(7-char socket name). A user with a slightly longer custom socket
name + a multi-digit pane id from a long session pushes the readonly
worst case very close to 2048. The static cases never exercised this,
so future text additions could silently put realistic runtime
injections over the budget.

what:
- Collapse the two cross-product @pytest.mark.parametrize decorators
  into one explicit list so the variable-length stress case is a peer
  entry instead of expanding the cross-product space.
- Add (TAG_READONLY, "%99", "/tmp/tmux-1000/dev-prod,12345,0") as the
  7th case. Exercises BOTH axes: multi-digit pane id and a longer-
  than-default socket name. Margin ~2 bytes from the 2048 ceiling.
- Inline comment names the fallback path (tighter compression form)
  if a future text addition trips this case.

note: this branch builds on pane-discoverability — merge after that
one lands. Standalone off main, the test would not pass since the
2KB compression in pane-discoverability is what makes the budget
fit in the first place.
@tony tony force-pushed the feature/variable-length-budget-test branch from d1de08f to 52c849e Compare May 9, 2026 10:37
@tony tony force-pushed the pane-discoverability branch from ca489e2 to 872248c Compare May 9, 2026 12:10
@tony
Copy link
Copy Markdown
Member Author

tony commented May 9, 2026

Folded into #37 per the multi-model weave-ask synthesis (2-of-3 votes for fold; the variable-length parametrize is intrinsic to #37's 2KB-budget contract, not an independent feature).

The variable-length stress case (%99 + dev-prod socket, margin ~2 bytes from 2048) now lives in the same commit that introduces _INSTR_SCOPE and the budget regression test on pane-discoverability — see 4d5744e mcp(refactor[server]): Compress instructions, add SCOPE, fit under 2KB transmitted.

Closing without merge.

@tony tony closed this May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants