stacklok · yrobla · Apr 9, 2026 · Apr 7, 2026 · Apr 8, 2026 · Apr 8, 2026
@@ -13,6 +13,37 @@ Applies to all Go files in the project.
 - Separate business logic from transport/protocol concerns
 - Keep packages focused on single responsibilities
 
+## Interface Design
+
+Check these whenever adding a method to an interface or defining a new type:
+
+- **Minimal surface**: Don't add interface methods that duplicate the semantics of existing ones. If an existing method already answers the question (possibly with a side effect), don't add a separate method for the same check.
+- **No silent no-ops**: A no-op that silently breaks callers who depend on the method working is a sign the interface is too broad. Narrow the interface or use a separate capability interface. Benign no-ops (e.g., `Close()` on an in-memory store) are fine.
+- **Option pattern must be compile-time safe**: Never define a local anonymous interface inside an option and type-assert against it to check capability — a silent no-op results if the target doesn't implement it. (Returning an explicit error from an option for input validation is fine.) Two typesafe approaches:
+  - *Config struct field*: put the setting on the config struct (e.g., `types.Config.SessionStorage`) so all consumers see it at compile time.
+  - *Typed functional option*: use `func(*ConcreteType)` so the option only compiles against the correct receiver.
+  If you need to cast inside an option to check whether the target supports it, the option is on the wrong abstraction. See #4638.
+- **Avoid parallel types that drift**: Don't define a separate config/data type that mirrors an existing one. Embed or reuse the original — two parallel structs require a conversion step and will diverge over time.
+
+## Resource Leaks
+
+Always pair resource acquisition with explicit release. Common patterns that leak:
+
+- Goroutines with no exit condition or cancellation path
+- Caches and maps that grow without a capacity limit or eviction policy
+- Connections, files, or handles opened without a corresponding `Close()` (use `defer`)
+- Tickers and timers whose `Stop()` is never called
+
+When reviewing code that acquires a resource, ask: where does this get released, and what happens if the normal release path is never reached?
+
+## Linting
+
+All lint rules must be followed. Run `task lint-fix` before submitting. Do not suppress linter warnings with `//nolint` directives unless the violation is a confirmed false positive — fix the root cause instead.
+
+## Validate Parsed Results
+
+A successful parse (`err == nil`) only means the input was syntactically acceptable to the parser — not that it meets your requirements. Always validate the parsed result against what you actually need. Standard library parsers routinely accept more inputs than a given call site should allow.
+
 ## SPDX License Headers
 
 All Go files require SPDX headers at the top:
@@ -69,10 +100,37 @@ if someCondition {
   state, ask whether it can be folded into a constructor or belongs in the
   caller's package
 
+## Document Architectural Constraints on Exported Functions
+
+When an exported function or constructor changes behavior based on injected infrastructure (storage backend, transport mode, external client), its doc comment must state what the injection does and does not solve. Callers cannot be expected to infer distributed-system constraints from the implementation.
+
+Include at minimum:
+- What the injected component enables (e.g., cross-replica metadata sharing).
+- What it does *not* solve (e.g., cross-replica message delivery, fan-out).
+- Any caller responsibility that follows (e.g., session affinity at the load balancer).
+
+## Concurrency Comments
+
+Keep comments about mutexes, locks, and concurrency accurate — they are easy to get wrong and mislead future readers:
+
+- Only say a lock "must be held" or "is already held" if you have verified it at that call site.
+- Do not claim an operation would deadlock without confirming that the lock in question would actually be re-acquired.
+- When a comment describes a concurrency invariant (e.g., "called with mu held"), add it to the function's doc comment so it travels with the signature, not inline at the call site.
+
 ## Logging
 
 - **Silent success** — no output at INFO or above for successful operations
 - **DEBUG** for diagnostics (runtime detection, state transitions, config values)
 - **INFO** sparingly — only for long-running operations like image pulls
 - **WARN** for non-fatal issues (deprecations, fallback behavior, cleanup failures)
 - **Never log** credentials, tokens, API keys, or passwords
+
+## Prefer Existing Code and Packages Over From-Scratch Implementations
+
+Before implementing any non-trivial functionality from scratch:
+
+1. **Search the toolhive repo first** — check if an existing method, utility, or package already provides the functionality or something close enough to extend.
+2. **Check the Go standard library** — the stdlib covers a wide surface area; prefer it over third-party packages when it fits.
+3. **Look for existing Go packages** — search for well-maintained OSS libraries that solve the problem before writing custom implementations.
+
+Implementing from scratch should be a last resort, justified by a specific gap no existing solution fills.
@@ -41,6 +41,10 @@ See `cmd/thv-operator/DESIGN.md` for detailed decision guidelines.
 - Chainsaw tests require a real Kubernetes cluster
 - Status updates require a separate client patch (`r.Status().Update()`)
 
+## Status Condition Parity
+
+When adding a status condition to one CRD type, check all parallel types (e.g., `MCPServer` and `VirtualMCPServer`) for the same condition. Conditions that warn about misconfiguration or unsupported states should be consistent across types that share the same feature set — a gap means one type silently accepts invalid config that the other rejects.
+
 ## Key Operator Commands
 
 ```bash

@@ -16,6 +16,10 @@ Do NOT leave optional sections empty or with only placeholder/template text. Eit
 - **Does this introduce a user-facing change?**: Describe the change from the user's perspective. Write "No" if not applicable.
 - **Special notes for reviewers**: Non-obvious design decisions, known limitations, areas wanting extra scrutiny, or planned follow-up work.
 
+## PR Scope
+
+Each PR must contain only related changes. If a bug fix, refactor, or unrelated cleanup is discovered while working on a feature, open a separate PR for it. Mixed-scope PRs are harder to review and harder to revert cleanly.
+
 ## Style guidelines
 
 - Keep the PR title under 70 characters, imperative mood, no trailing period.

@@ -0,0 +1,52 @@
+---
+paths:
+  - "**/*.go"
+---
+
+# Security Rules
+
+Applies to all Go files in the project.
+
+## Don't Store Internal Addressing in Shared State
+
+Never persist internal infrastructure addresses (hostnames, IPs, service URLs, pod names) into shared or external state stores (databases, caches, config passed to clients).
+
+Internal addresses stored externally:
+- Leak topology to anyone who can read the store
+- May allow callers to bypass security middleware by using the stored address directly
+- Couple your routing logic to volatile infrastructure state that changes independently
+
+**Instead**: derive routing from stable, non-sensitive inputs (e.g. a session ID, a content hash, a logical name). If you must store a target, store a logical identifier and resolve it at use time through a path that enforces security controls.
+
+## Route Through Security-Enforcing Components
+
+Always route traffic through the component responsible for auth, rate limiting, or policy enforcement — never optimize past it.
+
+A direct path that skips middleware is a vulnerability, not a performance improvement. If you find yourself type-asserting, casting, or reaching into an internal field to get a "more direct" address, stop and ask whether the shortcut bypasses any security boundary.
+
+When multiple routing options exist (e.g. a proxy vs. a raw address), choose the one where security controls are guaranteed to be in the critical path.
+
+## Prefer Stateless Routing Over Stored Routing
+
+When routing can be derived deterministically from stable request properties, compute it on every request rather than storing it.
+
+Storing routing decisions:
+- Creates state that must be recovered correctly after restarts
+- Introduces a window where stored state is stale or wrong
+- Expands the attack surface of the state store
+
+If the same input always maps to the same destination (consistent hashing, modular arithmetic, content addressing), there is no need to store the mapping. Remove the stored state and eliminate the recovery problem entirely.
+
+## All Requests Must Pass Through the Proxy Runner
+
+Every request to a managed container (MCP server or tool) must flow through the proxy runner (`pkg/runner/proxy`). Bypassing it is a vulnerability, not an optimization.
+
+The proxy runner is the single enforcement point for:
+- Authentication and authorization checks
+- Secret injection and credential management
+- Network policy and egress controls
+- Audit logging
+
+Any code that constructs a direct connection to a container — by using a raw host:port, reaching past the proxy interface, or type-asserting to an underlying transport — skips these controls entirely.
+
+**If you find a code path that contacts a container without going through the proxy runner, treat it as a security bug and fix it.**
@@ -32,9 +32,13 @@ Applies to test files and test directories.
 2. **Redundancy**: Avoid overlapping test cases exercising the same code path
 3. **Value**: Every test must add meaningful coverage — remove tests that don't
 4. **Consolidation**: Consolidate small test functions into a single table-driven test when they test the same function
-5. **Naming**: Use clear, descriptive test names that convey the scenario
+5. **Naming**: Test names must match what they actually assert — if the assertion changes, update the name too.
 6. **Boilerplate**: Minimize setup code; extract shared setup into helpers with `t.Helper()`
 
+## E2E Test Coverage
+
+E2E tests must verify functional behavior, not just infrastructure state. Confirming that pods are ready or that counts are correct is not sufficient — the test must also exercise the actual code path (send traffic, trigger the feature) to prove it works end-to-end.
+
 ## Test Scope
 
 Tests must only test code in the package under test. Do NOT test behavior of dependencies, external packages, or transitive functionality.
@@ -55,3 +59,31 @@ Write tests isolated from other tests that may set the same env vars. Use `t.Set
 ## Port Numbers
 
 Use random ports (e.g., `net.Listen("tcp", ":0")`) to let the OS assign a free port. Do not use hardcoded port numbers — even large ones can clash with running services.
+
+## Test Hooks in Production Structs
+
+Avoid adding test-only hook fields (nil-checked `func()` fields) to production structs. A field documented as "nil in production" signals the concern belongs outside the production type. Preferred alternatives:
+
+- **Interface seam**: Replace the internal component with an interface; tests inject a wrapper that adds the needed synchronization or observation.
+- **Functional constructor options**: Expose hook injection only through a constructor option so the production call site stays clean.
+- **Test at the observable boundary**: Control timing through the mock/stub's own behavior rather than hooking into production internals.
+
+Existing instances in the codebase are legacy — do not expand them. When touching a struct that already has hook fields, consider extracting them as part of the change.
+
+## Concurrent Tests: Always Add Timeouts to Blocking Barriers
+
+Blocking operations in tests (`WaitGroup.Wait()`, channel receives, `sync.Cond.Wait()`) must have a timeout/fail-fast path. Without one, a panicking goroutine or regression in synchronization logic causes the test to hang until the global `go test` timeout.
+
+```go
+// Good: fail fast with a clear message
+done := make(chan struct{})
+go func() { wg.Wait(); close(done) }()
+select {
+case <-done:
+case <-time.After(5 * time.Second):
+    t.Fatal("timeout waiting for goroutines to synchronize")
+}
+
+// Avoid: hangs indefinitely on deadlock
+wg.Wait()
+```
@@ -79,3 +79,11 @@ Adding caches, connection pools, or other performance optimizations without evid
 **Detect**: Caches introduced without profiling data or load estimates showing the uncached path is too slow; connection pools or object pools where the allocation cost hasn't been measured; complexity added to avoid overhead (e.g., TLS handshakes, serialization) at request rates where the overhead is negligible.
 
 **Instead**: Start with the straightforward implementation. Measure under realistic load. Add optimization only when measurements show it's needed, and document the evidence in the commit or PR description.
+
+## 10. Mutable Domain Objects with Mutex Protection
+
+Adding a mutex to a domain object and mutating it in place when state changes. This grows in complexity with every new mutation and makes objects harder to reason about under concurrency.
+
+**Detect**: Mutex fields on domain structs; mutation methods on types that were previously read-only; in-place writes guarded by an object-level lock; multiple layers each holding their own mutex.
+
+**Instead**: Ask whether the object can be reconstructed rather than mutated — rebuild from the source of truth and replace the reference. If mutation is truly necessary, centralize synchronization at one layer rather than distributing mutexes across multiple layers; everything below that layer is then single-threaded and much easier to reason about. Sharded locks for performance should only be introduced after profiling shows contention (see anti-pattern #9).