Status: Draft v2 Date: 2026-04-24 Related: PRD-001 (Capabilities and adapters), ADR-006 (Immutable runtime bootstrap), DDD-001 (Immutable bootstrap domain)
| Version | Date | Summary |
|---|---|---|
| Draft v1 | 2026-04-24 | Initial draft — immutable bootstrap contract, four-phase rollout |
| Draft v2 | 2026-04-24 | R1 packaging audit integrated: comfyui/mcp-server added to Phase 1 with native-gyp note; effort totals corrected to ~13d+2d CI; skills/host-webserver-debug/mcp-server and skills/web-summary/mcp-server identified as unscoped candidates (noted as follow-up); Renovate hash-management step made explicit; cross-references to ADR-006 packaging table and DDD-001 invariants tightened |
Skip if you already know the immutable-bootstrap product goals.
This PRD specifies the product requirements for making agentbox boot without any runtime dependency resolution. The pain point is that today's startup still runs npm install, global CLI installs, and Playwright browser downloads with || true on failure — so the boot outcome depends on network, registries, and timing, even though the image is advertised as Nix-built and reproducible. The shape of the answer is a phased migration to packaged artefacts: every dependency that used to be resolved at startup must be baked into the image, with Nix-packaged Node modules, pinned browsers, and a validation probe that proves capabilities before readiness. You will get the product goals, scope boundaries, phase plan, effort estimates, and acceptance criteria.
If you remember only one thing: boot must never install software; anything the runtime needs is already in the image or the boot fails loudly.
For the deep version, keep reading.
Scope. This document specifies the product requirements for item
1: removing mutable dependency installation from container startup so agentbox boots as an immutable, pre-packaged runtime rather than a best-effort VM-like environment.
Agentbox presents itself as a Nix-built, reproducible container, but startup still performs mutable runtime work:
- installs Node dependencies into copied source trees
- installs global CLIs with
npm install -g - downloads Playwright browsers on first boot
- tolerates failures with
|| true
That makes the boot result depend on network access, upstream registries, startup timing, and partial failure. The product contract is currently stronger than the runtime behavior.
- Boot must be deterministic. Two runs of the same image and manifest must start with the same binaries and dependency graph.
- Boot must not require network access. Network should be optional for application work, not required for the container to become ready.
- Boot must fail fast on packaging errors. Missing runtime artifacts must surface as explicit startup failures, not degraded best-effort warnings.
- Runtime mutation must be narrow and intentional. Startup may create instance state, but it must not resolve software dependencies.
- Feature gates must remain truthful. If a manifest flag enables a service or toolchain, the image must already contain the required executable and libraries.
- Replacing
supervisordin this phase. - Redesigning the skills catalog or feature taxonomy.
- Eliminating writable runtime state such as generated identities, profile directories, or persisted local data.
- Splitting all optional capabilities into separate containers.
As an operator, I want docker compose up to succeed offline if I already have the image, so container readiness does not depend on npm or browser downloads.
As a maintainer, I want packaging failures to appear at build time or immediately at process start, so regressions are caught before a user reaches a half-configured runtime.
As an integrator, I want manifest flags to correspond to artifacts already present in the image, so I can reason about capabilities from the manifest alone.
Startup must not:
- download packages from npm, PyPI, GitHub, or similar registries
- run package-manager resolution (
npm install,pnpm install,pip install,playwright install) - mutate the read-only application tree under
/opt/agentbox
Startup may:
- create or validate writable directories
- generate instance-local secrets and identity material
- provision workspace defaults and profile scaffolding
- start supervised processes
For every enabled manifest capability that starts a service or exposes a CLI, the built image must already include:
- the executable entrypoint
- its runtime dependency closure
- any required static assets or browser/runtime bundle
This includes service-local node_modules or an equivalent packaged closure.
The build pipeline must validate that each enabled service block has a resolvable runtime artifact. If an enabled feature cannot be packaged, the build must fail or the manifest validator must reject the configuration.
Missing artifacts at startup must be fatal for required features. Silent || true fallback is not acceptable for bootstrap-critical work.
Given a locally available image and volumes, startup must succeed with outbound network blocked, unless the selected application mode explicitly requires an external endpoint after readiness.
Bootstrap must emit structured status for:
- bootstrap started
- bootstrap phase completed
- bootstrap failed
- missing artifact detected
Those events must be visible through logs and readiness state.
-
config/entrypoint-unified.shcontains no package-manager install or dependency download steps. Test:RC-002-03(entrypoint linter) — grep scan ofconfig/entrypoint-unified.shfornpm install,pnpm,pip install,playwright install,npm install -g, andnpx.*install; zero matches required. -
A cold boot with egress blocked reaches readiness for a valid standalone manifest. Test:
RC-002-01(no-network boot) — container started with--network none; assertsGET /readyreturns HTTP 200 within 60 s. -
If an enabled service artifact is missing, startup fails deterministically with a specific error. Test:
RC-002-05(missing-artifact fatal) — entrypoint wrapper unlinks one required binary before supervisor init; asserts supervisord exits non-zero and stderr matchesFATAL:.*missingwithin 30 s. -
The application tree under
/opt/agentboxremains immutable at runtime. Test:RC-002-04(read-only app tree) — container started with/opt/agentboxbind-mounted read-only; asserts boot reaches readiness and no write attempt inside/opt/agentboxis logged. -
Feature-gated services can be started without creating
node_modulesinside the container filesystem at boot. Test:RC-002-02(artifact probes) — for each binary in the feature matrix (playwright, codex, gemini-cli, claude-code, ruflo, agentic-qe, code-server, nostr-bridge), assert binary present in PATH and--versionor--helpexits 0; nonode_modulesdirectory created under/opt/agentboxduring probe phase.
| Metric | Current problem | Target |
|---|---|---|
| Boot requires registry/network for dependency resolution | Yes | No |
| Best-effort bootstrap steps with silent failure | Present | Zero bootstrap-critical silent failures |
| Cold-start time variance caused by downloads | High | Low and bounded |
| Drift between built image and started runtime | Possible | Eliminated for packaged features |
| Risk | Why it matters | Mitigation |
|---|---|---|
| Larger images | Pre-packaging dependencies increases image size | Use feature gates, layer boundaries, and separate variants where needed |
| More build-time work | Some errors move earlier in the lifecycle | Acceptable trade: fail early, not at customer boot |
| Packaging complexity for JS tools | Some tools assume install-time side effects | Package wrappers and assets explicitly; add artifact probes per capability |
Phase 1 — Local service packages (6 services, ~5d)
- Add
buildNpmPackagederivations inflake.nixfor the six local services:management-api,mcp(nostr-bridge),skills/openai-codex/mcp-server,skills/lazy-fetch/mcp-server,skills/playwright/mcp-server, andskills/comfyui/mcp-server. Each derivation setssrc = ./<service>,npmDepsHash = "<prefetched-hash>", andpostInstallcopies the built tree into$out/opt/agentbox/<service>. Thelazy-fetchderivation additionally runstscinbuildPhasebefore packagingdist/. Thecomfyui/mcp-serverderivation addsnativeBuildInputs = [pkgs.python3 pkgs.nodeGyp]because thesharpdependency has native bindings that require a gyp rebuild; gate behindskills.media.comfyui_builtin. Note:skills/host-webserver-debug/mcp-serverandskills/web-summary/mcp-serverhave package-lock files and are packaging candidates; scoping to Phase 1 or later is a follow-up decision. - Remove the five
_install_node_depscalls from Phase 6 ofconfig/entrypoint-unified.sh. Replace with a probe loop: for each expectednode_modulespath,test -d "$path" || { echo "FATAL: missing closure $path"; exit 1; }. - Add
checkPhaseto each derivation:node <entrypoint> --versionornode -e "require('./<main>')"— confirms the closure loads without import errors. - Update
appRootinflake.nixto copy each derivation's output rather than the raw source directory, so the Nix-builtnode_modulesare baked into the image layer.
Phase 2 — Global CLI toolchains (9 packages, ~9d)
- Add
buildNpmPackagederivations for the nine global CLIs currently installed in Phase 7:ruvector,@claude-flow/cli,ruflo,agentic-qe,nagual-qe,codebase-memory-mcp,agent-browser,playwright(CLI wrapper only — browsers handled separately),@mermaid-js/mermaid-cli. Pin each to the same version range currently used at runtime. Add each derivation toallPackagesbehind its existinglib.optionalsfeature gate, mirroring thecodexPackagespattern already used for the Codex binary. - Replace all nine
npm install -glines in Phase 7 ofconfig/entrypoint-unified.shwith a single comment block explaining that these CLIs are now pre-packaged. Phase 7 becomes a no-op; remove it or reduce it to the artifact probe for each enabled CLI (command -v ruflo || { echo "FATAL: ruflo not in PATH"; exit 1; }). - For
playwrightbrowsers: theflake.nixalready includespkgs.playwright-driverinbrowserPackages. SetPLAYWRIGHT_BROWSERS_PATH=${pkgs.playwright-driver.browsers}inimageEnvwhenbrowserCfg.playwrightis true. Remove thenpx playwright install chromiumcall from Phase 7. - For
@mermaid-js/mermaid-cli(depends on Chromium via Puppeteer): setPUPPETEER_SKIP_DOWNLOAD=1andPUPPETEER_EXECUTABLE_PATH=${pkgs.chromium}/bin/chromiumin the derivation's build environment; addpkgs.chromiumtobrowserPackagesunconditionally when mermaid is enabled.
Phase 3 — Hash management and CI (1d)
- Add a Renovate
customManagersentry in.github/renovate.json(or create it) withfileMatch = ["flake\\.nix"],matchStringstargeting eachnpmDepsHash = "..."line, anddatasourceTemplate = "npm". This keeps each hash in sync withpackage-lock.jsonbumps without manualnix hashruns. - Add a
nix flake checkstep to the CI pipeline that builds all derivations and runs theircheckPhase. Gate image build on this step passing.
Phase 4 — Cleanup and acceptance (1d)
- Delete
STAGE_B_MODEand the Phase 6/7 function bodies fromconfig/entrypoint-unified.shonce all probe replacements are in place. Stage B becomes purely: validate probes, publishprofile.denv hints, emitBootstrapCompletedevent. - Update
docs/guides/quick-start.mdandREADME.mdto remove the note about first-boot network requirement. Add a build-time note: "Runnix build .#runtimeto produce a fully self-contained image; no network access is required at container start."