Skip to content

feat(stack): pull-through registry cache on by default#500

Open
bussyjd wants to merge 1 commit into
mainfrom
feat/stack-default-pull-through-cache
Open

feat(stack): pull-through registry cache on by default#500
bussyjd wants to merge 1 commit into
mainfrom
feat/stack-default-pull-through-cache

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 17, 2026

Summary

  • Three pull-through k3d registry caches (docker.io, ghcr.io, quay.io) are now started for all users on every obol stack up, not just when OBOL_DEVELOPMENT=true.
  • The local push target (localhost:54103) stays gated behind OBOL_DEVELOPMENT=true — it exists only for just dev-frontend hot-swap.
  • New --no-registry-cache flag on obol stack up (also readable via OBOL_DISABLE_REGISTRY_CACHE=true) for hosts behind a corporate proxy or with tight disk constraints.

Why

On a clean install every obol stack up makes the k3d node pull every image fresh from ghcr.io / docker.io. On the v1337 demo on spark1 this cost ~10 min waiting for LiteLLM alone, and obol stack down && obol stack up re-pays the same cost because the next k3d node also pulls fresh.

The pull-through cache containers were already built and working for dev mode. Promoting them to the default means the second obol stack up on the same host completes the LiteLLM rollout in <2 min vs ~10 min today. Disk footprint: ~0–2 GB per cache container, only what has been pulled.

What's changed

  1. internal/stack/dev_registry.go — Split devRegistryMirrors into pullThroughMirrors (3 caches, always on) and localPushMirror (dev-only). New ensureRegistryCaches(cfg, u, devMode) function; legacy ensureDevRegistries kept as a thin wrapper. devRegistrySetupregistrySetup (type alias for back-compat). renderRegistriesConfig now takes a mirror slice instead of hardcoding all-dev.
  2. internal/stack/backend_k3d.goensureRegistryCaches called unconditionally (unless OBOL_DISABLE_REGISTRY_CACHE=true); devMode flag controls whether localhost:54103 is included.
  3. internal/stack/stack.goreclaimLeakedDevK3dNetworks now runs for all users (not just dev mode) since the mirror containers are created for everyone and hold Docker networks open after cluster delete.
  4. cmd/obol/main.go — New --no-registry-cache flag on obol stack up.
  5. CLAUDE.md — "Dev Registry Cache" section renamed to "Registry Cache", split into pull-through (all users) and local push target (dev-only) subsections, with opt-out callout.

Test plan

  • go build ./... — clean
  • go test ./internal/stack/ ./cmd/obol/ -count=1 — all pass (both packages)
  • New tests: golden snapshots for pull-through-only and dev-mode registries.yaml; OBOL_DISABLE_REGISTRY_CACHE early-exit test; mirror invariant tests (count, remoteURL presence/absence)
  • Manual smoke: obol stack down && obol stack updocker ps | grep k3d-obol shows 3 cache containers; second up is faster (layers cached)

Risks

  • Extra ~0–2 GB disk per cache container (only what has been pulled). Mitigated by --no-registry-cache opt-out.
  • The Docker network CIDR pool exhaustion risk that reclaimLeakedDevK3dNetworks guards against now applies to all users; the function is updated to run for all users (not just dev mode) on obol stack purge.

Out of scope

  • The localhost:54103 local push target stays gated behind OBOL_DEVELOPMENT=true.
  • No change to obol stack down behaviour — cache containers intentionally persist across down/up cycles.

@OisinKyne
Copy link
Copy Markdown
Contributor

re-pays the same cost because the next k3d node also pulls fresh.

Can't we just say 'if not present'? instead of Always?

On a clean install every `obol stack up` made the k3d node pull every
image directly from ghcr.io / docker.io. On the v1337 demo on spark1
this cost ~10 min waiting for LiteLLM alone, and `obol stack down &&
obol stack up` re-paid the same cost because the next k3d node also
pulled fresh.

The pull-through cache containers (docker.io, ghcr.io, quay.io) were
already implemented for OBOL_DEVELOPMENT=true. This commit promotes them
to the default for all users so the second `obol stack up` on the same
host completes the LiteLLM rollout in <2 min vs ~10 min on a cold host.

Changes:
- Three pull-through caches (ports 54100-54102) are now started for all
  users on every `obol stack up`, regardless of OBOL_DEVELOPMENT.
- The local push target (localhost:54103) stays gated behind
  OBOL_DEVELOPMENT=true — it is only needed for `just dev-frontend`
  hot-swap and adds no value for regular installs.
- New `--no-registry-cache` flag on `obol stack up` (env:
  OBOL_DISABLE_REGISTRY_CACHE=true) for hosts behind a corporate proxy
  with their own caching, or with tight disk constraints.
- `reclaimLeakedDevK3dNetworks` (called on `obol stack purge`) now runs
  for all users, not just dev mode, since the mirror containers are
  created for everyone and hold Docker networks open after cluster delete.
- CLAUDE.md "Dev Registry Cache" section renamed to "Registry Cache" and
  split into "Pull-through caches (default for all installs)" and "Local
  push target (OBOL_DEVELOPMENT only)" sub-sections.
- Tests: golden snapshots for pull-through-only and dev-mode
  registries.yaml; OBOL_DISABLE_REGISTRY_CACHE early-exit test;
  mirror invariant tests (count, remoteURL presence/absence).
@OisinKyne OisinKyne force-pushed the feat/stack-default-pull-through-cache branch from 5b39484 to 3fc09ac Compare May 18, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants