Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,11 @@ strip = true
[profile.dev]
# Faster compile times for dev builds
debug = 1

[profile.local-fast]
# Local-only profile for faster dockerized inner-loop builds.
inherits = "dev"
opt-level = 1
debug = 0
codegen-units = 256
incremental = true
31 changes: 29 additions & 2 deletions architecture/build-containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,15 @@ This pulls `ghcr.io/nvidia/openshell-community/sandboxes/<name>:latest`.

`mise run cluster` is the primary development command. It bootstraps a cluster if one doesn't exist, then performs incremental deploys for subsequent runs.

For local (non-CI) Docker builds, OpenShell defaults to the Cargo profile
`local-fast` to reduce rebuild latency. CI keeps `release` builds by default.
Set `OPENSHELL_CARGO_PROFILE=release` locally when you need release-equivalent binaries.

The Dockerfiles keep the proto/build-script invalidation touch, but they no
longer touch gateway/supervisor main sources. That preserves incremental reuse
for unrelated rebuilds while still forcing protobuf regeneration safety when
needed.

The incremental deploy (`cluster-deploy-fast.sh`) fingerprints local Git changes and only rebuilds components whose files have changed:

| Changed files | Rebuild triggered |
Expand All @@ -45,16 +54,34 @@ The incremental deploy (`cluster-deploy-fast.sh`) fingerprints local Git changes
| `crates/openshell-server/*`, `Dockerfile.gateway` | Gateway |
| `crates/openshell-sandbox/*`, `crates/openshell-policy/*` | Supervisor |
| `deploy/helm/openshell/*` | Helm upgrade |
| `Dockerfile.cluster`, cluster entrypoint/healthcheck, kube manifests, bootstrap scripts | Full cluster bootstrap |

When no local changes are detected, the command is a no-op.

**Gateway updates** are pushed to a local registry and the StatefulSet is restarted. **Supervisor updates** are copied directly into the running cluster container via `docker cp` — new sandbox pods pick up the updated binary immediately through the hostPath mount, with no image rebuild or cluster restart required.
**Gateway updates** are pushed to a local registry and normally restart the StatefulSet. If the pushed digest already matches the running gateway image digest, fast deploy now skips Helm+rollout to avoid unnecessary restarts.

**Supervisor updates** are copied directly into the running cluster container via `docker cp`. By default (`DEPLOY_FAST_SUPERVISOR_RECONCILE=rolling-delete`), fast deploy restarts running sandbox pods one-by-one with bounded waits so they deterministically pick up the new supervisor binary. Set `DEPLOY_FAST_SUPERVISOR_RECONCILE=none` to keep current pods untouched until they naturally restart.

Fingerprints are stored in `.cache/cluster-deploy-fast.state`. You can also target specific components explicitly:
All fast deploy paths finish with a bounded readiness gate (`DEPLOY_FAST_READINESS_TIMEOUT_SECONDS`, default `90`) that validates Kubernetes `readyz`, gateway workload readiness, and supervisor binary presence before writing state.

Fingerprints are stored in `.cache/cluster-deploy-fast.state`. Explicit target deploys update only the reconciled component fingerprints so subsequent auto deploys stay deterministic. You can also target specific components explicitly:

```bash
mise run cluster -- gateway # rebuild gateway only
mise run cluster -- supervisor # rebuild supervisor only
mise run cluster -- chart # helm upgrade only
mise run cluster -- all # rebuild everything
```

To baseline local compile and image build latency before optimization work:

```bash
mise run cluster:baseline # cold + warm build timings
mise run cluster:baseline:full # same plus `mise run cluster` deploy timing
mise run cluster:baseline:warm # warm-only build timings
mise run cluster:baseline:warm:full # warm-only + deploy
```

Reports are written to `.cache/perf/` as both CSV and markdown.

Each `mise run cluster` invocation also emits a deploy transaction report to `.cache/deploy-reports/<tx-id>.md`, including selected actions (gateway rebuild, supervisor update, helm upgrade), fingerprints, and per-step durations.
2 changes: 1 addition & 1 deletion architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ flowchart LR

The `deploy_gateway_with_logs` variant accepts an `FnMut(String)` callback for progress reporting. The CLI wraps this in a `GatewayDeployLogPanel` for interactive terminals.

**Pre-deploy check** (CLI layer in `gateway_start`): In interactive terminals, `check_existing_deployment` inspects whether a container or volume already exists. If found, the user is prompted to destroy and recreate or reuse the existing gateway.
**Pre-deploy check** (CLI layer in `gateway_start`): `check_existing_deployment` inspects whether a container or volume already exists. In interactive terminals, the user is prompted to destroy and recreate or reuse the existing gateway. In non-interactive mode, the command fails unless `--recreate` or `--reuse-ok` is provided explicitly.

### 2) Image readiness

Expand Down
13 changes: 11 additions & 2 deletions crates/openshell-cli/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -742,11 +742,18 @@ enum GatewayCommands {

/// Destroy and recreate the gateway from scratch if one already exists.
///
/// Without this flag, an interactive prompt asks whether to recreate;
/// in non-interactive mode the existing gateway is reused silently.
/// Without this flag, an interactive prompt asks whether to recreate.
/// In non-interactive mode, the command fails unless `--reuse-ok` is set.
#[arg(long)]
recreate: bool,

/// Reuse an existing gateway in non-interactive mode.
///
/// Use this in automation when you intentionally want idempotent reuse.
/// Conflicts with `--recreate`.
#[arg(long, conflicts_with = "recreate")]
reuse_ok: bool,

/// Listen on plaintext HTTP instead of mTLS.
///
/// Use when the gateway sits behind a reverse proxy (e.g., Cloudflare
Expand Down Expand Up @@ -1445,6 +1452,7 @@ async fn main() -> Result<()> {
port,
gateway_host,
recreate,
reuse_ok,
plaintext,
disable_gateway_auth,
registry_username,
Expand All @@ -1458,6 +1466,7 @@ async fn main() -> Result<()> {
port,
gateway_host.as_deref(),
recreate,
reuse_ok,
plaintext,
disable_gateway_auth,
registry_username.as_deref(),
Expand Down
31 changes: 20 additions & 11 deletions crates/openshell-cli/src/run.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1315,6 +1315,7 @@ pub async fn gateway_admin_deploy(
port: u16,
gateway_host: Option<&str>,
recreate: bool,
reuse_ok: bool,
disable_tls: bool,
disable_gateway_auth: bool,
registry_username: Option<&str>,
Expand All @@ -1334,21 +1335,22 @@ pub async fn gateway_admin_deploy(
});

// Check whether a gateway already exists. If so, prompt the user (unless
// --recreate was passed or we're in non-interactive mode).
// --recreate was passed). Non-interactive mode now fails by default unless
// --reuse-ok is explicitly set.
let mut should_recreate = recreate;
if let Some(existing) =
openshell_bootstrap::check_existing_deployment(name, remote_opts.as_ref()).await?
{
let status = if existing.container_running {
"running"
} else if existing.container_exists {
"stopped"
} else {
"volume only"
};
if !should_recreate {
let interactive = std::io::stdin().is_terminal() && std::io::stderr().is_terminal();
if interactive {
let status = if existing.container_running {
"running"
} else if existing.container_exists {
"stopped"
} else {
"volume only"
};
eprintln!();
eprintln!(
"{} Gateway '{name}' already exists ({status}).",
Expand All @@ -1371,10 +1373,17 @@ pub async fn gateway_admin_deploy(
eprintln!("Keeping existing gateway.");
return Ok(());
}
} else {
// Non-interactive mode: reuse existing gateway silently.
eprintln!("Gateway '{name}' already exists, reusing.");
} else if reuse_ok {
eprintln!("Gateway '{name}' already exists ({status}), reusing (--reuse-ok).");
return Ok(());
} else {
return Err(miette::miette!(
"Gateway '{name}' already exists ({status}).\n\
Non-interactive mode requires explicit intent.\n\
Re-run with one of:\n\
--reuse-ok # keep existing gateway\n\
--recreate # destroy and redeploy"
));
}
}
}
Expand Down
14 changes: 12 additions & 2 deletions crates/openshell-core/build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,25 @@
// SPDX-License-Identifier: Apache-2.0

use std::env;
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
// --- Git-derived version ---
// Compute a version from `git describe` for local builds. In Docker/CI
// builds where .git is absent, this silently does nothing and the binary
// falls back to CARGO_PKG_VERSION (which is already sed-patched by the
// build pipeline).
println!("cargo:rerun-if-changed=../../.git/HEAD");
println!("cargo:rerun-if-changed=../../.git/refs/tags");
// In Docker builds we do not copy .git into the context, so registering
// missing rerun paths can force unnecessary build script churn.
if Path::new("../../.git/HEAD").exists() {
println!("cargo:rerun-if-changed=../../.git/HEAD");
}
if Path::new("../../.git/refs/tags").exists() {
println!("cargo:rerun-if-changed=../../.git/refs/tags");
}
if Path::new("../../.git/packed-refs").exists() {
println!("cargo:rerun-if-changed=../../.git/packed-refs");
}

if let Some(version) = git_version() {
println!("cargo:rustc-env=OPENSHELL_GIT_VERSION={version}");
Expand Down
2 changes: 1 addition & 1 deletion crates/openshell-sandbox/src/proxy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ impl ProxyHandle {
/// The proxy uses OPA for network decisions with process-identity binding
/// via `/proc/net/tcp`. All connections are evaluated through OPA policy.
#[allow(clippy::too_many_arguments)]
pub async fn start_with_bind_addr(
pub(crate) async fn start_with_bind_addr(
policy: &ProxyPolicy,
bind_addr: Option<SocketAddr>,
opa_engine: Arc<OpaEngine>,
Expand Down
43 changes: 36 additions & 7 deletions deploy/docker/Dockerfile.cluster
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ FROM --platform=$BUILDPLATFORM rust:1.88-slim AS supervisor-builder
ARG TARGETARCH
ARG BUILDARCH
ARG OPENSHELL_CARGO_VERSION
ARG OPENSHELL_CARGO_PROFILE=release
ARG CARGO_TARGET_CACHE_SCOPE=default
ARG SCCACHE_MEMCACHED_ENDPOINT

Expand Down Expand Up @@ -121,14 +122,23 @@ COPY proto/ proto/
RUN --mount=type=cache,id=cargo-registry-supervisor-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \
--mount=type=cache,id=cargo-target-supervisor-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \
--mount=type=cache,id=sccache-supervisor-${TARGETARCH},sharing=locked,target=/tmp/sccache \
. cross-build.sh && cargo_cross_build -p openshell-sandbox 2>/dev/null || true
. cross-build.sh && \
if [ "${OPENSHELL_CARGO_PROFILE}" = "release" ]; then \
cargo_cross_build --release -p openshell-sandbox 2>/dev/null || true; \
elif [ "${OPENSHELL_CARGO_PROFILE}" = "dev" ]; then \
cargo_cross_build -p openshell-sandbox 2>/dev/null || true; \
else \
cargo_cross_build --profile "${OPENSHELL_CARGO_PROFILE}" -p openshell-sandbox 2>/dev/null || true; \
fi

# Copy actual source code
COPY crates/ crates/

# Touch source files to ensure they're rebuilt (not the cached dummy)
RUN touch crates/openshell-sandbox/src/main.rs \
crates/openshell-core/build.rs \
# Touch build.rs and proto files to force proto code regeneration when the
# cargo target cache mount retains stale OUT_DIR artifacts from prior builds.
# Do not touch supervisor sources here; that defeats incremental reuse for
# unrelated changes and makes inner-loop builds slower.
RUN touch crates/openshell-core/build.rs \
proto/*.proto

# Build the supervisor binary
Expand All @@ -139,9 +149,28 @@ RUN --mount=type=cache,id=cargo-registry-supervisor-${TARGETARCH},sharing=locked
if [ -n "${OPENSHELL_CARGO_VERSION:-}" ]; then \
sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${OPENSHELL_CARGO_VERSION}"'"/}' Cargo.toml; \
fi && \
cargo_cross_build --release -p openshell-sandbox && \
mkdir -p /build/out && \
cp "$(cross_output_dir release)/openshell-sandbox" /build/out/
if [ "${OPENSHELL_CARGO_PROFILE}" = "release" ]; then \
cargo_cross_build --release -p openshell-sandbox && \
mkdir -p /build/out && \
cp "$(cross_output_dir release)/openshell-sandbox" /build/out/; \
elif [ "${OPENSHELL_CARGO_PROFILE}" = "dev" ]; then \
cargo_cross_build -p openshell-sandbox && \
mkdir -p /build/out && \
cp "$(cross_output_dir debug)/openshell-sandbox" /build/out/; \
else \
cargo_cross_build --profile "${OPENSHELL_CARGO_PROFILE}" -p openshell-sandbox && \
mkdir -p /build/out && \
cp "$(cross_output_dir "${OPENSHELL_CARGO_PROFILE}")/openshell-sandbox" /build/out/; \
fi

# ---------------------------------------------------------------------------
# Stage 1e: Minimal export stage for local supervisor extraction
# ---------------------------------------------------------------------------
# Exporting directly from supervisor-builder with --output type=local copies the
# full build filesystem (including target cache) and is very slow on macOS.
# This scratch stage contains only the final binary.
FROM scratch AS supervisor-export
COPY --from=supervisor-builder /build/out/openshell-sandbox /openshell-sandbox

# ---------------------------------------------------------------------------
# Stage 2: Install NVIDIA container toolkit on Ubuntu
Expand Down
28 changes: 22 additions & 6 deletions deploy/docker/Dockerfile.gateway
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ FROM --platform=$BUILDPLATFORM rust:1.88-slim AS builder
ARG TARGETARCH
ARG BUILDARCH
ARG OPENSHELL_CARGO_VERSION
ARG OPENSHELL_CARGO_PROFILE=release
ARG CARGO_TARGET_CACHE_SCOPE=default

# Install build dependencies
Expand Down Expand Up @@ -55,16 +56,23 @@ COPY proto/ proto/
RUN --mount=type=cache,id=cargo-registry-gateway-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \
--mount=type=cache,id=cargo-target-gateway-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \
--mount=type=cache,id=sccache-gateway-${TARGETARCH},sharing=locked,target=/tmp/sccache \
. cross-build.sh && cargo_cross_build --release -p openshell-server 2>/dev/null || true
. cross-build.sh && \
if [ "${OPENSHELL_CARGO_PROFILE}" = "release" ]; then \
cargo_cross_build --release -p openshell-server 2>/dev/null || true; \
elif [ "${OPENSHELL_CARGO_PROFILE}" = "dev" ]; then \
cargo_cross_build -p openshell-server 2>/dev/null || true; \
else \
cargo_cross_build --profile "${OPENSHELL_CARGO_PROFILE}" -p openshell-server 2>/dev/null || true; \
fi

# Copy actual source code
COPY crates/ crates/

# Touch source files to ensure they're rebuilt (not the cached dummy).
# Touch build.rs and proto files to force proto code regeneration when the
# cargo target cache mount retains stale OUT_DIR artifacts from prior builds.
RUN touch crates/openshell-server/src/main.rs \
crates/openshell-core/build.rs \
# Do not touch service sources here; that defeats incremental reuse for
# unrelated changes and makes inner-loop builds slower.
RUN touch crates/openshell-core/build.rs \
proto/*.proto

# Build the actual application
Expand All @@ -75,8 +83,16 @@ RUN --mount=type=cache,id=cargo-registry-gateway-${TARGETARCH},sharing=locked,ta
if [ -n "${OPENSHELL_CARGO_VERSION:-}" ]; then \
sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${OPENSHELL_CARGO_VERSION}"'"/}' Cargo.toml; \
fi && \
cargo_cross_build --release -p openshell-server && \
cp "$(cross_output_dir release)/openshell-server" /build/openshell-server
if [ "${OPENSHELL_CARGO_PROFILE}" = "release" ]; then \
cargo_cross_build --release -p openshell-server && \
cp "$(cross_output_dir release)/openshell-server" /build/openshell-server; \
elif [ "${OPENSHELL_CARGO_PROFILE}" = "dev" ]; then \
cargo_cross_build -p openshell-server && \
cp "$(cross_output_dir debug)/openshell-server" /build/openshell-server; \
else \
cargo_cross_build --profile "${OPENSHELL_CARGO_PROFILE}" -p openshell-server && \
cp "$(cross_output_dir "${OPENSHELL_CARGO_PROFILE}")/openshell-server" /build/openshell-server; \
fi

# Stage 2: Runtime (uses target platform)
# NVIDIA hardened Ubuntu base for supply chain consistency.
Expand Down
13 changes: 10 additions & 3 deletions scripts/bin/openshell
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,12 @@ else
# Current HEAD commit (detects branch switches, pulls, rebases)
current_head=$(git rev-parse HEAD 2>/dev/null || echo "unknown")

# Collect dirty (modified, staged, untracked) files
mapfile -t changed_files < <(
# Collect dirty (modified, staged, untracked) files.
# Use a bash-3-compatible read loop (macOS default bash has no mapfile).
changed_files=()
while IFS= read -r path; do
changed_files+=("$path")
done < <(
{
git diff --name-only 2>/dev/null
git diff --name-only --cached 2>/dev/null
Expand Down Expand Up @@ -95,7 +99,10 @@ if [[ "$needs_build" == "1" ]]; then
cd "$PROJECT_ROOT"
new_head=$(git rev-parse HEAD 2>/dev/null || echo "unknown")
# Recompute fingerprint of remaining dirty files (build may not change them)
mapfile -t post_files < <(
post_files=()
while IFS= read -r path; do
post_files+=("$path")
done < <(
{
git diff --name-only 2>/dev/null
git diff --name-only --cached 2>/dev/null
Expand Down
16 changes: 16 additions & 0 deletions tasks/cluster.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,19 @@ hide = true
description = "Tag and push gateway image to pull registry"
run = "tasks/scripts/cluster-push-component.sh gateway"
hide = true

["cluster:baseline"]
description = "Capture cold/warm baseline timings for local builds"
run = "tasks/scripts/cluster-baseline.sh --mode both"

["cluster:baseline:full"]
description = "Capture baseline timings including cluster deploy"
run = "tasks/scripts/cluster-baseline.sh --mode both --with-deploy"

["cluster:baseline:warm"]
description = "Capture warm-path baseline timings for local builds"
run = "tasks/scripts/cluster-baseline.sh --mode warm"

["cluster:baseline:warm:full"]
description = "Capture warm-path baseline timings including cluster deploy"
run = "tasks/scripts/cluster-baseline.sh --mode warm --with-deploy"
Loading
Loading