diff --git a/architecture/compute-runtimes.md b/architecture/compute-runtimes.md index 9ab512f98..565639d6a 100644 --- a/architecture/compute-runtimes.md +++ b/architecture/compute-runtimes.md @@ -23,7 +23,7 @@ Each runtime receives a sandbox spec from the gateway and is responsible for: | Docker | Local development with Docker available. | Container plus nested sandbox namespace. | Uses host networking so loopback gateway endpoints work from the supervisor. | | Podman | Rootless or single-machine deployments. | Container plus nested sandbox namespace. | Uses the Podman REST API, OCI image volumes, and CDI GPU devices when available. | | Kubernetes | Cluster deployment through Helm. | Pod plus nested sandbox namespace. | Uses Kubernetes API objects, service accounts, secrets, PVC-backed workspace storage, and GPU resources. | -| VM | Experimental microVM isolation. | Per-sandbox libkrun VM. | Gateway spawns `openshell-driver-vm` as a subprocess over a private, state-local Unix socket. | +| VM | Experimental microVM isolation. | Per-sandbox libkrun VM. | Gateway spawns `openshell-driver-vm` as a subprocess over a private, state-local Unix socket. The VM driver caches a prepared `rootfs.ext4` per source image, boots it read-only, and gives each sandbox a writable `overlay.ext4` for merged-root changes and runtime material. | VM runtime state paths are derived only from driver-validated sandbox IDs matching `[A-Za-z0-9._-]{1,128}`. The gateway-owned VM driver socket uses a diff --git a/crates/openshell-driver-vm/README.md b/crates/openshell-driver-vm/README.md index 0a11ceb0a..1dd232105 100644 --- a/crates/openshell-driver-vm/README.md +++ b/crates/openshell-driver-vm/README.md @@ -2,7 +2,7 @@ > Status: Experimental. The VM compute driver is under active development and the interface still has VM-specific plumbing that will be generalized. -Standalone libkrun-backed [`ComputeDriver`](../../proto/compute_driver.proto) for OpenShell. The gateway spawns this binary as a subprocess, talks to it over a Unix domain socket with the `openshell.compute.v1.ComputeDriver` gRPC surface, and lets it manage per-sandbox microVMs. The runtime (libkrun + libkrunfw + gvproxy) and the sandbox supervisor are embedded directly in the binary; each sandbox guest rootfs is derived from a configured container image at create time. +Standalone libkrun-backed [`ComputeDriver`](../../proto/compute_driver.proto) for OpenShell. The gateway spawns this binary as a subprocess, talks to it over a Unix domain socket with the `openshell.compute.v1.ComputeDriver` gRPC surface, and lets it manage per-sandbox microVMs. The runtime (libkrun + libkrunfw + gvproxy) and the sandbox supervisor are embedded directly in the binary; each sandbox boots from a cached immutable ext4 root disk derived from the configured container image plus a per-sandbox writable overlay disk. ## How it fits together @@ -42,7 +42,7 @@ By default `mise run gateway:vm`: - Listens on plaintext HTTP at `127.0.0.1:18081`. - Registers the CLI gateway `vm-dev` by writing `~/.config/openshell/gateways/vm-dev/metadata.json`. It does not modify the workspace `.env`. - Persists the gateway SQLite DB under `.cache/gateway-vm/gateway.db`. -- Places the VM driver state (per-sandbox rootfs plus `run/compute-driver.sock`) under `/tmp/openshell-vm-driver-$USER-vm-dev/` so the AF_UNIX socket path stays under macOS `SUN_LEN`. +- Places the VM driver state (per-sandbox `overlay.ext4`, image cache, and `run/compute-driver.sock`) under `/tmp/openshell-vm-driver-$USER-vm-dev/` so the AF_UNIX socket path stays under macOS `SUN_LEN`. - Passes `--driver-dir $PWD/target/debug` so the freshly built `openshell-driver-vm` is used instead of an older installed copy from `~/.local/libexec/openshell`, `/usr/libexec/openshell`, or `/usr/local/libexec`. For GPU passthrough (VFIO), pass `-- --gpu` and run with root privileges: @@ -124,10 +124,11 @@ The gateway resolves `openshell-driver-vm` in this order: `--driver-dir`, conven |---|---|---|---| | `--drivers vm` | `OPENSHELL_DRIVERS` | `kubernetes` | Select the VM compute driver. | | `--grpc-endpoint URL` | `OPENSHELL_GRPC_ENDPOINT` | — | Required. URL the sandbox guest dials to reach the gateway. Use `http://host.containers.internal:` (or `host.docker.internal` / `host.openshell.internal`) so traffic flows through gvproxy's host-loopback NAT (HostIP `192.168.127.254` → host `127.0.0.1`). Loopback URLs like `http://127.0.0.1:` are rewritten automatically by the driver. The bare gateway IP (`192.168.127.1`) only carries gvproxy's own services and will not reach host-bound ports. | -| `--vm-driver-state-dir DIR` | `OPENSHELL_VM_DRIVER_STATE_DIR` | `target/openshell-vm-driver` | Per-sandbox rootfs, console logs, image cache, and private `run/compute-driver.sock` UDS. | +| `--vm-driver-state-dir DIR` | `OPENSHELL_VM_DRIVER_STATE_DIR` | `target/openshell-vm-driver` | Per-sandbox overlay disks, console logs, image cache, and private `run/compute-driver.sock` UDS. | | `--driver-dir DIR` | `OPENSHELL_DRIVER_DIR` | unset | Override the directory searched for `openshell-driver-vm`. | | `--vm-driver-vcpus N` | `OPENSHELL_VM_DRIVER_VCPUS` | `2` | vCPUs per sandbox. | | `--vm-driver-mem-mib N` | `OPENSHELL_VM_DRIVER_MEM_MIB` | `2048` | Memory per sandbox, in MiB. | +| `--vm-overlay-disk-mib N` | `OPENSHELL_VM_OVERLAY_DISK_MIB` | `4096` | Sparse writable overlay disk size per sandbox, in MiB. | | `--vm-krun-log-level N` | `OPENSHELL_VM_KRUN_LOG_LEVEL` | `1` | libkrun verbosity (0–5). | | `--vm-tls-ca PATH` | `OPENSHELL_VM_TLS_CA` | — | CA cert for the guest's mTLS client bundle. Required when `--grpc-endpoint` uses `https://`. | | `--vm-tls-cert PATH` | `OPENSHELL_VM_TLS_CERT` | — | Guest client certificate. | @@ -145,7 +146,15 @@ The gateway is auto-registered by `mise run gateway:vm`. In another terminal: ./scripts/bin/openshell sandbox connect demo ``` -First sandbox takes 10–30 seconds to boot (image fetch/prepare/cache + libkrun + guest init). If `--from` is omitted, the VM driver uses the gateway's configured default sandbox image. Without either `--from` or `--sandbox-image`, VM sandbox creation fails. Subsequent creates reuse the prepared sandbox rootfs. +First sandbox takes 10–30 seconds to boot (image fetch/prepare/cache + libkrun + guest init). If `--from` is omitted, the VM driver uses the gateway's configured default sandbox image. Without either `--from` or `--sandbox-image`, VM sandbox creation fails. Subsequent creates reuse the prepared image cache and create only a sparse per-sandbox `overlay.ext4` before boot. + +During rootfs preparation the VM driver exports or pulls the selected OCI image, +applies the OpenShell guest mutations, formats a sparse `rootfs.ext4`, and +caches it under `/images//rootfs.ext4`. Each sandbox boots +that cached base disk read-only and gets its own sparse writable +`/sandboxes//overlay.ext4`. Guest init mounts overlayfs as `/`, +so writes to `/sandbox` and other mutable paths land in the overlay while the +cached root image remains unchanged. ## Logs and debugging @@ -162,6 +171,7 @@ The VM guest's serial console is appended to `//console.l - macOS on Apple Silicon, or Linux on aarch64/x86_64 with KVM - Rust toolchain +- e2fsprogs (`mke2fs` or `mkfs.ext4`, plus `debugfs`) for root and overlay disk image creation and QEMU environment injection - Guest-supervisor cross-compile toolchain (needed on macOS, and on Linux when host arch ≠ guest arch): - Matching rustup target: `rustup target add aarch64-unknown-linux-gnu` (or `x86_64-unknown-linux-gnu` for an amd64 guest) - `cargo install --locked cargo-zigbuild` and `brew install zig` (or distro equivalent). `vm:supervisor` uses `cargo zigbuild` to cross-compile the in-VM `openshell-sandbox` supervisor binary. diff --git a/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig b/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig index b5f0330af..1248773d9 100644 --- a/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig +++ b/crates/openshell-driver-vm/runtime/kernel/openshell.kconfig @@ -8,6 +8,13 @@ # # See also: check-vm-capabilities.sh for runtime verification. +# ── Root disk transport and filesystem ───────────────────────────────── +CONFIG_BLOCK=y +CONFIG_BLK_DEV=y +CONFIG_VIRTIO_BLK=y +CONFIG_EXT4_FS=y +CONFIG_EXT4_USE_FOR_EXT2=y + # ── Network Namespaces (required for pod isolation) ───────────────────── CONFIG_NET_NS=y CONFIG_NAMESPACES=y diff --git a/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh b/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh index b61fd4900..5471c8617 100644 --- a/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh +++ b/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh @@ -9,12 +9,6 @@ set -euo pipefail -# Source QEMU-injected environment variables if present. -if [ -f /srv/openshell-env.sh ]; then - # shellcheck source=/dev/null - source /srv/openshell-env.sh -fi - BOOT_START=$(date +%s%3N 2>/dev/null || date +%s) # gvisor-tap-vsock subnet layout: # 192.168.127.1 — gateway: gvproxy's DNS / DHCP / HTTP API. Does NOT @@ -44,6 +38,73 @@ ts() { printf "[%d.%03ds] %s\n" $((elapsed / 1000)) $((elapsed % 1000)) "$*" } +mount_initial_fs() { + mount -t proc proc /proc 2>/dev/null || true + mount -t sysfs sysfs /sys 2>/dev/null || true + mount -t tmpfs tmpfs /run 2>/dev/null || true + mount -t devtmpfs devtmpfs /dev 2>/dev/null || true +} + +move_mount_if_possible() { + local source="$1" + local target="/newroot${source}" + + mkdir -p "$target" 2>/dev/null || true + mount --move "$source" "$target" 2>/dev/null || true +} + +exec_chroot_overlay_root() { + local chroot_bin + for chroot_bin in /usr/sbin/chroot /usr/bin/chroot /sbin/chroot /bin/chroot; do + if [ -x "$chroot_bin" ]; then + exec "$chroot_bin" /newroot /srv/openshell-vm-sandbox-init.sh --post-overlay + fi + done + + ts "FATAL: chroot not found in guest rootfs" + exit 1 +} + +setup_overlay_root() { + ts "setting up writable overlay root" + mount_initial_fs + + if [ ! -b /dev/vdb ]; then + ts "FATAL: writable overlay disk /dev/vdb not found" + exit 1 + fi + + mount -o remount,ro / 2>/dev/null || true + mount --bind / /lower + mount -o remount,bind,ro /lower 2>/dev/null || true + + mount -t ext4 -o rw /dev/vdb /overlay + mkdir -p /overlay/upper /overlay/work + mount -t overlay overlay \ + -o lowerdir=/lower,upperdir=/overlay/upper,workdir=/overlay/work \ + /newroot + + move_mount_if_possible /proc + move_mount_if_possible /sys + move_mount_if_possible /dev + move_mount_if_possible /run + + exec_chroot_overlay_root +} + +if [ "${1:-}" != "--post-overlay" ]; then + setup_overlay_root +fi + +shift || true + +# Source QEMU-injected environment variables if present. The file lives in the +# overlay upperdir so the cached base rootfs remains immutable. +if [ -f /srv/openshell-env.sh ]; then + # shellcheck source=/dev/null + source /srv/openshell-env.sh +fi + parse_endpoint() { local endpoint="$1" local scheme rest authority path host port @@ -239,7 +300,8 @@ setup_gpu() { return 1 fi - # Stage GSP firmware from virtiofs to tmpfs to avoid slow FUSE reads + # Stage GSP firmware to tmpfs so module loading reads it from a stable + # early-boot path. if [ -d /lib/firmware/nvidia ]; then ts "staging GPU firmware to tmpfs" mkdir -p /run/firmware/nvidia @@ -273,6 +335,15 @@ setup_gpu() { fi } +setup_sandbox_workdir() { + mkdir -p /sandbox + if ! chown -R sandbox:sandbox /sandbox 2>/dev/null; then + chown -R 10001:10001 /sandbox + fi + chmod 0755 /sandbox + ts "prepared /sandbox ownership" +} + mount -t proc proc /proc 2>/dev/null & mount -t sysfs sysfs /sys 2>/dev/null & mount -t tmpfs tmpfs /tmp 2>/dev/null & @@ -286,6 +357,8 @@ mount -t tmpfs tmpfs /dev/shm 2>/dev/null & mount -t cgroup2 cgroup2 /sys/fs/cgroup 2>/dev/null & wait +setup_sandbox_workdir + hostname openshell-sandbox-vm 2>/dev/null || true ip link set lo up 2>/dev/null || true diff --git a/crates/openshell-driver-vm/src/driver.rs b/crates/openshell-driver-vm/src/driver.rs index b797f4835..0e3aa03ba 100644 --- a/crates/openshell-driver-vm/src/driver.rs +++ b/crates/openshell-driver-vm/src/driver.rs @@ -5,7 +5,7 @@ use crate::gpu::{ GpuInventory, SubnetAllocator, allocate_vsock_cid, mac_from_sandbox_id, tap_device_name, }; use crate::rootfs::{ - create_rootfs_archive_from_dir, extract_rootfs_archive_to, + create_ext4_image_from_dir_with_size, create_rootfs_image_from_dir, extract_rootfs_archive_to, prepare_sandbox_rootfs_from_image_root, sandbox_guest_init_path, }; use bollard::Docker; @@ -37,7 +37,6 @@ use std::collections::{HashMap, HashSet}; use std::fs; use std::io::Read; use std::net::Ipv4Addr; -use std::os::unix::fs::PermissionsExt; use std::path::{Component, Path, PathBuf}; use std::pin::Pin; use std::process::Stdio; @@ -56,6 +55,7 @@ const DRIVER_NAME: &str = "openshell-driver-vm"; const WATCH_BUFFER: usize = 256; const DEFAULT_VCPUS: u8 = 2; const DEFAULT_MEM_MIB: u32 = 2048; +const DEFAULT_OVERLAY_DISK_MIB: u64 = 4096; /// gvproxy host-loopback IP — gvproxy's TCP/UDP/ICMP forwarder NAT-rewrites /// this destination to the host's `127.0.0.1` and dials out from the host /// process. This is the only address that transparently reaches host-bound @@ -84,13 +84,14 @@ const OPENSHELL_HOST_GATEWAY_ALIAS: &str = "host.openshell.internal"; /// `GVPROXY_HOST_LOOPBACK_IP` — they do **not** go through the gateway IP. const GVPROXY_HOST_LOOPBACK_ALIAS: &str = "host.containers.internal"; const GUEST_SSH_SOCKET_PATH: &str = "/run/openshell/ssh.sock"; -const GUEST_TLS_DIR: &str = "/opt/openshell/tls"; const GUEST_TLS_CA_PATH: &str = "/opt/openshell/tls/ca.crt"; const GUEST_TLS_CERT_PATH: &str = "/opt/openshell/tls/tls.crt"; const GUEST_TLS_KEY_PATH: &str = "/opt/openshell/tls/tls.key"; const IMAGE_CACHE_ROOT_DIR: &str = "images"; -const IMAGE_CACHE_ROOTFS_ARCHIVE: &str = "rootfs.tar"; +const IMAGE_CACHE_ROOTFS_IMAGE: &str = "rootfs.ext4"; +const SANDBOX_OVERLAY_IMAGE: &str = "overlay.ext4"; const IMAGE_EXPORT_ROOTFS_ARCHIVE: &str = "source-rootfs.tar"; +const IMAGE_CACHE_LAYOUT_VERSION: &str = "sandbox-rootfs-ext4-overlay-v1"; const IMAGE_IDENTITY_FILE: &str = "image-identity"; const IMAGE_REFERENCE_FILE: &str = "image-reference"; static IMAGE_CACHE_BUILD_COUNTER: AtomicU64 = AtomicU64::new(0); @@ -114,6 +115,7 @@ pub struct VmDriverConfig { pub krun_log_level: u32, pub vcpus: u8, pub mem_mib: u32, + pub overlay_disk_mib: u64, pub guest_tls_ca: Option, pub guest_tls_cert: Option, pub guest_tls_key: Option, @@ -135,6 +137,7 @@ impl Default for VmDriverConfig { krun_log_level: 1, vcpus: DEFAULT_VCPUS, mem_mib: DEFAULT_MEM_MIB, + overlay_disk_mib: DEFAULT_OVERLAY_DISK_MIB, guest_tls_ca: None, guest_tls_cert: None, guest_tls_key: None, @@ -363,7 +366,6 @@ impl VmDriver { let gpu_device = spec.map_or("", |s| s.gpu_device.as_str()); let state_dir = sandbox_state_dir(&self.config.state_dir, &sandbox.id)?; - let rootfs = state_dir.join("rootfs"); let image_ref = self.resolved_sandbox_image(sandbox).ok_or_else(|| { Status::failed_precondition( "vm sandboxes require template.image or a configured default sandbox image", @@ -373,7 +375,7 @@ impl VmDriver { sandbox_id = %sandbox.id, image_ref = %image_ref, state_dir = %state_dir.display(), - "vm driver: resolved image ref, preparing rootfs" + "vm driver: resolved image ref, preparing disks" ); tokio::fs::create_dir_all(&state_dir) @@ -397,15 +399,12 @@ impl VmDriver { ), ); - let image_identity = match self - .prepare_runtime_rootfs(&sandbox.id, &image_ref, &rootfs) - .await - { + let image_identity = match self.prepare_runtime_rootfs(&sandbox.id, &image_ref).await { Ok(image_identity) => { info!( sandbox_id = %sandbox.id, image_identity = %image_identity, - "vm driver: rootfs prepared" + "vm driver: cached root disk resolved" ); image_identity } @@ -413,18 +412,24 @@ impl VmDriver { warn!( sandbox_id = %sandbox.id, error = %err.message(), - "vm driver: rootfs preparation failed" + "vm driver: root disk preparation failed" ); let _ = tokio::fs::remove_dir_all(&state_dir).await; return Err(err); } }; - if let Some(tls_paths) = tls_paths.as_ref() - && let Err(err) = prepare_guest_tls_materials(&rootfs, tls_paths).await + let disk_paths = + sandbox_runtime_disk_paths(&self.config.state_dir, &state_dir, &image_identity); + let root_disk = disk_paths.root_disk; + let overlay_disk = disk_paths.overlay_disk; + + if let Err(err) = self + .prepare_runtime_overlay(&overlay_disk, tls_paths.as_ref()) + .await { let _ = tokio::fs::remove_dir_all(&state_dir).await; return Err(Status::internal(format!( - "prepare guest TLS materials failed: {err}" + "prepare guest overlay disk failed: {err}" ))); } @@ -474,7 +479,8 @@ impl VmDriver { command.stdout(Stdio::inherit()); command.stderr(Stdio::inherit()); command.arg("--internal-run-vm"); - command.arg("--vm-rootfs").arg(&rootfs); + command.arg("--vm-root-disk").arg(&root_disk); + command.arg("--vm-overlay-disk").arg(&overlay_disk); command.arg("--vm-exec").arg(sandbox_guest_init_path()); command.arg("--vm-workdir").arg("/"); command.arg("--vm-console-output").arg(&console_output); @@ -733,19 +739,37 @@ impl VmDriver { &self, sandbox_id: &str, image_ref: &str, - rootfs: &Path, ) -> Result { - let image_identity = self - .ensure_cached_image_rootfs_archive(sandbox_id, image_ref) - .await?; - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, &image_identity); - let rootfs_dest = rootfs.to_path_buf(); - tokio::task::spawn_blocking(move || extract_rootfs_archive_to(&archive_path, &rootfs_dest)) + self.ensure_cached_image_rootfs_image(sandbox_id, image_ref) .await - .map_err(|err| Status::internal(format!("sandbox rootfs extraction panicked: {err}")))? - .map_err(|err| Status::internal(format!("extract sandbox rootfs failed: {err}")))?; + } - Ok(image_identity) + async fn prepare_runtime_overlay( + &self, + overlay_disk: &Path, + tls_paths: Option<&VmDriverTlsPaths>, + ) -> Result<(), String> { + let tls_materials = match tls_paths { + Some(paths) => Some(read_guest_tls_materials(paths).await?), + None => None, + }; + let overlay_disk = overlay_disk.to_path_buf(); + let overlay_size_bytes = self + .config + .overlay_disk_mib + .checked_mul(1024 * 1024) + .ok_or_else(|| { + format!( + "overlay disk size {} MiB is too large", + self.config.overlay_disk_mib + ) + })?; + + tokio::task::spawn_blocking(move || { + create_sandbox_overlay_image(&overlay_disk, overlay_size_bytes, tls_materials.as_ref()) + }) + .await + .map_err(|err| format!("overlay image preparation panicked: {err}"))? } fn resolved_sandbox_image(&self, sandbox: &Sandbox) -> Option { @@ -757,14 +781,14 @@ impl VmDriver { }) } - async fn ensure_cached_image_rootfs_archive( + async fn ensure_cached_image_rootfs_image( &self, sandbox_id: &str, image_ref: &str, ) -> Result { if let Some((docker, image_identity)) = self.resolve_local_docker_image(image_ref).await? { return self - .ensure_cached_local_image_rootfs_archive( + .ensure_cached_local_image_rootfs_image( sandbox_id, image_ref, &docker, @@ -773,7 +797,7 @@ impl VmDriver { .await; } - info!(image_ref = %image_ref, "vm driver: ensuring cached image rootfs archive (registry)"); + info!(image_ref = %image_ref, "vm driver: ensuring cached root disk image (registry)"); let reference = parse_registry_reference(image_ref)?; let client = registry_client(); let auth = registry_auth(image_ref)?; @@ -787,7 +811,7 @@ impl VmDriver { )) })?; info!(image_ref = %image_ref, "vm driver: fetching manifest digest"); - let image_identity = client + let source_image_identity = client .fetch_manifest_digest(&reference, &auth) .await .map_err(|err| { @@ -797,10 +821,11 @@ impl VmDriver { })?; info!( image_ref = %image_ref, - image_identity = %image_identity, + image_identity = %source_image_identity, "vm driver: manifest digest resolved" ); - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, &image_identity); + let image_identity = prepared_image_cache_identity(&source_image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, &image_identity); // Mirror the K8s `Pulling` event so the CLI flips to the // image-pull spinner with the image name as detail. We emit it @@ -816,37 +841,37 @@ impl VmDriver { ), ); - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { info!( image_identity = %image_identity, - archive_path = %archive_path.display(), - "vm driver: image rootfs archive cache hit (no build needed)" + image_path = %image_path.display(), + "vm driver: root disk image cache hit (no build needed)" ); - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; return Ok(image_identity); } info!( image_identity = %image_identity, - "vm driver: image rootfs archive cache miss, acquiring build lock" + "vm driver: root disk image cache miss, acquiring build lock" ); let _cache_guard = self.image_cache_lock.lock().await; info!( image_identity = %image_identity, "vm driver: build lock acquired" ); - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { info!( image_identity = %image_identity, - "vm driver: image rootfs archive cache hit after lock (built by another task)" + "vm driver: root disk image cache hit after lock (built by another task)" ); - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; return Ok(image_identity); } - self.build_cached_registry_image_rootfs_archive( + self.build_cached_registry_image_rootfs_image( sandbox_id, &client, &reference, @@ -855,7 +880,7 @@ impl VmDriver { &image_identity, ) .await?; - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; Ok(image_identity) } @@ -936,14 +961,15 @@ impl VmDriver { } } - async fn ensure_cached_local_image_rootfs_archive( + async fn ensure_cached_local_image_rootfs_image( &self, sandbox_id: &str, image_ref: &str, docker: &Docker, image_identity: &str, ) -> Result { - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, image_identity); + let cache_identity = prepared_image_cache_identity(image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, &cache_identity); self.publish_platform_event( sandbox_id.to_string(), @@ -955,38 +981,38 @@ impl VmDriver { ), ); - if tokio::fs::metadata(&archive_path).await.is_ok() { - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + if tokio::fs::metadata(&image_path).await.is_ok() { + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; - return Ok(image_identity.to_string()); + return Ok(cache_identity); } let _cache_guard = self.image_cache_lock.lock().await; - if tokio::fs::metadata(&archive_path).await.is_ok() { - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + if tokio::fs::metadata(&image_path).await.is_ok() { + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; - return Ok(image_identity.to_string()); + return Ok(cache_identity); } - self.build_cached_local_image_rootfs_archive(docker, image_ref, image_identity) + self.build_cached_local_image_rootfs_image(docker, image_ref, &cache_identity) .await?; - self.publish_pulled_event(sandbox_id, image_ref, &archive_path) + self.publish_pulled_event(sandbox_id, image_ref, &image_path) .await; - Ok(image_identity.to_string()) + Ok(cache_identity) } - async fn build_cached_local_image_rootfs_archive( + async fn build_cached_local_image_rootfs_image( &self, docker: &Docker, image_ref: &str, image_identity: &str, ) -> Result<(), Status> { let cache_dir = image_cache_dir(&self.config.state_dir, image_identity); - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, image_identity); let staging_dir = image_cache_staging_dir(&self.config.state_dir, image_identity); let exported_rootfs = staging_dir.join(IMAGE_EXPORT_ROOTFS_ARCHIVE); let prepared_rootfs = staging_dir.join("rootfs"); - let prepared_archive = staging_dir.join(IMAGE_CACHE_ROOTFS_ARCHIVE); + let prepared_image = staging_dir.join(IMAGE_CACHE_ROOTFS_IMAGE); tokio::fs::create_dir_all(image_cache_root_dir(&self.config.state_dir)) .await @@ -1021,14 +1047,14 @@ impl VmDriver { let image_identity_owned = image_identity.to_string(); let exported_rootfs_for_build = exported_rootfs.clone(); let prepared_rootfs_for_build = prepared_rootfs.clone(); - let prepared_archive_for_build = prepared_archive.clone(); + let prepared_image_for_build = prepared_image.clone(); let build_result = tokio::task::spawn_blocking(move || { - prepare_exported_rootfs_archive( + prepare_exported_rootfs_image( &image_ref_owned, &image_identity_owned, &exported_rootfs_for_build, &prepared_rootfs_for_build, - &prepared_archive_for_build, + &prepared_image_for_build, ) }) .await @@ -1039,19 +1065,19 @@ impl VmDriver { return Err(Status::failed_precondition(err)); } - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { let _ = tokio::fs::remove_dir_all(&staging_dir).await; return Ok(()); } - tokio::fs::rename(&prepared_archive, &archive_path) + tokio::fs::rename(&prepared_image, &image_path) .await - .map_err(|err| Status::internal(format!("store cached image rootfs failed: {err}")))?; + .map_err(|err| Status::internal(format!("store cached rootfs image failed: {err}")))?; let _ = tokio::fs::remove_dir_all(&staging_dir).await; Ok(()) } - async fn build_cached_registry_image_rootfs_archive( + async fn build_cached_registry_image_rootfs_image( &self, sandbox_id: &str, client: &OciClient, @@ -1061,10 +1087,10 @@ impl VmDriver { image_identity: &str, ) -> Result<(), Status> { let cache_dir = image_cache_dir(&self.config.state_dir, image_identity); - let archive_path = image_cache_rootfs_archive(&self.config.state_dir, image_identity); + let image_path = image_cache_rootfs_image(&self.config.state_dir, image_identity); let staging_dir = image_cache_staging_dir(&self.config.state_dir, image_identity); let prepared_rootfs = staging_dir.join("rootfs"); - let prepared_archive = staging_dir.join(IMAGE_CACHE_ROOTFS_ARCHIVE); + let prepared_image = staging_dir.join(IMAGE_CACHE_ROOTFS_IMAGE); tokio::fs::create_dir_all(image_cache_root_dir(&self.config.state_dir)) .await @@ -1115,13 +1141,13 @@ impl VmDriver { } info!( image_ref = %image_ref, - "vm driver: image layers pulled, preparing rootfs archive" + "vm driver: image layers pulled, preparing rootfs image" ); let image_ref_owned = image_ref.to_string(); let image_identity_owned = image_identity.to_string(); let prepared_rootfs_for_build = prepared_rootfs.clone(); - let prepared_archive_for_build = prepared_archive.clone(); + let prepared_image_for_build = prepared_image.clone(); let build_result = tokio::task::spawn_blocking(move || { prepare_sandbox_rootfs_from_image_root( &prepared_rootfs_for_build, @@ -1130,7 +1156,7 @@ impl VmDriver { .map_err(|err| { format!("vm sandbox image '{image_ref_owned}' is not base-compatible: {err}") })?; - create_rootfs_archive_from_dir(&prepared_rootfs_for_build, &prepared_archive_for_build) + create_rootfs_image_from_dir(&prepared_rootfs_for_build, &prepared_image_for_build) }) .await .map_err(|err| Status::internal(format!("image rootfs preparation panicked: {err}")))?; @@ -1139,28 +1165,28 @@ impl VmDriver { warn!( image_ref = %image_ref, error = %err, - "vm driver: rootfs archive build failed" + "vm driver: rootfs image build failed" ); let _ = tokio::fs::remove_dir_all(&staging_dir).await; return Err(Status::failed_precondition(err)); } - if tokio::fs::metadata(&archive_path).await.is_ok() { + if tokio::fs::metadata(&image_path).await.is_ok() { info!( image_identity = %image_identity, - "vm driver: another task wrote archive while we were building, discarding ours" + "vm driver: another task wrote image while we were building, discarding ours" ); let _ = tokio::fs::remove_dir_all(&staging_dir).await; return Ok(()); } - tokio::fs::rename(&prepared_archive, &archive_path) + tokio::fs::rename(&prepared_image, &image_path) .await - .map_err(|err| Status::internal(format!("store cached image rootfs failed: {err}")))?; + .map_err(|err| Status::internal(format!("store cached rootfs image failed: {err}")))?; info!( image_identity = %image_identity, - archive_path = %archive_path.display(), - "vm driver: image rootfs archive committed to cache" + image_path = %image_path.display(), + "vm driver: root disk image committed to cache" ); let _ = tokio::fs::remove_dir_all(&staging_dir).await; Ok(()) @@ -1633,17 +1659,17 @@ async fn export_local_image_rootfs_to_path( } } -fn prepare_exported_rootfs_archive( +fn prepare_exported_rootfs_image( image_ref: &str, image_identity: &str, exported_rootfs: &Path, prepared_rootfs: &Path, - prepared_archive: &Path, + prepared_image: &Path, ) -> Result<(), String> { extract_rootfs_archive_to(exported_rootfs, prepared_rootfs)?; prepare_sandbox_rootfs_from_image_root(prepared_rootfs, image_identity) .map_err(|err| format!("vm sandbox image '{image_ref}' is not base-compatible: {err}"))?; - create_rootfs_archive_from_dir(prepared_rootfs, prepared_archive) + create_rootfs_image_from_dir(prepared_rootfs, prepared_image) } fn registry_client() -> OciClient { @@ -1807,8 +1833,8 @@ impl VmDriver { /// Emit a `Pulled` platform event with a message that mirrors the /// kubelet's `Successfully pulled image ... Image size: N bytes.` /// format so the CLI's `extract_image_size` parser works unchanged. - async fn publish_pulled_event(&self, sandbox_id: &str, image_ref: &str, archive_path: &Path) { - let size_suffix = tokio::fs::metadata(archive_path).await.map_or_else( + async fn publish_pulled_event(&self, sandbox_id: &str, image_ref: &str, image_path: &Path) { + let size_suffix = tokio::fs::metadata(image_path).await.map_or_else( |_| String::new(), |meta| format!(" Image size: {} bytes.", meta.len()), ); @@ -2272,6 +2298,27 @@ fn sandbox_state_dir(root: &Path, sandbox_id: &str) -> Result { Ok(sandboxes_root_dir(root).join(sandbox_id)) } +fn sandbox_overlay_image(state_dir: &Path) -> PathBuf { + state_dir.join(SANDBOX_OVERLAY_IMAGE) +} + +#[derive(Debug, Clone, PartialEq, Eq)] +struct SandboxRuntimeDiskPaths { + root_disk: PathBuf, + overlay_disk: PathBuf, +} + +fn sandbox_runtime_disk_paths( + driver_state_root: &Path, + state_dir: &Path, + image_identity: &str, +) -> SandboxRuntimeDiskPaths { + SandboxRuntimeDiskPaths { + root_disk: image_cache_rootfs_image(driver_state_root, image_identity), + overlay_disk: sandbox_overlay_image(state_dir), + } +} + #[allow(clippy::result_large_err)] fn validate_sandbox_state_dir(root: &Path, state_dir: &Path) -> Result<(), Status> { let sandboxes_root = sandboxes_root_dir(root); @@ -2341,8 +2388,8 @@ fn image_cache_dir(root: &Path, image_identity: &str) -> PathBuf { image_cache_root_dir(root).join(sanitize_image_identity(image_identity)) } -fn image_cache_rootfs_archive(root: &Path, image_identity: &str) -> PathBuf { - image_cache_dir(root, image_identity).join(IMAGE_CACHE_ROOTFS_ARCHIVE) +fn image_cache_rootfs_image(root: &Path, image_identity: &str) -> PathBuf { + image_cache_dir(root, image_identity).join(IMAGE_CACHE_ROOTFS_IMAGE) } fn image_cache_staging_dir(root: &Path, image_identity: &str) -> PathBuf { @@ -2353,6 +2400,10 @@ fn image_cache_staging_dir(root: &Path, image_identity: &str) -> PathBuf { )) } +fn prepared_image_cache_identity(image_identity: &str) -> String { + format!("{IMAGE_CACHE_LAYOUT_VERSION}:{image_identity}") +} + fn sanitize_image_identity(image_identity: &str) -> String { image_identity .chars() @@ -2390,29 +2441,105 @@ async fn write_sandbox_image_metadata( Ok(()) } -async fn prepare_guest_tls_materials( - rootfs: &Path, - paths: &VmDriverTlsPaths, -) -> Result<(), std::io::Error> { - let guest_tls_dir = rootfs.join(GUEST_TLS_DIR.trim_start_matches('/')); - tokio::fs::create_dir_all(&guest_tls_dir).await?; +#[derive(Debug, Clone)] +struct GuestTlsMaterials { + ca: Vec, + cert: Vec, + key: Vec, +} - copy_guest_tls_material(&paths.ca, &guest_tls_dir.join("ca.crt"), 0o644).await?; - copy_guest_tls_material(&paths.cert, &guest_tls_dir.join("tls.crt"), 0o644).await?; - copy_guest_tls_material(&paths.key, &guest_tls_dir.join("tls.key"), 0o600).await?; - Ok(()) +async fn read_guest_tls_materials(paths: &VmDriverTlsPaths) -> Result { + let ca = tokio::fs::read(&paths.ca) + .await + .map_err(|err| format!("read {}: {err}", paths.ca.display()))?; + let cert = tokio::fs::read(&paths.cert) + .await + .map_err(|err| format!("read {}: {err}", paths.cert.display()))?; + let key = tokio::fs::read(&paths.key) + .await + .map_err(|err| format!("read {}: {err}", paths.key.display()))?; + Ok(GuestTlsMaterials { ca, cert, key }) } -async fn copy_guest_tls_material( - source: &Path, - dest: &Path, - mode: u32, -) -> Result<(), std::io::Error> { - tokio::fs::copy(source, dest).await?; - tokio::fs::set_permissions(dest, fs::Permissions::from_mode(mode)).await?; +fn create_sandbox_overlay_image( + overlay_disk: &Path, + size_bytes: u64, + tls_materials: Option<&GuestTlsMaterials>, +) -> Result<(), String> { + let staging_dir = overlay_staging_dir(overlay_disk); + if staging_dir.exists() { + fs::remove_dir_all(&staging_dir) + .map_err(|err| format!("remove stale overlay staging dir: {err}"))?; + } + + let result = (|| { + fs::create_dir_all(staging_dir.join("upper")) + .map_err(|err| format!("create overlay upper dir: {err}"))?; + fs::create_dir_all(staging_dir.join("work")) + .map_err(|err| format!("create overlay work dir: {err}"))?; + fs::create_dir_all(staging_dir.join("config")) + .map_err(|err| format!("create overlay config dir: {err}"))?; + + if let Some(tls) = tls_materials { + stage_guest_tls_materials(&staging_dir, tls)?; + } + + create_ext4_image_from_dir_with_size(&staging_dir, overlay_disk, size_bytes) + })(); + + let _ = fs::remove_dir_all(&staging_dir); + result +} + +fn stage_guest_tls_materials( + staging_dir: &Path, + materials: &GuestTlsMaterials, +) -> Result<(), String> { + let tls_dir = staging_dir + .join("upper") + .join(GUEST_TLS_CA_PATH.trim_start_matches('/')) + .parent() + .ok_or_else(|| "guest TLS CA path has no parent".to_string())? + .to_path_buf(); + fs::create_dir_all(&tls_dir) + .map_err(|err| format!("create guest TLS dir {}: {err}", tls_dir.display()))?; + + let ca_path = staging_dir + .join("upper") + .join(GUEST_TLS_CA_PATH.trim_start_matches('/')); + let cert_path = staging_dir + .join("upper") + .join(GUEST_TLS_CERT_PATH.trim_start_matches('/')); + let key_path = staging_dir + .join("upper") + .join(GUEST_TLS_KEY_PATH.trim_start_matches('/')); + fs::write(&ca_path, &materials.ca) + .map_err(|err| format!("write guest TLS CA {}: {err}", ca_path.display()))?; + fs::write(&cert_path, &materials.cert) + .map_err(|err| format!("write guest TLS cert {}: {err}", cert_path.display()))?; + fs::write(&key_path, &materials.key) + .map_err(|err| format!("write guest TLS key {}: {err}", key_path.display()))?; + + #[cfg(unix)] + { + use std::os::unix::fs::PermissionsExt as _; + + fs::set_permissions(&key_path, fs::Permissions::from_mode(0o600)) + .map_err(|err| format!("chmod guest TLS key {}: {err}", key_path.display()))?; + } + Ok(()) } +fn overlay_staging_dir(overlay_disk: &Path) -> PathBuf { + let parent = overlay_disk.parent().unwrap_or_else(|| Path::new(".")); + parent.join(format!( + ".openshell-overlay-staging-{}-{}", + std::process::id(), + current_time_ms() + )) +} + async fn terminate_vm_process(child: &mut Child) -> Result<(), std::io::Error> { if let Some(pid) = child.id() && let Err(err) = kill(Pid::from_raw(pid.cast_signed()), Signal::SIGTERM) @@ -2662,6 +2789,22 @@ mod tests { assert_eq!(err.code(), Code::InvalidArgument); } + #[test] + fn sandbox_runtime_disk_paths_use_cached_root_and_per_sandbox_overlay() { + let driver_state = Path::new("/tmp/openshell-vm"); + let state_dir = driver_state.join("sandboxes").join("sandbox-123"); + let image_identity = prepared_image_cache_identity("sha256:abc"); + + let disks = sandbox_runtime_disk_paths(driver_state, &state_dir, &image_identity); + + assert_eq!( + disks.root_disk, + image_cache_rootfs_image(driver_state, &image_identity) + ); + assert_eq!(disks.overlay_disk, state_dir.join(SANDBOX_OVERLAY_IMAGE)); + assert_ne!(disks.root_disk.parent(), Some(state_dir.as_path())); + } + #[test] fn capabilities_report_configured_default_image() { let driver = VmDriver { @@ -3188,54 +3331,11 @@ mod tests { } #[test] - fn prepare_exported_rootfs_archive_rewrites_docker_exported_rootfs() { - let base = unique_temp_dir(); - let source_rootfs = base.join("source-rootfs"); - let exported_rootfs = base.join("exported-rootfs.tar"); - let prepared_rootfs = base.join("prepared-rootfs"); - let prepared_archive = base.join("prepared-rootfs.tar"); - let extracted = base.join("extracted"); - - for path in [ - "bin/bash", - "bin/mount", - "bin/sed", - "sbin/ip", - "opt/openshell/bin/openshell-sandbox", - ] { - let path = source_rootfs.join(path); - fs::create_dir_all(path.parent().unwrap()).unwrap(); - fs::write(path, "").unwrap(); - } - - create_rootfs_archive_from_dir(&source_rootfs, &exported_rootfs).unwrap(); - prepare_exported_rootfs_archive( - "openshell/sandbox-from:123", - "sha256:local-image", - &exported_rootfs, - &prepared_rootfs, - &prepared_archive, - ) - .unwrap(); - extract_rootfs_archive_to(&prepared_archive, &extracted).unwrap(); - - assert!(extracted.join("srv/openshell-vm-sandbox-init.sh").is_file()); - assert!( - extracted - .join("opt/openshell/bin/openshell-sandbox") - .is_file() - ); + fn prepared_image_cache_identity_includes_rootfs_layout_version() { assert_eq!( - fs::read_to_string(extracted.join("opt/openshell/.rootfs-type")).unwrap(), - "sandbox\n" + prepared_image_cache_identity("sha256:local-image"), + format!("{IMAGE_CACHE_LAYOUT_VERSION}:sha256:local-image") ); - assert!( - fs::read_to_string(extracted.join(".openshell-rootfs-variant")) - .unwrap() - .contains("sha256:local-image") - ); - - let _ = fs::remove_dir_all(base); } #[test] @@ -3247,50 +3347,61 @@ mod tests { } #[tokio::test] - async fn prepare_guest_tls_materials_copies_bundle_into_rootfs() { + async fn read_guest_tls_materials_reports_missing_input() { let base = unique_temp_dir(); - let source_dir = base.join("source"); - let rootfs = base.join("rootfs"); - std::fs::create_dir_all(&source_dir).unwrap(); - std::fs::create_dir_all(&rootfs).unwrap(); - - let ca = source_dir.join("ca.crt"); - let cert = source_dir.join("tls.crt"); - let key = source_dir.join("tls.key"); - std::fs::write(&ca, "ca").unwrap(); - std::fs::write(&cert, "cert").unwrap(); - std::fs::write(&key, "key").unwrap(); - - prepare_guest_tls_materials( - &rootfs, - &VmDriverTlsPaths { - ca: ca.clone(), - cert: cert.clone(), - key: key.clone(), - }, - ) + let source_dir = base.join("missing-source"); + + let err = read_guest_tls_materials(&VmDriverTlsPaths { + ca: source_dir.join("ca.crt"), + cert: source_dir.join("tls.crt"), + key: source_dir.join("tls.key"), + }) .await - .unwrap(); + .expect_err("missing TLS materials should fail before image injection"); + + assert!(err.contains("ca.crt")); + + let _ = std::fs::remove_dir_all(base); + } + + #[cfg(unix)] + #[test] + fn stage_guest_tls_materials_places_files_in_overlay_upper_with_private_key_mode() { + use std::os::unix::fs::PermissionsExt as _; + + let base = unique_temp_dir(); + let materials = GuestTlsMaterials { + ca: b"ca".to_vec(), + cert: b"cert".to_vec(), + key: b"key".to_vec(), + }; + + stage_guest_tls_materials(&base, &materials).expect("stage TLS materials"); - let guest_dir = rootfs.join(GUEST_TLS_DIR.trim_start_matches('/')); assert_eq!( - std::fs::read_to_string(guest_dir.join("ca.crt")).unwrap(), - "ca" + fs::read( + base.join("upper") + .join(GUEST_TLS_CA_PATH.trim_start_matches('/')) + ) + .unwrap(), + b"ca" ); assert_eq!( - std::fs::read_to_string(guest_dir.join("tls.crt")).unwrap(), - "cert" + fs::read( + base.join("upper") + .join(GUEST_TLS_CERT_PATH.trim_start_matches('/')) + ) + .unwrap(), + b"cert" ); + let key_path = base + .join("upper") + .join(GUEST_TLS_KEY_PATH.trim_start_matches('/')); + assert_eq!(fs::read(&key_path).unwrap(), b"key"); assert_eq!( - std::fs::read_to_string(guest_dir.join("tls.key")).unwrap(), - "key" + fs::metadata(&key_path).unwrap().permissions().mode() & 0o777, + 0o600 ); - let key_mode = std::fs::metadata(guest_dir.join("tls.key")) - .unwrap() - .permissions() - .mode() - & 0o777; - assert_eq!(key_mode, 0o600); let _ = std::fs::remove_dir_all(base); } diff --git a/crates/openshell-driver-vm/src/ffi.rs b/crates/openshell-driver-vm/src/ffi.rs index db5d3ec10..423ad6f05 100644 --- a/crates/openshell-driver-vm/src/ffi.rs +++ b/crates/openshell-driver-vm/src/ffi.rs @@ -29,7 +29,18 @@ type KrunInitLog = type KrunCreateCtx = unsafe extern "C" fn() -> i32; type KrunFreeCtx = unsafe extern "C" fn(ctx_id: u32) -> i32; type KrunSetVmConfig = unsafe extern "C" fn(ctx_id: u32, num_vcpus: u8, ram_mib: u32) -> i32; -type KrunSetRoot = unsafe extern "C" fn(ctx_id: u32, root_path: *const c_char) -> i32; +type KrunAddDisk = unsafe extern "C" fn( + ctx_id: u32, + block_id: *const c_char, + disk_path: *const c_char, + read_only: bool, +) -> i32; +type KrunSetRootDiskRemount = unsafe extern "C" fn( + ctx_id: u32, + device: *const c_char, + fstype: *const c_char, + options: *const c_char, +) -> i32; type KrunSetWorkdir = unsafe extern "C" fn(ctx_id: u32, workdir_path: *const c_char) -> i32; type KrunSetExec = unsafe extern "C" fn( ctx_id: u32, @@ -67,7 +78,8 @@ pub struct LibKrun { pub krun_create_ctx: KrunCreateCtx, pub krun_free_ctx: KrunFreeCtx, pub krun_set_vm_config: KrunSetVmConfig, - pub krun_set_root: KrunSetRoot, + pub krun_add_disk: KrunAddDisk, + pub krun_set_root_disk_remount: KrunSetRootDiskRemount, pub krun_set_workdir: KrunSetWorkdir, pub krun_set_exec: KrunSetExec, pub krun_set_console_output: KrunSetConsoleOutput, @@ -119,7 +131,12 @@ impl LibKrun { krun_create_ctx: load_symbol(library, b"krun_create_ctx\0", &libkrun_path)?, krun_free_ctx: load_symbol(library, b"krun_free_ctx\0", &libkrun_path)?, krun_set_vm_config: load_symbol(library, b"krun_set_vm_config\0", &libkrun_path)?, - krun_set_root: load_symbol(library, b"krun_set_root\0", &libkrun_path)?, + krun_add_disk: load_symbol(library, b"krun_add_disk\0", &libkrun_path)?, + krun_set_root_disk_remount: load_symbol( + library, + b"krun_set_root_disk_remount\0", + &libkrun_path, + )?, krun_set_workdir: load_symbol(library, b"krun_set_workdir\0", &libkrun_path)?, krun_set_exec: load_symbol(library, b"krun_set_exec\0", &libkrun_path)?, krun_set_console_output: load_symbol( diff --git a/crates/openshell-driver-vm/src/main.rs b/crates/openshell-driver-vm/src/main.rs index ed9967f4a..124f9af9d 100644 --- a/crates/openshell-driver-vm/src/main.rs +++ b/crates/openshell-driver-vm/src/main.rs @@ -27,8 +27,11 @@ struct Args { #[arg(long, hide = true, default_value_t = false)] internal_run_vm: bool, - #[arg(long, hide = true)] - vm_rootfs: Option, + #[arg(long = "vm-root-disk", hide = true, alias = "vm-rootfs")] + vm_root_disk: Option, + + #[arg(long = "vm-overlay-disk", hide = true)] + vm_overlay_disk: Option, #[arg(long, hide = true)] vm_exec: Option, @@ -114,6 +117,9 @@ struct Args { #[arg(long, env = "OPENSHELL_VM_DRIVER_MEM_MIB", default_value_t = 2048)] mem_mib: u32, + #[arg(long, env = "OPENSHELL_VM_OVERLAY_DISK_MIB", default_value_t = 4096)] + overlay_disk_mib: u64, + #[arg(long, env = "OPENSHELL_VM_GPU")] gpu: bool, @@ -199,6 +205,7 @@ async fn main() -> Result<()> { krun_log_level: args.krun_log_level, vcpus: args.vcpus, mem_mib: args.mem_mib, + overlay_disk_mib: args.overlay_disk_mib, guest_tls_ca: args.guest_tls_ca.clone(), guest_tls_cert: args.guest_tls_cert.clone(), guest_tls_key: args.guest_tls_key.clone(), @@ -448,10 +455,14 @@ impl Stream for AuthenticatedUnixIncoming { } fn build_vm_launch_config(args: &Args) -> std::result::Result { - let rootfs = args - .vm_rootfs + let root_disk = args + .vm_root_disk + .clone() + .ok_or_else(|| "--vm-root-disk is required in internal VM mode".to_string())?; + let overlay_disk = args + .vm_overlay_disk .clone() - .ok_or_else(|| "--vm-rootfs is required in internal VM mode".to_string())?; + .ok_or_else(|| "--vm-overlay-disk is required in internal VM mode".to_string())?; let exec_path = args .vm_exec .clone() @@ -468,7 +479,8 @@ fn build_vm_launch_config(args: &Args) -> std::result::Result &'static str { SANDBOX_GUEST_INIT_PATH @@ -44,6 +52,7 @@ pub fn extract_rootfs_archive_to(archive_path: &Path, dest: &Path) -> Result<(), .map_err(|e| format!("extract rootfs tarball into {}: {e}", dest.display())) } +#[cfg(test)] pub fn create_rootfs_archive_from_dir(source: &Path, archive_path: &Path) -> Result<(), String> { if let Some(parent) = archive_path.parent() { fs::create_dir_all(parent).map_err(|e| format!("create {}: {e}", parent.display()))?; @@ -65,6 +74,68 @@ pub fn create_rootfs_archive_from_dir(source: &Path, archive_path: &Path) -> Res .map_err(|e| format!("finalize {}: {e}", archive_path.display())) } +pub fn create_rootfs_image_from_dir(source: &Path, image_path: &Path) -> Result<(), String> { + let image_size = rootfs_image_size_bytes(source)?; + create_ext4_image_from_dir_with_size(source, image_path, image_size) +} + +pub fn create_ext4_image_from_dir_with_size( + source: &Path, + image_path: &Path, + image_size: u64, +) -> Result<(), String> { + if let Some(parent) = image_path.parent() { + fs::create_dir_all(parent).map_err(|e| format!("create {}: {e}", parent.display()))?; + } + if image_path.exists() { + fs::remove_file(image_path) + .map_err(|e| format!("remove old rootfs image {}: {e}", image_path.display()))?; + } + + let required_size = ext4_image_min_size_bytes(source)?; + if image_size < required_size { + return Err(format!( + "ext4 image size {} bytes is too small for {} (requires at least {} bytes)", + image_size, + source.display(), + required_size + )); + } + + let image = File::create(image_path) + .map_err(|e| format!("create rootfs image {}: {e}", image_path.display()))?; + image + .set_len(image_size) + .map_err(|e| format!("size rootfs image {}: {e}", image_path.display()))?; + drop(image); + + if let Err(err) = format_ext4_image_from_dir(source, image_path) { + let _ = fs::remove_file(image_path); + return Err(err); + } + + Ok(()) +} + +pub fn write_rootfs_image_file( + image_path: &Path, + guest_path: &str, + contents: &[u8], +) -> Result<(), String> { + ensure_rootfs_image_parent_dirs(image_path, guest_path); + + let tmp_path = temporary_injection_path(image_path); + fs::write(&tmp_path, contents).map_err(|e| format!("write {}: {e}", tmp_path.display()))?; + let _ = run_debugfs(image_path, &format!("rm {guest_path}")); + let result = run_debugfs( + image_path, + &format!("write {} {}", tmp_path.display(), guest_path), + ); + let _ = fs::remove_file(&tmp_path); + result +} + +#[cfg(test)] fn append_rootfs_tree_to_archive( builder: &mut tar::Builder>, source: &Path, @@ -119,6 +190,7 @@ fn append_rootfs_tree_to_archive( Ok(()) } +#[cfg(test)] fn append_symlink_to_archive( builder: &mut tar::Builder>, source_path: &Path, @@ -165,6 +237,10 @@ fn prepare_sandbox_rootfs(rootfs: &Path) -> Result<(), String> { fs::write(opt_dir.join(".rootfs-type"), "sandbox\n") .map_err(|e| format!("write sandbox rootfs marker: {e}"))?; ensure_sandbox_guest_user(rootfs)?; + create_sandbox_mountpoint(&rootfs.join("sandbox"))?; + create_sandbox_mountpoint(&rootfs.join("lower"))?; + create_sandbox_mountpoint(&rootfs.join("overlay"))?; + create_sandbox_mountpoint(&rootfs.join("newroot"))?; Ok(()) } @@ -174,6 +250,15 @@ pub fn validate_sandbox_rootfs(rootfs: &Path) -> Result<(), String> { require_rootfs_path(rootfs, "/opt/openshell/bin/openshell-sandbox")?; require_any_rootfs_path(rootfs, &["/bin/bash"])?; require_any_rootfs_path(rootfs, &["/bin/mount", "/usr/bin/mount"])?; + require_any_rootfs_path( + rootfs, + &[ + "/usr/sbin/chroot", + "/usr/bin/chroot", + "/sbin/chroot", + "/bin/chroot", + ], + )?; require_any_rootfs_path( rootfs, &["/sbin/ip", "/usr/sbin/ip", "/bin/ip", "/usr/bin/ip"], @@ -182,6 +267,164 @@ pub fn validate_sandbox_rootfs(rootfs: &Path) -> Result<(), String> { Ok(()) } +fn create_sandbox_mountpoint(path: &Path) -> Result<(), String> { + fs::create_dir_all(path).map_err(|e| format!("create {}: {e}", path.display()))?; + #[cfg(unix)] + { + use std::os::unix::fs::PermissionsExt as _; + + fs::set_permissions(path, fs::Permissions::from_mode(0o755)) + .map_err(|e| format!("chmod {}: {e}", path.display()))?; + } + Ok(()) +} + +fn rootfs_image_size_bytes(source: &Path) -> Result { + let used = directory_size_bytes(source)?; + let headroom = (used / 4).max(ROOTFS_IMAGE_MIN_HEADROOM_BYTES); + let size = (used + headroom).max(ROOTFS_IMAGE_MIN_SIZE_BYTES); + Ok(round_up_to_mib(size)) +} + +fn ext4_image_min_size_bytes(source: &Path) -> Result { + let used = directory_size_bytes(source)?; + Ok(round_up_to_mib(used + EXT4_IMAGE_MIN_HEADROOM_BYTES)) +} + +fn directory_size_bytes(path: &Path) -> Result { + let metadata = + fs::symlink_metadata(path).map_err(|e| format!("stat {}: {e}", path.display()))?; + if metadata.file_type().is_file() || metadata.file_type().is_symlink() { + return Ok(metadata.len()); + } + if !metadata.file_type().is_dir() { + return Ok(0); + } + + let mut size = 4096; + for entry in fs::read_dir(path).map_err(|e| format!("read {}: {e}", path.display()))? { + let entry = entry.map_err(|e| format!("read {}: {e}", path.display()))?; + size += directory_size_bytes(&entry.path())?; + } + Ok(size) +} + +fn round_up_to_mib(bytes: u64) -> u64 { + const MIB: u64 = 1024 * 1024; + bytes.div_ceil(MIB) * MIB +} + +fn format_ext4_image_from_dir(source: &Path, image_path: &Path) -> Result<(), String> { + let mut last_error = None; + for tool in ["mke2fs", "mkfs.ext4"] { + for candidate in e2fs_tool_candidates(tool) { + let label = candidate.display().to_string(); + let output = Command::new(&candidate) + .arg("-q") + .arg("-F") + .arg("-t") + .arg("ext4") + .arg("-E") + .arg("root_owner=0:0") + .arg("-d") + .arg(source) + .arg(image_path) + .output(); + match output { + Ok(output) if output.status.success() => return Ok(()), + Ok(output) => { + last_error = Some(format!( + "{label} failed with status {}\nstdout: {}\nstderr: {}", + output.status, + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr) + )); + } + Err(err) if err.kind() == std::io::ErrorKind::NotFound => { + last_error = Some(format!("{label} not found")); + } + Err(err) => { + last_error = Some(format!("run {label}: {err}")); + } + } + } + } + Err(format!( + "failed to create ext4 rootfs image from {}: {}. Install e2fsprogs (mke2fs/mkfs.ext4) and retry", + source.display(), + last_error.unwrap_or_else(|| "no ext4 formatter found".to_string()) + )) +} + +fn ensure_rootfs_image_parent_dirs(image_path: &Path, guest_path: &str) { + let Some(parent) = Path::new(guest_path).parent() else { + return; + }; + let mut current = String::new(); + for component in parent.components() { + let part = component.as_os_str().to_string_lossy(); + if part == "/" || part.is_empty() { + continue; + } + current.push('/'); + current.push_str(&part); + let _ = run_debugfs(image_path, &format!("mkdir {current}")); + } +} + +fn run_debugfs(image_path: &Path, command: &str) -> Result<(), String> { + let mut last_error = None; + for candidate in e2fs_tool_candidates("debugfs") { + let label = candidate.display().to_string(); + let output = Command::new(&candidate) + .arg("-w") + .arg("-R") + .arg(command) + .arg(image_path) + .output(); + match output { + Ok(output) if output.status.success() => return Ok(()), + Ok(output) => { + last_error = Some(format!( + "{label} failed with status {}\nstdout: {}\nstderr: {}", + output.status, + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr) + )); + } + Err(err) if err.kind() == std::io::ErrorKind::NotFound => { + last_error = Some(format!("{label} not found")); + } + Err(err) => { + last_error = Some(format!("run {label}: {err}")); + } + } + } + Err(format!( + "debugfs command '{command}' failed for {}: {}. Install e2fsprogs (debugfs) and retry", + image_path.display(), + last_error.unwrap_or_else(|| "debugfs not found".to_string()) + )) +} + +fn e2fs_tool_candidates(tool: &str) -> Vec { + let mut candidates = vec![PathBuf::from(tool)]; + for root in ["/opt/homebrew/opt/e2fsprogs", "/usr/local/opt/e2fsprogs"] { + candidates.push(Path::new(root).join("sbin").join(tool)); + candidates.push(Path::new(root).join("bin").join(tool)); + } + candidates +} + +fn temporary_injection_path(image_path: &Path) -> PathBuf { + let n = INJECTION_COUNTER.fetch_add(1, Ordering::Relaxed); + let parent = image_path.parent().unwrap_or_else(|| Path::new(".")); + parent.join(format!( + ".openshell-rootfs-inject-{}-{n}", + std::process::id() + )) +} + fn ensure_sandbox_guest_user(rootfs: &Path) -> Result<(), String> { const SANDBOX_UID: u32 = 10001; const SANDBOX_GID: u32 = 10001; @@ -336,6 +579,7 @@ mod tests { fs::create_dir_all(rootfs.join("sbin")).expect("create sbin"); fs::write(rootfs.join("bin/bash"), b"bash").expect("write bash"); fs::write(rootfs.join("bin/mount"), b"mount").expect("write mount"); + fs::write(rootfs.join("bin/chroot"), b"chroot").expect("write chroot"); fs::write(rootfs.join("bin/sed"), b"sed").expect("write sed"); fs::write(rootfs.join("sbin/ip"), b"ip").expect("write ip"); @@ -343,7 +587,16 @@ mod tests { validate_sandbox_rootfs(&rootfs).expect("validate sandbox rootfs"); assert!(rootfs.join("srv/openshell-vm-sandbox-init.sh").is_file()); - assert!(!rootfs.join("sandbox").exists()); + assert!(rootfs.join("sandbox").is_dir()); + assert!(rootfs.join("lower").is_dir()); + assert!(rootfs.join("overlay").is_dir()); + assert!(rootfs.join("newroot").is_dir()); + assert!( + fs::read_dir(rootfs.join("sandbox")) + .expect("read sandbox") + .next() + .is_none() + ); assert!( fs::read_to_string(rootfs.join("etc/passwd")) .expect("read passwd") @@ -363,7 +616,7 @@ mod tests { } #[test] - fn prepare_sandbox_rootfs_preserves_image_workdir_contents() { + fn prepare_sandbox_rootfs_preserves_image_workdir_contents_in_rootfs() { let dir = unique_temp_dir(); let rootfs = dir.join("rootfs"); @@ -378,6 +631,7 @@ mod tests { prepare_sandbox_rootfs(&rootfs).expect("prepare sandbox rootfs"); + assert!(rootfs.join("sandbox").is_dir()); assert_eq!( fs::read_to_string(rootfs.join("sandbox/app.py")).expect("read app"), "print('hello')\n" diff --git a/crates/openshell-driver-vm/src/runtime.rs b/crates/openshell-driver-vm/src/runtime.rs index 758808c8e..6640db33c 100644 --- a/crates/openshell-driver-vm/src/runtime.rs +++ b/crates/openshell-driver-vm/src/runtime.rs @@ -10,7 +10,7 @@ use std::ptr; use std::sync::atomic::{AtomicI32, Ordering}; use std::time::{Duration, Instant}; -use crate::{embedded_runtime, ffi, procguard}; +use crate::{embedded_runtime, ffi, procguard, rootfs}; pub const VM_RUNTIME_DIR_ENV: &str = "OPENSHELL_VM_RUNTIME_DIR"; @@ -18,7 +18,7 @@ pub const VM_RUNTIME_DIR_ENV: &str = "OPENSHELL_VM_RUNTIME_DIR"; /// Used by the SIGTERM/SIGINT handler to forward signals to the VM. static CHILD_PID: AtomicI32 = AtomicI32::new(0); -/// PID of the helper process (gvproxy for libkrun, virtiofsd for QEMU). +/// PID of the helper process (gvproxy for libkrun; zero for QEMU). /// Zero when not running. Used by the SIGTERM/SIGINT handler and /// procguard cleanup callback to ensure the helper doesn't outlive the /// launcher (especially on macOS where `PR_SET_PDEATHSIG` is absent). @@ -45,7 +45,8 @@ const COMPAT_NET_FEATURES: u32 = NET_FEATURE_CSUM | NET_FEATURE_HOST_UFO; pub struct VmLaunchConfig { - pub rootfs: PathBuf, + pub root_disk: PathBuf, + pub overlay_disk: PathBuf, pub vcpus: u8, pub mem_mib: u32, pub exec_path: String, @@ -96,10 +97,16 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { .as_deref() .ok_or("host_ip is required for QEMU backend")?; - if !config.rootfs.is_dir() { + if !config.root_disk.is_file() { return Err(format!( - "rootfs directory not found: {}", - config.rootfs.display() + "root disk image not found: {}", + config.root_disk.display() + )); + } + if !config.overlay_disk.is_file() { + return Err(format!( + "overlay disk image not found: {}", + config.overlay_disk.display() )); } @@ -111,70 +118,13 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { check_kvm_access()?; let guest_env = qemu_guest_env_vars(config, host_dns_server()); - write_guest_env_file(&config.rootfs, &guest_env)?; - - let rootfs_str = config.rootfs.to_str().ok_or("rootfs path not UTF-8")?; - let sandbox_dir = config.rootfs.parent().unwrap_or(&config.rootfs); - let sock_prefix = tap_device.trim_start_matches("vmtap-"); - let virtiofsd_sock_dir = PathBuf::from(format!("/tmp/ovm-qemu-{sock_prefix}")); - std::fs::create_dir_all(&virtiofsd_sock_dir) - .map_err(|e| format!("create virtiofsd sock dir: {e}"))?; - let virtiofsd_sock = virtiofsd_sock_dir.join("virtiofsd.sock"); - let shm_path = format!("/dev/shm/ovm-qemu-{sock_prefix}"); - - std::fs::create_dir_all(&shm_path).map_err(|e| format!("create shm dir: {e}"))?; + write_guest_env_file(&config.overlay_disk, &guest_env)?; let runtime_dir = qemu_runtime_dir()?; - let gw_port = config.gateway_port.unwrap_or(0); setup_tap_networking(tap_device, host_ip, gw_port)?; let mut tap_guard = TapGuard::new(tap_device.to_string(), host_ip.to_string(), gw_port); - let virtiofsd_log = sandbox_dir.join("virtiofsd.log"); - let virtiofsd_log_file = - std::fs::File::create(&virtiofsd_log).map_err(|e| format!("create virtiofsd log: {e}"))?; - - let virtiofsd_bin = { - let runtime_virtiofsd = runtime_dir.join("virtiofsd"); - if runtime_virtiofsd.is_file() { - runtime_virtiofsd - } else { - PathBuf::from("virtiofsd") - } - }; - - let mut virtiofsd_cmd = StdCommand::new(&virtiofsd_bin); - virtiofsd_cmd - .arg("--socket-path") - .arg(&virtiofsd_sock) - .arg("--shared-dir") - .arg(rootfs_str) - .arg("--cache=auto") - .stdin(Stdio::null()) - .stdout(Stdio::null()) - .stderr(virtiofsd_log_file); - - #[cfg(target_os = "linux")] - { - use nix::sys::signal::Signal; - use std::os::unix::process::CommandExt as _; - unsafe { - virtiofsd_cmd.pre_exec(|| { - nix::sys::prctl::set_pdeathsig(Signal::SIGKILL) - .map_err(|err| std::io::Error::other(format!("pdeathsig: {err}"))) - }); - } - } - - let virtiofsd_child = virtiofsd_cmd - .spawn() - .map_err(|e| format!("failed to start virtiofsd: {e}"))?; - let virtiofsd_pid = virtiofsd_child.id().cast_signed(); - GVPROXY_PID.store(virtiofsd_pid, Ordering::Relaxed); - let mut virtiofsd_guard = GvproxyGuard::new(virtiofsd_child); - - wait_for_path(&virtiofsd_sock, Duration::from_secs(5), "virtiofsd socket")?; - let vmlinux = runtime_dir.join("vmlinux"); if !vmlinux.is_file() { return Err(format!("VM kernel not found: {}", vmlinux.display())); @@ -198,20 +148,7 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { .arg(&vmlinux) .arg("-append") .arg(&kernel_cmdline) - .arg("-chardev") - .arg(format!( - "socket,id=virtiofs,path={}", - virtiofsd_sock.display() - )) - .arg("-device") - .arg("vhost-user-fs-pci,chardev=virtiofs,tag=rootfs") - .arg("-object") - .arg(format!( - "memory-backend-memfd,id=mem,size={}M,share=on", - config.mem_mib - )) - .arg("-numa") - .arg("node,memdev=mem") + .args(qemu_disk_args(config)) .arg("-netdev") .arg(format!( "tap,id=net0,ifname={tap_device},script=no,downscript=no" @@ -263,15 +200,8 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { .map_err(|e| format!("failed to wait for QEMU: {e}"))?; CHILD_PID.store(0, Ordering::Relaxed); - unsafe { - libc::kill(virtiofsd_pid, libc::SIGTERM); - } - virtiofsd_guard.disarm(); - GVPROXY_PID.store(0, Ordering::Relaxed); teardown_tap_networking(tap_device, host_ip, gw_port); tap_guard.disarm(); - let _ = std::fs::remove_dir_all(&shm_path); - let _ = std::fs::remove_dir_all(&virtiofsd_sock_dir); if status.success() { Ok(()) @@ -280,12 +210,30 @@ fn run_qemu_vm(config: &VmLaunchConfig) -> Result<(), String> { } } -/// Write environment variables into the rootfs so the guest init script -/// can source them. virtiofs shares the host rootfs directory into the guest. -fn write_guest_env_file(rootfs: &Path, env_vars: &[String]) -> Result<(), String> { - let srv_dir = rootfs.join("srv"); - std::fs::create_dir_all(&srv_dir).map_err(|e| format!("create /srv in rootfs: {e}"))?; - let env_file = srv_dir.join("openshell-env.sh"); +fn qemu_disk_args(config: &VmLaunchConfig) -> Vec { + vec![ + "-drive".to_string(), + format!( + "file={},if=none,format=raw,id=rootfs,readonly=on", + config.root_disk.display() + ), + "-device".to_string(), + "virtio-blk-pci,drive=rootfs".to_string(), + "-drive".to_string(), + format!( + "file={},if=none,format=raw,id=overlay", + config.overlay_disk.display() + ), + "-device".to_string(), + "virtio-blk-pci,drive=overlay".to_string(), + ] +} + +/// Write environment variables into the overlay disk so the guest init script +/// can source them after the overlay root is mounted. QEMU does not provide a +/// `krun_set_exec` equivalent, so the launcher injects this small per-sandbox +/// file into the overlay upperdir before boot. +fn write_guest_env_file(overlay_disk: &Path, env_vars: &[String]) -> Result<(), String> { let mut content = String::new(); for var in env_vars { if let Some((key, value)) = var.split_once('=') { @@ -293,8 +241,11 @@ fn write_guest_env_file(rootfs: &Path, env_vars: &[String]) -> Result<(), String let _ = writeln!(content, "export {key}=\"{}\"", shell_escape(value)); } } - std::fs::write(&env_file, &content).map_err(|e| format!("write guest env file: {e}"))?; - Ok(()) + rootfs::write_rootfs_image_file( + overlay_disk, + "/upper/srv/openshell-env.sh", + content.as_bytes(), + ) } fn qemu_guest_env_vars(config: &VmLaunchConfig, dns_server: Option) -> Vec { @@ -331,9 +282,9 @@ fn shell_escape(s: &str) -> String { fn build_kernel_cmdline(config: &VmLaunchConfig) -> String { let mut parts = vec![ "console=ttyS0".to_string(), - "root=rootfs".to_string(), - "rootfstype=virtiofs".to_string(), - "rw".to_string(), + "root=/dev/vda".to_string(), + "rootfstype=ext4".to_string(), + "ro".to_string(), "panic=-1".to_string(), format!("init={}", config.exec_path), ]; @@ -674,10 +625,16 @@ fn procguard_kill_children() { } fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> { - if !config.rootfs.is_dir() { + if !config.root_disk.is_file() { + return Err(format!( + "root disk image not found: {}", + config.root_disk.display() + )); + } + if !config.overlay_disk.is_file() { return Err(format!( - "rootfs directory not found: {}", - config.rootfs.display() + "overlay disk image not found: {}", + config.overlay_disk.display() )); } @@ -702,7 +659,7 @@ fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> { let vm = VmContext::create(&runtime_dir, config.log_level)?; vm.set_vm_config(config.vcpus, config.mem_mib)?; - vm.set_root(&config.rootfs)?; + vm.set_disks(&config.root_disk, &config.overlay_disk)?; vm.set_workdir(&config.workdir)?; // Run gvproxy strictly as the guest's virtual NIC / DHCP / router. @@ -749,12 +706,12 @@ fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> { )); } - let sock_base = gvproxy_socket_base(&config.rootfs)?; + let sock_base = gvproxy_socket_base(&config.root_disk)?; let net_sock = sock_base.with_extension("v"); let _ = std::fs::remove_file(&net_sock); let _ = std::fs::remove_file(sock_base.with_extension("v-krun.sock")); - let run_dir = config.rootfs.parent().unwrap_or(&config.rootfs); + let run_dir = config.root_disk.parent().unwrap_or(&config.root_disk); let gvproxy_log = run_dir.join("gvproxy.log"); let gvproxy_log_file = std::fs::File::create(&gvproxy_log) .map_err(|e| format!("create gvproxy log {}: {e}", gvproxy_log.display()))?; @@ -1013,11 +970,52 @@ impl VmContext { ) } - fn set_root(&self, rootfs: &Path) -> Result<(), String> { - let rootfs_c = path_to_cstring(rootfs)?; + fn set_disks(&self, root_disk: &Path, overlay_disk: &Path) -> Result<(), String> { + let root_disk_c = path_to_cstring(root_disk)?; + let block_id_c = CString::new("root").map_err(|e| format!("invalid block id: {e}"))?; + check( + unsafe { + (self.krun.krun_add_disk)( + self.ctx_id, + block_id_c.as_ptr(), + root_disk_c.as_ptr(), + true, + ) + }, + "krun_add_disk", + )?; + + let overlay_disk_c = path_to_cstring(overlay_disk)?; + let overlay_block_id_c = + CString::new("overlay").map_err(|e| format!("invalid block id: {e}"))?; check( - unsafe { (self.krun.krun_set_root)(self.ctx_id, rootfs_c.as_ptr()) }, - "krun_set_root", + unsafe { + (self.krun.krun_add_disk)( + self.ctx_id, + overlay_block_id_c.as_ptr(), + overlay_disk_c.as_ptr(), + false, + ) + }, + "krun_add_disk", + )?; + + let device_c = + CString::new("/dev/vda").map_err(|e| format!("invalid root disk device: {e}"))?; + let fstype_c = + CString::new("ext4").map_err(|e| format!("invalid root disk fstype: {e}"))?; + let options_c = + CString::new("ro").map_err(|e| format!("invalid root disk options: {e}"))?; + check( + unsafe { + (self.krun.krun_set_root_disk_remount)( + self.ctx_id, + device_c.as_ptr(), + fstype_c.as_ptr(), + options_c.as_ptr(), + ) + }, + "krun_set_root_disk_remount", ) } @@ -1234,8 +1232,8 @@ fn secure_socket_base(subdir: &str) -> Result { Ok(dir) } -fn gvproxy_socket_base(rootfs: &Path) -> Result { - Ok(secure_socket_base("osd-gv")?.join(hash_path_id(rootfs))) +fn gvproxy_socket_base(root_disk: &Path) -> Result { + Ok(secure_socket_base("osd-gv")?.join(hash_path_id(root_disk))) } fn install_signal_forwarding(pid: i32) { @@ -1342,7 +1340,8 @@ mod tests { fn qemu_config() -> VmLaunchConfig { VmLaunchConfig { - rootfs: PathBuf::from("/rootfs"), + root_disk: PathBuf::from("/rootfs.ext4"), + overlay_disk: PathBuf::from("/overlay.ext4"), vcpus: 2, mem_mib: 2048, exec_path: "/srv/openshell-vm-sandbox-init.sh".to_string(), @@ -1377,6 +1376,9 @@ mod tests { fn kernel_cmdline_keeps_guest_init_metadata_out_of_proc_cmdline() { let cmdline = build_kernel_cmdline(&qemu_config()); + assert!(cmdline.contains("root=/dev/vda")); + assert!(cmdline.contains("rootfstype=ext4")); + assert!(cmdline.contains(" ro")); assert!(cmdline.contains("ip=10.0.128.2::10.0.128.1:255.255.255.252:sandbox::off")); assert!(cmdline.contains("firmware_class.path=/lib/firmware")); assert!(!cmdline.contains("VM_NET_IP=")); @@ -1384,4 +1386,24 @@ mod tests { assert!(!cmdline.contains("VM_NET_DNS=")); assert!(!cmdline.contains("GPU_ENABLED=")); } + + #[test] + fn qemu_disk_args_attach_base_readonly_and_overlay_readwrite() { + let args = qemu_disk_args(&qemu_config()); + + assert!(args.contains(&"-drive".to_string())); + assert!( + args.contains( + &"file=/rootfs.ext4,if=none,format=raw,id=rootfs,readonly=on".to_string() + ) + ); + assert!(args.contains(&"virtio-blk-pci,drive=rootfs".to_string())); + assert!(args.contains(&"file=/overlay.ext4,if=none,format=raw,id=overlay".to_string())); + assert!( + !args + .iter() + .any(|arg| arg.contains("id=overlay,readonly=on")) + ); + assert!(args.contains(&"virtio-blk-pci,drive=overlay".to_string())); + } } diff --git a/crates/openshell-server/src/cli.rs b/crates/openshell-server/src/cli.rs index ccc08cf2b..a99311f8f 100644 --- a/crates/openshell-server/src/cli.rs +++ b/crates/openshell-server/src/cli.rs @@ -173,6 +173,14 @@ struct Args { )] vm_mem_mib: u32, + /// Writable overlay disk size for each VM sandbox, in MiB. + #[arg( + long, + env = "OPENSHELL_VM_OVERLAY_DISK_MIB", + default_value_t = VmComputeConfig::default_overlay_disk_mib() + )] + vm_overlay_disk_mib: u64, + /// CA certificate installed into VM sandboxes for gateway mTLS. #[arg(long, env = "OPENSHELL_VM_TLS_CA")] vm_tls_ca: Option, @@ -411,6 +419,7 @@ async fn run_from_args(args: Args) -> Result<()> { krun_log_level: args.vm_krun_log_level, vcpus: args.vm_vcpus, mem_mib: args.vm_mem_mib, + overlay_disk_mib: args.vm_overlay_disk_mib, guest_tls_ca: args.vm_tls_ca, guest_tls_cert: args.vm_tls_cert, guest_tls_key: args.vm_tls_key, diff --git a/crates/openshell-server/src/compute/vm.rs b/crates/openshell-server/src/compute/vm.rs index 1e62d4942..043b222c2 100644 --- a/crates/openshell-server/src/compute/vm.rs +++ b/crates/openshell-server/src/compute/vm.rs @@ -81,6 +81,9 @@ pub struct VmComputeConfig { /// Default memory allocation for VM sandboxes, in MiB. pub mem_mib: u32, + /// Writable overlay disk size for each VM sandbox, in MiB. + pub overlay_disk_mib: u64, + /// Host-side CA certificate for the guest's mTLS client bundle. pub guest_tls_ca: Option, @@ -116,6 +119,12 @@ impl VmComputeConfig { 2048 } + /// Default writable overlay disk size, in MiB. + #[must_use] + pub const fn default_overlay_disk_mib() -> u64 { + 4096 + } + #[must_use] fn default_driver_search_dirs(home: Option) -> Vec { let mut dirs = Vec::new(); @@ -138,6 +147,7 @@ impl Default for VmComputeConfig { krun_log_level: Self::default_krun_log_level(), vcpus: Self::default_vcpus(), mem_mib: Self::default_mem_mib(), + overlay_disk_mib: Self::default_overlay_disk_mib(), guest_tls_ca: None, guest_tls_cert: None, guest_tls_key: None, @@ -457,6 +467,9 @@ pub async fn spawn( .arg(vm_config.krun_log_level.to_string()); command.arg("--vcpus").arg(vm_config.vcpus.to_string()); command.arg("--mem-mib").arg(vm_config.mem_mib.to_string()); + command + .arg("--overlay-disk-mib") + .arg(vm_config.overlay_disk_mib.to_string()); if let Some(tls) = guest_tls_paths { command.arg("--guest-tls-ca").arg(tls.ca); command.arg("--guest-tls-cert").arg(tls.cert); diff --git a/docs/reference/sandbox-compute-drivers.mdx b/docs/reference/sandbox-compute-drivers.mdx index 5fea1b6ae..a0e63c918 100644 --- a/docs/reference/sandbox-compute-drivers.mdx +++ b/docs/reference/sandbox-compute-drivers.mdx @@ -73,17 +73,20 @@ MicroVM-backed sandboxes run inside VM-backed isolation instead of a container b The gateway uses the VM compute driver to create VM-backed sandboxes. MicroVM requires host virtualization support. It uses [libkrun](https://github.com/containers/libkrun) with Apple's [Hypervisor framework](https://developer.apple.com/documentation/hypervisor) on macOS, KVM on Linux, and [QEMU](https://www.qemu.org/) for GPU-backed sandboxes on Linux. +The VM driver prepares an immutable ext4 root disk image from the selected sandbox image and caches it by image identity. Each sandbox boots that cached base disk read-only and receives its own writable `overlay.ext4` disk for `/`, including `/sandbox` writes and runtime TLS material. The overlay persists for the sandbox lifetime and is deleted with the sandbox state directory. + For maintainer-level implementation details, refer to the [VM driver README](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-vm/README.md). | Option | Environment variable | Description | |---|---|---| | `--drivers vm` | `OPENSHELL_DRIVERS=vm` | Select the VM compute driver. VM is never auto-detected. | | `--driver-dir ` | `OPENSHELL_DRIVER_DIR` | Search a custom directory for `openshell-driver-vm`. | -| `--vm-driver-state-dir ` | `OPENSHELL_VM_DRIVER_STATE_DIR` | Store VM rootfs, console logs, runtime state, image-rootfs cache, and the private `run/compute-driver.sock` socket under this directory. | +| `--vm-driver-state-dir ` | `OPENSHELL_VM_DRIVER_STATE_DIR` | Store VM overlay disks, console logs, runtime state, image-rootfs cache, and the private `run/compute-driver.sock` socket under this directory. | | `--vm-driver-vcpus ` | `OPENSHELL_VM_DRIVER_VCPUS` | Set the default vCPU count for VM sandboxes. | | `--vm-driver-mem-mib ` | `OPENSHELL_VM_DRIVER_MEM_MIB` | Set the default memory allocation for VM sandboxes in MiB. | +| `--vm-overlay-disk-mib ` | `OPENSHELL_VM_OVERLAY_DISK_MIB` | Set the sparse writable overlay disk size for each VM sandbox in MiB. | | `--vm-krun-log-level ` | `OPENSHELL_VM_KRUN_LOG_LEVEL` | Set the libkrun log level for VM helper processes. | -| `--vm-tls-ca`, `--vm-tls-cert`, `--vm-tls-key` | `OPENSHELL_VM_TLS_CA`, `OPENSHELL_VM_TLS_CERT`, `OPENSHELL_VM_TLS_KEY` | Copy sandbox client TLS materials into VM guests for mTLS callback to the gateway. | +| `--vm-tls-ca`, `--vm-tls-cert`, `--vm-tls-key` | `OPENSHELL_VM_TLS_CA`, `OPENSHELL_VM_TLS_CERT`, `OPENSHELL_VM_TLS_KEY` | Store sandbox client TLS materials in the VM overlay for mTLS callback to the gateway. | The gateway starts `openshell-driver-vm` over a private Unix socket and passes its process ID so the driver can reject unexpected local clients. The driver's standalone TCP listener is disabled unless `--allow-unauthenticated-tcp` is set for local development. diff --git a/e2e/rust/e2e-vm.sh b/e2e/rust/e2e-vm.sh index a4afb16d5..0330579d5 100755 --- a/e2e/rust/e2e-vm.sh +++ b/e2e/rust/e2e-vm.sh @@ -56,9 +56,9 @@ DRIVER_BIN="${ROOT}/target/debug/openshell-driver-vm" STATE_DIR_ROOT="/tmp" # Smoke test timeouts. First boot extracts the embedded libkrun runtime -# (~60-90MB of zstd per architecture) and prepares a sandbox rootfs from the -# configured image. The guest then starts the sandbox supervisor directly; a -# cold microVM is typically ready within ~15s after image preparation. +# (~60-90MB of zstd per architecture) and prepares an ext4 root disk from the +# configured image. The guest then starts the sandbox supervisor directly; a cold +# microVM is typically ready within ~15s after image preparation. GATEWAY_READY_TIMEOUT=60 SANDBOX_PROVISION_TIMEOUT=180 @@ -104,7 +104,7 @@ s.close()')" # Per-run state dir so concurrent e2e runs don't collide on the UDS or # sandbox state. The VM driver creates `/compute-driver.sock` -# and `/sandboxes//rootfs/` under here. Keep the +# and `/sandboxes//overlay.ext4` under here. Keep the # basename short — see the SUN_LEN comment above. RUN_STATE_DIR="${STATE_DIR_ROOT}/os-vm-e2e-${HOST_PORT}-$$" mkdir -p "${RUN_STATE_DIR}" @@ -147,7 +147,7 @@ cleanup() { rm -f "${GATEWAY_LOG}" 2>/dev/null || true # Only wipe the per-run state dir on success. On failure, leave it for - # post-mortem (serial console logs, gvproxy logs, rootfs dumps). + # post-mortem (serial console logs, gvproxy logs, root disk images). if [ "${exit_code}" -eq 0 ]; then rm -rf "${RUN_STATE_DIR}" 2>/dev/null || true else @@ -219,11 +219,12 @@ echo "==> Gateway ready after ${elapsed}s" # metadata lookup needed when TLS is disabled. export OPENSHELL_GATEWAY_ENDPOINT="http://127.0.0.1:${HOST_PORT}" +export OPENSHELL_E2E_EXPECT_VM_OVERLAY=1 -# The VM driver creates each sandbox VM from scratch — the embedded -# rootfs is extracted per sandbox, and the guest's sandbox supervisor -# then initializes policy, netns, Landlock, and sshd. On a cold host -# this is ~15s; allow 180s for slower CI runners. +# The VM driver creates each sandbox VM from a cached read-only ext4 root disk +# plus a writable overlay disk. The guest's sandbox supervisor then initializes +# policy, netns, Landlock, and sshd. On a cold host this is ~15s after image +# preparation; allow 180s for slower CI runners. export OPENSHELL_PROVISION_TIMEOUT="${SANDBOX_PROVISION_TIMEOUT}" echo "==> Running e2e smoke test (endpoint: ${OPENSHELL_GATEWAY_ENDPOINT})" diff --git a/e2e/rust/tests/smoke.rs b/e2e/rust/tests/smoke.rs index 172afa22b..c27255e5e 100644 --- a/e2e/rust/tests/smoke.rs +++ b/e2e/rust/tests/smoke.rs @@ -68,6 +68,10 @@ async fn gateway_smoke() { sb.create_output, ); + if std::env::var_os("OPENSHELL_E2E_EXPECT_VM_OVERLAY").is_some() { + assert_vm_overlay_root(&sb.name).await; + } + // ── 3. Verify the sandbox appeared in the list ─────────────────── let mut list_cmd = openshell_cmd(); list_cmd @@ -95,3 +99,41 @@ async fn gateway_smoke() { // ── 4. Cleanup ─────────────────────────────────────────────────── sb.cleanup().await; } + +async fn assert_vm_overlay_root(sandbox_name: &str) { + let script = concat!( + "set -eu; ", + "test \"$(stat -f -c %T /)\" = \"overlayfs\"; ", + "printf \"overlay-write\\n\" > /sandbox/overlay-check; ", + "test \"$(cat /sandbox/overlay-check)\" = \"overlay-write\"; ", + "if [ -e /opt/openshell/tls/tls.key ]; then ", + "test \"$(stat -c %a /opt/openshell/tls/tls.key)\" = \"600\"; ", + "fi; ", + "echo vm-overlay-ok", + ); + + let mut exec_cmd = openshell_cmd(); + exec_cmd + .args(["sandbox", "exec", "--name", sandbox_name, "--no-tty", "--"]) + .arg("sh") + .arg("-lc") + .arg(script) + .stdout(Stdio::piped()) + .stderr(Stdio::piped()); + + let output = exec_cmd + .output() + .await + .expect("failed to run VM overlay assertion"); + let combined = strip_ansi(&format!( + "{}{}", + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr), + )); + + assert!( + output.status.success() && combined.contains("vm-overlay-ok"), + "VM overlay assertion failed (status {:?}):\n{combined}", + output.status.code(), + ); +}