Onboarding audit: cut 'time to first dstack app' from ~22 steps to a single script

## The problem

If you have a TDX host and want to run your first dstack app, the README says "deploy your own" and links to the deployment guide. Following that guide end-to-end takes around **22 ordered steps across two repos** before a `docker-compose.yaml` is reachable in a browser.

The first time I walked through it I had:

- Three terminals open (vmm, auth-simple, log tail)
- One browser tab on the KMS bootstrap page
- One text editor on `auth-config.json` that I edited four separate times
- A Cloudflare API token I had to provision because the gateway needs DNS-01 ACME
- A domain I had to own because the gateway URL pattern is `<id>-<port>.gateway.<your-domain>`

We can do better. This issue is about figuring out how.

## What's painful, concretely

A few moments where I sat there and thought "this should not be a step":

**The chicken-and-egg dance with the KMS allowlist.** On a fresh `auth-config.json`, `kms.mrAggregated` is empty, and `auth-simple` denies all KMS boots against an empty allowlist — but you can't know the value until the KMS CVM has booted. So the documented procedure is: deploy the KMS, watch bootstrap fail, call `Onboard.GetAttestationInfo` to read the hash, paste it into `auth-config.json`, retry. For a single-node KMS this gate is near-circular — it's the operator's KMS proving its own measurement to an allowlist the same operator just wrote — and removing it is a config default, not a hard problem (see the KMS section below).

**The "exit 1, edit the .env, run me again" pattern.** Three different scripts do this: `kms/dstack-app/deploy-simple.sh`, `gateway/dstack-app/deploy-to-vmm.sh`, `gateway/dstack-app/bootstrap-cluster.sh`. On the happy path that's three re-invocations just to get past prompts that could have been CLI flags or interactive prompts.

**Copy-paste a hash, then type `y`.** When you deploy the gateway, the script prints a compose hash, asks `Continue? [y/N]`, and the right answer is *not yet* — you're supposed to put the hash in `auth-config.json` first, then come back and type `y`. Same dance again when you deploy your first app. The script knows the hash. The auth file is on the same disk. It could just write it.

**Three terminals, no systemd.** vmm, auth-simple, and the log tail all run as foreground processes. There's no installed service, no `systemctl status`, no `journalctl`. Restarts after config edits are manual. Lifecycle is the operator's problem.

**Domain required everywhere.** The KMS bootstrap UI has a required "domain" text input. The gateway needs a real DNS-resolvable wildcard and a Cloudflare token. The app URL pattern bakes the gateway domain in. The only escape hatch today is `--port tcp:0.0.0.0:HOST:CTR`, which works but skips the gateway entirely, gives you plain HTTP, and isn't called out as a real path.

**Restart vmm to pick up new URLs.** `vmm.toml` has `kms_urls` and `gateway_urls`, both read at startup. So the order is: start vmm → deploy KMS → edit vmm.toml → restart vmm → deploy gateway → edit vmm.toml → restart vmm.

**Two repos.** `meta-dstack` owns the host-side build config and the OS image. `dstack` owns the services. For a hoster running a single TDX box, both repos are needed and the cross-references (`build.sh hostcfg`, `digest.txt`, `build-config.sh`) are easy to lose track of.

## What "good" could look like

Two tiers. **Tier 1 gets you to a running app with no domain and no gateway.** Tier 2 adds managed HTTPS + routing as a deliberate, separate step for people who want it.

### Tier 1 — first app (default)

A first-time user on a TDX+SGX host with Docker and nothing else:

```
$ sudo apt install dstack            # package installs binaries + systemd units

$ sudo dstack init
  ✓ checked hardware: Intel TDX + SGX present
  ✓ downloaded guest image (dstack-0.5.7, digest matches)
  ✓ generated vmm.toml, kms.toml, auth-config.json
  ✓ configured single-node KMS auth (no measurement pre-registration needed)
  ✓ started dstack-vmm, dstack-auth-simple, gramine-sealing-key-provider
  ✓ deployed KMS CVM in Local-Key-Provider mode, auto-bootstrapped
  ✓ wired kms_urls, restarted dstack-vmm
  → VMM dashboard: https://<host>:9080 (self-signed — click through the warning)

$ dstack run my-docker-compose.yaml
  ✓ computed compose hash → registered with auth-simple
  ✓ deployed app, waiting for boot
  → App URL: http://<host>:9300/    (direct port mapping — no gateway, no domain)
```

The user owns no domain, holds no Cloudflare token, never opens a browser to bootstrap KMS, never edits a JSON file, never restarts a process by hand, and never sets up a gateway. The full security model still applies — real TDX attestation, KMS in a CVM, real `auth-simple` — we've just stopped asking the user to be the integration glue, and access is via direct host:port mapping instead of gateway routing.

### Tier 2 — managed HTTPS + routing (optional, separate)

When you want pretty `https://<id>-<port>.gateway.<domain>` URLs, automatic Let's Encrypt certs, and load-balanced routing, you set up the gateway as a second step:

```
$ sudo dstack gateway init      # opt-in; this is where a domain + ACME provider come in
```

Keeping the gateway out of the default flow means the headline quickstart never blocks on DNS or a Cloudflare token, and the gateway's genuine complexity (wildcard DNS, ACME, WireGuard) is opt-in rather than mandatory. It also removes the contradiction of "deploy without a domain" — Tier 1 simply has no gateway, so there's no wildcard-hostname-to-resolve problem; it's direct host:port.

## Design decisions

These pin down the rest of the work. The first three are decided; flagging them so reviewers can object.

### 1. Two-tier onboarding: KMS-first, gateway optional

KMS is **not** optional — without it you don't get the real experience (persistent, upgradeable per-app keys derived in a TEE). So Tier 1 is "vmm + single-node KMS + your app via direct port", and the gateway is a separate Tier 2 step. This is the single decision that unblocks "deploy without a domain": the domain/DNS/ACME complexity all lives in the gateway, so making the gateway opt-in makes the domain opt-in for free.

### 2. Hardware: SGX required, fail fast

`dstack init` **refuses to run on a host without SGX** and exits with a clear message. KMS-in-CVM attestation depends on the Gramine SGX sealing key provider; silently degrading to host-mode KMS (no real attestation) would undermine the "full experience" promise. Evaluating on cloud TDX without SGX is explicitly not supported by `dstack init` — that's a conscious trade for not shipping a footgun.

### 3. TLS: self-signed by default; real domain only in the gateway tier

Self-signed certs are the only honest "no-domain" answer for a server-deployed product. mkcert-style "install the CA in your trust store" is great UX *when the issuer and the browser are on the same machine* — they're not, when dstack runs on a server and you browse from a laptop. We can't paper over that with a script.

- **Tier 1 surfaces** (VMM dashboard, KMS bootstrap, direct-port app access): self-signed cert with the host's IP, hostname, and `localhost` in the SAN. Browser warning; user clicks through. KMS already does exactly this today (`kms/src/onboard_service.rs`). Optional: download the dstack CA and install it on your laptop to silence the warning — documented per-OS, not scripted (the server/laptop gap makes it un-scriptable).
- **Tier 2 (gateway):** real domain + Cloudflare/Route53 token + Let's Encrypt DNS-01 (today's production path, kept as-is).
- **sslip.io** is the no-domain wildcard option for Tier 2 routing — `<id>-<port>.<host-ip>.sslip.io` resolves cleanly without owning DNS. But it isn't on the Public Suffix List (verified against [publicsuffix.org's PSL](https://publicsuffix.org/list/public_suffix_list.dat)), so every `*.sslip.io` Let's Encrypt cert shares one global rate limit. So it's a documented "works, with a caveat" option, not the recommended default.

### 4. Process & packaging: systemd-native, installed as an OS package

The software ships as an OS package (apt/deb, dnf/rpm). The **package** owns installing and removing the binaries and the systemd units — `apt remove dstack` is the uninstall, not a bespoke subcommand:

- `dstack-vmm.service` — main VMM, `Restart=always`, logs via `journalctl -u dstack-vmm`.
- `dstack-auth-simple.service` — auth-simple webhook. Hot-reload of `auth-config.json` already exists.
- `gramine-sealing-key-provider.service` — SGX key provider. (Gateway, in Tier 2, brings its own unit.)

`dstack init` is then *deployment* bootstrap, not software install: generate configs, start services, bring up KMS, wire URLs. Its inverse is a deployment teardown (`dstack destroy` / `reset` — name TBD) that removes CVMs, generated configs, and keys but leaves the software installed. This separation is cleaner than an `init`/`uninstall` pair, which would conflate "install software" with "stand up a deployment".

Second-order wins: the `vmm.toml.kms_urls` restart pain becomes `systemctl restart dstack-vmm` driven by `dstack init` (no human in the loop); ordering (`After=`, `Requires=`) replaces the "edit → restart → deploy → edit → restart" dance; "three terminals" → one `journalctl -fu dstack-vmm`. We implicitly punt non-systemd distros (musl/Alpine, FreeBSD), which is fine — TDX hosts are almost universally Ubuntu/Debian/Fedora.

## KMS modes — what counts as "the full experience"

Three independent axes get conflated in the docs today, so it's worth naming them.

**Boot mode** (per `kms/README.md`):
- **Non-KMS Mode** — ephemeral per-boot keys, no persistence, no upgrades. `app-id == compose-hash`. Useful as a "show me a CVM run" demo target with zero infra. Not the quickstart bar.
- **Local-Key-Provider Mode** — SGX-sealed keys via Gramine, persistent, but `app-id == compose-hash` so upgrades are awkward. This is how the KMS itself runs.
- **KMS Mode** — full deterministic per-app keys derived from a KMS root, persistent + upgradeable, `app-id == app contract address`. This is what apps should run in.

**Auth backend:**
- **auth-mock** — always allow, demo/testing only.
- **auth-simple** — JSON allowlist, single-operator. Good fit for self-hosters.
- **auth-eth** — on-chain via smart contracts. The decentralized-governance path.

**Where the KMS runs:**
- **On the host** — current "dev deployment". No Gramine, no SGX, no real attestation of the KMS itself. Marked as "no security guarantees" in the docs.
- **In a CVM with Gramine** — current "production deployment". Real TDX attestation of the KMS. Requires SGX BIOS + the Gramine sealing key provider.

**Quickstart target:** KMS Mode (for apps) + KMS-in-CVM with Gramine (for the KMS itself) + auth-simple (for governance). That gives the full security story minus the blockchain — real attestation, real per-app key derivation, real upgrade path, single-operator authorization via JSON.

`auth-eth` is documented as the upgrade path: same `dstack init` flow, swap the auth backend. Non-KMS mode and Local-Key-Provider-for-apps stay as advanced examples in the docs, not as quickstart options.

**One config default makes the single-node path clean: `enforce_self_authorization = false`.** This is the fix for the chicken-and-egg above, and it's worth understanding precisely. With the default (`true`), the KMS self-attests to its own auth API before it will bootstrap: it builds its own boot info (`local_kms_boot_info`, which includes `mr_aggregated`) and POSTs it to `bootAuth/kms`, which `auth-simple` rejects until `kms.mrAggregated` is populated (`kms/src/main_service/upgrade_authority.rs:218`, `kms/auth-simple/index.ts:120`). For a single operator who owns the auth config, this self-gate is near-circular and buys nothing — turning it off lets the KMS bootstrap immediately with no measurement pre-registration. Crucially, this changes **only** the KMS proving *itself*:

- App authorization is untouched — apps still go through `bootAuth/app`, which checks the compose hash, not `mrAggregated`.
- App attestation is untouched — `GetAppKey`/`SignCert` still verify each requesting app's own TDX quote.
- The `mrAggregated` allowlist only does real work when a second KMS node replicates from the first (`ensure_kms_allowed`), which a single-node quickstart never does.

(For the record: `mr_aggregated` is `SHA256(mr_td ‖ rtmr0 ‖ rtmr1 ‖ rtmr2 ‖ rtmr3)` — `dstack-attest/src/attestation.rs:766` — so it can't simply be precomputed from `dstack-mr`, which only emits MRTD + RTMR0–2. RTMR3 folds in runtime measurements including the Gramine key-provider's MRENCLAVE. That's another reason to drop the gate for single-node rather than try to predict the value.)

The honest cost of picking "KMS-in-CVM + Gramine" as the default is that Gramine setup is itself painful today (see `docs/tutorials/gramine-key-provider.md`). If `dstack init` doesn't automate it, we've moved the friction rather than fixed it. So Gramine bring-up — pulling and starting `gramine-sealing-key-provider` as a systemd unit, pointing KMS at it, verifying — has to be part of `dstack init`.

## How to get there

Roughly three layers, in priority order. Layers 1–2 are Tier 1 (the headline goal); layer 3 is the optional gateway tier.

**1. The `dstack` CLI + `dstack init`.** A new Rust crate produces a single user-facing `dstack` binary (`init`, `run`, `ls`, `logs`, `destroy`, …) that talks to the VMM over the existing prpc interface (reusing `http-client` and the `*-rpc` proto crates — not a rewrite) and supersedes `vmm-cli.py`. **Rust, not Python** — the repo is already almost entirely Rust, a static binary drops straight into the OS package next to `dstack-vmm`, and it removes the Python-on-the-host friction (venv, system-Python pinning) that a self-hoster hits today. `dstack init` is the bring-up: render `vmm.toml`/`kms.toml`/`auth-config.json` from a few inputs (host IP, image version, mode flags); refuse on non-SGX hosts; start the systemd units; generate the single-node `kms.toml` with `enforce_self_authorization = false` so bootstrap doesn't gate on a measurement the operator hasn't seen yet; auto-derive the KMS bootstrap domain (host IP or `localhost`) so the browser step disappears (`kms/src/onboard_service.rs:367`). Idempotent re-runs.

**2. `dstack run` — eliminate the manual hash dance.** `dstack run <compose>` wraps compose + register + deploy into one step: compute the compose hash, write it into `auth-config.json` for you, deploy, and expose the app via direct port mapping. The "exit 1, edit .env, re-run" pattern in the deploy scripts goes away — they fold into `dstack` subcommands.

**3. Tier-2 gateway as an opt-in step.** `dstack gateway init` brings up the gateway CVM for people who want managed HTTPS + routing. Gateway grows a self-signed no-domain mode (mirroring KMS) for local use; the real-domain → Let's Encrypt DNS-01 path stays for production; sslip.io is the documented no-domain wildcard option (with the shared-rate-limit caveat). All the domain/DNS/ACME complexity is confined here.

## Open questions

- **meta-dstack consolidation.** For a self-host quickstart, the host-side artifacts (`vmm.toml` template, systemd units, OS image tarball) could ship inside the `dstack` OS package rather than requiring a second repo clone + build. Worth doing in this issue's scope, or separate?
- **Teardown semantics.** What does `dstack destroy`/`reset` remove by default — CVMs + configs but keep KMS keys (so you can re-init against the same identity), or wipe everything? Probably a safe default + a `--purge` flag.
- **`vmm-cli.py` transition.** Hard-replace it with the `dstack` binary, or wrap it during a deprecation window? It has real users and scripts depending on it today.

Happy to break this into separate issues once we agree on direction. Filing it as one umbrella so the conversation about priorities can happen in one place.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onboarding audit: cut 'time to first dstack app' from ~22 steps to a single script #699

The problem

What's painful, concretely

What "good" could look like

Tier 1 — first app (default)

Tier 2 — managed HTTPS + routing (optional, separate)

Design decisions

1. Two-tier onboarding: KMS-first, gateway optional

2. Hardware: SGX required, fail fast

3. TLS: self-signed by default; real domain only in the gateway tier

4. Process & packaging: systemd-native, installed as an OS package

KMS modes — what counts as "the full experience"

How to get there

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Onboarding audit: cut 'time to first dstack app' from ~22 steps to a single script #699

Description

The problem

What's painful, concretely

What "good" could look like

Tier 1 — first app (default)

Tier 2 — managed HTTPS + routing (optional, separate)

Design decisions

1. Two-tier onboarding: KMS-first, gateway optional

2. Hardware: SGX required, fail fast

3. TLS: self-signed by default; real domain only in the gateway tier

4. Process & packaging: systemd-native, installed as an OS package

KMS modes — what counts as "the full experience"

How to get there

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions