Skip to content

demo: lakebox#5223

Draft
pietern wants to merge 38 commits intomainfrom
demo-lakebox
Draft

demo: lakebox#5223
pietern wants to merge 38 commits intomainfrom
demo-lakebox

Conversation

@pietern
Copy link
Copy Markdown
Contributor

@pietern pietern commented May 8, 2026

Continuation of #4930.

shuochen0311 and others added 16 commits April 10, 2026 18:20
Lakebox provides SSH-accessible development environments backed by
microVM isolation. This adds CLI commands for lifecycle management:

- `lakebox auth login` — authenticate to a Databricks workspace
- `lakebox create` — create a new lakebox (with optional SSH public key)
- `lakebox list` — list your lakeboxes (shows status, key hash, default)
- `lakebox ssh` — SSH to your default lakebox (or create one on first use)
- `lakebox status <id>` — show lakebox details
- `lakebox delete <id>` — delete a lakebox
- `lakebox set-default <id>` — change the default lakebox

Features:
- Default lakebox management stored at ~/.databricks/lakebox.json per profile
- Automatic SSH config management (~/.ssh/config)
- Public key auth only (password/keyboard-interactive disabled in SSH config)
- Creates and sets default on first `lakebox ssh` if none exists
- Remove PubkeyHashPrefix field from lakeboxEntry (no longer returned by API)
- Remove KEY column from list output
- Remove Key line from status output
- Add register-key subcommand for SSH public key registration

Co-authored-by: Isaac
…rites

- Add 'register' command: generates ~/.ssh/lakebox_rsa and registers with API
- Remove 'register-key' command (replaced by 'register')
- Remove 'login' command (use 'auth login' + 'register' separately)
- SSH command passes options directly as args instead of writing ~/.ssh/config
- Check for ssh-keygen availability with helpful install instructions

Co-authored-by: Isaac
- Hook into auth login PostRun to auto-generate ~/.ssh/lakebox_rsa and
  register it after OAuth completes
- Fix hook: match on sub.Name() not sub.Use (Use includes args)
- Export EnsureAndReadKey and RegisterKey for use by auth hook
- Update help text

Co-authored-by: Isaac
Everything after -- is passed directly to the ssh process, enabling:
  lakebox ssh -- echo hello          # run command and return
  lakebox ssh <id> -- cat /etc/os-release
  lakebox ssh -- -L 8080:localhost:8080  # port forwarding

Co-authored-by: Isaac
After 'lakebox auth login --host <url>', the post-login hook now
constructs the workspace client directly from the --host/--profile
flags instead of using MustWorkspaceClient (which started with an
empty config and fell back to the DEFAULT profile).

All lakebox commands now use a mustWorkspaceClient wrapper that reads
the last-login profile from ~/.databricks/lakebox.json, so 'lakebox ssh'
uses the correct profile without requiring --profile on every invocation.

Also adds install.sh and upload.sh scripts.
Fix workspace client init after login, persist last profile
Merge kelvich's workspace client fix. Add -- passthrough support so
extra args (remote commands, port forwarding, ssh flags) are passed
directly to the ssh process.

Co-authored-by: Isaac
Single cyan accent color throughout. Bold for IDs, dim for metadata.
Braille spinner with elapsed time during async operations.

- create: animated spinner during provisioning
- list: aligned columns with colored status, cyan bold for running
- status: clean field layout
- delete: spinner during removal
- ssh: spinner during connection
- register: spinner during key registration
- Shared ui.go with all primitives

Co-authored-by: Isaac
The lakebox manager moved its REST surface to a proto-defined service with
JSON transcoding (databricks-eng/universe#1839855 + follow-ups). That
changed three things this CLI was depending on:

1. JSON field name: each Lakebox message now serializes as `lakeboxId`
   (proto3 lowerCamelCase default), not `name`. List/status/create were
   parsing into `Name string \`json:"name"\`` and silently getting the
   empty string for every entry — the visible symptom was `lakebox list`
   showing rows with blank ID columns.

2. Status codes: proto-transcoded handlers return 200 OK uniformly. The
   CLI was checking 201 Created on POST /api/2.0/lakebox and 204
   NoContent on DELETE, both of which now look like errors.

3. Key registration moved to its own top-level collection at
   /api/2.0/lakebox-keys (was /api/2.0/lakebox/register-key), to avoid a
   path collision with /api/2.0/lakebox/{lakebox_id}.

Drop the now-unused `extractLakeboxID` helper — the wire field is the
customer-facing ID directly.

Verified against dev-aws-us-west-2: list, status, create, delete all
work end-to-end. register hits a separate manager-side issue (stale
UserKey records in TiDB that the new schema can't deserialize) — not
fixed here.

Co-authored-by: Isaac
Reynold's restructure (databricks-eng/universe#1874214) nested the two
lakebox resources under the service namespace — moving sandboxes from
/api/2.0/lakebox to /api/2.0/lakebox/sandboxes and SSH keys from
/api/2.0/lakebox-keys to /api/2.0/lakebox/ssh-keys — and renamed the
resource type from Lakebox to Sandbox, which surfaces on the wire as
sandboxId / sandboxes (was lakeboxId / lakeboxes).

CLI still pointed at the old paths and decoded the old field names, so
list / status / create returned empty IDs and 404s. Fix both endpoint
constants, rename the request/response types and fields to match the
proto, and update the four call sites in create / list / ssh / status.
User-facing copy ("Lakebox …") is unchanged — the product is still
Lakebox; only the resource type renamed.

Verified end-to-end against dev-aws-us-west-2: create / list / status
/ delete all work; ssh passthrough works.

Co-authored-by: Isaac
Surfaces the new per-sandbox auto-stop knobs the manager added
(databricks-eng/universe#1875183) so users can see at a glance how long
their sandbox will live before the watchdog reaps it.

- `sandboxEntry` gains pointer fields `IdleTimeoutSecs` and `Persist` so
  we keep the proto3 explicit-presence semantics ("not in response" vs
  "explicitly set to 0 / false").
- `autoStopLabel()` collapses the policy to one short token:
  - `persist == true` → `never`
  - `idle_timeout_secs > 0` → compact duration (`90s`, `15m`, `2h`,
    `1h30m`)
  - otherwise → the manager's global default (10m), rendered
    explicitly so the column never says `default`
- `lakebox list` adds an AUTOSTOP column between STATUS and DEFAULT.
- `lakebox status` adds an `autostop` field after `fqdn`.

Verified end-to-end against dev-aws-us-west-2 — list and status both
render `10m` for sandboxes with no per-record override.

Co-authored-by: Isaac
Surfaces the per-sandbox auto-stop knobs the manager added in
databricks-eng/universe#1875183 so users can flip them from the CLI
instead of curl + JSON.

  lakebox config <id> --idle-timeout 15m       # 15-minute timeout
  lakebox config <id> --idle-timeout 1h30m     # any Go duration
  lakebox config <id> --idle-timeout 0         # clear → manager default
  lakebox config <id> --persist                # never auto-stop
  lakebox config <id> --persist=false          # back to timeout path
  lakebox config <id> --idle-timeout 30m --persist=false   # combined

Implementation notes:

- `updateBody` is the inner Sandbox sent in the PATCH body. The proto's
  `(google.api.http)` declares `body: "sandbox"`, so the HTTP body is
  the inner `Sandbox` message, NOT a `{"sandbox": {...}}` envelope.
  First wired-up version got this wrong and the manager rejected with
  "unknown field `sandbox`" — kept the type comment to flag the gotcha
  for the next reader.
- `IdleTimeoutSecs` carries `,string` JSON tag because proto3 JSON
  canonical form serializes int64 as a quoted string. The manager
  accepts both bare-number and quoted-string on input but always
  emits quoted on output, so without the tag we hit "cannot unmarshal
  string into Go struct field … int64" on the response read-back.
- Pointer fields (`*int64`, `*bool`) carry proto3 explicit-presence
  through to the wire — only the flags the user actually passed get
  emitted, so a `--persist`-only invocation does not clobber an
  existing idle_timeout (and vice-versa).
- Client-side range pre-flight (`[60s, 86400s]` plus the 0 clear
  sentinel) mirrors the manager's `MIN_IDLE_TIMEOUT_SECS` /
  `MAX_IDLE_TIMEOUT_SECS` constants so users get a clearer error
  than the server's `INVALID_ARGUMENT`.

Verified end-to-end against dev-aws-us-west-2:
  config --idle-timeout 15m  → status shows `15m`
  config --persist           → status shows `never`
  config --idle-timeout 0 --persist=false → status shows `10m`

Co-authored-by: Isaac
Tracks the matching rename in the lakebox manager
(databricks-eng/universe#1875183 follow-up). The manager-side flag
moved from `persist` to `no_autostop` because the original name
conflicted with the storage-persistence concept already in this
codebase.

CLI changes:
  --persist          → --no-autostop
  --persist=false    → --no-autostop=false

Plus a help-text note on the manager's new auto-clear behavior:
setting `--idle-timeout` to a non-zero value in a follow-up call
clears `--no-autostop` automatically, on the assumption that the
caller wants timeout-based stopping back. The CLI itself does not
need any extra logic for this — the manager handles it server-side
based on field presence in the PATCH body, and the CLI's existing
"omit unset flags from the wire payload" semantics (proto3
explicit-presence via *bool / *int64) feed straight into that.

Verified the marshal output matches what the new manager expects:
  --no-autostop      → {"sandbox_id":"x","no_autostop":true}
  --idle-timeout 15m → {"sandbox_id":"x","idle_timeout_secs":"900"}
  no flags           → {"sandbox_id":"x"}                      (rejected)

End-to-end against staging blocked until the manager PR rolls out.

Co-authored-by: Isaac
Tracks the matching change in the lakebox manager
(databricks-eng/universe#1875183) which moved the per-sandbox idle
timeout off `optional int64 idle_timeout_secs = 7` and onto
`optional google.protobuf.Duration idle_timeout = 7`. Drops the
sentinel-overloaded int64 in favor of a duration-typed field.

Wire shape:
- Response field is now `idleTimeout` carrying a proto3-canonical
  Duration string (e.g. `"900s"`); parsed into seconds via
  `time.ParseDuration` for the autostop column.
- Request body sends `idle_timeout` as the same string format.

The CLI flag stays `--idle-timeout` (Go duration string in / Go
duration string out); only the wire encoding changes. `list` and
`status` show the manager's global default for any sandbox whose
per-record value isn't yet visible under the new field name — that's
deliberate forward-compat behavior so an older manager + newer CLI
combination just degrades to showing the default rather than crashing.

Co-authored-by: Isaac
- ssh: auto-pick uw2.s.dbrx.dev when the workspace host has `.staging.` in it,
  otherwise keep using prod uw2.dbrx.dev. `--gateway` still overrides.
- api: when the workspace host carries a `?o=<id>` selector or the SDK config
  has a workspace_id, send `X-Databricks-Org-Id` so multi-workspace gateways
  (dogfood.staging.databricks.com) route the request to the right workspace.
  Without it the gateway rejects PATs with "Credential was not sent or was
  of an unsupported type for this API".

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:39 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:39 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:40 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:40 — with GitHub Actions Inactive
@pietern pietern changed the title demo: lakebox subcommand demo: lakebox May 8, 2026
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:51 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:51 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:53 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 14:53 — with GitHub Actions Inactive
pietern added 4 commits May 8, 2026 17:01
…onments

Brings in the original cmd/lakebox/* sources from
#4930 with full commit-history
attribution. Subsequent commits adapt the standalone CLI into a
'databricks lakebox' subcommand, replace hand-rolled HTTP/spinner/color
plumbing with libs primitives, and add unit tests.
Wire the cmd/lakebox tree from #4930 into the main CLI:

 - cmd/cmd.go registers lakebox.New() under the 'development' command
   group alongside bundle and sync.
 - cmd/fuzz_panic_test.go adds 'lakebox' to manualRoots so TestCountFuzz
   doesn't fuzz hand-written commands as if they were auto-generated.
 - cmd/lakebox tree: the original PR's standalone-CLI scaffolding is
   adapted for subcommand use — drop the auth-login hijacking and its
   helper exports, drop the 'last_profile' state field that only mattered
   when lakebox owned the whole CLI, switch PreRunE to root.MustWorkspaceClient
   directly, and update help text from 'lakebox foo' to
   'databricks lakebox foo' throughout.

Also conforms cmd/lakebox to project lint rules: env.UserHomeDir(ctx)
in place of os.UserHomeDir, errors.Is(err, fs.ErrNotExist) instead of
os.IsNotExist, atomic.Bool over sync.Once in the spinner gate, errors.New
for static error strings.

Co-authored-by: Isaac
Replace the hand-rolled braille spinner, TTY detection, and stderr
plumbing with the existing cmdio facilities:

- spin(ctx, msg) wraps cmdio.NewSpinner — capability-aware, runs through
  the same Bubble Tea program slot as other CLI spinners. ok/fail markers
  are logged via cmdio.LogString after Close.
- ok(ctx, ...) and warn(ctx, ...) are now ctx-based and route to stderr
  through cmdio rather than taking a writer. Call sites drop their
  cmd.ErrOrStderr() locals where they were only used for these helpers.
- field/blank still take an io.Writer because callers need to target
  stdout for structured output (list, status, config).

Drops the local isTTY, atomic.Bool spinner gate, and ticker goroutine.

Co-authored-by: Isaac
Drops the cyan/bold/dim/reset constants and the local accent/bold/dim
wrappers in favor of cmdio.Cyan and cmdio.HiBlack, which respect the
SupportsStdoutColor capability check. Bold-for-emphasis is folded into
Cyan since cmdio does not expose a Go-level Bold helper today; visually
this means lakebox IDs and emphasized command names render in cyan
rather than uncolored bold, consistent with the rest of the CLI.

field/status now take a context so they can call cmdio.HiBlack /
cmdio.Cyan; their writer parameter stays for callers that target stdout.

Co-authored-by: Isaac
pietern added 18 commits May 8, 2026 17:02
cmdio.Cyan/HiBlack covered most of lakebox's needs but conflated two
distinct visual roles: bold-for-emphasis (uncolored) on IDs and command
names, and dim (\x1b[2m, faint) on secondary metadata. The previous
commit collapsed both into Cyan/HiBlack and changed the rendering.

Add Bold and Dim helpers alongside the existing color set — ansiBold
already lived in color.go; ansiDim is new. Both gate on the same
SupportsStdoutColor capability check as Red/Green/etc.

Switch lakebox call sites back to Bold for IDs and command emphasis,
and to Dim for secondary text (autostop labels, FQDNs, "(default
cleared)", table headers, the "No lakeboxes found" notice). Running
lakebox IDs in `list` go back to bold-cyan via composition.

Co-authored-by: Isaac
Two parity gaps from the previous commit, found by diffing byte-level
output against the original ui.go:

- status('creating') was bold cyan in the original; the cmdio rewrite
  dropped the bold. Restore via Bold(Cyan(...)) composition. Bytes
  differ from the original ('\x1b[1m\x1b[36m...\x1b[0m\x1b[0m' vs
  '\x1b[36m\x1b[1m...\x1b[0m') but render identically — SGR codes
  are additive, the extra trailing reset is a no-op.

- field() applied %-10s padding to the already-Dim-wrapped label, so
  the SGR escapes inflated the byte count and column alignment broke
  whenever color was enabled. Pad first, then wrap.

Co-authored-by: Isaac
The Unix path used syscall.Exec to replace the Go process with ssh
directly, saving one fork. The Windows path already used
exec.Command(...).Run(), and that works on all platforms — terminal
signals are delivered to ssh via the foreground process group either
way. Collapse to one cross-platform path; drop the build-tagged file
and the runtime.GOOS check.

Co-authored-by: Isaac
libs/execv already wraps the syscall.Exec / Windows-emulation pattern
the previous version reimplemented inline. Switch to it so ssh truly
replaces the CLI process on Unix instead of running as a child — fewer
moving parts when the user hits Ctrl-C, and one fewer Go process in the
ps tree for the lifetime of the session.

Co-authored-by: Isaac
Replace the hand-rolled HTTP plumbing with client.DatabricksClient.Do,
following the pattern in cmd/api/api.go and bundle/deploy/filer.go. Each
method becomes a single Do() call; the SDK handles auth, JSON marshal,
JSON unmarshal, error parsing, and retries.

Removed:
 - doRequest (manual http.NewRequestWithContext + Config.Authenticate)
 - parseAPIError + the local apiError type (SDK returns apierr.APIError)
 - manual json.Marshal / json.NewDecoder.Decode in every method
 - net/http response status-code branching

Preserved:
 - X-Databricks-Org-Id is still injected on every call. The SDK's
   Config.WorkspaceID is the source of truth; we fall back to parsing
   `?o=<id>` off the host because some staging gateways are configured
   that way and the SDK doesn't lift the query into Config.WorkspaceID.

newLakeboxAPI now returns (*lakeboxAPI, error) since client.New can fail
on bad config; callers updated.

Co-authored-by: Isaac
Today if a code path between spin(...) and s.ok/s.fail returns early,
the cmdio Bubble Tea program keeps running and we leak a goroutine
plus garble the terminal. The wrapper kept its own `finished` gate but
exposed no way to close without printing a marker.

Add Close() that stops the spinner with no marker (wired through the
same `finished` gate, so calling Close() after ok/fail is a no-op),
and `defer s.Close()` at every spin site. ok/fail still print the ✓/✗
line on the success/failure paths; Close is just the cleanup safety
net.

Co-authored-by: Isaac
…gate

Define a local cmdioSpinner interface (just Close()) that the unexported
cmdio.spinner type satisfies structurally. Embed it on our wrapper so
spinner.Close comes for free, and drop the func() workaround.

The 'finished' bool was only preventing double-printing the ✓/✗ marker
if a caller called ok/fail twice — caller pilot error rather than a real
hazard, and cmdio's own Close is already idempotent (sync.OnceFunc on
sendQuit), so the gate isn't needed for resource safety. Net effect:
shorter, the embedded Close() is still safe to defer, and double-calls
to ok/fail print twice (which they always should have).

Co-authored-by: Isaac
Add Update(msg string) to the cmdioSpinner interface so callers can
re-suffix the spinner mid-spin without reaching past our wrapper. No
current call site uses it, but it's a free pass-through via embedding
and matches the underlying cmdio API.

Co-authored-by: Isaac
If the default lakebox stored at ~/.databricks/lakebox.json gets removed
out-of-band (auto-stop expiry, admin reap, deletion from another
machine), 'lakebox ssh' would happily try to ssh to it via the gateway
and the user would get a confusing 'Permission denied (publickey)' from
ssh. There was no signal that the default was stale.

api.get the saved default first; if it 404s (or any other error), warn,
clearDefault, and fall through to the existing 'no default → provision a
fresh one' branch. Mirrors the validation already in 'lakebox create'.

Co-authored-by: Isaac
Cover load/save/clear round-trips, missing-file and corrupt-JSON paths,
multi-profile independence, and the legacy 'last_profile' field that
older CLI versions wrote — loadState must accept it (silently dropping
the unknown key) and saveState must rewrite the file without it so it
naturally falls off on the next mutation.

All tests use env.WithUserHomeDir(t.Context(), t.TempDir()) so they
operate on an isolated state file.

Co-authored-by: Isaac
The /api/2.0/lakebox/ssh-keys endpoint identifies registered keys by
hash. Live exploration confirmed the algorithm: sha256 of
'<type> <base64-blob>' (comment stripped) truncated to 16 bytes, hex
encoded — looks like MD5 (32 hex chars) but isn't.

Encode this client-side so we can answer 'is this local key registered?'
without a list call. Tests use the exact hashes captured from the live
API as ground truth, plus an edge case for empty input.

Co-authored-by: Isaac
Two small fixes to the keyHash helper:

- Replace the nested IndexByte calls with a range over strings.SplitSeq
  that breaks after the second token. Tracks a running byte offset so we
  still slice the original string instead of allocating a joined copy.
- Drop the misleading 'without a list call' phrasing. You still need to
  call GET /ssh-keys; the helper just means you can match a local key
  against the listing by hash, without re-uploading the key contents.

Co-authored-by: Isaac
The SplitSeq approach needed a running offset and a 'first iteration?'
guard inside the loop. Walking bytes directly until we see the second
space is shorter and reads more directly: count spaces, slice when the
counter hits 2. Single pass, no allocation.

Co-authored-by: Isaac
The previous wording implied the expected hashes were pulled from real
registered keys returned by the API. They aren't — they're sha256[:16]
of synthetic strings I posted during exploration. The algorithm was
verified live; the test pins the algorithm rather than any specific
captured registration.

Co-authored-by: Isaac
Every case in TestKeyHash already pins an exact 32-char hex string, so a
separate length-only test buys nothing.

Co-authored-by: Isaac
Drop the bespoke resolveWorkspaceID helper and the cached wsID field on
lakeboxAPI. Match the minimal pattern that libs/telemetry, libs/filer,
and SDK-generated workspace services already use: read cfg.WorkspaceID
directly, send the X-Databricks-Org-Id header if set.

Removes the '?o=<id>' fallback that parsed the host's query string.
That behavior was unique to lakebox and inconsistent with how every
other CLI surface handles SPOG hosts; the SDK's host-metadata discovery
populates cfg.WorkspaceID for hosts that need it, and users who run
into edge cases set workspace_id explicitly the same way they would
for `bundle deploy` or `databricks api`.

Adds the auth.WorkspaceIDNone ("none") sentinel strip so a profile
created via `databricks auth login` for SPOG account-level access
doesn't send the literal string "none" as the routing identifier.
This fix matches what cmd/api/api.go (#5137) and libs/auth do; the
four other orgIDHeaders helpers in the codebase still have the latent
bug, which is a separate cleanup.

Co-authored-by: Isaac
setDefault and clearDefault unconditionally rewrote ~/.databricks/
lakebox.json even when the in-memory state was identical to what was
already on disk: clearing a profile that wasn't in the map, or
re-setting the same value. That created or touched the file for
no-op operations.

Add change-detection guards to both: setDefault is a no-op when the
profile already maps to the requested ID; clearDefault is a no-op
when the profile isn't in the map. Result: a CLI invocation that
doesn't change state can no longer cause a file to spring into
existence on a fresh machine.

Tests:
 - clearDefault on a missing profile must leave the file absent
 - setDefault with an unchanged value must not bump the mtime
 - getDefault on a fresh state must not create the file (regression
   test for the read-only path)

Co-authored-by: Isaac
Mark the lakebox parent command as Hidden so it doesn't appear in
'databricks --help' under Developer Tools. The subcommands themselves
are still reachable — 'databricks lakebox --help' lists them — but the
feature stays out of the discoverable surface while it remains
internal.

This also reverts the acceptance/help/output.txt regen from the
previous push, since hiding the command means the golden file already
matches the actual help output.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 15:04 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 8, 2026 15:04 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants