add VM-internal failure telemetry (OOM + service crashes)#250
Conversation
Two new telemetry event types surface VM-level failures alongside the existing browser events: - `system_oom_kill` — emitted by an in-process /dev/kmsg reader in the api server whenever the kernel OOM-killer terminates a process, including unsupervised Chrome renderer subprocesses. - `service_crashed` — emitted by a tiny supervisord eventlistener binary that POSTs to the local /telemetry/events endpoint whenever a supervised service unexpectedly exits (PROCESS_STATE_EXITED with expected=0, or PROCESS_STATE_FATAL). Both events flow through the existing EventStream and inherit the SSE and S2 sinks for free. Categorized as `system` so they're always-on. The shim is shipped in both the chromium-headful and chromium-headless images and registered as `[eventlistener:supervisord-shim]`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The default supervisord eventlistener buffer is 10. When several supervised services flap in close succession (which is exactly what happens during a real failure cascade) supervisord drops events before the shim has a chance to drain them.
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR adds telemetry event handling to kernel-images-api but does not modify API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) as specified in the filter. To monitor this PR anyway, reply with |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4ecdc0d. Configure here.
| } | ||
| } | ||
|
|
||
| if _, err := out.WriteString("RESULT 2\nOK\n"); err != nil { |
There was a problem hiding this comment.
Trailing newline in RESULT breaks supervisord protocol
High Severity
The shim writes "RESULT 2\nOK\n" but the supervisord eventlistener protocol requires "RESULT 2\nOK" — no trailing newline after the data. The official childutils.py send() produces RESULT 2\nOK exactly. The extra \n is left in supervisord's read buffer after it consumes the declared 2 bytes of result data. When the shim next sends READY\n, supervisord's buffer contains \nREADY\n; the first 6 bytes (\nREADY) don't match the expected READY\n token, so supervisord transitions the listener to UNKNOWN state and clears its buffer — discarding the READY\n. Both sides then block waiting for input from each other, permanently deadlocking the listener after the first event.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 4ecdc0d. Configure here.
| ServiceName string `json:"service_name"` | ||
| FromState string `json:"from_state"` | ||
| Pid *int `json:"pid,omitempty"` | ||
| } |
There was a problem hiding this comment.
Shim uses ad-hoc structs instead of oapi types
Low Severity
The shim defines ad-hoc struct types (telemetryEventBody, telemetryEventSource, serviceCrashedPayload) instead of using generated oapi types like oapi.BrowserServiceCrashedEventData. The rule requires all event producers to use generated oapi types when building payloads to prevent drift between the documented API contract and actual event shapes. If the OpenAPI schema for service_crashed events changes, these duplicated types won't be updated automatically.
Triggered by learned rule: Producer repo must own typed event schemas in OpenAPI and use generated oapi types
Reviewed by Cursor Bugbot for commit 4ecdc0d. Configure here.


Summary
Adds two new telemetry event types so customers see when the VM kills processes or supervised services crash:
system_oom_kill— fires whenever the Linux OOM-killer terminates a process inside the VM. Sourced from/dev/kmsgby an in-process goroutine inkernel-images-api. Catches Chrome renderer subprocesses that aren't supervised.service_crashed— fires whenever supervisord reports aPROCESS_STATE_EXITEDwithexpected=0(orPROCESS_STATE_FATAL). Sourced from a tiny shim binary (kernel-images-supervisord-shim) that supervisord launches as an eventlistener; the shim POSTs to the existing local/telemetry/eventsendpoint.Both event categories are
system, which is always-on regardless of caller telemetry config. Events flow through the existingEventStream, so the SSE stream (GET /telemetry/stream) and the S2 storage sink inherit them for free.Why two sources
Coverage matrix:
No de-dup — if both fire for the same kill, that's the signal (RAM exhaustion, not a bug).
Design notes
lib/sysmononly owns the kmsg goroutine and publishes directly toEventStream. The supervisord shim handles its own schema mapping and uses the existing telemetry HTTP endpoint, so this PR doesn't introduce a new unix socket or a central event-dispatcher abstraction (vs. the earlier sketch).OKeven if the HTTP POST fails, so supervisord doesn't quarantine the listener when kernel-images-api is briefly down.supervisord-shim.confeventlistener entry.Out of scope (per the original spec)
oom_score_adjTest plan
oom_score_adj, comm with internal space, non-OOM lines (preamble), malformed linesparseFields,readEvent,mapEventfor unexpected exits, expected exits (skipped), FATAL transitions, unrelated event types (skipped)lib/sysmon: pipe an OOM-kill line through a FIFO and verify the event lands inEventStreamwith the right schemago vet ./...cleango test -race ./...(excluding e2e) passesecho f > /proc/sysrq-triggerproducessystem_oom_killon/telemetry/streamsupervisorctl stop mutterdoes NOT fireservice_crashed(expected=1)kill -SEGV $(pgrep mutter)firesservice_crashedNote
Medium Risk
Adds new always-on system telemetry sources (reading
/dev/kmsgand a supervisord eventlistener that POSTs into the API), which could increase event volume and affect runtime behavior if parsing/IO misbehaves or the local telemetry endpoint is unavailable.Overview
Adds VM-internal failure telemetry by introducing two new event types:
system_oom_kill(emitted by a newsysmonmonitor that tails/dev/kmsginsidekernel-images-api) andservice_crashed(emitted by a new lightweightsupervisord-shimeventlistener that converts supervisord PROCESS_STATE_* crash transitions into POSTs to/telemetry/events).Updates the OpenAPI spec and generated
oapiunion types to include these events, wiressysmonstartup intoserver/cmd/api/main.go, and updates both headful/headless Chromium images to build/install the shim plus enable it via newsupervisord-shim.conf(with a larger event buffer). Includes unit/round-trip tests for the kmsg parser and shim event mapping.Reviewed by Cursor Bugbot for commit 4ecdc0d. Bugbot is set up for automated code reviews on this repo. Configure here.