CDP telemetry fixes#251
Conversation
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR title 'CDP telemetry fixes' does not indicate changes to kernel API endpoints or Temporal workflows; please add a comment to opt in to deploy monitoring if this affects those areas. To monitor this PR anyway, reply with |
| } else if r.Context().Err() != nil { | ||
| reason = oapi.ContextCancelled | ||
| } | ||
| reason := resolveDisconnectReason(cause, r.Context(), mgr, urlCh, upstreamURL, restartConfirmWait, logger) |
There was a problem hiding this comment.
this seems to be a behavioral change, not just telemetry. Previously the URL watcher canceled the pump as soon as UpstreamManager published a different DevTools URL, forcing clients off stale upstream sessions. With this change, urlCh is only checked during dial/reason resolution after the pump exits, so a session can stay attached to the old upstream until the websocket itself errors or the request context is canceled. Could we preserve the watcher for stale-session cancellation while keeping this new cause-based reason resolution?
There was a problem hiding this comment.
maybe I am wrong, but just double-checking here
Overview
Bug
When chromium restarted mid-session (
PATCH /chromium/flags,/configure, etc., or a crash + supervisord autorestart), the CDP proxy emittedcdp_disconnect.reason = "client_close"instead of"upstream_changed". Live repro confirmed: ~1.6s after the WS opened, reason wasclient_closeeven though chromium clearly went away.Root cause
A goroutine watched
mgr.Subscribe()for the new chromium URL and stampedreasonOverride = upstream_changedbefore triggering pump teardown. But the upstream's TCP EOF (chromium dying) reaches the pump before the new "DevTools listening on" log line is emitted (~5–8 s gap). Socleanupalways ran first, sawreasonOverride == nil, and fell through toclient_close.Fix
wsproxy.Pump'sonClosecallback now receives aPumpExitCause(client/upstream/context).reasonOverrideindevtoolsproxy. Reason resolution moved into cleanup as a small helper,resolveDisconnectReason, which:client_close/context_cancelledimmediately from the cause,mgr.Current()and otherwise waits up torestartConfirmWait(10 s) onurlChfor a new URL — new URL ⇒upstream_changed, timeout ⇒upstream_error.Tests
TestResolveDisconnectReason— 6 table cases covering all four reasons (sub-100 ms).TestWebSocketProxyHandler_EmitsUpstreamChangedOnMidStreamRestart— integration through the handler withrestartConfirmWaitshortened.upstream_error+ client_close tests continue to pass.Verification
Repro'd live against the headful image. Before fix:
reason=client_close, duration_ms=1620. After fix:reason=upstream_changed, duration_ms=10388.Note
Medium Risk
Touches the WebSocket proxy pump/cleanup path and changes how
cdp_disconnectreasons are derived, which could affect client reconnect behavior and telemetry classification under failure/restart conditions.Overview
Improves CDP disconnect telemetry by teaching
wsproxy.Pumpto report which side caused the proxy loop to exit, and using that signal indevtoolsproxyto classify disconnects asclient_close,context_cancelled,upstream_error, orupstream_changed.When the upstream side dies, the proxy now briefly waits for an updated DevTools URL from
UpstreamManager(with a configurablerestartConfirmWait) to distinguish Chromium restarts from generic upstream failures; adds targeted unit tests for the new resolution logic and a mid-stream restart scenario. Most other diffs are import ordering/whitespace cleanup.Reviewed by Cursor Bugbot for commit d6ef60c. Bugbot is set up for automated code reviews on this repo. Configure here.