test(ci): OS-level sidecar watcher for the Windows silent ELIFECYCLE#7846
Conversation
In-process diagnostics (diagnostics.ts heartbeat at 5 Hz + node-report snapshots on every beforeEach and heartbeat tick) merged in #7838 and #7842 reach a hard ceiling: during every captured death window the V8 main isolate is event-loop-starved for 200-400 ms before the process is externally terminated, so any timer-driven probe (heartbeat, setTimeout, --report-on-signal handler) never gets serviced and we have zero JS-visible state from the actual moment of death. To capture state during the starvation window we need a probe whose own scheduling does not depend on the dying process's libuv event loop. This commit adds a tiny bash background loop to the Windows backend-test steps (both with- and without-plugins). Every 500 ms it appends: - netstat.log: localhost TCP socket state — surfaces TIME_WAIT / CLOSE_WAIT accumulation or ephemeral-port exhaustion that the in-process libuv handle list can't see (libuv only shows handles Node currently knows about; the kernel may hold many more sockets in disposal states). - tasklist.log: node.exe process state from the Windows OS view (handle count, working set, CPU time), independent of whether V8 is responsive. Both files land in $GITHUB_WORKSPACE/node-report/ which is already the artifact-upload target on failure, so they ride for free on existing infrastructure. The watcher is killed cleanly after `pnpm test` returns so it never holds the runner open. On the next captured silent ELIFECYCLE we'll have, for the first time, a 500 ms-resolution external observation of TCP and process state across the death window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoAdd OS-level sidecar watcher for Windows silent ELIFECYCLE flake
WalkthroughsDescription• Add OS-level sidecar watcher to Windows backend test jobs • Poll TCP socket states and process metrics every 500ms • Capture external diagnostics during event-loop starvation window • Write netstat and tasklist logs to artifact directory Diagramflowchart LR
A["Test Job Start"] --> B["Launch Bash Watcher Loop"]
B --> C["Poll Every 500ms"]
C --> D["netstat.log<br/>TCP States"]
C --> E["tasklist.log<br/>Process Metrics"]
F["pnpm test --exit"] --> G["Test Execution"]
G --> H["Kill Watcher"]
H --> I["Upload Artifacts"]
D --> I
E --> I
File Changes1. .github/workflows/backend-tests.yml
|
Code Review by Qodo
1. Broken netstat grep filter
|
| ts=$(date '+%H:%M:%S.%3N') | ||
| { | ||
| echo "=== $ts ===" | ||
| netstat -an 2>/dev/null | grep -E "TCP\s+(127\.0\.0\.1|\[::1\])" || true |
There was a problem hiding this comment.
1. Broken netstat grep filter 🐞 Bug ≡ Correctness
The watcher uses grep -E "TCP\s+...", but \s is not a whitespace escape in extended regular expressions, so the filter is likely to match nothing and netstat.log won’t capture the intended socket-state timeline. This breaks the primary diagnostic signal the watcher is meant to provide in both Windows test jobs.
Agent Prompt
### Issue description
The workflow’s background watcher pipes `netstat` output into `grep -E "TCP\s+..."`. In `grep -E` (extended regular expressions), `\s` is not a special whitespace token, so the pattern will not match typical `netstat` lines (which separate columns with spaces).
### Issue Context
This watcher is intended to capture localhost TCP socket states during Windows silent-ELIFECYCLE failures. If the filter matches nothing, `netstat.log` becomes mostly timestamps with no actionable data.
### Fix
Replace `\s+` with a POSIX character class, or switch to PCRE mode.
Examples:
- `grep -E "TCP[[:space:]]+(127\.0\.0\.1|\[::1\])"`
- or `grep -P "TCP\s+(127\.0\.0\.1|\[::1\])"` (only if `grep -P` is guaranteed on the runner)
Apply the same fix in both duplicated watcher blocks.
### Fix Focus Areas
- .github/workflows/backend-tests.yml[246-246]
- .github/workflows/backend-tests.yml[380-380]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
CI Feedback 🧐A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
Summary
Adds a bash background loop to both Windows backend-test steps (with- and without-plugins) that polls Windows OS state every 500 ms and writes to the existing
node-report/artifact directory:netstat.log— localhost TCP socket states (ESTABLISHED / TIME_WAIT / CLOSE_WAIT / FIN_WAIT) per tick.tasklist.log—node.exeprocess handle count, working set, CPU time per tick.Why
Eleven captured silent-ELIFECYCLE deaths from the merged in-process diagnostics (#7838, #7842) all share the same shape: the V8 main isolate goes event-loop-starved for 200–400 ms (5 Hz heartbeat falls silent), then the process is externally terminated bypassing all JS handlers,
--report-on-fatalerror, and--report-uncaught-exception. The libuv handle list in our pre-killnode-reportsnapshots is nominal — but libuv only knows about handles Node currently owns. The kernel can hold many more sockets in disposal states (TIME_WAIT etc.) that we never see.To capture state during the starvation window we need a probe whose scheduling doesn't depend on the dying process's event loop. A bash background loop is the simplest such probe — its scheduling is the runner's bash, not the Node isolate.
What this changes
Workflow-only — no code changes. Both Windows steps now wrap
pnpm test -- --exitlike this:Linux jobs are untouched (the flake is Windows-only).
Expected next capture
On the next silent ELIFECYCLE Win+plugins failure we'll get, for the first time:
Either signal can be the smoking gun (or rule out a class of hypotheses, e.g. TIME_WAIT exhaustion).
Risk
netstat -anandtasklistare cheap. The sleep loop runs in its own bash process so it doesn't compete with Node for the event loop. 500 ms cadence keeps CPU overhead negligible.🤖 Generated with Claude Code