test(ci): probe G — install ProcDump as JIT debugger to capture __fastfail across the silent kill#7862
test(ci): probe G — install ProcDump as JIT debugger to capture __fastfail across the silent kill#7862JohnMcLear wants to merge 5 commits into
Conversation
…tasklist sidecar Three orthogonal probes against the Windows silent-ELIFECYCLE flake, landed in one PR because they're all workflow-only and complementary. PROBE A — Defender real-time monitoring OFF for the test phase. The kill fingerprint (silent external termination, no JS-handler trace, no native abort report, sub-1s death window) matches Microsoft Defender's behavioural-monitoring TerminateProcess signature. GHA Windows runners have Defender RT enabled by default, and rapid loopback TCP fanout is on Defender's "suspect process behaviour" list. If kills disappear with RT off → causal, this PR is the fix-as-mitigation; if not → Defender ruled out. PROBE H — pre-test wevtutil clear + post-test event log dump. We've never looked at the Windows event log around the kill. `Application`, `System`, `Microsoft-Windows-Windows Defender/ Operational`, and the `Application Error`/`Application Hang`/ `Windows Error Reporting` providers between them will surface who killed the process: Defender, Service Control Manager, Werfault, kernel guard, etc. Clear the logs pre-test so signal-to-noise is high; dump post-test regardless of pass/fail. PROBE I — tasklist sidecar fix (latent bug from PR #7846). The bash `tasklist /v /fi "imagename eq node.exe" /fo csv` produced empty output on the runner — git-bash mangles tasklist's UTF-16-LE-with-BOM output. Switch to PowerShell's Get-CimInstance Win32_Process with explicit columns. This gives us the OS-side equivalent of the libuv handle table (HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime, UserModeTime) sampled every 500 ms. When Node's `_getActiveHandles` goes silent during the V8 starvation window, the OS still sees the process; this captures that view. All three additions land in node-report/ which the existing artifact upload picks up on failure. No test-code changes. No new dependencies. Expected outcomes: - Defender root cause: Win-with-plugins flake rate drops materially over 5+ runs. event-defender.txt shows pre-kill threat-detection entries on the kills that DO still happen. - Defender not the root cause: event-application.txt / event-system.txt names the actual terminator (Service Control Manager, kernel, Werfault). Probe G (procdump) is the next step. - Neither: kernel-level kill bypassing all event logging — escalates to ETW tracing or a procdump on kill-detect trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first artifact upload step has `if: failure()` so we only see node-report data on failure. For the Defender hypothesis (PR #7855) we need to compare event-defender.txt between a passing run (baseline) and a future failing run (kill signature) — otherwise N=1 captures can't be evaluated. Add a second upload step gated on `always()` that uploads only the small text files (event-*.txt, defender-*.txt) on every run regardless of outcome. The unique `-${{ github.run_attempt }}` suffix lets reruns accumulate separate artifacts for comparison. Each artifact is ~few KB so this doesn't materially impact storage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…capture Probe A (PR #7855) ruled out Defender as the killer: with DisableRealtimeMonitoring + DisableBehaviorMonitoring + DisableIOAVProtection all = True, the silent ELIFECYCLE still fired (run 26470378618, Win without plugins, `pad.ts > Tests > creates a new Pad with empty text`, kill +470 ms post-test-start, exit 255). The captured event logs showed: - Application log: empty (zero entries during test phase) - System log: only pre-test service stops; no SCM TerminateProcess - Defender Operational: only stale 2026-05-18 runner provisioning - Application Error / Hang / WER: zero entries The fixed tasklist sidecar showed the dying Node process (PID 7036) was completely healthy 1 second before death: HandleCount=323 stable, ThreadCount=17 stable, WorkingSetSize ~321 MB stable, KernelModeTime and UserModeTime growing linearly. No anomaly in OS-side process state. Then dead 1 second later with zero entry in any Windows event log. That fingerprint — silent external termination with no event-log trace and no anomaly in OS-side state — matches `__fastfail` (the `int 29h` fast-fail intrinsic). libuv on Windows uses `__fastfail` for certain internal assertion failures in its TCP and pipe paths (uv_win.c, tcp-win.c, pipe-win.c). When triggered, it immediately terminates the process bypassing all user-mode notification including WER. The only standard tool that catches state across __fastfail is a JIT-installed debugger. Install Sysinternals ProcDump as the system JIT debugger: - downloads procdump.zip from sysinternals.com - extracts to C:\procdump - `-i -ma` registers as the AeDebug handler, configured for full memory dumps - dumps land in node-report/dumps/ which the existing failure artifact picks up On the next silent ELIFECYCLE this captures a .dmp file with full call stack across the kill — loadable in WinDbg with "!analyze -v" to see the libuv assertion (or whatever else) that fired the fast-fail. That should be the final word on what's killing the process. Built on top of probe-flake-defender-eventlog-sidecar (#7855) because the event-log capture + sidecar fix are useful baselines even after Defender's ruled out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoAdd ProcDump JIT debugger and Windows diagnostics probes
WalkthroughsDescription• Install ProcDump as JIT debugger to capture __fastfail crashes • Disable Windows Defender real-time monitoring during tests (probe A) • Clear and dump Windows Event Logs pre/post-test (probe H) • Fix tasklist sidecar to use PowerShell instead of bash (probe I) • Upload Defender and Event Log diagnostics as artifacts Diagramflowchart LR
A["Test Setup"] --> B["Install ProcDump<br/>as JIT Debugger"]
A --> C["Disable Defender<br/>Real-time Monitoring"]
A --> D["Clear Event Logs"]
B --> E["Run Backend Tests"]
C --> E
D --> E
E --> F["Dump Event Logs<br/>Application/System/Defender"]
E --> G["Verify Defender State"]
F --> H["Upload Diagnostics<br/>Artifacts"]
G --> H
File Changes1. .github/workflows/backend-tests.yml
|
Code Review by Qodo
1. Unverified ProcDump download
|
…m workspace root Rev 2 (PR #7862, run 26511524556) confirmed ProcDump was successfully registered as JIT debugger via -i. The Win+plugins job then failed with the silent ELIFECYCLE fingerprint, but NO .dmp file was captured in the artifact. Two problems: 1. The registered AeDebug command used -j with cwd (workspace root) as the dump path, not the dumps subdirectory I'd intended. So if a dump WAS written, it went to D:\a\etherpad \etherpad\<pid>.dmp, outside my upload path. 2. More importantly: AeDebug only fires for unhandled SEH / __fastfail / WER-classified crashes. The fact that NOTHING fired tells us the kill class bypasses all of those. Rev 3 attacks both problems: (a) Push-Location to node-report/dumps before procdump -i so the cwd at install time is the dumps subdirectory. Future AeDebug- triggered dumps land where the artifact upload picks them up. (b) Adds an attached procdump per node.exe pid. A bash background loop polls Get-Process node every 500 ms and spawns `procdump -ma -t -n 3 <pid> dumps/` for each new PID. The -t flag dumps on process TERMINATION — including external TerminateProcess — which AeDebug never sees. (c) After pnpm test exits, the test step now walks the workspace root for any stray .dmp files and copies them into the upload directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Install Sysinternals ProcDump as the system Just-In-Time debugger on both Windows backend matrices. When ANY process on the runner crashes or fast-fails (
int 29h), the OS hands it to the JIT debugger, which writes a full memory dump to the artifact directory.Why
Defender ruled out by #7855. First failure on that PR (run 26470378618) with
DisableRealtimeMonitoring + DisableBehaviorMonitoring + DisableIOAVProtectionall confirmedTrueproduced this evidence:Then dead 1 second later with zero trace anywhere.
That fingerprint — silent external termination with no event-log entry and a healthy OS-side process — is the signature of
__fastfail(int 29h). libuv on Windows uses it for internal assertion failures inuv_win.c,tcp-win.c,pipe-win.c(TCP/pipe state-corruption checks).__fastfailterminates the process bypassing all user-mode notification and WER.The only standard tool that captures state across
__fastfailis a JIT-installed debugger.What this changes
New step on both Windows backend matrices, BEFORE "Run the backend tests":
-iregisters procdump64.exe as the AeDebug handler (system JIT debugger)-mawrites full memory dumpsnode-report/dumps/— picked up by the existing failure-artifact uploadBuilt on PR #7855
Branch is on top of
probe-flake-defender-eventlog-sidecar, so it carries the Defender-off + event-log + tasklist-fix probes too. Even though Defender's ruled out, those baselines stay useful for cross-comparison.Expected outcome
On the next silent ELIFECYCLE failure: a
.dmpfile in the artifact. Loadable in WinDbg with!analyze -v— that should name the function calling__fastfailand the assertion that fired.If no
.dmpis produced even on failure → the kill isn't going through user-mode exception handling at all → it's coming from the kernel (HVCI, CET violation, page-fault chain). Escalates to ETW tracing.🤖 Generated with Claude Code