test(ci): probe G — install ProcDump as JIT debugger to capture __fastfail across the silent kill by JohnMcLear · Pull Request #7862 · ether/etherpad

JohnMcLear · 2026-05-27T09:09:51Z

Summary

Install Sysinternals ProcDump as the system Just-In-Time debugger on both Windows backend matrices. When ANY process on the runner crashes or fast-fails (int 29h), the OS hands it to the JIT debugger, which writes a full memory dump to the artifact directory.

Why

Defender ruled out by #7855. First failure on that PR (run 26470378618) with DisableRealtimeMonitoring + DisableBehaviorMonitoring + DisableIOAVProtection all confirmed True produced this evidence:

Source	Contents at kill time
Application log	empty (cleared pre-test, zero entries during run)
System log	only pre-test service-stop events
Defender Operational	only stale 2026-05-18 provisioning events
Application Error/Hang/WER	zero entries
OS-side tasklist (node.exe) 1s before kill	HandleCount=323 stable, ThreadCount=17 stable, WS=321 MB stable, KernelModeTime + UserModeTime growing linearly — completely healthy

Then dead 1 second later with zero trace anywhere.

That fingerprint — silent external termination with no event-log entry and a healthy OS-side process — is the signature of __fastfail (int 29h). libuv on Windows uses it for internal assertion failures in uv_win.c, tcp-win.c, pipe-win.c (TCP/pipe state-corruption checks). __fastfail terminates the process bypassing all user-mode notification and WER.

The only standard tool that captures state across __fastfail is a JIT-installed debugger.

What this changes

New step on both Windows backend matrices, BEFORE "Run the backend tests":

- name: Install ProcDump as JIT debugger (probe G)
  shell: powershell
  run: |
    Invoke-WebRequest "https://download.sysinternals.com/files/Procdump.zip" -OutFile "$env:TEMP\Procdump.zip"
    Expand-Archive -Path "$env:TEMP\Procdump.zip" -DestinationPath "C:\procdump" -Force
    New-Item -ItemType Directory -Force -Path "${{ github.workspace }}\node-report\dumps" | Out-Null
    & "C:\procdump\procdump64.exe" -accepteula -i -ma "${{ github.workspace }}\node-report\dumps"

-i registers procdump64.exe as the AeDebug handler (system JIT debugger)
-ma writes full memory dumps
Dumps land in node-report/dumps/ — picked up by the existing failure-artifact upload

Built on PR #7855

Branch is on top of probe-flake-defender-eventlog-sidecar, so it carries the Defender-off + event-log + tasklist-fix probes too. Even though Defender's ruled out, those baselines stay useful for cross-comparison.

Expected outcome

On the next silent ELIFECYCLE failure: a .dmp file in the artifact. Loadable in WinDbg with !analyze -v — that should name the function calling __fastfail and the assertion that fired.

If no .dmp is produced even on failure → the kill isn't going through user-mode exception handling at all → it's coming from the kernel (HVCI, CET violation, page-fault chain). Escalates to ETW tracing.

🤖 Generated with Claude Code

…tasklist sidecar Three orthogonal probes against the Windows silent-ELIFECYCLE flake, landed in one PR because they're all workflow-only and complementary. PROBE A — Defender real-time monitoring OFF for the test phase. The kill fingerprint (silent external termination, no JS-handler trace, no native abort report, sub-1s death window) matches Microsoft Defender's behavioural-monitoring TerminateProcess signature. GHA Windows runners have Defender RT enabled by default, and rapid loopback TCP fanout is on Defender's "suspect process behaviour" list. If kills disappear with RT off → causal, this PR is the fix-as-mitigation; if not → Defender ruled out. PROBE H — pre-test wevtutil clear + post-test event log dump. We've never looked at the Windows event log around the kill. `Application`, `System`, `Microsoft-Windows-Windows Defender/ Operational`, and the `Application Error`/`Application Hang`/ `Windows Error Reporting` providers between them will surface who killed the process: Defender, Service Control Manager, Werfault, kernel guard, etc. Clear the logs pre-test so signal-to-noise is high; dump post-test regardless of pass/fail. PROBE I — tasklist sidecar fix (latent bug from PR #7846). The bash `tasklist /v /fi "imagename eq node.exe" /fo csv` produced empty output on the runner — git-bash mangles tasklist's UTF-16-LE-with-BOM output. Switch to PowerShell's Get-CimInstance Win32_Process with explicit columns. This gives us the OS-side equivalent of the libuv handle table (HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime, UserModeTime) sampled every 500 ms. When Node's `_getActiveHandles` goes silent during the V8 starvation window, the OS still sees the process; this captures that view. All three additions land in node-report/ which the existing artifact upload picks up on failure. No test-code changes. No new dependencies. Expected outcomes: - Defender root cause: Win-with-plugins flake rate drops materially over 5+ runs. event-defender.txt shows pre-kill threat-detection entries on the kills that DO still happen. - Defender not the root cause: event-application.txt / event-system.txt names the actual terminator (Service Control Manager, kernel, Werfault). Probe G (procdump) is the next step. - Neither: kernel-level kill bypassing all event logging — escalates to ETW tracing or a procdump on kill-detect trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first artifact upload step has `if: failure()` so we only see node-report data on failure. For the Defender hypothesis (PR #7855) we need to compare event-defender.txt between a passing run (baseline) and a future failing run (kill signature) — otherwise N=1 captures can't be evaluated. Add a second upload step gated on `always()` that uploads only the small text files (event-*.txt, defender-*.txt) on every run regardless of outcome. The unique `-${{ github.run_attempt }}` suffix lets reruns accumulate separate artifacts for comparison. Each artifact is ~few KB so this doesn't materially impact storage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…capture Probe A (PR #7855) ruled out Defender as the killer: with DisableRealtimeMonitoring + DisableBehaviorMonitoring + DisableIOAVProtection all = True, the silent ELIFECYCLE still fired (run 26470378618, Win without plugins, `pad.ts > Tests > creates a new Pad with empty text`, kill +470 ms post-test-start, exit 255). The captured event logs showed: - Application log: empty (zero entries during test phase) - System log: only pre-test service stops; no SCM TerminateProcess - Defender Operational: only stale 2026-05-18 runner provisioning - Application Error / Hang / WER: zero entries The fixed tasklist sidecar showed the dying Node process (PID 7036) was completely healthy 1 second before death: HandleCount=323 stable, ThreadCount=17 stable, WorkingSetSize ~321 MB stable, KernelModeTime and UserModeTime growing linearly. No anomaly in OS-side process state. Then dead 1 second later with zero entry in any Windows event log. That fingerprint — silent external termination with no event-log trace and no anomaly in OS-side state — matches `__fastfail` (the `int 29h` fast-fail intrinsic). libuv on Windows uses `__fastfail` for certain internal assertion failures in its TCP and pipe paths (uv_win.c, tcp-win.c, pipe-win.c). When triggered, it immediately terminates the process bypassing all user-mode notification including WER. The only standard tool that catches state across __fastfail is a JIT-installed debugger. Install Sysinternals ProcDump as the system JIT debugger: - downloads procdump.zip from sysinternals.com - extracts to C:\procdump - `-i -ma` registers as the AeDebug handler, configured for full memory dumps - dumps land in node-report/dumps/ which the existing failure artifact picks up On the next silent ELIFECYCLE this captures a .dmp file with full call stack across the kill — loadable in WinDbg with "!analyze -v" to see the libuv assertion (or whatever else) that fired the fast-fail. That should be the final word on what's killing the process. Built on top of probe-flake-defender-eventlog-sidecar (#7855) because the event-log capture + sidecar fix are useful baselines even after Defender's ruled out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-27T09:09:56Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

qodo-free-for-open-source-projects · 2026-05-27T09:10:11Z

Review Summary by Qodo

Add ProcDump JIT debugger and Windows diagnostics probes

🧪 Tests ✨ Enhancement

Walkthroughs

Description

• Install ProcDump as JIT debugger to capture __fastfail crashes
• Disable Windows Defender real-time monitoring during tests (probe A)
• Clear and dump Windows Event Logs pre/post-test (probe H)
• Fix tasklist sidecar to use PowerShell instead of bash (probe I)
• Upload Defender and Event Log diagnostics as artifacts

Diagram

flowchart LR
  A["Test Setup"] --> B["Install ProcDump<br/>as JIT Debugger"]
  A --> C["Disable Defender<br/>Real-time Monitoring"]
  A --> D["Clear Event Logs"]
  B --> E["Run Backend Tests"]
  C --> E
  D --> E
  E --> F["Dump Event Logs<br/>Application/System/Defender"]
  E --> G["Verify Defender State"]
  F --> H["Upload Diagnostics<br/>Artifacts"]
  G --> H

File Changes

1. .github/workflows/backend-tests.yml 🧪 Tests +184/-32

Add Windows diagnostics and JIT debugger probes

• Added ProcDump JIT debugger installation step (probe G) to both Windows backend test jobs,
 downloading and registering procdump64.exe with -i -ma flags to capture full memory dumps on
 process crashes
• Implemented Defender real-time monitoring disable step (probe A) before tests with state
 verification before and after
• Added event log clearing pre-test and comprehensive post-test event log dumping (probe H) for
 Application, System, Defender Operational, and WER logs
• Fixed tasklist sidecar command (probe I) replacing bash tasklist with PowerShell
 Get-CimInstance to properly capture process metrics including HandleCount and WorkingSetSize
• Added new artifact upload step to always capture Defender and Event Log diagnostics regardless of
 test pass/fail status

.github/workflows/backend-tests.yml

qodo-free-for-open-source-projects · 2026-05-27T09:10:12Z

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Unverified ProcDump download 🐞 Bug ⛨ Security

Description

The workflow downloads ProcDump from the internet and executes it with system-wide effects (-i
installs a JIT debugger) without any integrity or signature verification, creating a CI supply-chain
execution risk.

Code

.github/workflows/backend-tests.yml[R233-240]

Evidence
The workflow downloads ProcDump from an external URL, extracts it, and executes it to install as the
system JIT debugger, with no intervening hash/signature checks.
.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[427-453]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow downloads `Procdump.zip` and immediately executes `procdump64.exe` to install it as the system JIT debugger. Without validating the downloaded content (hash/signature), a compromised upstream or interception could result in executing attacker-controlled code in CI.
## Issue Context
This happens in both Windows jobs (`withoutpluginsWindows` and `withpluginsWindows`).
## Fix Focus Areas
- .github/workflows/backend-tests.yml[233-243]
- .github/workflows/backend-tests.yml[443-453]
## Suggested fix
- After download+extract, validate the binary before running it:
- Prefer verifying the Authenticode signature: `Get-AuthenticodeSignature C:\procdump\procdump64.exe` and require `Status -eq 'Valid'`.
- Optionally also pin a known SHA256 for the zip or exe via `Get-FileHash` and compare to a constant.
- If validation fails, `Write-Error` and `exit 1` to prevent executing untrusted code.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Full dump artifact exposure 🐞 Bug ⛨ Security

Description

ProcDump is configured to write full memory dumps (-ma) into node-report/dumps, and the existing
failure artifact upload publishes the entire node-report/ directory, which can expose sensitive
in-memory data in CI artifacts.

Code

.github/workflows/backend-tests.yml[R236-240]

Evidence
The ProcDump command explicitly requests full dumps and writes them under node-report/dumps, and
the workflow uploads node-report/ on failure, which will include those dumps.
.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[332-340]
.github/workflows/backend-tests.yml[427-453]
.github/workflows/backend-tests.yml[542-550]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow creates full memory dumps (`-ma`) in a directory that is subsequently uploaded as an artifact on failure. Full dumps can include credentials/tokens/env values and other sensitive runtime data.
## Issue Context
Dumps are written to `${{ github.workspace }}\node-report\dumps`, and failures upload `node-report/` wholesale.
## Fix Focus Areas
- .github/workflows/backend-tests.yml[236-240]
- .github/workflows/backend-tests.yml[332-340]
- .github/workflows/backend-tests.yml[446-450]
- .github/workflows/backend-tests.yml[542-550]
## Suggested fix
One (or combine multiple):
- Prefer a smaller dump type unless full dumps are strictly required (e.g., use a minidump option instead of `-ma`).
- Keep dumps out of the default `node-report/` artifact path (upload them under a separate artifact name with stricter conditions/retention).
- Gate dump upload behind an explicit opt-in (e.g., a workflow input, a repo variable, or a branch/owner condition) so routine pushes do not publish dumps.
- If keeping dumps, consider reducing `retention-days` for dump artifacts compared to other logs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ProcDump install not checked 🐞 Bug ☼ Reliability

Description

The ProcDump installation step does not check the external process exit code, so procdump64.exe -i
can fail (leaving no JIT debugger installed) while the step still appears successful.

Code

.github/workflows/backend-tests.yml[R233-243]

Evidence
The step invokes procdump64.exe and then only prints AeDebug registry keys; it never checks
$LASTEXITCODE nor asserts that AeDebug changed to ProcDump.
.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[427-453]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow invokes `procdump64.exe` but does not validate `$LASTEXITCODE` (or otherwise assert registry changes). PowerShell will not automatically fail the step on a non-zero exit code from an external executable.
## Issue Context
The step’s purpose is diagnostic; silent failure defeats the probe and wastes CI cycles.
## Fix Focus Areas
- .github/workflows/backend-tests.yml[233-243]
- .github/workflows/backend-tests.yml[443-453]
## Suggested fix
- Immediately after invoking ProcDump, check `$LASTEXITCODE` and either:
- `if ($LASTEXITCODE -ne 0) { Write-Error "ProcDump JIT install failed ($LASTEXITCODE)"; exit $LASTEXITCODE }`, or
- emit a loud warning to the log/step summary and continue if you intentionally don’t want to fail the job.
- Optionally parse/validate the `AeDebug` registry values and fail/warn if they don’t reference ProcDump.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Potential 32-bit JIT gap 🐞 Bug ☼ Reliability

Description

Only procdump64.exe is installed as the JIT debugger, so crashes from 32-bit processes (which can
consult the WOW6432Node AeDebug key) may not be captured by ProcDump.

Code

.github/workflows/backend-tests.yml[R240-243]

Evidence
The step runs procdump64.exe -i but separately reads both the 64-bit AeDebug registry key and the
WOW6432Node key, indicating awareness of two paths while only installing one executable.
.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[427-453]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow installs only `procdump64.exe` as the JIT debugger but also queries the WOW6432Node AeDebug key. If any relevant process is 32-bit, dumps might not be produced.
## Issue Context
This is likely low-impact if everything under test is 64-bit, but it’s easy to harden.
## Fix Focus Areas
- .github/workflows/backend-tests.yml[240-243]
- .github/workflows/backend-tests.yml[450-453]
## Suggested fix
- Also install the 32-bit handler (e.g., run `procdump.exe -i ...` if present in the zip) or otherwise ensure both 64-bit and 32-bit AeDebug registrations point to ProcDump.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

…m workspace root Rev 2 (PR #7862, run 26511524556) confirmed ProcDump was successfully registered as JIT debugger via -i. The Win+plugins job then failed with the silent ELIFECYCLE fingerprint, but NO .dmp file was captured in the artifact. Two problems: 1. The registered AeDebug command used -j with cwd (workspace root) as the dump path, not the dumps subdirectory I'd intended. So if a dump WAS written, it went to D:\a\etherpad \etherpad\<pid>.dmp, outside my upload path. 2. More importantly: AeDebug only fires for unhandled SEH / __fastfail / WER-classified crashes. The fact that NOTHING fired tells us the kill class bypasses all of those. Rev 3 attacks both problems: (a) Push-Location to node-report/dumps before procdump -i so the cwd at install time is the dumps subdirectory. Future AeDebug- triggered dumps land where the artifact upload picks them up. (b) Adds an attached procdump per node.exe pid. A bash background loop polls Get-Process node every 500 ms and spawns `procdump -ma -t -n 3 <pid> dumps/` for each new PID. The -t flag dumps on process TERMINATION — including external TerminateProcess — which AeDebug never sees. (c) After pnpm test exits, the test step now walks the workspace root for any stray .dmp files and copies them into the upload directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JohnMcLear and others added 3 commits May 26, 2026 19:58

JohnMcLear mentioned this pull request May 27, 2026

test(ci): probe K — Node 22 added to Windows-no-plugins matrix as comparator #7863

Closed

JohnMcLear and others added 2 commits May 27, 2026 13:36

test(ci): comment bump on probe G to retrigger Windows runs

fb5434c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(ci): probe G — install ProcDump as JIT debugger to capture __fastfail across the silent kill#7862

test(ci): probe G — install ProcDump as JIT debugger to capture __fastfail across the silent kill#7862
JohnMcLear wants to merge 5 commits into
developfrom
probe-flake-procdump

JohnMcLear commented May 27, 2026

Uh oh!

qodo-code-review Bot commented May 27, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 27, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JohnMcLear commented May 27, 2026

Summary

Why

What this changes

Built on PR #7855

Expected outcome

Uh oh!

qodo-code-review Bot commented May 27, 2026

Qodo reviews are paused for this user.

Uh oh!

qodo-free-for-open-source-projects Bot commented May 27, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-free-for-open-source-projects Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-free-for-open-source-projects Bot commented May 27, 2026 •

edited

Loading