Skip to content

test(ci): probe G — install ProcDump as JIT debugger to capture __fastfail across the silent kill#7862

Open
JohnMcLear wants to merge 5 commits into
developfrom
probe-flake-procdump
Open

test(ci): probe G — install ProcDump as JIT debugger to capture __fastfail across the silent kill#7862
JohnMcLear wants to merge 5 commits into
developfrom
probe-flake-procdump

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Summary

Install Sysinternals ProcDump as the system Just-In-Time debugger on both Windows backend matrices. When ANY process on the runner crashes or fast-fails (int 29h), the OS hands it to the JIT debugger, which writes a full memory dump to the artifact directory.

Why

Defender ruled out by #7855. First failure on that PR (run 26470378618) with DisableRealtimeMonitoring + DisableBehaviorMonitoring + DisableIOAVProtection all confirmed True produced this evidence:

Source Contents at kill time
Application log empty (cleared pre-test, zero entries during run)
System log only pre-test service-stop events
Defender Operational only stale 2026-05-18 provisioning events
Application Error/Hang/WER zero entries
OS-side tasklist (node.exe) 1s before kill HandleCount=323 stable, ThreadCount=17 stable, WS=321 MB stable, KernelModeTime + UserModeTime growing linearly — completely healthy

Then dead 1 second later with zero trace anywhere.

That fingerprint — silent external termination with no event-log entry and a healthy OS-side process — is the signature of __fastfail (int 29h). libuv on Windows uses it for internal assertion failures in uv_win.c, tcp-win.c, pipe-win.c (TCP/pipe state-corruption checks). __fastfail terminates the process bypassing all user-mode notification and WER.

The only standard tool that captures state across __fastfail is a JIT-installed debugger.

What this changes

New step on both Windows backend matrices, BEFORE "Run the backend tests":

- name: Install ProcDump as JIT debugger (probe G)
  shell: powershell
  run: |
    Invoke-WebRequest "https://download.sysinternals.com/files/Procdump.zip" -OutFile "$env:TEMP\Procdump.zip"
    Expand-Archive -Path "$env:TEMP\Procdump.zip" -DestinationPath "C:\procdump" -Force
    New-Item -ItemType Directory -Force -Path "${{ github.workspace }}\node-report\dumps" | Out-Null
    & "C:\procdump\procdump64.exe" -accepteula -i -ma "${{ github.workspace }}\node-report\dumps"
  • -i registers procdump64.exe as the AeDebug handler (system JIT debugger)
  • -ma writes full memory dumps
  • Dumps land in node-report/dumps/ — picked up by the existing failure-artifact upload

Built on PR #7855

Branch is on top of probe-flake-defender-eventlog-sidecar, so it carries the Defender-off + event-log + tasklist-fix probes too. Even though Defender's ruled out, those baselines stay useful for cross-comparison.

Expected outcome

On the next silent ELIFECYCLE failure: a .dmp file in the artifact. Loadable in WinDbg with !analyze -v — that should name the function calling __fastfail and the assertion that fired.

If no .dmp is produced even on failure → the kill isn't going through user-mode exception handling at all → it's coming from the kernel (HVCI, CET violation, page-fault chain). Escalates to ETW tracing.

🤖 Generated with Claude Code

JohnMcLear and others added 3 commits May 26, 2026 19:58
…tasklist sidecar

Three orthogonal probes against the Windows silent-ELIFECYCLE flake,
landed in one PR because they're all workflow-only and complementary.

PROBE A — Defender real-time monitoring OFF for the test phase.
The kill fingerprint (silent external termination, no JS-handler
trace, no native abort report, sub-1s death window) matches
Microsoft Defender's behavioural-monitoring TerminateProcess
signature. GHA Windows runners have Defender RT enabled by default,
and rapid loopback TCP fanout is on Defender's "suspect process
behaviour" list. If kills disappear with RT off → causal, this PR
is the fix-as-mitigation; if not → Defender ruled out.

PROBE H — pre-test wevtutil clear + post-test event log dump.
We've never looked at the Windows event log around the kill.
`Application`, `System`, `Microsoft-Windows-Windows Defender/
Operational`, and the `Application Error`/`Application Hang`/
`Windows Error Reporting` providers between them will surface
who killed the process: Defender, Service Control Manager,
Werfault, kernel guard, etc. Clear the logs pre-test so
signal-to-noise is high; dump post-test regardless of pass/fail.

PROBE I — tasklist sidecar fix (latent bug from PR #7846).
The bash `tasklist /v /fi "imagename eq node.exe" /fo csv`
produced empty output on the runner — git-bash mangles tasklist's
UTF-16-LE-with-BOM output. Switch to PowerShell's
Get-CimInstance Win32_Process with explicit columns. This gives
us the OS-side equivalent of the libuv handle table (HandleCount,
ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime,
UserModeTime) sampled every 500 ms. When Node's `_getActiveHandles`
goes silent during the V8 starvation window, the OS still
sees the process; this captures that view.

All three additions land in node-report/ which the existing
artifact upload picks up on failure. No test-code changes.
No new dependencies.

Expected outcomes:
  - Defender root cause: Win-with-plugins flake rate drops materially
    over 5+ runs. event-defender.txt shows pre-kill threat-detection
    entries on the kills that DO still happen.
  - Defender not the root cause: event-application.txt /
    event-system.txt names the actual terminator (Service Control
    Manager, kernel, Werfault). Probe G (procdump) is the next step.
  - Neither: kernel-level kill bypassing all event logging — escalates
    to ETW tracing or a procdump on kill-detect trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first artifact upload step has `if: failure()` so we only see
node-report data on failure. For the Defender hypothesis (PR #7855)
we need to compare event-defender.txt between a passing run (baseline)
and a future failing run (kill signature) — otherwise N=1 captures
can't be evaluated. Add a second upload step gated on `always()`
that uploads only the small text files (event-*.txt, defender-*.txt)
on every run regardless of outcome. The unique `-${{ github.run_attempt }}`
suffix lets reruns accumulate separate artifacts for comparison.

Each artifact is ~few KB so this doesn't materially impact storage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…capture

Probe A (PR #7855) ruled out Defender as the killer: with
DisableRealtimeMonitoring + DisableBehaviorMonitoring + DisableIOAVProtection
all = True, the silent ELIFECYCLE still fired (run 26470378618, Win
without plugins, `pad.ts > Tests > creates a new Pad with empty text`,
kill +470 ms post-test-start, exit 255). The captured event logs
showed:

  - Application log: empty (zero entries during test phase)
  - System log: only pre-test service stops; no SCM TerminateProcess
  - Defender Operational: only stale 2026-05-18 runner provisioning
  - Application Error / Hang / WER: zero entries

The fixed tasklist sidecar showed the dying Node process (PID 7036)
was completely healthy 1 second before death: HandleCount=323 stable,
ThreadCount=17 stable, WorkingSetSize ~321 MB stable, KernelModeTime
and UserModeTime growing linearly. No anomaly in OS-side process
state. Then dead 1 second later with zero entry in any Windows
event log.

That fingerprint — silent external termination with no event-log
trace and no anomaly in OS-side state — matches `__fastfail` (the
`int 29h` fast-fail intrinsic). libuv on Windows uses `__fastfail`
for certain internal assertion failures in its TCP and pipe paths
(uv_win.c, tcp-win.c, pipe-win.c). When triggered, it immediately
terminates the process bypassing all user-mode notification
including WER. The only standard tool that catches state across
__fastfail is a JIT-installed debugger.

Install Sysinternals ProcDump as the system JIT debugger:
  - downloads procdump.zip from sysinternals.com
  - extracts to C:\procdump
  - `-i -ma` registers as the AeDebug handler, configured for full
    memory dumps
  - dumps land in node-report/dumps/ which the existing failure
    artifact picks up

On the next silent ELIFECYCLE this captures a .dmp file with full
call stack across the kill — loadable in WinDbg with
"!analyze -v" to see the libuv assertion (or whatever else) that
fired the fast-fail. That should be the final word on what's killing
the process.

Built on top of probe-flake-defender-eventlog-sidecar (#7855)
because the event-log capture + sidecar fix are useful baselines
even after Defender's ruled out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Add ProcDump JIT debugger and Windows diagnostics probes

🧪 Tests ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Install ProcDump as JIT debugger to capture __fastfail crashes
• Disable Windows Defender real-time monitoring during tests (probe A)
• Clear and dump Windows Event Logs pre/post-test (probe H)
• Fix tasklist sidecar to use PowerShell instead of bash (probe I)
• Upload Defender and Event Log diagnostics as artifacts
Diagram
flowchart LR
  A["Test Setup"] --> B["Install ProcDump<br/>as JIT Debugger"]
  A --> C["Disable Defender<br/>Real-time Monitoring"]
  A --> D["Clear Event Logs"]
  B --> E["Run Backend Tests"]
  C --> E
  D --> E
  E --> F["Dump Event Logs<br/>Application/System/Defender"]
  E --> G["Verify Defender State"]
  F --> H["Upload Diagnostics<br/>Artifacts"]
  G --> H

Loading

Grey Divider

File Changes

1. .github/workflows/backend-tests.yml 🧪 Tests +184/-32

Add Windows diagnostics and JIT debugger probes

• Added ProcDump JIT debugger installation step (probe G) to both Windows backend test jobs,
 downloading and registering procdump64.exe with -i -ma flags to capture full memory dumps on
 process crashes
• Implemented Defender real-time monitoring disable step (probe A) before tests with state
 verification before and after
• Added event log clearing pre-test and comprehensive post-test event log dumping (probe H) for
 Application, System, Defender Operational, and WER logs
• Fixed tasklist sidecar command (probe I) replacing bash tasklist with PowerShell
 Get-CimInstance to properly capture process metrics including HandleCount and WorkingSetSize
• Added new artifact upload step to always capture Defender and Event Log diagnostics regardless of
 test pass/fail status

.github/workflows/backend-tests.yml


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 27, 2026

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Unverified ProcDump download 🐞 Bug ⛨ Security
Description
The workflow downloads ProcDump from the internet and executes it with system-wide effects (-i
installs a JIT debugger) without any integrity or signature verification, creating a CI supply-chain
execution risk.
Code

.github/workflows/backend-tests.yml[R233-240]

Evidence
The workflow downloads ProcDump from an external URL, extracts it, and executes it to install as the
system JIT debugger, with no intervening hash/signature checks.

.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[427-453]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow downloads `Procdump.zip` and immediately executes `procdump64.exe` to install it as the system JIT debugger. Without validating the downloaded content (hash/signature), a compromised upstream or interception could result in executing attacker-controlled code in CI.
## Issue Context
This happens in both Windows jobs (`withoutpluginsWindows` and `withpluginsWindows`).
## Fix Focus Areas
- .github/workflows/backend-tests.yml[233-243]
- .github/workflows/backend-tests.yml[443-453]
## Suggested fix
- After download+extract, validate the binary before running it:
- Prefer verifying the Authenticode signature: `Get-AuthenticodeSignature C:\procdump\procdump64.exe` and require `Status -eq 'Valid'`.
- Optionally also pin a known SHA256 for the zip or exe via `Get-FileHash` and compare to a constant.
- If validation fails, `Write-Error` and `exit 1` to prevent executing untrusted code.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Full dump artifact exposure 🐞 Bug ⛨ Security
Description
ProcDump is configured to write full memory dumps (-ma) into node-report/dumps, and the existing
failure artifact upload publishes the entire node-report/ directory, which can expose sensitive
in-memory data in CI artifacts.
Code

.github/workflows/backend-tests.yml[R236-240]

Evidence
The ProcDump command explicitly requests full dumps and writes them under node-report/dumps, and
the workflow uploads node-report/ on failure, which will include those dumps.

.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[332-340]
.github/workflows/backend-tests.yml[427-453]
.github/workflows/backend-tests.yml[542-550]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow creates full memory dumps (`-ma`) in a directory that is subsequently uploaded as an artifact on failure. Full dumps can include credentials/tokens/env values and other sensitive runtime data.
## Issue Context
Dumps are written to `${{ github.workspace }}\node-report\dumps`, and failures upload `node-report/` wholesale.
## Fix Focus Areas
- .github/workflows/backend-tests.yml[236-240]
- .github/workflows/backend-tests.yml[332-340]
- .github/workflows/backend-tests.yml[446-450]
- .github/workflows/backend-tests.yml[542-550]
## Suggested fix
One (or combine multiple):
- Prefer a smaller dump type unless full dumps are strictly required (e.g., use a minidump option instead of `-ma`).
- Keep dumps out of the default `node-report/` artifact path (upload them under a separate artifact name with stricter conditions/retention).
- Gate dump upload behind an explicit opt-in (e.g., a workflow input, a repo variable, or a branch/owner condition) so routine pushes do not publish dumps.
- If keeping dumps, consider reducing `retention-days` for dump artifacts compared to other logs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. ProcDump install not checked 🐞 Bug ☼ Reliability
Description
The ProcDump installation step does not check the external process exit code, so procdump64.exe -i
can fail (leaving no JIT debugger installed) while the step still appears successful.
Code

.github/workflows/backend-tests.yml[R233-243]

Evidence
The step invokes procdump64.exe and then only prints AeDebug registry keys; it never checks
$LASTEXITCODE nor asserts that AeDebug changed to ProcDump.

.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[427-453]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow invokes `procdump64.exe` but does not validate `$LASTEXITCODE` (or otherwise assert registry changes). PowerShell will not automatically fail the step on a non-zero exit code from an external executable.
## Issue Context
The step’s purpose is diagnostic; silent failure defeats the probe and wastes CI cycles.
## Fix Focus Areas
- .github/workflows/backend-tests.yml[233-243]
- .github/workflows/backend-tests.yml[443-453]
## Suggested fix
- Immediately after invoking ProcDump, check `$LASTEXITCODE` and either:
- `if ($LASTEXITCODE -ne 0) { Write-Error "ProcDump JIT install failed ($LASTEXITCODE)"; exit $LASTEXITCODE }`, or
- emit a loud warning to the log/step summary and continue if you intentionally don’t want to fail the job.
- Optionally parse/validate the `AeDebug` registry values and fail/warn if they don’t reference ProcDump.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

4. Potential 32-bit JIT gap 🐞 Bug ☼ Reliability
Description
Only procdump64.exe is installed as the JIT debugger, so crashes from 32-bit processes (which can
consult the WOW6432Node AeDebug key) may not be captured by ProcDump.
Code

.github/workflows/backend-tests.yml[R240-243]

Evidence
The step runs procdump64.exe -i but separately reads both the 64-bit AeDebug registry key and the
WOW6432Node key, indicating awareness of two paths while only installing one executable.

.github/workflows/backend-tests.yml[217-243]
.github/workflows/backend-tests.yml[427-453]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow installs only `procdump64.exe` as the JIT debugger but also queries the WOW6432Node AeDebug key. If any relevant process is 32-bit, dumps might not be produced.
## Issue Context
This is likely low-impact if everything under test is 64-bit, but it’s easy to harden.
## Fix Focus Areas
- .github/workflows/backend-tests.yml[240-243]
- .github/workflows/backend-tests.yml[450-453]
## Suggested fix
- Also install the 32-bit handler (e.g., run `procdump.exe -i ...` if present in the zip) or otherwise ensure both 64-bit and 32-bit AeDebug registrations point to ProcDump.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

JohnMcLear and others added 2 commits May 27, 2026 13:36
…m workspace root

Rev 2 (PR #7862, run 26511524556) confirmed ProcDump was successfully
registered as JIT debugger via -i. The Win+plugins job then failed
with the silent ELIFECYCLE fingerprint, but NO .dmp file was
captured in the artifact. Two problems:

  1. The registered AeDebug command used -j with cwd (workspace
     root) as the dump path, not the dumps subdirectory I'd
     intended. So if a dump WAS written, it went to D:\a\etherpad
     \etherpad\<pid>.dmp, outside my upload path.

  2. More importantly: AeDebug only fires for unhandled SEH /
     __fastfail / WER-classified crashes. The fact that NOTHING
     fired tells us the kill class bypasses all of those.

Rev 3 attacks both problems:

  (a) Push-Location to node-report/dumps before procdump -i so the
      cwd at install time is the dumps subdirectory. Future AeDebug-
      triggered dumps land where the artifact upload picks them up.

  (b) Adds an attached procdump per node.exe pid. A bash background
      loop polls Get-Process node every 500 ms and spawns
      `procdump -ma -t -n 3 <pid> dumps/` for each new PID. The -t
      flag dumps on process TERMINATION — including external
      TerminateProcess — which AeDebug never sees.

  (c) After pnpm test exits, the test step now walks the workspace
      root for any stray .dmp files and copies them into the upload
      directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant