fix(test): enable HTTP keep-alive on the global agent for backend tests#7852
fix(test): enable HTTP keep-alive on the global agent for backend tests#7852JohnMcLear wants to merge 1 commit into
Conversation
The Windows backend-test silent-ELIFECYCLE flake (silent exit 255 mid test phase, no JS-handler trace, no native abort report; documented by PRs #7838, #7842) correlates with localhost server-side TIME_WAIT socket accumulation. The OS-level netstat sidecar added in PR #7846 captured the smoking gun in run 26419467467: netstat poll server-side TIME_WAIT on [::1]:50398 ---------------- ------------------------------------- 21:02:18.337 56 21:02:20.765 103 21:02:23.204 136 21:02:25.662 169 21:02:28.146 201 21:02:30.638 228 ← silent kill 37 ms later That's ~14 server-active-close TIME_WAITs per second growing linearly toward the kill. All TIME_WAITs are on the Etherpad listening port — the SERVER is the active closer, because supertest opens a fresh TCP connection per request and the server sends FIN after the response. The hard kernel TIME_WAIT cap (Windows default ~2000 free TCBs) was not exceeded in this capture, but 228 half-dead handles forces libuv's IOCP socket-table scan to walk a much larger working set on every completion. The kill cluster is concentrated on tests that perform rapid sequential HTTP roundtrips (importexportGetPost.ts, pad.ts, import.ts DOCX round-trips) — exactly the pattern that grows TIME_WAIT fastest. Setting `http.globalAgent.keepAlive = true` in the backend-test common bootstrap makes supertest's underlying http.request reuse a single TCP connection for sequential requests to the same origin. Connection reuse collapses the TIME_WAIT churn from ~14/s to nearly zero — each test no longer leaves a half-dead socket behind. Linux's TCP recycling is fast enough that the same load doesn't symptomize there, so this keep-alive is a Windows-targeted mitigation that's also a strict improvement on Linux (less socket churn = less work overall). This is a real behavior change scoped to the test process — tests share a long-lived connection rather than opening fresh ones — so the shape of any race that depended on per-request connect/disconnect cycles will shift. None of the existing backend tests assert on that, but the change is observable and is being landed deliberately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review Summary by QodoEnable HTTP keep-alive to fix Windows backend-test silent-ELIFECYCLE flake
WalkthroughsDescription• Enable HTTP keep-alive on global agent to fix Windows backend-test flake • Prevents TIME_WAIT socket accumulation from rapid sequential requests • Reduces IOCP socket-table scan overhead on Windows test execution • Adds detailed diagnostic comments explaining root cause analysis Diagramflowchart LR
A["Rapid sequential HTTP requests"] -->|"without keep-alive"| B["Fresh TCP per request"]
B --> C["Server-side TIME_WAIT accumulation"]
C --> D["Large IOCP socket-table scan"]
D --> E["Silent ELIFECYCLE kill"]
A -->|"with keep-alive=true"| F["TCP connection reuse"]
F --> G["Minimal TIME_WAIT churn"]
G --> H["Reduced IOCP overhead"]
File Changes1. src/tests/backend/common.ts
|
Code Review by Qodo
1. Keep-alive breaks shutdown timeout
|
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
| // Enable HTTP keep-alive on the global agent for the test process. Without | ||
| // this, every supertest request opens a fresh TCP connection and the server | ||
| // closes it on response — the server side then enters TIME_WAIT for the | ||
| // default Windows TcpTimedWaitDelay (~120 s) before the ephemeral port is | ||
| // freed. | ||
| // | ||
| // The Windows backend-test job's OS-level netstat sidecar (PR #7846) | ||
| // captured the smoking gun for the silent-ELIFECYCLE flake in run | ||
| // 26419467467: localhost server-side TIME_WAIT counts on the Etherpad | ||
| // listening port climbed linearly at ~14/s, reaching 228 active TIME_WAIT | ||
| // entries on `[::1]:50398` 37 ms before the kill — all server-active-close | ||
| // half-dead sockets, all from rapid sequential supertest requests with no | ||
| // connection reuse. The kill cluster on Windows + Node 24 + plugins | ||
| // correlates tightly with this TIME_WAIT accumulation: it gives libuv a | ||
| // large pool of half-dead handles to walk on every IOCP completion. | ||
| // | ||
| // Setting keepAlive=true on http.globalAgent makes supertest's underlying | ||
| // http.request reuse a single TCP connection for sequential requests to | ||
| // the same origin, collapsing TIME_WAIT churn from ~14/s to nearly zero. | ||
| // Linux is unaffected; the flake was Windows-only because Linux's | ||
| // TIME_WAIT recycling is much faster and the kernel can sustain higher | ||
| // half-dead-socket counts without symptom. | ||
| http.globalAgent.keepAlive = true; | ||
|
|
There was a problem hiding this comment.
1. Keep-alive breaks shutdown timeout 🐞 Bug ☼ Reliability
Setting http.globalAgent.keepAlive=true can leave an idle HTTP connection open when backend test teardown calls server.exit(), and server.stop() will fail if shutdown hooks exceed its 3s timeout. express.closeServer() explicitly allows up to 5s before it forcibly destroys remaining sockets, so this change can cause teardown to error with "Timed out waiting for shutdown tasks" and fail the test run.
Agent Prompt
## Issue description
`src/tests/backend/common.ts` enables keep-alive on `http.globalAgent` at module load, but the test teardown does not close/destroy the agent’s sockets before calling `server.exit()`. Etherpad shutdown (`server.stop()`) enforces a 3s timeout for shutdown hooks, while the HTTP server shutdown path (`closeServer()`) may legitimately take up to 5s before it force-terminates remaining sockets.
This creates a new failure mode: an idle keep-alive connection can keep the HTTP server from fully closing within 3 seconds, causing `server.stop()` to throw `Timed out waiting for shutdown tasks` and the test process to exit non-zero.
## Issue Context
- Backend tests start a server and create a long-lived supertest client.
- With keep-alive enabled globally, it is normal for at least one connection to remain open briefly after the last request.
- The shutdown path already anticipates lingering connections (5s force-destroy), but the higher-level shutdown timeout is only 3s.
## Fix Focus Areas
- src/tests/backend/common.ts[156-161]
- src/tests/backend/common.ts[22-45]
### Suggested implementation direction
In the existing `after(async function () { ... })` teardown in `common.ts`, explicitly close keep-alive sockets before `await server.exit()`:
- Call `http.globalAgent.destroy()` (and optionally restore `http.globalAgent.keepAlive` to its previous value).
This keeps the keep-alive behavior during tests, but ensures shutdown hooks complete quickly and deterministically.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
CI Feedback 🧐A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
Summary
One-line behaviour change in
src/tests/backend/common.ts: sethttp.globalAgent.keepAlive = trueat module load. Targeted at the Windows backend-test silent-ELIFECYCLE flake we've been chasing across #7838, #7842, #7846.Smoking gun from PR #7846
The OS-level netstat sidecar in #7846 captured
run 26419467467 Win-no-plugins. localhost server-side TIME_WAIT count on the Etherpad listening port[::1]:50398:~14 server-active-close TIME_WAITs per second, growing linearly. ESTABLISHED count is ~zero because each supertest request opens-and-closes a fresh TCP. The SERVER is the active closer (sends FIN after response), so the SERVER side enters TIME_WAIT — for ~120 s on Windows default
TcpTimedWaitDelay.228 half-dead sockets isn't enough to hit the Windows free-TCB cap (~2000), but it forces libuv's IOCP socket-table scan to walk a much larger working set on every TCP completion. The kill cluster is concentrated on tests that do rapid sequential HTTP roundtrips —
importexportGetPost.ts > Import authorization checks,pad.ts > Tests,import.ts > DOCX export -> import round-trip, all of which match this profile.The fix
http.globalAgent.keepAlive = truemakes supertest's underlyinghttp.requestreuse a single TCP connection for sequential requests to the same origin. Each test no longer leaves a TIME_WAIT socket behind. Churn collapses from ~14/s to ~0.Linux is unaffected by the flake because its TCP recycling and kernel half-dead socket handling are far more permissive. This change is a strict improvement there too (less socket churn = less work overall).
What this does NOT do
diagnostics.ts— the heartbeat / boundary / mid-test reports stay exactly as merged in test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE #7838 + test(ci): schedule a mid-test snapshot 150ms after every beforeEach #7842. If this fix doesn't fully resolve the flake, we still capture death data.Test plan
🤖 Generated with Claude Code