Skip to content

CI timeout reliability improvements for flaky test failures#1194

Open
brooke-hamilton wants to merge 2 commits intodevcontainers:mainfrom
brooke-hamilton:test-timeouts
Open

CI timeout reliability improvements for flaky test failures#1194
brooke-hamilton wants to merge 2 commits intodevcontainers:mainfrom
brooke-hamilton:test-timeouts

Conversation

@brooke-hamilton
Copy link
Copy Markdown

Summary

This PR adds timeout guards and retry logic to prevent CI failures caused by transient network timeouts when tests make OCI registry calls.

Problem

Recent non-dependabot CI runs have been failing due to mocha test timeouts on network operations. Every failure in the last two weeks was caused by a test exceeding its timeout while making HTTP calls to OCI registries like ghcr.io. Examples:

Run #4064 (PR #1189) — containerFeaturesOrder.test.ts

12 passing (1m)
2 failing

1) Feature Dependencies > dependsOn > valid dependsOn with published oci Features:
   Error: Timeout of 20000ms exceeded.

Run #4063 (PR #1188) — featuresCLICommands.test.ts

27 passing (3m)
2 failing

1) test functions getVersionsStrictSorted and getPublishedTags > should list published versions:
   Error: Timeout of 2000ms exceeded.

This test had no this.timeout() set, so it inherited mocha's 2-second default — far too short for network I/O.

Run #4062 (PR #1183) — featureHelpers.test.ts

277 passing (10m)  
2 failing

1) validate processFeatureIdentifier > should process oci registry (with a digest):
   Error: Timeout of 4000ms exceeded.

Earlier dependabot runs also experienced multi-hour hangs (1.5–3+ hours) before eventually failing, wasting CI resources.

Solution

1. Add --retries 1 to the test-matrix npm script

- "test-matrix": "env TS_NODE_PROJECT=src/test/tsconfig.json mocha -r ts-node/register --exit"
+ "test-matrix": "env TS_NODE_PROJECT=src/test/tsconfig.json mocha -r ts-node/register --exit --retries 1"

This automatically retries any transiently failing test once before marking it as failed. This alone would have prevented all 4 failing jobs across the 3 recent non-dependabot runs.

2. Add .mocharc.yml with a 6-minute global timeout

timeout: 360000  # 6 minutes global safety net

Tests with explicit this.timeout() calls override this, but tests without a timeout (like the getVersionsStrictSorted tests) now get a reasonable default instead of 2 seconds.

3. Add timeout-minutes to all GitHub Actions jobs

Workflow Job Timeout
dev-containers.yml cli 10 min
dev-containers.yml tests-matrix 30 min
dev-containers.yml features-registry-compatibility 10 min
dev-containers.yml install-script 10 min
test-windows.yml tests-matrix 15 min
test-docker-v29.yml test-docker-v29 20 min
test-docker-v20.yml test-docker-v20 20 min

This prevents jobs from hanging for hours (as seen in dependabot runs #4055 and #4056 which ran for 1h 50m+).

Files Changed

  • .github/workflows/dev-containers.yml — Added timeout-minutes to 4 jobs
  • .github/workflows/test-windows.yml — Added timeout-minutes: 15
  • .github/workflows/test-docker-v29.yml — Added timeout-minutes: 20
  • .github/workflows/test-docker-v20.yml — Added timeout-minutes: 20
  • .mocharc.yml — New file with 6-minute global timeout
  • package.json — Added --retries 1 to test-matrix script

Impact

Recent Failure Root Cause Fixed by --retries 1? Fixed by .mocharc.yml?
Run #4064 (20s timeout) OCI network latency ✅ Yes No (overridden by test)
Run #4063 job 1 (2s timeout) Missing this.timeout() ✅ Yes ✅ Yes (directly)
Run #4063 job 2 (4s timeout) OCI network latency ✅ Yes No (overridden by test)
Run #4062 (4s timeout) OCI network latency ✅ Yes No (overridden by test)

All recent non-dependabot failures would be prevented by these changes.

Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
@brooke-hamilton brooke-hamilton requested a review from a team as a code owner April 8, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant