fix: Prevent CI status getting stuck in 'pending' state (race condition) #952

cfsmp3 · 2025-12-23T16:08:52Z

Summary

This PR fixes a critical race condition that caused GitHub CI status to remain permanently stuck as "pending" / "Tests queued" even after tests had actually failed. This issue was discovered while investigating CCExtractor PR #1880.

Key fixes:

Prevent tests from being queued before artifacts are available
Ensure GitHub status is always updated, even if database operations fail
Improve artifact search with proper pagination handling

Bug Analysis

Real-World Case: CCExtractor PR #1880

The CI status for PR #1880 showed "Waiting for status to be reported — Tests queued" for both Linux and Windows, despite the tests having actually failed hours earlier.

Timeline of Events (2025-12-23, all times UTC)

Time	Event	GitHub Status
10:29:32	Workflow requested → tests scheduled	`pending` - "Waiting for actions to complete"
10:30:54	Linux test FAILS (artifact not ready yet!)	`error` - "No build artifact found"
10:31:24	Linux artifact finally created (29 sec late)	—
10:31:43	Windows test FAILS (artifact not ready yet!)	`error` - "No build artifact found"
11:08:23	Windows artifact finally created	—
11:09:22	Workflow completes → NEW tests queued	`pending` - "Tests queued" ← STUCK HERE

Root Causes Identified

1. Race Condition (Primary Issue)

When the workflow_run.completed webhook fires, the Sample Platform immediately queues tests. However, GitHub's artifact API may not have propagated the artifacts yet. In this case:

Linux test ran at 10:30:55, artifact created at 10:31:24 (29 seconds late)
Windows test ran at 10:31:43, artifact created at 11:08:23 (37 minutes late!)

2. Status Override

When new tests were queued at 11:09:22, the GitHub status was set to "pending" with "Tests queued", overwriting the previous "error" status. GitHub's commit status API replaces the previous status for each context.

3. Silent Failure in `mark_test_failed`

The function caught all exceptions and only logged them:

except Exception as e:
    log.error(f"Failed to mark test {test.id} as failed: {e}")
    # GitHub never notified! Status stays as "pending" forever

Solution

1. Artifact Verification Before Queuing Tests

Added verify_artifacts_exist() function that checks if artifacts are available before calling queue_test(). If the workflow succeeded but artifacts aren't found (race condition), it posts an ERROR status:

"Build succeeded but artifact not yet available - please retry"

This prevents queuing tests that will immediately fail.

2. Robust `mark_test_failed` Function

Completely rewrote to ensure GitHub is always notified:

Separated database and GitHub updates into independent try/catch blocks
If DB fails, still tries to update GitHub
If GitHub fails, logs specific error (GithubException vs other)
Logs CRITICAL when both fail
Now includes target_url for better traceability

3. Improved Artifact Search

New find_artifact_for_commit() function:

Properly handles GitHub API pagination
Searches up to 500 artifacts (configurable via MAX_ARTIFACTS_TO_SEARCH)
Better error handling and logging
Used by both webhook handler and start_test()

Files Changed

File	Changes
`mod_ci/controllers.py`	+200/-44 lines - Core fixes
`tests/test_ci/test_controllers.py`	+251 lines - 12 new tests

New Tests

TestFindArtifactForCommit (5 tests): Success, not found, wrong platform, exception handling, max search limit
TestVerifyArtifactsExist (3 tests): Both exist, only Linux, neither exists
TestMarkTestFailedImproved (4 tests): Both succeed, DB fails, both fail, includes target_url

Test Plan

Verify syntax and type checks pass (mypy, isort)
CI tests pass
Manual test: Create a PR in CCExtractor, verify status is correctly reported
Verify that if artifact isn't available, ERROR status is posted (not stuck pending)

Codecov Report

❌ Patch coverage is 77.77778% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.81%. Comparing base (53b3d7f) to head (363f94b).
⚠️ Report is 40 commits behind head on master.

Files with missing lines	Patch %	Lines
mod_ci/controllers.py	77.77%	16 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #952      +/-   ##
==========================================
- Coverage   86.88%   86.81%   -0.08%     
==========================================
  Files          35       35              
  Lines        3759     3822      +63     
  Branches      767      776       +9     
==========================================
+ Hits         3266     3318      +52     
- Misses        355      364       +9     
- Partials      138      140       +2

Flag	Coverage Δ
unittests	`86.81% <77.77%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

This fix addresses a critical race condition that caused GitHub CI status to remain stuck as "pending" / "Tests queued" even after tests failed. | Time | Event | GitHub Status | |------|-------|---------------| | 10:29:32 | Workflow requested, tests scheduled | pending - "Waiting for actions" | | 10:30:54 | Linux test fails (artifact not ready) | error - "No build artifact found" | | 10:31:24 | Linux artifact created | — | | 10:31:43 | Windows test fails (artifact not ready) | error - "No build artifact found" | | 11:08:23 | Windows artifact created | — | | 11:09:22 | Workflow completes, NEW tests queued | pending - "Tests queued" ← STUCK | 1. **Race Condition**: Tests were queued before artifacts were available. The workflow_run.completed webhook fired, but GitHub's artifact API hadn't propagated the artifacts yet (29 seconds gap for Linux). 2. **Status Override**: When new tests (7138/7139) were queued at 11:09:22, the status was set to "pending", overwriting the previous "error" status. 3. **Silent Failure**: The `mark_test_failed` function caught all exceptions silently, so if GitHub status update failed, no error was reported and the status remained stuck as "pending". - Separated database and GitHub updates so one failing doesn't block the other - Added detailed logging for each step (db_success, github_success) - Now includes target_url in GitHub status for better traceability - Logs CRITICAL error when both updates fail - Added `verify_artifacts_exist()` check before calling `queue_test()` - If workflow succeeded but artifact isn't found, posts ERROR status: "Build succeeded but artifact not yet available - please retry" - Prevents queuing tests that will immediately fail - New `find_artifact_for_commit()` function with proper pagination - Searches up to MAX_ARTIFACTS_TO_SEARCH (500) artifacts - Better error handling and logging - Used by both webhook handler and start_test() - All artifact search operations now log progress - Test failures include test ID in log messages - Artifact availability is logged before queuing - `mod_ci/controllers.py`: Core fixes - `tests/test_ci/test_controllers.py`: Added 12 new tests - TestFindArtifactForCommit: 5 tests - TestVerifyArtifactsExist: 3 tests - TestMarkTestFailedImproved: 4 tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…rror Works around different PyGithub versions having different type stubs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests that exercise the successful workflow completion path now mock verify_artifacts_exist to return {'linux': True, 'windows': True}. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sonarqubecloud · 2025-12-23T19:11:57Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.4% Duplication on New Code

See analysis details on SonarQube Cloud

cfsmp3 requested review from canihavesomecoffee and thealphadollar as code owners December 23, 2025 16:08

cfsmp3 closed this Dec 23, 2025

cfsmp3 reopened this Dec 23, 2025

cfsmp3 force-pushed the fix/ci-status-race-condition branch 2 times, most recently from 0a0845b to 6152875 Compare December 23, 2025 19:00

canihavesomecoffee approved these changes Dec 23, 2025

View reviewed changes

cfsmp3 and others added 6 commits December 23, 2025 20:11

fix: Add type annotation to fix mypy error on GithubObject.NotSet

b70fd66

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: Use getattr for GithubObject.NotSet to avoid mypy attr-defined e…

9022ea1

…rror Works around different PyGithub versions having different type stubs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

style: Fix line length in deschedule_test calls

e45db5e

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

style: Fix continuation line indentation in tests

c8d796b

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

canihavesomecoffee force-pushed the fix/ci-status-race-condition branch from 6152875 to c8d796b Compare December 23, 2025 19:11

canihavesomecoffee merged commit c7ba746 into master Dec 23, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Prevent CI status getting stuck in 'pending' state (race condition) #952

fix: Prevent CI status getting stuck in 'pending' state (race condition) #952

Uh oh!

cfsmp3 commented Dec 23, 2025

Uh oh!

codecov bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Prevent CI status getting stuck in 'pending' state (race condition) #952

fix: Prevent CI status getting stuck in 'pending' state (race condition) #952

Uh oh!

Conversation

cfsmp3 commented Dec 23, 2025

Summary

Bug Analysis

Real-World Case: CCExtractor PR #1880

Timeline of Events (2025-12-23, all times UTC)

Root Causes Identified

1. Race Condition (Primary Issue)

2. Status Override

3. Silent Failure in mark_test_failed

Solution

1. Artifact Verification Before Queuing Tests

2. Robust mark_test_failed Function

3. Improved Artifact Search

Files Changed

New Tests

Test Plan

Related

Uh oh!

codecov bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud bot commented Dec 23, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3. Silent Failure in `mark_test_failed`

2. Robust `mark_test_failed` Function

codecov bot commented Dec 23, 2025 •

edited

Loading