Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 23, 2025

Summary

This PR fixes a critical race condition that caused GitHub CI status to remain permanently stuck as "pending" / "Tests queued" even after tests had actually failed. This issue was discovered while investigating CCExtractor PR #1880.

Key fixes:

  • Prevent tests from being queued before artifacts are available
  • Ensure GitHub status is always updated, even if database operations fail
  • Improve artifact search with proper pagination handling

Bug Analysis

Real-World Case: CCExtractor PR #1880

The CI status for PR #1880 showed "Waiting for status to be reported — Tests queued" for both Linux and Windows, despite the tests having actually failed hours earlier.

Timeline of Events (2025-12-23, all times UTC)

Time Event GitHub Status
10:29:32 Workflow requested → tests scheduled pending - "Waiting for actions to complete"
10:30:54 Linux test FAILS (artifact not ready yet!) error - "No build artifact found"
10:31:24 Linux artifact finally created (29 sec late)
10:31:43 Windows test FAILS (artifact not ready yet!) error - "No build artifact found"
11:08:23 Windows artifact finally created
11:09:22 Workflow completes → NEW tests queued pending - "Tests queued" ← STUCK HERE

Root Causes Identified

1. Race Condition (Primary Issue)

When the workflow_run.completed webhook fires, the Sample Platform immediately queues tests. However, GitHub's artifact API may not have propagated the artifacts yet. In this case:

  • Linux test ran at 10:30:55, artifact created at 10:31:24 (29 seconds late)
  • Windows test ran at 10:31:43, artifact created at 11:08:23 (37 minutes late!)

2. Status Override

When new tests were queued at 11:09:22, the GitHub status was set to "pending" with "Tests queued", overwriting the previous "error" status. GitHub's commit status API replaces the previous status for each context.

3. Silent Failure in mark_test_failed

The function caught all exceptions and only logged them:

except Exception as e:
    log.error(f"Failed to mark test {test.id} as failed: {e}")
    # GitHub never notified! Status stays as "pending" forever

Solution

1. Artifact Verification Before Queuing Tests

Added verify_artifacts_exist() function that checks if artifacts are available before calling queue_test(). If the workflow succeeded but artifacts aren't found (race condition), it posts an ERROR status:

"Build succeeded but artifact not yet available - please retry"

This prevents queuing tests that will immediately fail.

2. Robust mark_test_failed Function

Completely rewrote to ensure GitHub is always notified:

  • Separated database and GitHub updates into independent try/catch blocks
  • If DB fails, still tries to update GitHub
  • If GitHub fails, logs specific error (GithubException vs other)
  • Logs CRITICAL when both fail
  • Now includes target_url for better traceability

3. Improved Artifact Search

New find_artifact_for_commit() function:

  • Properly handles GitHub API pagination
  • Searches up to 500 artifacts (configurable via MAX_ARTIFACTS_TO_SEARCH)
  • Better error handling and logging
  • Used by both webhook handler and start_test()

Files Changed

File Changes
mod_ci/controllers.py +200/-44 lines - Core fixes
tests/test_ci/test_controllers.py +251 lines - 12 new tests

New Tests

  • TestFindArtifactForCommit (5 tests): Success, not found, wrong platform, exception handling, max search limit
  • TestVerifyArtifactsExist (3 tests): Both exist, only Linux, neither exists
  • TestMarkTestFailedImproved (4 tests): Both succeed, DB fails, both fail, includes target_url

Test Plan

  • Verify syntax and type checks pass (mypy, isort)
  • CI tests pass
  • Manual test: Create a PR in CCExtractor, verify status is correctly reported
  • Verify that if artifact isn't available, ERROR status is posted (not stuck pending)

Related

🤖 Generated with Claude Code

@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 77.77778% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.81%. Comparing base (53b3d7f) to head (363f94b).
⚠️ Report is 40 commits behind head on master.

Files with missing lines Patch % Lines
mod_ci/controllers.py 77.77% 16 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #952      +/-   ##
==========================================
- Coverage   86.88%   86.81%   -0.08%     
==========================================
  Files          35       35              
  Lines        3759     3822      +63     
  Branches      767      776       +9     
==========================================
+ Hits         3266     3318      +52     
- Misses        355      364       +9     
- Partials      138      140       +2     
Flag Coverage Δ
unittests 86.81% <77.77%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cfsmp3 cfsmp3 force-pushed the fix/ci-status-race-condition branch 2 times, most recently from 0a0845b to 6152875 Compare December 23, 2025 19:00
cfsmp3 and others added 6 commits December 23, 2025 20:11
This fix addresses a critical race condition that caused GitHub CI status
to remain stuck as "pending" / "Tests queued" even after tests failed.

| Time | Event | GitHub Status |
|------|-------|---------------|
| 10:29:32 | Workflow requested, tests scheduled | pending - "Waiting for actions" |
| 10:30:54 | Linux test fails (artifact not ready) | error - "No build artifact found" |
| 10:31:24 | Linux artifact created | — |
| 10:31:43 | Windows test fails (artifact not ready) | error - "No build artifact found" |
| 11:08:23 | Windows artifact created | — |
| 11:09:22 | Workflow completes, NEW tests queued | pending - "Tests queued" ← STUCK |

1. **Race Condition**: Tests were queued before artifacts were available.
   The workflow_run.completed webhook fired, but GitHub's artifact API
   hadn't propagated the artifacts yet (29 seconds gap for Linux).

2. **Status Override**: When new tests (7138/7139) were queued at 11:09:22,
   the status was set to "pending", overwriting the previous "error" status.

3. **Silent Failure**: The `mark_test_failed` function caught all exceptions
   silently, so if GitHub status update failed, no error was reported and
   the status remained stuck as "pending".

- Separated database and GitHub updates so one failing doesn't block the other
- Added detailed logging for each step (db_success, github_success)
- Now includes target_url in GitHub status for better traceability
- Logs CRITICAL error when both updates fail

- Added `verify_artifacts_exist()` check before calling `queue_test()`
- If workflow succeeded but artifact isn't found, posts ERROR status:
  "Build succeeded but artifact not yet available - please retry"
- Prevents queuing tests that will immediately fail

- New `find_artifact_for_commit()` function with proper pagination
- Searches up to MAX_ARTIFACTS_TO_SEARCH (500) artifacts
- Better error handling and logging
- Used by both webhook handler and start_test()

- All artifact search operations now log progress
- Test failures include test ID in log messages
- Artifact availability is logged before queuing

- `mod_ci/controllers.py`: Core fixes
- `tests/test_ci/test_controllers.py`: Added 12 new tests

- TestFindArtifactForCommit: 5 tests
- TestVerifyArtifactsExist: 3 tests
- TestMarkTestFailedImproved: 4 tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…rror

Works around different PyGithub versions having different type stubs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests that exercise the successful workflow completion path now mock
verify_artifacts_exist to return {'linux': True, 'windows': True}.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@canihavesomecoffee canihavesomecoffee force-pushed the fix/ci-status-race-condition branch from 6152875 to c8d796b Compare December 23, 2025 19:11
@sonarqubecloud
Copy link

@canihavesomecoffee canihavesomecoffee merged commit c7ba746 into master Dec 23, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants