fix: Retry tests when build is still in progress instead of failing #973

cfsmp3 · 2025-12-25T08:04:49Z

Summary

This PR fixes a critical regression introduced in the Phase 2 reliability improvements (PRs #943/#946) that causes tests to be permanently marked as failed when the GitHub Actions build hasn't completed yet.

The Problem

When a PR is opened, both the GitHub Actions build and the Sample Platform test are triggered. If the cron job picks up the test before the build completes (which can take up to 40 minutes for Windows builds), the test is marked as permanently failed instead of being retried later.

Real-world example: PR CCExtractor/ccextractor#1896:

PR opened at 07:38:50 UTC
Windows build started at 07:38:57 UTC (takes ~40 min)
Sample Platform test triggered at 07:40:32 UTC (< 2 min later)
Test marked as failed with "Build still in progress"
Test will never run even after build completes

Root Cause Analysis

Previous behavior (before December 2025):

if not artifact_saved:
    log.critical("Could not find an artifact for this commit")
    return  # Silent return - test stays "pending"

If artifact not found → return silently
Test stays in pending state
Next cron cycle (10 min later) retries
Eventually build completes → test runs successfully

Regression behavior (after Phase 2 fixes):

if artifact is None:
    error_detail = _diagnose_missing_artifact(...)
    mark_test_failed(db, test, repository, error_detail)  # PERMANENT FAILURE
    return

If artifact not found → mark test as failed immediately
GitHub status set to ERROR
Test marked as "canceled" in database
Test is never retried even after build completes

The Fix

Modified _diagnose_missing_artifact() to return a tuple indicating whether the failure is retryable:

Scenario	Retryable	Behavior
Build in progress (`in_progress`, `queued`)	✅ Yes	Return silently, cron will retry
No workflow run found (may be queued)	✅ Yes	Return silently, cron will retry
Diagnostic check failed (API error)	✅ Yes	Return silently, cron will retry
Build failed (`failure`, `cancelled`)	❌ No	Mark test as failed
Artifact expired (build succeeded but artifact gone)	❌ No	Mark test as failed

Fixed behavior:

if artifact is None:
    error_detail, is_retryable = _diagnose_missing_artifact(...)
    
    if is_retryable:
        # Build is still in progress - don't mark as failed
        # Just return and let the next cron cycle retry this test
        log.info(f"Test {test.id}: {error_detail}")
        return
    
    # Permanent failure - mark the test as failed
    log.critical(f"Test {test.id}: ...")
    mark_test_failed(db, test, repository, error_detail)
    return

Files Changed

File	Changes
`mod_ci/controllers.py`	Modified `_diagnose_missing_artifact()` to return `(message, is_retryable)` tuple; Updated `start_test()` to handle retryable vs permanent failures
`tests/test_ci/test_controllers.py`	Added 8 new tests for retry behavior

Test Coverage

New Tests Added

TestDiagnoseMissingArtifact class (6 tests):

test_build_in_progress_is_retryable - Build with status in_progress returns is_retryable=True
test_build_queued_is_retryable - Build with status queued returns is_retryable=True
test_build_failed_is_not_retryable - Build with conclusion failure returns is_retryable=False
test_artifact_expired_is_not_retryable - Successful build with missing artifact returns is_retryable=False
test_no_workflow_run_is_retryable - No workflow run found returns is_retryable=True
test_diagnostic_exception_is_retryable - API errors default to is_retryable=True (safe default)

TestStartTestRetryBehavior class (2 tests):

test_retryable_failure_does_not_mark_test_failed - Verifies mark_test_failed() is NOT called for retryable failures
test_permanent_failure_marks_test_failed - Verifies mark_test_failed() IS called for permanent failures

Test Results

$ TZ=America/New_York python -m nose2 -v tests.test_ci.test_controllers.TestDiagnoseMissingArtifact tests.test_ci.test_controllers.TestStartTestRetryBehavior

test_artifact_expired_is_not_retryable ... ok
test_build_failed_is_not_retryable ... ok
test_build_in_progress_is_retryable ... ok
test_build_queued_is_retryable ... ok
test_diagnostic_exception_is_retryable ... ok
test_no_workflow_run_is_retryable ... ok
test_permanent_failure_marks_test_failed ... ok
test_retryable_failure_does_not_mark_test_failed ... ok

----------------------------------------------------------------------
Ran 8 tests in 0.831s

OK

How to Verify

Before This Fix

Open a PR that triggers a Windows build
If cron runs before build completes (~40 min), test is permanently failed
GitHub check shows ERROR status with "Build still in progress"
Test never runs even after build finishes

After This Fix

Open a PR that triggers a Windows build
If cron runs before build completes, test stays pending
Logs show: "Test {id}: Build still in progress: ... Will retry when build completes."
Next cron cycle retries the test
When build completes, test runs successfully

Manual Verification

After deploying, monitor logs for:

INFO: Test 1234: Build still in progress: 'Build CCExtractor on Windows' is in_progress. Will retry when build completes.

This indicates the retry logic is working correctly.

Deployment Notes

No configuration changes required
No database migrations required
Safe to deploy immediately - this is a pure bug fix
Backward compatible - existing tests in final states (completed/failed) are unaffected

Related Issues

Discovered while investigating CCExtractor/ccextractor#1896 Windows test failure
Related to Phase 2 reliability improvements: PRs fix(reliability): Add timeouts and error handling for CI operations #943, fix(reliability): Phase 2 - Error handling and retry logic #946

🤖 Generated with Claude Code

This fixes a regression introduced in Phase 2 reliability improvements (PRs #943/#946) where tests would be permanently marked as failed when the GitHub Actions build hadn't completed yet. Previous behavior (before Dec 2025): - If artifact not found, just return silently - Test stays in "pending" state - Next cron cycle retries the test - Eventually build completes and test runs Regression behavior (after Phase 2): - If artifact not found, mark test as failed immediately - Test is permanently marked as "canceled" with ERROR status - No retry - test never runs even after build completes Fixed behavior (this PR): - Distinguish between retryable and permanent failures - Retryable: build in progress, workflow queued, diagnostic error - Permanent: build failed, artifact expired - Only mark as failed for permanent failures - Return silently for retryable failures (next cron cycle retries) Changes: - Modified _diagnose_missing_artifact() to return tuple (message, is_retryable) - Modified start_test() to check is_retryable before marking as failed - Added 8 comprehensive tests for retry behavior Fixes issue where Windows tests fail with "Build still in progress" when the cron job runs before the 40-minute Windows build completes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sonarqubecloud · 2025-12-25T08:43:39Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cfsmp3 requested review from canihavesomecoffee and thealphadollar as code owners December 25, 2025 08:04

Merge branch 'master' into fix/build-in-progress-retry

a06b56c

cfsmp3 merged commit 6a637a5 into master Dec 25, 2025
6 checks passed

cfsmp3 deleted the fix/build-in-progress-retry branch December 25, 2025 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Retry tests when build is still in progress instead of failing #973

fix: Retry tests when build is still in progress instead of failing #973

cfsmp3 commented Dec 25, 2025

Uh oh!

sonarqubecloud bot commented Dec 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Retry tests when build is still in progress instead of failing #973

fix: Retry tests when build is still in progress instead of failing #973

Conversation

cfsmp3 commented Dec 25, 2025

Summary

The Problem

Root Cause Analysis

The Fix

Files Changed

Test Coverage

New Tests Added

Test Results

How to Verify

Before This Fix

After This Fix

Manual Verification

Deployment Notes

Related Issues

Uh oh!

sonarqubecloud bot commented Dec 25, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants