Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 25, 2025

Summary

This PR fixes a critical regression introduced in the Phase 2 reliability improvements (PRs #943/#946) that causes tests to be permanently marked as failed when the GitHub Actions build hasn't completed yet.

The Problem

When a PR is opened, both the GitHub Actions build and the Sample Platform test are triggered. If the cron job picks up the test before the build completes (which can take up to 40 minutes for Windows builds), the test is marked as permanently failed instead of being retried later.

Real-world example: PR CCExtractor/ccextractor#1896:

  • PR opened at 07:38:50 UTC
  • Windows build started at 07:38:57 UTC (takes ~40 min)
  • Sample Platform test triggered at 07:40:32 UTC (< 2 min later)
  • Test marked as failed with "Build still in progress"
  • Test will never run even after build completes

Root Cause Analysis

Previous behavior (before December 2025):

if not artifact_saved:
    log.critical("Could not find an artifact for this commit")
    return  # Silent return - test stays "pending"
  • If artifact not found → return silently
  • Test stays in pending state
  • Next cron cycle (10 min later) retries
  • Eventually build completes → test runs successfully

Regression behavior (after Phase 2 fixes):

if artifact is None:
    error_detail = _diagnose_missing_artifact(...)
    mark_test_failed(db, test, repository, error_detail)  # PERMANENT FAILURE
    return
  • If artifact not found → mark test as failed immediately
  • GitHub status set to ERROR
  • Test marked as "canceled" in database
  • Test is never retried even after build completes

The Fix

Modified _diagnose_missing_artifact() to return a tuple indicating whether the failure is retryable:

Scenario Retryable Behavior
Build in progress (in_progress, queued) ✅ Yes Return silently, cron will retry
No workflow run found (may be queued) ✅ Yes Return silently, cron will retry
Diagnostic check failed (API error) ✅ Yes Return silently, cron will retry
Build failed (failure, cancelled) ❌ No Mark test as failed
Artifact expired (build succeeded but artifact gone) ❌ No Mark test as failed

Fixed behavior:

if artifact is None:
    error_detail, is_retryable = _diagnose_missing_artifact(...)
    
    if is_retryable:
        # Build is still in progress - don't mark as failed
        # Just return and let the next cron cycle retry this test
        log.info(f"Test {test.id}: {error_detail}")
        return
    
    # Permanent failure - mark the test as failed
    log.critical(f"Test {test.id}: ...")
    mark_test_failed(db, test, repository, error_detail)
    return

Files Changed

File Changes
mod_ci/controllers.py Modified _diagnose_missing_artifact() to return (message, is_retryable) tuple; Updated start_test() to handle retryable vs permanent failures
tests/test_ci/test_controllers.py Added 8 new tests for retry behavior

Test Coverage

New Tests Added

TestDiagnoseMissingArtifact class (6 tests):

  • test_build_in_progress_is_retryable - Build with status in_progress returns is_retryable=True
  • test_build_queued_is_retryable - Build with status queued returns is_retryable=True
  • test_build_failed_is_not_retryable - Build with conclusion failure returns is_retryable=False
  • test_artifact_expired_is_not_retryable - Successful build with missing artifact returns is_retryable=False
  • test_no_workflow_run_is_retryable - No workflow run found returns is_retryable=True
  • test_diagnostic_exception_is_retryable - API errors default to is_retryable=True (safe default)

TestStartTestRetryBehavior class (2 tests):

  • test_retryable_failure_does_not_mark_test_failed - Verifies mark_test_failed() is NOT called for retryable failures
  • test_permanent_failure_marks_test_failed - Verifies mark_test_failed() IS called for permanent failures

Test Results

$ TZ=America/New_York python -m nose2 -v tests.test_ci.test_controllers.TestDiagnoseMissingArtifact tests.test_ci.test_controllers.TestStartTestRetryBehavior

test_artifact_expired_is_not_retryable ... ok
test_build_failed_is_not_retryable ... ok
test_build_in_progress_is_retryable ... ok
test_build_queued_is_retryable ... ok
test_diagnostic_exception_is_retryable ... ok
test_no_workflow_run_is_retryable ... ok
test_permanent_failure_marks_test_failed ... ok
test_retryable_failure_does_not_mark_test_failed ... ok

----------------------------------------------------------------------
Ran 8 tests in 0.831s

OK

How to Verify

Before This Fix

  1. Open a PR that triggers a Windows build
  2. If cron runs before build completes (~40 min), test is permanently failed
  3. GitHub check shows ERROR status with "Build still in progress"
  4. Test never runs even after build finishes

After This Fix

  1. Open a PR that triggers a Windows build
  2. If cron runs before build completes, test stays pending
  3. Logs show: "Test {id}: Build still in progress: ... Will retry when build completes."
  4. Next cron cycle retries the test
  5. When build completes, test runs successfully

Manual Verification

After deploying, monitor logs for:

INFO: Test 1234: Build still in progress: 'Build CCExtractor on Windows' is in_progress. Will retry when build completes.

This indicates the retry logic is working correctly.

Deployment Notes

  • No configuration changes required
  • No database migrations required
  • Safe to deploy immediately - this is a pure bug fix
  • Backward compatible - existing tests in final states (completed/failed) are unaffected

Related Issues


🤖 Generated with Claude Code

This fixes a regression introduced in Phase 2 reliability improvements
(PRs #943/#946) where tests would be permanently marked as failed when
the GitHub Actions build hadn't completed yet.

Previous behavior (before Dec 2025):
- If artifact not found, just return silently
- Test stays in "pending" state
- Next cron cycle retries the test
- Eventually build completes and test runs

Regression behavior (after Phase 2):
- If artifact not found, mark test as failed immediately
- Test is permanently marked as "canceled" with ERROR status
- No retry - test never runs even after build completes

Fixed behavior (this PR):
- Distinguish between retryable and permanent failures
- Retryable: build in progress, workflow queued, diagnostic error
- Permanent: build failed, artifact expired
- Only mark as failed for permanent failures
- Return silently for retryable failures (next cron cycle retries)

Changes:
- Modified _diagnose_missing_artifact() to return tuple (message, is_retryable)
- Modified start_test() to check is_retryable before marking as failed
- Added 8 comprehensive tests for retry behavior

Fixes issue where Windows tests fail with "Build still in progress"
when the cron job runs before the 40-minute Windows build completes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sonarqubecloud
Copy link

@cfsmp3 cfsmp3 merged commit 6a637a5 into master Dec 25, 2025
6 checks passed
@cfsmp3 cfsmp3 deleted the fix/build-in-progress-retry branch December 25, 2025 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants