Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 24, 2025

Summary

This PR introduces an optional Celery task queue system as an alternative to the current cron-based test processing. This is part of the architectural improvements outlined in the Sample Platform Reliability Assessment.

Why This Change?

The current cron-based polling approach has several limitations:

  • Tests accumulate while waiting for the next cron cycle (every 10 minutes)
  • If cron crashes, no tests run until manually restarted
  • No parallel processing capability
  • Limited retry mechanisms for failed tasks

The Celery task queue provides:

  • Event-driven processing: Tests start immediately when queued (instead of waiting for cron)
  • Parallel execution: Multiple tests can run concurrently
  • Built-in retry logic: Automatic retries with exponential backoff
  • Better visibility: Queue depth monitoring, task status tracking
  • Graceful degradation: Falls back to cron if Celery is unavailable

Changes Overview

New Files (5)

File Purpose
celery_app.py Celery application factory with Flask context integration
mod_ci/tasks.py Task definitions for test execution, cleanup, and pending test discovery
install/celery-worker.service Systemd service file for Celery worker process
install/celery-beat.service Systemd service file for Celery beat scheduler (periodic tasks)
tests/test_ci/test_tasks.py Unit tests for Celery-related functions (6 tests)

Modified Files (4)

File Changes
requirements.txt Added celery[redis]==5.3.6 and redis==5.0.1
config_sample.py Added Celery configuration options (broker URL, feature flag, etc.)
mod_ci/controllers.py Added trigger_test_tasks() function, modified add_test_entry() to return test IDs
install/installation.md Added comprehensive Celery setup and deployment documentation

Architecture

Task Definitions

Task Purpose Schedule
start_test_task(test_id, bot_token) Execute a single test (download artifacts, create GCP VM) On-demand
check_expired_instances_task() Clean up timed-out VMs Every 5 minutes
process_pending_tests_task() Find pending tests and queue them Every 1 minute

Queue Configuration

Queue Purpose
default General tasks, periodic test discovery
test_execution Test execution tasks (one per test)
maintenance Cleanup and maintenance tasks

Feature Flag

The USE_CELERY_TASKS configuration option controls whether Celery is active:

  • False (default): Cron-based processing continues as before
  • True: Webhooks trigger Celery tasks immediately after test creation

This allows for a gradual, zero-risk migration from cron to Celery.


How to Deploy

Prerequisites

  1. Install Redis on the server:

    sudo apt update
    sudo apt install redis-server
    
    # Configure Redis
    sudo nano /etc/redis/redis.conf
    # Set: supervised systemd
    # Set: bind 127.0.0.1 ::1
    
    # Enable and start Redis
    sudo systemctl enable redis-server
    sudo systemctl start redis-server
    
    # Verify Redis is running
    redis-cli ping  # Should return PONG
  2. Install Python dependencies:

    cd /var/www/sample-platform
    source venv/bin/activate
    pip install -r requirements.txt

Configuration

Add the following to your config.py:

# Celery Configuration
CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'

# Feature flag (set to True to enable Celery, False for cron fallback)
USE_CELERY_TASKS = False  # Start with False for safe rollout

Install Systemd Services

# Create log directory
sudo mkdir -p /var/www/sample-platform/logs/celery
sudo chown -R www-data:www-data /var/www/sample-platform/logs/celery

# Create runtime directory
sudo mkdir -p /var/run/celery
sudo chown www-data:www-data /var/run/celery

# Install systemd services
sudo cp /var/www/sample-platform/install/celery-worker.service /etc/systemd/system/
sudo cp /var/www/sample-platform/install/celery-beat.service /etc/systemd/system/

# Reload systemd and enable services
sudo systemctl daemon-reload
sudo systemctl enable celery-worker celery-beat

# Start the services
sudo systemctl start celery-worker
sudo systemctl start celery-beat

Recommended Rollout Strategy

Stage 1: Install & Monitor (Week 1-2)

USE_CELERY_TASKS = False  # Cron remains primary
  • Deploy code changes
  • Start Celery services
  • Monitor logs at /var/www/sample-platform/logs/celery/
  • Verify periodic tasks (cleanup, pending check) execute correctly
  • Cron continues processing tests as before

Stage 2: Celery Primary (Week 3-4)

USE_CELERY_TASKS = True
  • Enable Celery for new tests (webhooks trigger tasks immediately)
  • Reduce cron frequency to every 30 minutes (fallback only)
  • Monitor for any missed tests

Stage 3: Celery Only (Week 5+)

  • Disable cron job entirely (sudo crontab -e and comment out the line)
  • Full Celery operation
  • Optionally remove cron fallback code

How to Test

Unit Tests

cd /var/www/sample-platform
source venv/bin/activate
TZ=America/New_York python -m nose2 -v tests.test_ci.test_tasks

Expected output: 6 tests pass

Manual Testing

  1. Verify Redis is running:

    redis-cli ping  # Should return PONG
  2. Verify Celery worker is running:

    sudo systemctl status celery-worker
    celery -A celery_app.celery inspect active
  3. Verify Celery beat is running:

    sudo systemctl status celery-beat
    celery -A celery_app.celery inspect scheduled
  4. Monitor queue depth:

    redis-cli LLEN celery
  5. View logs:

    tail -f /var/www/sample-platform/logs/celery/*.log
  6. Test task execution manually (Python shell):

    from mod_ci.tasks import check_expired_instances_task
    result = check_expired_instances_task.apply()
    print(result.get())  # Should return {'status': 'success'}

Optional: Flower Dashboard

For web-based monitoring:

pip install flower
celery -A celery_app.celery flower --port=5555

Then access http://localhost:5555 for a real-time task dashboard.


How to Rollback

If issues occur, rollback is simple:

# 1. Stop Celery services
sudo systemctl stop celery-beat celery-worker

# 2. Disable in config (edit config.py)
USE_CELERY_TASKS = False

# 3. Restart platform
sudo systemctl restart platform

# 4. Ensure cron is running (every 10 minutes)
sudo crontab -e
# Verify/uncomment: */10 * * * * python /var/www/sample-platform/mod_ci/cron.py ...

# 5. Verify cron picks up tests
tail -f /var/www/sample-platform/logs/cron.log

Files Changed in Detail

celery_app.py (New)

  • Celery application factory function make_celery(app)
  • Flask application context integration for database access
  • Beat schedule configuration for periodic tasks
  • Queue routing configuration
  • Graceful handling of missing config (for testing)

mod_ci/tasks.py (New)

  • start_test_task: Wraps existing start_test() with Celery retry logic
  • check_expired_instances_task: Periodic cleanup task
  • process_pending_tests_task: Finds pending tests and queues them
  • Proper error handling and logging
  • Flask app context management

mod_ci/controllers.py (Modified)

  • add_test_entry(): Now returns list of created test IDs (was None)
  • trigger_test_tasks(): New function to optionally queue Celery tasks
  • Webhook handlers updated to call trigger_test_tasks() after test creation
  • Added db.flush() calls to get test IDs before commit

config_sample.py (Modified)

Added configuration options:

CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TIMEZONE = 'UTC'
CELERY_ENABLE_UTC = True
CELERY_TASK_ACKS_LATE = True
CELERY_WORKER_PREFETCH_MULTIPLIER = 1
CELERY_TASK_REJECT_ON_WORKER_LOST = True
CELERY_TASK_SOFT_TIME_LIMIT = 3600  # 1 hour
CELERY_TASK_TIME_LIMIT = 3900  # 1 hour 5 minutes
USE_CELERY_TASKS = False

install/celery-worker.service (New)

  • Systemd service for Celery worker
  • Runs as www-data user
  • Processes 3 queues with 2 concurrent workers
  • Auto-restart on failure

install/celery-beat.service (New)

  • Systemd service for Celery beat scheduler
  • Runs as www-data user
  • Triggers periodic tasks (cleanup, pending test discovery)
  • Auto-restart on failure

install/installation.md (Modified)

  • Added comprehensive "Optional: Setting up Celery Task Queue" section
  • Redis installation instructions
  • Celery configuration guide
  • Service installation steps
  • Monitoring commands
  • Gradual migration guide
  • Rollback procedure

tests/test_ci/test_tasks.py (New)

6 unit tests covering:

  • trigger_test_tasks() with Celery disabled (default)
  • trigger_test_tasks() with empty test ID list
  • trigger_test_tasks() handling import errors gracefully
  • add_test_entry() returning test IDs
  • add_test_entry() with invalid commit hash
  • add_test_entry() with database commit failure

Backward Compatibility

  • 100% backward compatible: Default behavior unchanged
  • USE_CELERY_TASKS = False (default) means cron continues as before
  • No changes to existing test execution flow unless explicitly enabled
  • Gradual migration path with cron as fallback
  • Easy rollback if issues occur

Related Issues


Test Plan

  • Unit tests pass (6 new tests for Celery functionality)
  • Deploy to staging with USE_CELERY_TASKS = False
  • Verify Celery services start and run correctly
  • Verify periodic tasks execute (check logs)
  • Enable USE_CELERY_TASKS = True on staging
  • Submit test PR and verify tests are queued via Celery
  • Verify test execution completes successfully
  • Monitor for 1 week before production rollout

🤖 Generated with Claude Code

cfsmp3 and others added 3 commits December 24, 2025 13:27
Fixes #901

When GCP fails to create a VM (e.g., due to zone resource exhaustion),
the error was logged but the test status showed a raw error dict that
was not user-friendly.

Changes:
- Added parse_gcp_error() helper function to extract meaningful messages
  from GCP API error responses
- Added GCP_ERROR_MESSAGES dict mapping known error codes to user-friendly
  messages (ZONE_RESOURCE_POOL_EXHAUSTED, QUOTA_EXCEEDED, TIMEOUT, etc.)
- Updated start_test() to use parse_gcp_error() when VM creation fails
- For unknown errors, shows the error code and a truncated message

Now users see messages like:
"GCP resources temporarily unavailable in the configured zone.
The test will be retried automatically when resources become available."

Instead of raw dicts like:
"{'errors': [{'code': 'ZONE_RESOURCE_POOL_EXHAUSTED', ...}]}"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Addresses review feedback: For unknown GCP error codes, the raw error
message could contain sensitive info (project names, zones, etc).

Changes:
- Log full error details server-side for debugging
- Return generic "VM creation failed" message to users for unknown errors
- Known error codes (ZONE_RESOURCE_POOL_EXHAUSTED, etc.) still return
  user-friendly messages
- Added optional `log` parameter for testability
- Updated tests to verify sensitive info is logged but not returned

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace cron-based polling with an optional Celery task queue system
for faster, more reliable test execution. This architectural improvement
enables event-driven processing, parallel test execution, and better
retry handling.

New files:
- celery_app.py: Celery application factory with Flask context
- mod_ci/tasks.py: Task definitions (start_test, cleanup, pending check)
- install/celery-worker.service: Systemd service for Celery worker
- install/celery-beat.service: Systemd service for Celery beat scheduler
- tests/test_ci/test_tasks.py: Unit tests for Celery functionality

Modified files:
- requirements.txt: Added celery[redis] and redis packages
- config_sample.py: Added Celery configuration options
- mod_ci/controllers.py: Added trigger_test_tasks(), updated add_test_entry()
- install/installation.md: Added Celery setup documentation

Key features:
- USE_CELERY_TASKS feature flag for gradual migration (default: False)
- Parallel mode: cron continues as fallback during transition
- Three task queues: default, test_execution, maintenance
- Periodic tasks via Celery Beat for cleanup and pending test discovery
- Graceful degradation if Celery unavailable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants