Skip to content

Fix CDR workflow flakiness: Fleet API retries, None-version guards, artifact checks, unique GCP deployment names#4091

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/fix-github-actions-workflow-failures
Draft

Fix CDR workflow flakiness: Fleet API retries, None-version guards, artifact checks, unique GCP deployment names#4091
Copilot wants to merge 2 commits intomainfrom
copilot/fix-github-actions-workflow-failures

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 2, 2026

The CDR ("Create Environment with Cloud Logs") workflow was failing/flaking due to four distinct root causes: transient Kibana/Fleet 502/503 errors crashing scripts immediately, None package versions causing TypeError in downstream version comparisons, missing generated artifacts from cascading failures, and deterministic 409 collisions on GCP deployment re-runs.

tests/fleet_api/base_call_api.py — Retry with exponential backoff

  • perform_api_call now retries up to 8× on HTTP 502/503/504 and on ConnectionError/Timeout/ChunkedEncodingError
  • Backoff is jittered exponential, capped at 60s
  • Added 60s request timeout to all calls
  • New private _retry_sleep() helper applies jitter before the cap
# before: one shot, crashes immediately on 502/503
response = requests.request(method=method, url=url, headers=headers, auth=auth, **params)

# after: retries with backoff, logs each transient failure
while True:
    attempt += 1
    try:
        response = requests.request(..., timeout=60, ...)
    except TRANSIENT_EXCEPTIONS as exc:
        if attempt <= retries:
            time.sleep(_retry_sleep(attempt, ...)); continue
        raise
    if response.status_code in TRANSIENT_STATUSES and attempt <= retries:
        time.sleep(_retry_sleep(attempt, ...)); continue
    ...

Integration installers — None package-version guard

Added early-exit guard in three installers matching the pattern already used in install_elastic_defend_integration.py:

  • install_aws_asset_inventory_integration.py
  • install_azure_asset_inventory_integration.py
  • install_gcp_asset_inventory_integration.py (also calls version.parse() directly, making the guard critical)
package_version = get_package_version(cfg=cnfg.elk_config, package_name=PACKAGE_NAME)
if not package_version:
    logger.error("Could not resolve ... package version from Fleet (Kibana may be unavailable).")
    raise SystemExit(1)

.github/actions/cdr/action.yml — Workflow-level fixes

"Deploy Azure Asset Inventory agent"

  • Gate condition changed from asset_inventory_supported == 'true'azure-asset-inventory-integration.outcome == 'success' (stops cascading when installer fails)
  • Added missing STACK_VERSION env var (script uses set -euo pipefail; was failing on unbound variable)
  • Added explicit arm_parameters.json existence check before az deployment (file is only written on successful fleet calls)

"Deploy AWS Asset Inventory agent"

  • Gate condition changed to require aws-asset-inventory.outcome == 'success'
  • Added explicit check for aws-asset-inventory-linux.sh before SCP (file is generated at the end of the installer; missing it caused cryptic scp: No such file or directory)

"Create GCP Service Account"

  • Deployment name changed from <name>-acc<name>-acc-${{ github.run_id }} to eliminate 409 "already exists" on re-runs

Screenshot/Data

N/A — workflow reliability fix; no new features or findings format changes.

Related Issues

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added the necessary README/documentation (if appropriate)

Introducing a new rule?

Original prompt

Repo: elastic/cloudbeat

A GitHub Actions workflow run for "Create Environment with Cloud Logs (CDR)" is failing/flaky due to transient Kibana/Fleet API errors (502/503), missing generated artifacts cascading from earlier failures, and deterministic infra creation collisions.

Evidence (job logs ref 48fce74):

  • Fleet API calls intermittently return 502 "The instance rejected the connection." and 503 "The requested resource is currently unavailable." causing scripts to exit.
  • Several installer scripts log Package version: None and crash when parsing/comparing versions (TypeError from packaging.version).
  • Downstream steps fail because expected generated scripts do not exist (e.g. scp of tests/integrations_setup/aws-asset-inventory-linux.sh fails).
  • GCP deployment creation fails deterministically with 409 already exists (.../deployments/<name>-acc already exists).
  • Azure CLI fails to parse parameters file @arm_parameters.json, likely due to path/working-directory mismatch.

Implement fixes and open a PR:

  1. Make Fleet API calls resilient to transient failures:

    • Update tests/fleet_api/base_call_api.py perform_api_call to retry with exponential backoff + jitter on transient HTTP statuses (502/503/504) and on transient request exceptions (ConnectionError/Timeout/etc.).
    • Ensure requests include a reasonable timeout.
  2. Prevent None package versions from causing TypeError and provide actionable errors:

    • Update integration installer scripts that call get_package_version to explicitly check for a falsey/None return and fail fast with a clear log message (or optionally retry via shared API layer). At minimum include:
      • tests/integrations_setup/install_aws_asset_inventory_integration.py
      • tests/integrations_setup/install_azure_asset_inventory_integration.py
    • Ensure version compatibility checks are not invoked with None.
  3. Reduce cascading failures from missing generated artifacts:

    • Where workflow/scripts assume generated install scripts exist (e.g., aws-asset-inventory-linux.sh), add explicit checks and fail with clear message before attempting scp/install.
    • If this logic is in shell scripts under tests/integrations_setup or called workflows, implement accordingly.
  4. Fix deterministic infra collisions in GCP deployment creation:

    • Identify the script/workflow step creating the GCP Deployment Manager deployment <deployment-name>-acc and modify it to use a unique name per run (e.g., append short random suffix or ${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}), or implement create-or-update semantics if appropriate.
  5. Fix Azure ARM parameter file parsing failure:

    • Ensure the az deployment command references the correct parameters file path (use an explicit path to deploy/azure/arm_parameters.json generated by the python installer, or cd to the right directory). Update the relevant script/workflow.

Constraints:

  • Keep changes minimal and focused on reliability.
  • Add/adjust logging to make transient readiness issues obvious.
  • Update/extend tests if present, but prioritize workflow stability.

Create a pull request with these changes.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing because parts of the workflow assume external services (Elastic Cloud Kibana/Fleet, Azure ARM deploy, GCP Deployment Manager) are immediately available and that certain generated files exist, but none of those assumptions hold reliably in this run. You’re seeing three distinct failure classes:

  1. Transient Elastic Cloud / Kibana Fleet API failures (502/503)
    Logs show repeated failures like:

    • API call failed, status code 502 ... "The instance rejected the connection."
    • API call failed, status code 503 ... "The requested resource is currently unavailable."

    These happen during:

    • agent policy creation (create_agent_policy)
    • package version resolution (get_package_version)
    • entity store enablement (enable_entity_store_v2)

    Concrete fix: add retry/backoff (and optionally a readiness wait) around Fleet API calls, and make package-version parsing robust when Fleet is temporarily unavailable.

    Code suggestion: add retries to the shared Fleet API caller

    Update tests/fleet_api/base_call_api.py (ref 48fce74b80b498d4c9b2fdd6a07f3bca43f1f18b) so transient 502/503/504 (and connection errors) are retried instead of failing the whole job immediately.

    Suggested patch (illustrative):

    # tests/fleet_api/base_call_api.py
    
    import time
    import random
    import requests
    
    TRANSIENT_STATUSES = {502, 503, 504}
    TRANSIENT_EXCEPTIONS = (
        requests.excepti...

This pull request was created from Copilot chat.

@mergify
Copy link
Copy Markdown

mergify bot commented Apr 2, 2026

This pull request does not have a backport label. Could you fix it @Copilot? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

…ecks, unique GCP names

Agent-Logs-Url: https://github.com/elastic/cloudbeat/sessions/9d369ca2-51a6-4ff9-ad51-9f61348b182b

Co-authored-by: romulets <5350001+romulets@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix GitHub Actions workflow failures for Cloud Environment creation Fix CDR workflow flakiness: Fleet API retries, None-version guards, artifact checks, unique GCP deployment names Apr 2, 2026
Copilot AI requested a review from romulets April 2, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants