Conversation
|
This pull request does not have a backport label. Could you fix it @Copilot? 🙏
|
…ecks, unique GCP names Agent-Logs-Url: https://github.com/elastic/cloudbeat/sessions/9d369ca2-51a6-4ff9-ad51-9f61348b182b Co-authored-by: romulets <5350001+romulets@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix GitHub Actions workflow failures for Cloud Environment creation
Fix CDR workflow flakiness: Fleet API retries, None-version guards, artifact checks, unique GCP deployment names
Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The CDR ("Create Environment with Cloud Logs") workflow was failing/flaking due to four distinct root causes: transient Kibana/Fleet 502/503 errors crashing scripts immediately,
Nonepackage versions causingTypeErrorin downstream version comparisons, missing generated artifacts from cascading failures, and deterministic 409 collisions on GCP deployment re-runs.tests/fleet_api/base_call_api.py— Retry with exponential backoffperform_api_callnow retries up to 8× on HTTP 502/503/504 and onConnectionError/Timeout/ChunkedEncodingErrortimeoutto all calls_retry_sleep()helper applies jitter before the capIntegration installers —
Nonepackage-version guardAdded early-exit guard in three installers matching the pattern already used in
install_elastic_defend_integration.py:install_aws_asset_inventory_integration.pyinstall_azure_asset_inventory_integration.pyinstall_gcp_asset_inventory_integration.py(also callsversion.parse()directly, making the guard critical).github/actions/cdr/action.yml— Workflow-level fixes"Deploy Azure Asset Inventory agent"
asset_inventory_supported == 'true'→azure-asset-inventory-integration.outcome == 'success'(stops cascading when installer fails)STACK_VERSIONenv var (script usesset -euo pipefail; was failing on unbound variable)arm_parameters.jsonexistence check beforeaz deployment(file is only written on successful fleet calls)"Deploy AWS Asset Inventory agent"
aws-asset-inventory.outcome == 'success'aws-asset-inventory-linux.shbefore SCP (file is generated at the end of the installer; missing it caused crypticscp: No such file or directory)"Create GCP Service Account"
<name>-acc→<name>-acc-${{ github.run_id }}to eliminate 409 "already exists" on re-runsScreenshot/Data
N/A — workflow reliability fix; no new features or findings format changes.
Related Issues
Checklist
Introducing a new rule?
Original prompt
Repo: elastic/cloudbeat
A GitHub Actions workflow run for "Create Environment with Cloud Logs (CDR)" is failing/flaky due to transient Kibana/Fleet API errors (502/503), missing generated artifacts cascading from earlier failures, and deterministic infra creation collisions.
Evidence (job logs ref 48fce74):
Package version: Noneand crash when parsing/comparing versions (TypeError from packaging.version).tests/integrations_setup/aws-asset-inventory-linux.shfails)..../deployments/<name>-acc already exists).@arm_parameters.json, likely due to path/working-directory mismatch.Implement fixes and open a PR:
Make Fleet API calls resilient to transient failures:
tests/fleet_api/base_call_api.pyperform_api_callto retry with exponential backoff + jitter on transient HTTP statuses (502/503/504) and on transient request exceptions (ConnectionError/Timeout/etc.).Prevent
Nonepackage versions from causing TypeError and provide actionable errors:get_package_versionto explicitly check for a falsey/None return and fail fast with a clear log message (or optionally retry via shared API layer). At minimum include:tests/integrations_setup/install_aws_asset_inventory_integration.pytests/integrations_setup/install_azure_asset_inventory_integration.pyReduce cascading failures from missing generated artifacts:
aws-asset-inventory-linux.sh), add explicit checks and fail with clear message before attempting scp/install.tests/integrations_setupor called workflows, implement accordingly.Fix deterministic infra collisions in GCP deployment creation:
<deployment-name>-accand modify it to use a unique name per run (e.g., append short random suffix or${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}), or implement create-or-update semantics if appropriate.Fix Azure ARM parameter file parsing failure:
az deploymentcommand references the correct parameters file path (use an explicit path todeploy/azure/arm_parameters.jsongenerated by the python installer, orcdto the right directory). Update the relevant script/workflow.Constraints:
Create a pull request with these changes.
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: The job is failing because parts of the workflow assume external services (Elastic Cloud Kibana/Fleet, Azure ARM deploy, GCP Deployment Manager) are immediately available and that certain generated files exist, but none of those assumptions hold reliably in this run. You’re seeing three distinct failure classes:
Transient Elastic Cloud / Kibana Fleet API failures (502/503)
Logs show repeated failures like:
API call failed, status code 502 ... "The instance rejected the connection."API call failed, status code 503 ... "The requested resource is currently unavailable."These happen during:
create_agent_policy)get_package_version)enable_entity_store_v2)Concrete fix: add retry/backoff (and optionally a readiness wait) around Fleet API calls, and make package-version parsing robust when Fleet is temporarily unavailable.
Code suggestion: add retries to the shared Fleet API caller
Update
tests/fleet_api/base_call_api.py(ref48fce74b80b498d4c9b2fdd6a07f3bca43f1f18b) so transient 502/503/504 (and connection errors) are retried instead of failing the whole job immediately.Suggested patch (illustrative):
This pull request was created from Copilot chat.