diff --git a/.github/workflows/post-rca-comment.yml b/.github/workflows/post-rca-comment.yml new file mode 100644 index 0000000..72638fc --- /dev/null +++ b/.github/workflows/post-rca-comment.yml @@ -0,0 +1,40 @@ +name: Post Root Cause Analysis Comment + +on: + workflow_dispatch: + inputs: + issue_number: + description: 'Issue number to comment on' + required: true + default: '12' + +permissions: + issues: write + contents: read + +jobs: + post-comment: + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Post Root Cause Analysis + uses: actions/github-script@v7 + with: + script: | + const fs = require('fs'); + const issueNumber = ${{ github.event.inputs.issue_number }}; + + // Read the root cause analysis file + const commentBody = fs.readFileSync('Act-3/ROOT_CAUSE_ANALYSIS.md', 'utf8'); + + // Post the comment + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: issueNumber, + body: commentBody + }); + + console.log(`Posted root cause analysis to issue #${issueNumber}`); diff --git a/Act-3/HOW_TO_POST_RCA.md b/Act-3/HOW_TO_POST_RCA.md new file mode 100644 index 0000000..7c9cd30 --- /dev/null +++ b/Act-3/HOW_TO_POST_RCA.md @@ -0,0 +1,42 @@ +# How to Post the Root Cause Analysis to GitHub Issue + +The root cause analysis for the ArgoCD deployment failure has been completed and documented in `ROOT_CAUSE_ANALYSIS.md`. + +## Automated Options + +### Option 1: Using GitHub CLI +```bash +cd Act-3 +gh issue comment 12 --body-file ROOT_CAUSE_ANALYSIS.md +``` + +### Option 2: Using the Bash Script +```bash +cd Act-3 +export GITHUB_TOKEN="your_github_token_here" +./post-rca-to-issue.sh 12 +``` + +### Option 3: Using GitHub Actions Workflow +1. Go to the Actions tab in the repository +2. Select "Post Root Cause Analysis Comment" workflow +3. Click "Run workflow" +4. Enter issue number: `12` +5. Click "Run workflow" + +## Manual Option + +If automated options are not available: + +1. Open the GitHub issue: https://github.com/DevExpGbb/agentic-platform-engineering/issues/12 +2. Copy the content from `ROOT_CAUSE_ANALYSIS.md` +3. Paste it as a new comment on the issue +4. Click "Comment" + +## Summary of Findings + +**Root Cause:** Invalid Kubernetes manifest with malformed `apiVersion` field +**Location:** `apps/broken-aks-store-all-in-one.yaml` line 178 in source repository +**Issue:** `apiVersion: apps/v` should be `apiVersion: apps/v1` + +See `ROOT_CAUSE_ANALYSIS.md` for complete details and remediation recommendations. diff --git a/Act-3/INVESTIGATION_SUMMARY.md b/Act-3/INVESTIGATION_SUMMARY.md new file mode 100644 index 0000000..c247d92 --- /dev/null +++ b/Act-3/INVESTIGATION_SUMMARY.md @@ -0,0 +1,87 @@ +# Investigation Summary: ArgoCD Deployment Failure + +**Date:** 2026-02-03 +**Issue:** #12 - 🚨 ArgoCD Deployment Failed: 2-broken-apps +**Application:** 2-broken-apps +**Status:** ✅ Root Cause Identified + +--- + +## Executive Summary + +The ArgoCD deployment failure for the `2-broken-apps` application has been thoroughly investigated. The root cause has been identified as an **intentionally broken Kubernetes manifest** in the source repository used for testing the ArgoCD notification system. + +## Root Cause + +**Problem:** Invalid `apiVersion` field in Deployment manifest +**Location:** `https://github.com/dcasati/argocd-notification-examples.git` +- File: `apps/broken-aks-store-all-in-one.yaml` +- Line: 178 +- Current value: `apiVersion: apps/v` (incomplete) +- Expected value: `apiVersion: apps/v1` (complete) + +**Affected Resource:** `order-service` Deployment + +## Why This Matters + +- Kubernetes cannot parse manifests with invalid `apiVersion` values +- ArgoCD validation fails before attempting to apply the resource +- Results in "synchronization tasks are not valid" error +- Application remains in "Degraded" health and "OutOfSync" status + +## Context + +Based on analysis of the repository and commit history: + +1. **Repository Name:** `argocd-notification-examples` - suggests this is a testing repository +2. **Commit Message:** "break apiVersion formatting in deployment YAML" - explicitly indicates intentional breakage +3. **Purpose:** This appears to be a test case to validate the ArgoCD notification system + +**Result:** ✅ The notification system is working correctly. The automated workflow successfully detected the failure and created GitHub issue #12. + +## Documentation Provided + +All findings have been documented in: + +1. **`Act-3/ROOT_CAUSE_ANALYSIS.md`** - Complete technical analysis with: + - Detailed problem description + - 4 remediation options + - Verification steps + - Investigation methodology + +2. **`Act-3/HOW_TO_POST_RCA.md`** - Instructions for posting the analysis to GitHub issue + +3. **`Act-3/post-rca-to-issue.sh`** - Bash script for automated posting (requires GitHub token) + +4. **`.github/workflows/post-rca-comment.yml`** - GitHub Actions workflow for posting via UI + +## Next Steps + +### If This Is a Test (Most Likely) +- ✅ Mark test as successful - notification system is working +- Consider closing the application: `argocd app delete 2-broken-apps` +- Update documentation about the test case + +### If This Needs to Be Fixed +- Follow Option 1 in `ROOT_CAUSE_ANALYSIS.md` to fix the source repository +- Change line 178: `apiVersion: apps/v` → `apiVersion: apps/v1` +- Commit, push, and trigger ArgoCD sync + +## Verification + +The notification workflow successfully: +1. Detected the ArgoCD sync failure +2. Extracted failure details including error message and revision +3. Triggered GitHub repository_dispatch event +4. Created GitHub issue #12 with comprehensive failure information +5. Applied appropriate labels: `argocd-deployment-failure`, `automated`, `bug` + +## Conclusion + +**Root Cause:** Intentionally malformed `apiVersion` field in test repository +**System Status:** ArgoCD notification system is functioning correctly +**Recommendation:** If testing is complete, delete the test application. Otherwise, fix the source repository manifest. + +--- + +For complete technical details and remediation options, see `Act-3/ROOT_CAUSE_ANALYSIS.md`. diff --git a/Act-3/README.md b/Act-3/README.md new file mode 100644 index 0000000..aadfb8b --- /dev/null +++ b/Act-3/README.md @@ -0,0 +1,119 @@ +# Act 3: ArgoCD Deployment Failure Investigation + +This directory contains the investigation results for the ArgoCD deployment failure of the `2-broken-apps` application. + +## Quick Links + +- 📊 **[Investigation Summary](INVESTIGATION_SUMMARY.md)** - Executive summary of findings +- 🔍 **[Root Cause Analysis](ROOT_CAUSE_ANALYSIS.md)** - Detailed technical analysis with remediation options +- 📝 **[How to Post RCA](HOW_TO_POST_RCA.md)** - Instructions for posting the analysis to GitHub issue #12 + +## Investigation Results + +### Root Cause +Invalid Kubernetes manifest with incomplete `apiVersion` field in the source repository. + +- **Location:** `apps/broken-aks-store-all-in-one.yaml` (line 178) +- **Issue:** `apiVersion: apps/v` (should be `apiVersion: apps/v1`) +- **Repository:** https://github.com/dcasati/argocd-notification-examples.git +- **Revision:** `8cd04df204028ff78613a69fdb630625864037c6` + +### Conclusion +This appears to be an **intentional test case** to validate the ArgoCD notification system: +- ✅ The notification system detected the failure +- ✅ GitHub issue #12 was automatically created +- ✅ All error details were properly captured and reported + +## Files in This Directory + +| File | Description | +|------|-------------| +| `INVESTIGATION_SUMMARY.md` | Executive summary of the investigation | +| `ROOT_CAUSE_ANALYSIS.md` | Complete technical analysis with 4 remediation options | +| `HOW_TO_POST_RCA.md` | Instructions for posting analysis to GitHub issue | +| `post-rca-to-issue.sh` | Bash script for automated posting (requires GitHub token) | +| `argocd-test-app.yaml` | ArgoCD Application manifest (the one causing the issue) | + +## Related Files + +| File | Description | +|------|-------------| +| `../.github/workflows/post-rca-comment.yml` | GitHub Actions workflow for posting RCA to issue | +| `../.github/workflows/argocd-deployment-failure.yml` | Workflow that creates issues on ArgoCD failures | +| `../.github/argocd/argocd-notifications-config.yaml` | ArgoCD notification configuration | + +## Remediation Options + +The `ROOT_CAUSE_ANALYSIS.md` provides four options: + +1. **Fix the source repository** (recommended if not a test) +2. **Use a different revision** (rollback to working commit) +3. **Use a different source repository** (point to valid repo) +4. **Delete the application** (if testing is complete) + +## How to Use These Files + +### To Post the Analysis to GitHub Issue #12 + +Choose one of these methods: + +```bash +# Option 1: Using GitHub CLI +gh issue comment 12 --body-file ROOT_CAUSE_ANALYSIS.md + +# Option 2: Using the bash script (requires GITHUB_TOKEN) +export GITHUB_TOKEN="your_token_here" +./post-rca-to-issue.sh 12 + +# Option 3: Manual copy/paste +# Open ROOT_CAUSE_ANALYSIS.md and copy content to GitHub issue #12 +``` + +### To Fix the Issue + +If this is not a test and needs to be fixed: + +```bash +# Clone the source repository +git clone https://github.com/dcasati/argocd-notification-examples.git +cd argocd-notification-examples + +# Fix the apiVersion +sed -i 's/apiVersion: apps\/v$/apiVersion: apps\/v1/' apps/broken-aks-store-all-in-one.yaml + +# Commit and push +git add apps/broken-aks-store-all-in-one.yaml +git commit -m "Fix: Complete apiVersion for order-service Deployment" +git push origin main + +# Trigger ArgoCD sync +argocd app sync 2-broken-apps +``` + +## Background: ArgoCD Notifications + +This investigation demonstrates the ArgoCD notification system working correctly: + +``` +ArgoCD detects failure + ↓ +ArgoCD Notifications sends webhook + ↓ +GitHub repository_dispatch + ↓ +GitHub Actions creates issue + ↓ +GitHub Copilot investigates + ↓ +Root cause identified and documented +``` + +## Related Issues + +- GitHub Issue #12: 🚨 ArgoCD Deployment Failed: 2-broken-apps +- GitHub Issue #11: 🚨 ArgoCD Deployment Failed: 2-broken-apps (duplicate) + +--- + +**Investigation completed:** 2026-02-03 +**Investigated by:** GitHub Copilot Agent diff --git a/Act-3/ROOT_CAUSE_ANALYSIS.md b/Act-3/ROOT_CAUSE_ANALYSIS.md new file mode 100644 index 0000000..ed6f4cd --- /dev/null +++ b/Act-3/ROOT_CAUSE_ANALYSIS.md @@ -0,0 +1,169 @@ +# Root Cause Analysis: ArgoCD Deployment Failure (2-broken-apps) + +**Investigation Date:** 2026-02-03 +**Issue:** #12 - 🚨 ArgoCD Deployment Failed: 2-broken-apps +**Status:** Root Cause Identified + +--- + +## 🔍 Root Cause Analysis + +I've investigated the ArgoCD deployment failure for the `2-broken-apps` application and identified the root cause. + +### Summary +The deployment is failing due to an **invalid Kubernetes manifest** in the source repository. Specifically, there is a malformed `apiVersion` field in the `order-service` Deployment manifest. + +### Root Cause Details + +**Location:** `apps/broken-aks-store-all-in-one.yaml` (lines 178-179) + +**Issue:** The `apiVersion` field is incomplete: +```yaml +apiVersion: apps/v # ❌ INVALID - incomplete version +kind: Deployment +metadata: + name: order-service +``` + +**Expected:** +```yaml +apiVersion: apps/v1 # ✅ CORRECT +kind: Deployment +metadata: + name: order-service +``` + +### Technical Analysis + +1. **Repository:** https://github.com/dcasati/argocd-notification-examples.git +2. **Broken Commit:** `8cd04df204028ff78613a69fdb630625864037c6` +3. **Commit Message:** "break apiVersion formatting in deployment YAML" +4. **Affected Resource:** `order-service` Deployment in `apps/broken-aks-store-all-in-one.yaml` + +The error message "one or more synchronization tasks are not valid" is ArgoCD's response to encountering an invalid Kubernetes manifest that cannot be parsed or validated against the Kubernetes API. + +### Impact + +- **Health Status:** Degraded (as reported) +- **Sync Status:** OutOfSync (as reported) +- **Failed Resource:** order-service Deployment +- **Retry Behavior:** ArgoCD attempted to sync 2 times before giving up (as configured in the retry policy) + +--- + +## 📋 Remediation Recommendations + +### Option 1: Fix the Source Repository (Recommended) +This is the proper long-term fix if you control the source repository: + +```bash +# 1. Clone the source repository +git clone https://github.com/dcasati/argocd-notification-examples.git +cd argocd-notification-examples + +# 2. Edit the broken manifest +# Change line 178 from "apiVersion: apps/v" to "apiVersion: apps/v1" +sed -i 's/apiVersion: apps\/v$/apiVersion: apps\/v1/' apps/broken-aks-store-all-in-one.yaml + +# 3. Commit and push the fix +git add apps/broken-aks-store-all-in-one.yaml +git commit -m "Fix: Complete apiVersion for order-service Deployment" +git push origin main + +# 4. Trigger ArgoCD sync +argocd app sync 2-broken-apps +``` + +### Option 2: Use a Different Revision +Point the ArgoCD Application to a working commit (if one exists before the breaking change): + +```bash +# Find a working commit +git log --oneline apps/broken-aks-store-all-in-one.yaml + +# Update the ArgoCD Application to use that revision +argocd app set 2-broken-apps --revision +argocd app sync 2-broken-apps +``` + +### Option 3: Use a Different Source Repository +If this repository is intentionally broken for testing, update the ArgoCD Application manifest to point to a working repository: + +```bash +# Edit Act-3/argocd-test-app.yaml +# Change spec.source.repoURL to a valid repository +# For example: https://github.com/Azure-Samples/aks-store-demo.git +# Change spec.source.path to a valid path +# For example: aks-store-all-in-one.yaml +``` + +### Option 4: Delete the Application (If Testing) +If this was intentionally created to test the ArgoCD notification system and is no longer needed: + +```bash +# Delete the application from ArgoCD +argocd app delete 2-broken-apps + +# Or delete the manifest file +kubectl delete -f Act-3/argocd-test-app.yaml +``` + +--- + +## 🔐 Additional Observations + +Based on the repository structure and commit message, this appears to be an **intentional test case** to validate the ArgoCD notification system. The repository is named "argocd-notification-examples" and the commit explicitly states it's breaking the YAML. + +**If this is a test:** +- ✅ The notification system is working correctly +- ✅ GitHub Actions workflow successfully created this issue +- ✅ The error detection and reporting mechanism is functioning as designed + +**If this is not a test:** +- Follow Option 1 above to fix the source repository +- Verify the fix by running: `kubectl apply --dry-run=server -f apps/broken-aks-store-all-in-one.yaml` + +--- + +## 📊 Verification Steps + +After applying any fix, verify the deployment: + +```bash +# 1. Check application status +argocd app get 2-broken-apps + +# 2. Watch for sync completion +argocd app wait 2-broken-apps --health + +# 3. Verify pods are running +kubectl get pods -n default -l app=order-service + +# 4. Check deployment status +kubectl describe deployment order-service -n default +``` + +--- + +## Investigation Methodology + +1. **Examined ArgoCD Application Manifest** + - Located at: `Act-3/argocd-test-app.yaml` + - Identified source repository and path + +2. **Cloned Source Repository** + - Repository: https://github.com/dcasati/argocd-notification-examples.git + - Analyzed commit history and current state + +3. **Identified Broken Manifest** + - File: `apps/broken-aks-store-all-in-one.yaml` + - Line 178: Malformed `apiVersion: apps/v` (missing the `1`) + +4. **Confirmed Root Cause** + - The incomplete apiVersion prevents Kubernetes from parsing the manifest + - ArgoCD cannot validate or apply the resource + - Results in "synchronization tasks are not valid" error + +--- + +**Note:** This root cause analysis was performed by examining the source repository at revision `8cd04df204028ff78613a69fdb630625864037c6` and identifying the malformed `apiVersion` field in the order-service Deployment manifest. diff --git a/Act-3/post-rca-to-issue.sh b/Act-3/post-rca-to-issue.sh new file mode 100755 index 0000000..0cbefd8 --- /dev/null +++ b/Act-3/post-rca-to-issue.sh @@ -0,0 +1,38 @@ +#!/bin/bash +# Script to post root cause analysis to GitHub issue + +set -e + +ISSUE_NUMBER="${1:-12}" +REPO_OWNER="DevExpGbb" +REPO_NAME="agentic-platform-engineering" +COMMENT_FILE="Act-3/ROOT_CAUSE_ANALYSIS.md" + +echo "Posting root cause analysis to issue #${ISSUE_NUMBER}..." + +# Check if GitHub token is available +if [ -z "${GITHUB_TOKEN}" ] && [ -z "${GH_TOKEN}" ]; then + echo "ERROR: No GitHub token found in environment" + echo "Please set GITHUB_TOKEN or GH_TOKEN environment variable" + echo "" + echo "Alternatively, you can manually post the comment from: ${COMMENT_FILE}" + echo "Or trigger the workflow: .github/workflows/post-rca-comment.yml" + exit 1 +fi + +# Use GITHUB_TOKEN if available, otherwise GH_TOKEN +TOKEN="${GITHUB_TOKEN:-$GH_TOKEN}" + +# Read the comment body and create JSON payload +COMMENT_BODY=$(cat "${COMMENT_FILE}" | jq -Rs .) + +# Create the API request +curl -X POST \ + -H "Accept: application/vnd.github+json" \ + -H "Authorization: Bearer ${TOKEN}" \ + -H "X-GitHub-Api-Version: 2022-11-28" \ + "https://api.github.com/repos/${REPO_OWNER}/${REPO_NAME}/issues/${ISSUE_NUMBER}/comments" \ + -d "{\"body\":${COMMENT_BODY}}" + +echo "" +echo "Successfully posted root cause analysis to issue #${ISSUE_NUMBER}"