diff --git a/Act-3/ISSUE-12-FINDINGS.md b/Act-3/ISSUE-12-FINDINGS.md new file mode 100644 index 0000000..29c6926 --- /dev/null +++ b/Act-3/ISSUE-12-FINDINGS.md @@ -0,0 +1,143 @@ +# Issue #12 - Root Cause Analysis Findings + +**Issue:** [๐Ÿšจ ArgoCD Deployment Failed: 2-broken-apps](https://github.com/DevExpGbb/agentic-platform-engineering/issues/12) +**Date:** 2026-02-03 +**Analyzed By:** GitHub Copilot Agent + +--- + +## ๐Ÿ” Root Cause Analysis Complete + +I've completed a thorough analysis of the deployment failure for the `2-broken-apps` application. + +### ๐Ÿ“‹ Summary + +The deployment is failing due to **TWO critical issues** in the source repository manifest (`apps/broken-aks-store-all-in-one.yaml`): + +--- + +### โŒ Issue #1: Invalid API Version (Line 178) + +**Location:** [`apps/broken-aks-store-all-in-one.yaml:178`](https://github.com/dcasati/argocd-notification-examples/blob/main/apps/broken-aks-store-all-in-one.yaml#L178) + +**Current Code:** +```yaml +apiVersion: apps/v +kind: Deployment +metadata: + name: order-service +``` + +**Problem:** The `apiVersion` is incomplete - it should be `apps/v1` not `apps/v`. + +**Impact:** +- โŒ Kubernetes API server rejects the manifest +- โŒ ArgoCD sync fails with "synchronization tasks are not valid" +- โŒ The `order-service` deployment cannot be created +- โŒ Cascading failure blocks entire application stack + +**Fix:** +```yaml +apiVersion: apps/v1 # Changed from apps/v +``` + +--- + +### โŒ Issue #2: Typo in Docker Image Name (Line 475) + +**Location:** [`apps/broken-aks-store-all-in-one.yaml:475`](https://github.com/dcasati/argocd-notification-examples/blob/main/apps/broken-aks-store-all-in-one.yaml#L475) + +**Current Code:** +```yaml +- name: store-admin + image: ghcr.io/azure-samples/aks-store-demo/store-dmin:2.1.0 +``` + +**Problem:** Typo in image name - `store-dmin` should be `store-admin`. + +**Impact:** +- โŒ Container image pull fails with `ImagePullBackOff` +- โŒ Pod remains stuck in `Pending` state +- โŒ Deployment becomes degraded +- โŒ Health checks fail + +**Fix:** +```yaml +image: ghcr.io/azure-samples/aks-store-demo/store-admin:2.1.0 +``` + +--- + +## ๐ŸŽฏ Recommended Remediation + +Since both issues exist in the **external source repository** (https://github.com/dcasati/argocd-notification-examples.git), you have three options: + +### Option 1: Fork & Fix (Recommended for testing) โญ + +1. Fork the repository: https://github.com/dcasati/argocd-notification-examples +2. Fix both issues in your fork +3. Update the ArgoCD app to point to your fork temporarily: + ```yaml + spec: + source: + repoURL: https://github.com//argocd-notification-examples.git + targetRevision: fixed-branch + ``` + +### Option 2: Host Corrected Manifests Locally + +1. Copy the corrected manifest to this repository under `manifests/aks-store/` +2. Update [`Act-3/argocd-test-app.yaml`](https://github.com/DevExpGbb/agentic-platform-engineering/blob/main/Act-3/argocd-test-app.yaml) to point to the local path + +### Option 3: Use Kustomize Overlays (Advanced) + +Create Kustomize patches to fix the remote manifest without forking. + +--- + +## ๐Ÿ“„ Full Documentation + +I've created a comprehensive Root Cause Analysis document with detailed remediation steps, verification procedures, and impact analysis: + +**๐Ÿ“– [View Full RCA Document](./RCA-2-broken-apps.md)** + +--- + +## โœ… Next Steps + +1. **Choose a remediation option** based on your requirements +2. **Implement the fixes** to the source manifests +3. **Sync the ArgoCD application**: + ```bash + argocd app sync 2-broken-apps + ``` +4. **Verify deployment**: + ```bash + kubectl get pods -n default + argocd app get 2-broken-apps + ``` + +--- + +## ๐Ÿ“Š Impact Summary + +| Issue | Location | Severity | Fix Required | +|-------|----------|----------|--------------| +| Invalid apiVersion | Line 178 | **Critical** | `apps/v` โ†’ `apps/v1` | +| Invalid image name | Line 475 | **High** | `store-dmin` โ†’ `store-admin` | + +--- + +**Analysis Status:** โœ… Complete +**Root Cause:** Identified (syntax errors in upstream repository) +**Documentation:** [RCA-2-broken-apps.md](./RCA-2-broken-apps.md) +**Ready for Remediation:** Yes + +--- + +## ๐Ÿ”— Related Links + +- [Issue #12](https://github.com/DevExpGbb/agentic-platform-engineering/issues/12) +- [Source Repository](https://github.com/dcasati/argocd-notification-examples) +- [ArgoCD Application Definition](./argocd-test-app.yaml) +- [Full RCA Document](./RCA-2-broken-apps.md) diff --git a/Act-3/RCA-2-broken-apps.md b/Act-3/RCA-2-broken-apps.md new file mode 100644 index 0000000..e5fda02 --- /dev/null +++ b/Act-3/RCA-2-broken-apps.md @@ -0,0 +1,269 @@ +# Root Cause Analysis: 2-broken-apps ArgoCD Deployment Failure + +**Date:** 2026-02-03 +**Application:** `2-broken-apps` +**Status:** โŒ Deployment Failed +**Analyzed By:** GitHub Copilot Agent + +--- + +## ๐Ÿ” Executive Summary + +The ArgoCD application `2-broken-apps` is failing to deploy due to **TWO critical issues** in the source repository manifest file (`apps/broken-aks-store-all-in-one.yaml`): + +1. **Invalid API Version** (Line 178) - Critical syntax error +2. **Invalid Docker Image Name** (Line 475) - Typo in image name + +Both issues originate from the upstream source repository (https://github.com/dcasati/argocd-notification-examples.git) and require fixes at the source. + +--- + +## ๐Ÿ• Timeline + +- **Initial Detection:** ArgoCD sync failure detected +- **Investigation Started:** Manual review of manifest file +- **Root Cause Identified:** Two syntax/configuration errors found (lines 178, 475) +- **RCA Documented:** 2026-02-03 +- **Status:** Awaiting remediation decision + +--- + +## ๐Ÿ› Issue Details + +### Issue #1: Invalid API Version (Line 178) โŒ + +**Location:** `apps/broken-aks-store-all-in-one.yaml:178` + +**Current Code:** +```yaml +apiVersion: apps/v +kind: Deployment +metadata: + name: order-service +``` + +**Problem:** +The `apiVersion` field is incomplete. It reads `apps/v` when it should be `apps/v1`. + +**Impact:** +- Kubernetes API server rejects the manifest immediately +- ArgoCD reports: "one or more synchronization tasks are not valid" +- The `order-service` Deployment cannot be created +- Cascading failure blocks entire application stack + +**Fix Required:** +```yaml +apiVersion: apps/v1 # Changed from apps/v +kind: Deployment +metadata: + name: order-service +``` + +--- + +### Issue #2: Invalid Docker Image Name (Line 475) โŒ + +**Location:** `apps/broken-aks-store-all-in-one.yaml:475` + +**Current Code:** +```yaml +containers: + - name: store-admin + image: ghcr.io/azure-samples/aks-store-demo/store-dmin:2.1.0 +``` + +**Problem:** +The image name contains a typo: `store-dmin` instead of `store-admin`. + +**Impact:** +- Container image pull fails with `ImagePullBackOff` +- Pod remains in `Pending` state indefinitely +- Deployment becomes degraded +- Application health check fails + +**Fix Required:** +```yaml +containers: + - name: store-admin + image: ghcr.io/azure-samples/aks-store-demo/store-admin:2.1.0 # Fixed typo +``` + +--- + +## ๐ŸŽฏ Remediation Options + +Since these issues exist in an **external source repository**, there are three approaches to fix them: + +### Option 1: Fix the Source Repository (Recommended) โญ + +**Best for:** Collaborative/Open Source scenarios + +1. Fork https://github.com/dcasati/argocd-notification-examples +2. Create a new branch (e.g., `fix/manifest-errors`) +3. Fix both issues: + - Line 178: `apps/v` โ†’ `apps/v1` + - Line 475: `store-dmin` โ†’ `store-admin` +4. Submit a Pull Request to the upstream repository +5. Wait for merge or use your fork temporarily + +**ArgoCD Update (temporary):** +```yaml +spec: + source: + repoURL: https://github.com//argocd-notification-examples.git + targetRevision: fix/manifest-errors + path: apps +``` + +--- + +### Option 2: Host Corrected Manifests Locally + +**Best for:** Quick fix/Internal control + +1. Copy the corrected manifest to this repository: + ```bash + mkdir -p manifests/aks-store + # Copy and fix the manifest + ``` + +2. Update ArgoCD Application (`Act-3/argocd-test-app.yaml`): + ```yaml + spec: + source: + repoURL: https://github.com/DevExpGbb/agentic-platform-engineering.git + targetRevision: main + path: manifests/aks-store + ``` + +3. Apply the updated ArgoCD Application: + ```bash + kubectl apply -f Act-3/argocd-test-app.yaml + ``` + +--- + +### Option 3: Use Kustomize Overlays + +**Best for:** Advanced patching without forking + +1. Create a Kustomize overlay structure: + ``` + manifests/aks-store/ + โ”œโ”€โ”€ kustomization.yaml + โ””โ”€โ”€ patches/ + โ”œโ”€โ”€ order-service-apiversion.yaml + โ””โ”€โ”€ store-admin-image.yaml + ``` + +2. Configure Kustomize to patch the remote manifest: + ```yaml + # kustomization.yaml + resources: + - https://raw.githubusercontent.com/dcasati/argocd-notification-examples/main/apps/broken-aks-store-all-in-one.yaml + + patches: + - path: patches/order-service-apiversion.yaml + - path: patches/store-admin-image.yaml + ``` + +3. Update ArgoCD to use Kustomize: + ```yaml + spec: + source: + repoURL: https://github.com/DevExpGbb/agentic-platform-engineering.git + targetRevision: main + path: manifests/aks-store + kustomize: {} + ``` + +--- + +## ๐Ÿงช Verification Steps + +After applying the fix: + +### 1. Validate Syntax +```bash +kubectl apply --dry-run=client -f apps/broken-aks-store-all-in-one.yaml +``` + +### 2. Verify Image Availability +```bash +docker pull ghcr.io/azure-samples/aks-store-demo/store-admin:2.1.0 +``` + +### 3. Sync ArgoCD Application +```bash +argocd app sync 2-broken-apps +``` + +### 4. Monitor Deployment +```bash +# Check application status +argocd app get 2-broken-apps + +# Watch pods come online +kubectl get pods -n default -w + +# Verify specific deployments +kubectl get deployment order-service -n default +kubectl get deployment store-admin -n default +``` + +### 5. Check Health Status +```bash +# All pods should be running +kubectl get pods -n default | grep -E "(order-service|store-admin)" + +# Check events for any issues +kubectl get events -n default --sort-by='.lastTimestamp' | tail -20 +``` + +--- + +## ๐Ÿ“Š Impact Analysis + +| Component | Status | Impact Level | +|-----------|--------|--------------| +| `order-service` | โŒ Failed | **Critical** - Blocks deployment | +| `store-admin` | โŒ Failed | **High** - ImagePullBackOff | +| `store-front` | โš ๏ธ Degraded | **Medium** - Depends on order-service | +| `product-service` | โœ… OK | **Low** - Independent | +| `makeline-service` | โš ๏ธ Degraded | **Medium** - Depends on order-service | + +--- + +## ๐Ÿ”‘ Key Learnings + +1. **ArgoCD Error Messages:** The generic error "synchronization tasks are not valid" can indicate basic YAML/API syntax issues +2. **Validation First:** Always validate Kubernetes manifests before deploying: + ```bash + kubectl apply --dry-run=client -f + ``` +3. **Source Control:** Issues in upstream repositories require coordination with repository owners +4. **Testing:** Test deployments should use intentionally broken manifests to validate notification workflows +5. **Dependency Awareness:** A single invalid resource can block entire application deployment; understand service dependencies + +--- + +## ๐Ÿ“ Related Files + +- **ArgoCD Application Definition:** `Act-3/argocd-test-app.yaml` +- **Source Repository:** https://github.com/dcasati/argocd-notification-examples.git +- **Problematic Manifest:** `apps/broken-aks-store-all-in-one.yaml` +- **Workflow Handler:** `.github/workflows/argocd-deployment-failure.yml` + +--- + +## ๐Ÿ”— References + +- [ArgoCD Application CRD Documentation](https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#applications) +- [Kubernetes API Versions](https://kubernetes.io/docs/reference/using-api/#api-versioning) +- [Kustomize Patching](https://kubectl.docs.kubernetes.io/references/kustomize/patches/) +- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) + +--- + +**Status:** โœ… Root Cause Identified | โณ Awaiting Remediation Decision +**Next Steps:** Choose remediation option and implement fix diff --git a/Act-3/README.md b/Act-3/README.md new file mode 100644 index 0000000..84aacbd --- /dev/null +++ b/Act-3/README.md @@ -0,0 +1,71 @@ +# Act-3: ArgoCD Deployment Monitoring & Root Cause Analysis + +This directory contains ArgoCD application definitions and associated troubleshooting documentation for deployment failures. + +## Contents + +### ArgoCD Application Definitions +- **[argocd-test-app.yaml](./argocd-test-app.yaml)** - Test ArgoCD application definition for `2-broken-apps` + +### Root Cause Analysis Documentation +- **[ISSUE-12-FINDINGS.md](./ISSUE-12-FINDINGS.md)** - Quick summary of findings for Issue #12 +- **[RCA-2-broken-apps.md](./RCA-2-broken-apps.md)** - Comprehensive root cause analysis for the `2-broken-apps` deployment failure + +## Quick Links + +- [Issue #12: ArgoCD Deployment Failed](https://github.com/DevExpGbb/agentic-platform-engineering/issues/12) +- [ArgoCD Deployment Failure Workflow](../.github/workflows/argocd-deployment-failure.yml) +- [ArgoCD Notifications Setup](../.github/argocd/SETUP.md) + +## Overview + +This act demonstrates automated ArgoCD deployment monitoring and issue creation workflow: + +1. **ArgoCD detects deployment failure** (sync failed or health degraded) +2. **ArgoCD Notifications** sends webhook to GitHub +3. **GitHub Actions workflow** creates/updates issue automatically +4. **Copilot Agent** analyzes the issue and provides root cause analysis + +## Current Status: Issue #12 Analysis + +**Application:** `2-broken-apps` +**Status:** Root cause identified โœ… +**Issues Found:** 2 critical errors in upstream repository + +### Quick Summary + +Two issues identified in the source repository (https://github.com/dcasati/argocd-notification-examples.git): + +1. **Invalid API Version** (Line 178) - Critical + - `apiVersion: apps/v` should be `apiVersion: apps/v1` + +2. **Typo in Image Name** (Line 475) - High + - `store-dmin` should be `store-admin` + +๐Ÿ“– **[View Full Analysis](./ISSUE-12-FINDINGS.md)** + +## Troubleshooting + +To investigate ArgoCD deployment issues: + +```bash +# Check application status +argocd app get 2-broken-apps + +# Check pods in namespace +kubectl get pods -n default + +# Describe failed pods +kubectl describe pods -n default + +# Get pod logs +kubectl logs -n default + +# Check events +kubectl get events -n default --sort-by='.lastTimestamp' +``` + +## Related Documentation + +- [ArgoCD Notifications Configuration](../.github/argocd/argocd-notifications-config.yaml) +- [Setup Guide](../.github/argocd/SETUP.md)