Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Ephemeral Stack Cleanup — AgentCore ENI-Aware Redesign

**Date:** 2026-06-08
**Branch / PR:** `feat/cleanup-ephemeral-stacks` / [PR #109](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/pull/109)
**Related issues:** [#72](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/72) (scheduled ephemeral cleanup — *not yet `approved`*), [#111](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/111) (document AgentCore ENI cleanup workflow), [#278](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/278) (shellcheck/shell-test tooling gap)
**Target file:** `scripts/cleanup-ephemeral-stacks.sh`

## Problem

The current `cleanup-ephemeral-stacks.sh` (PR #109) follows a **"pre-clean ENIs → fire `delete-stack`"** model. For stacks that contain a Bedrock AgentCore Runtime, this model is structurally broken.

### Root cause (validated against a live stuck stack)

A real ephemeral stack (`scoschre`, account `465528542731`, `us-east-1`) entered `DELETE_FAILED` with:

```
The following resource(s) failed to delete:
[AgentVpcRuntimeSG…, AgentVpcPrivateSubnet1…, AgentVpcPrivateSubnet2…]
subnet 'subnet-…' has dependencies and cannot be deleted
security group 'sg-…' has a dependent object
```

The dependency was **two `agentic_ai`-type ENIs** left in the stack's private subnets / runtime SG:

| Fact | Evidence |
|------|----------|
| ENIs are Hyperplane-managed | Attachment IDs are `ela-attach-*`; `Attachment.InstanceOwnerId = amazon-aws` |
| They **cannot** be force-detached | `detach-network-interface --force` → `OperationNotPermitted: You are not allowed to manage 'ela-attach' attachments` |
| They **cannot** be force-deleted while attached | `delete-network-interface` → `InvalidParameterValue: ... currently in use` |
| `--dry-run` is **not** a reliable probe | Both ops returned `DryRunOperation: Request would have succeeded` — dry-run validates IAM only, **not** managed-attachment/resource state |
| They are reclaimed **asynchronously by AWS** | The stack's `Runtime` resource reached `DELETE_COMPLETE` at ~19:37; the ENIs persisted **>1 hour**, then AWS reclaimed them on its own (`InvalidNetworkInterfaceID.NotFound`), after which `delete-stack` succeeded with zero manual ENI action |

**Conclusion:** the existing ENI force-detach/delete block (current lines ~146–197) is incapable of clearing these ENIs under any IAM principal. It only adds `sleep 15` delays and false confidence, then races AWS's async reclamation with an immediate `delete-stack` → `DELETE_FAILED`.

This is an **architectural** problem (per systematic-debugging Phase 4.5), not a patchable bug: the fix is to stop trying to force ENI cleanup and instead **observe** reclamation read-only and let repeated passes retry.

## Goals

1. Reliably delete aged, unprotected ABCA ephemeral stacks **without** attempting impossible ENI manipulation.
2. Be **idempotent and cron-safe**: a stack stuck on ENI-reclamation lag is *expected*, and a later pass finishes it automatically.
3. Provide a precise operator signal when a stack is waiting on AWS reclamation (satisfies #111).
4. Support an interactive `--wait` mode.

## Non-goals

- Force-detaching or force-deleting Hyperplane (`ela-attach`) ENIs — proven impossible.
- Deleting the live, in-use AgentCore runtime's ENIs (they live in a *different* VPC; never in scope).
- Synchronously guaranteeing a single run fully tears down every stack (ENI lag can exceed 1 hour).

## Design

### Run modes (hybrid)

- **Default (cron-safe, fire-and-forget):** issue `delete-stack` for every eligible stack, do not block. Print a summary. This is the primary unattended path.
- **`--wait` (interactive):** after issuing deletes, poll each stack to a terminal state and report `DELETE_COMPLETE` vs. `DELETE_FAILED` (with the blocking reason).
- **`--dry-run`:** unchanged — report intended actions, mutate nothing.

### Per-stack flow (after the existing age/safety filters, which are unchanged)

The age/safety filters stay exactly as they are: prefix match → `describe-stacks` succeeds → `Description == "ABCA Development Stack"` → not termination-protected → not `*IN_PROGRESS*` → parseable creation time → older than `MAX_AGE_HOURS`.

After a stack passes those filters, branch on **stack status**:

1. **Fresh aged stack** (`CREATE_COMPLETE`, `UPDATE_COMPLETE`, `ROLLBACK_COMPLETE`, `UPDATE_ROLLBACK_COMPLETE`):
- The Runtime resource still exists and is deleted *during* `delete-stack`. There are no orphan ENIs to check yet.
- → Issue `delete-stack` **unconditionally**.

2. **`DELETE_FAILED` stack** (retry path — stack stuck only on ENI-reclamation lag):
- Run a **read-only** check for blocking `agentic_ai` ENIs in the stack's subnets and runtime SG (gather subnet/SG physical IDs via `list-stack-resources`, then `describe-network-interfaces --filters subnet-id=…` / `group-id=…`).
- **If blocking ENIs are present:** SKIP this pass. Log `"<stack>: pending reclamation (N AgentCore ENIs not yet released by AWS)"`. Count as `Pending`.
- **If none remain:** re-issue `delete-stack` (it will now succeed).

`DELETE_FAILED` is included in the `list-stacks --stack-status-filter`, so stuck stacks are naturally re-evaluated on every pass. The `*IN_PROGRESS*` skip is deliberately narrow: it catches `DELETE_IN_PROGRESS` (don't disturb a stack mid-teardown) but **not** `DELETE_FAILED` (the terminal stuck state we *do* retry). This distinction is load-bearing and is pinned by a regression test.

### What is removed vs. added

- **Removed:** the ENI force-detach / `sleep 15` / force-delete block. It is proven impossible for `ela-attach` ENIs and these stacks contain no other ENI type.
- **Added (read-only only):** a small diagnostic that *observes* blocking `agentic_ai` ENIs in a stack's subnets/SG. It is used **only** as the `DELETE_FAILED` retry gate and to produce the operator signal. It never mutates ENIs.

### Exit semantics

- **Exit 0** if all attempted `delete-stack` calls were issued without API error. A stack left `DELETE_FAILED` / `Pending` awaiting AWS reclamation is **expected**, not a failure — the next pass handles it. This keeps cron quiet.
- **Exit 1** only on real errors: credential/auth failure (`sts:GetCallerIdentity`), or unexpected CloudFormation/EC2 API errors.

### Summary output

```
=== Summary ===
Deleted: <delete-stack issued this pass>
Skipped: <too young / protected / not ABCA / in progress>
Pending: <DELETE_FAILED, blocking ENIs not yet reclaimed — will retry next pass>
Failed: <real API error>
```

## Operator guidance (docs — folds in #111)

Add to `docs/guides/DEPLOYMENT_GUIDE.md` an "AgentCore ENI reclamation" subsection:

- **Why** a stack with an AgentCore Runtime can sit in `DELETE_FAILED`: Hyperplane `agentic_ai` ENIs are released asynchronously by AWS after the Runtime backend tears down (observed lag: >1 hour).
- **These ENIs cannot be force-detached or force-deleted** — do not try; `ela-attach` attachments reject manual management.
- **Recovery:** wait for reclamation, then re-run the cleanup script (or `aws cloudformation delete-stack`). Check reclamation with:
```
aws ec2 describe-network-interfaces \
--filters Name=subnet-id,Values=<stack-subnet-ids> Name=interface-type,Values=agentic_ai \
--query 'NetworkInterfaces[].NetworkInterfaceId'
```
An empty result means the stack will now delete cleanly.
- **Escape hatch** for an indefinitely stuck stack: `aws cloudformation delete-stack --stack-name <name> --retain-resources <VPC/Subnet/SG logical-ids>` to drop the stack shell, then clean the VPC once ENIs clear.

Regenerate Starlight mirrors (`cd docs && node scripts/sync-starlight.mjs`) and commit them alongside.

## Testing

The repo currently has **no shell-test harness** (no `bats`, no `*.bats`), and shellcheck is not yet wired in (tracked by #278). To pin the load-bearing behavior without over-investing in net-new tooling:

- **Minimum:** a small `bats`-style or plain-`bash` assertion test that the status-classification logic selects a `DELETE_FAILED` ABCA stack (with no blocking ENIs) for retry, and skips it (counts `Pending`) when blocking ENIs are present. Refactor the classification into a pure, testable function (`classify_stack` taking status + ENI-count → action) so it can be unit-tested without AWS calls.
- **Lint:** run `shellcheck` on the script (manually for this PR; #278 wires it into CI).
- **Manual integration evidence (already captured):** the `scoschre` stack was unstuck by exactly this gated-retry sequence — blocking-ENI query returned `0`, `delete-stack` then succeeded, VPC removed, live `mainRuntime` untouched.
- **Acceptance validation (planned):** deploy a fresh ephemeral stack, let its first `delete-stack` reach `DELETE_FAILED` on AgentCore ENIs, then confirm a subsequent cleanup-script pass reports `Pending` while ENIs linger and completes the deletion on the first pass after AWS reclaims them — end-to-end exercise of both status branches.

## Risks & mitigations

| Risk | Mitigation |
|------|------------|
| A future "simplification" collapses the `*IN_PROGRESS*` skip into `DELETE_*`, silently killing the retry path | Regression test asserts `DELETE_FAILED` → retry selected |
| `list-stack-resources` on a partially-deleted stack returns stale subnet/SG IDs | Gate query tolerates `NotFound` per resource; treat unresolvable subnet/SG as "no blocking ENI" and allow retry |
| Misclassifying a *live* runtime's ENIs as orphans | Gate queries **only** the target stack's own subnet/SG physical IDs; live runtime is in a separate VPC (verified) |
| Cron noise | Exit 0 on `Pending`; only real API/auth errors are non-zero |

## Governance

Implements #72 (not yet `approved`). PR #109 is already open against this branch, so work continues under the existing artifact; the missing `approved` label on #72 should be flagged to an admin but does not block refinement of the existing PR.
225 changes: 225 additions & 0 deletions scripts/cleanup-ephemeral-stacks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
#!/usr/bin/env bash
# cleanup-ephemeral-stacks.sh — Delete ephemeral CloudFormation stacks older than MAX_AGE_HOURS.
#
# Targets stacks deployed by this CDK app that do NOT have termination protection.
# Handles stuck ENI cleanup (AgentCore/Lambda Hyperplane ENIs) before deletion.
#
# Usage:
# AWS_PROFILE=abca ./scripts/cleanup-ephemeral-stacks.sh [--dry-run] [--max-age-hours N] [--prefix PREFIX]
#
# Options:
# --dry-run Show what would be deleted without acting
# --max-age-hours N Delete stacks older than N hours (default: 4)
# --prefix PREFIX Only target stacks matching this prefix (default: all ABCA stacks)
#
# Safety:
# - Never touches stacks with termination protection enabled
# - Only targets stacks with description matching "ABCA Development Stack"
# - Skips stacks in UPDATE_IN_PROGRESS or CREATE_IN_PROGRESS states

set -euo pipefail

MAX_AGE_HOURS=${MAX_AGE_HOURS:-48}
DRY_RUN=false
PREFIX=""
REGION="${AWS_DEFAULT_REGION:-us-east-1}"

while [[ $# -gt 0 ]]; do
case $1 in
--dry-run) DRY_RUN=true; shift ;;
--max-age-hours) MAX_AGE_HOURS="$2"; shift 2 ;;
--prefix) PREFIX="$2"; shift 2 ;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done

# Validate numeric input — guards the age arithmetic against injection/garbage.
if ! [[ "$MAX_AGE_HOURS" =~ ^[0-9]+$ ]]; then
echo "Error: --max-age-hours must be a non-negative integer (got: '$MAX_AGE_HOURS')" >&2
exit 1
fi

MAX_AGE_SECONDS=$((MAX_AGE_HOURS * 3600))
NOW=$(date +%s)

# Surface the blast radius before touching anything. Confirms the operator is
# pointed at the account/identity they think they are (defense in depth).
CALLER_IDENTITY=$(aws sts get-caller-identity \
--region "$REGION" \
--query '[Account,Arn]' --output text 2>/dev/null) || {
echo "Error: unable to resolve AWS identity (sts:GetCallerIdentity failed). Check credentials." >&2
exit 1
}
ACCOUNT_ID=$(echo "$CALLER_IDENTITY" | cut -f1)
CALLER_ARN=$(echo "$CALLER_IDENTITY" | cut -f2)

echo "=== Ephemeral Stack Cleanup ==="
echo " Account: $ACCOUNT_ID"
echo " Identity: $CALLER_ARN"
echo " Region: $REGION"
echo " Max age: ${MAX_AGE_HOURS}h"
echo " Dry run: $DRY_RUN"
echo " Prefix filter: ${PREFIX:-<none>}"
echo ""

# List all stacks (excluding deleted ones)
STACKS=$(aws cloudformation list-stacks \
--region "$REGION" \
--stack-status-filter \
CREATE_COMPLETE UPDATE_COMPLETE ROLLBACK_COMPLETE \
UPDATE_ROLLBACK_COMPLETE DELETE_FAILED \
--query 'StackSummaries[*].[StackName,CreationTime]' \
--output text 2>/dev/null)

if [[ -z "$STACKS" ]]; then
echo "No stacks found."
exit 0
fi

DELETED=0
SKIPPED=0
FAILED=0

while IFS=$'\t' read -r STACK_NAME CREATION_TIME; do
# Apply prefix filter
if [[ -n "$PREFIX" && "$STACK_NAME" != "$PREFIX"* ]]; then
continue
fi

# Get stack details (description, termination protection, tags)
STACK_INFO=$(aws cloudformation describe-stacks \
--region "$REGION" \
--stack-name "$STACK_NAME" \
--query 'Stacks[0].[Description,EnableTerminationProtection,StackStatus]' \
--output text 2>/dev/null) || continue

DESCRIPTION=$(echo "$STACK_INFO" | cut -f1)
TERMINATION_PROTECTED=$(echo "$STACK_INFO" | cut -f2)
STATUS=$(echo "$STACK_INFO" | cut -f3)

# Only target stacks from this CDK app
if [[ "$DESCRIPTION" != "ABCA Development Stack" ]]; then
continue
fi

# Never touch termination-protected stacks
if [[ "$TERMINATION_PROTECTED" == "True" ]]; then
echo " SKIP (protected): $STACK_NAME"
((SKIPPED++)) || true
continue
fi

# Skip stacks in active transitions
if [[ "$STATUS" == *"IN_PROGRESS"* ]]; then
echo " SKIP (in progress): $STACK_NAME ($STATUS)"
((SKIPPED++)) || true
continue
fi

# Check age. Parse the CreationTime to epoch seconds (GNU date, then BSD date).
# FAIL CLOSED: if both parsers fail we cannot trust the age, so SKIP rather than
# risk deleting a stack we can't prove is old enough.
CREATED_EPOCH=$(date -d "$CREATION_TIME" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S" "${CREATION_TIME%%.*}" +%s 2>/dev/null || echo "")
if ! [[ "$CREATED_EPOCH" =~ ^[0-9]+$ ]]; then
echo " SKIP (unparseable creation time '$CREATION_TIME'): $STACK_NAME"
((SKIPPED++)) || true
continue
fi
AGE_SECONDS=$((NOW - CREATED_EPOCH))

if [[ $AGE_SECONDS -lt $MAX_AGE_SECONDS ]]; then
AGE_HOURS=$((AGE_SECONDS / 3600))
echo " SKIP (too young: ${AGE_HOURS}h): $STACK_NAME"
((SKIPPED++)) || true
continue
fi

AGE_HOURS=$((AGE_SECONDS / 3600))
echo " TARGET: $STACK_NAME (age: ${AGE_HOURS}h, status: $STATUS)"

if [[ "$DRY_RUN" == "true" ]]; then
echo " [dry-run] Would delete $STACK_NAME"
((DELETED++)) || true
continue
fi

# --- ENI cleanup (handles stuck VPC deletion) ---
# Find security groups owned by this stack
SG_IDS=$(aws cloudformation list-stack-resources \
--region "$REGION" \
--stack-name "$STACK_NAME" \
--query "StackResourceSummaries[?ResourceType=='AWS::EC2::SecurityGroup'].PhysicalResourceId" \
--output text 2>/dev/null) || true

if [[ -n "$SG_IDS" && "$SG_IDS" != "None" ]]; then
for SG_ID in $SG_IDS; do
# Find ENIs attached to this security group.
# shellcheck disable=SC2016 # backticks are JMESPath literal syntax for --query, must NOT expand
ENIS=$(aws ec2 describe-network-interfaces \
--region "$REGION" \
--filters "Name=group-id,Values=$SG_ID" \
--query 'NetworkInterfaces[?Status==`in-use`].[NetworkInterfaceId,Attachment.AttachmentId]' \
--output text 2>/dev/null) || true

if [[ -n "$ENIS" && "$ENIS" != "None" ]]; then
echo " Cleaning up ENIs in security group $SG_ID..."
while IFS=$'\t' read -r ENI_ID ATTACHMENT_ID; do
if [[ -n "$ENI_ID" && "$ENI_ID" != "None" ]]; then
echo " Force-detaching $ENI_ID ($ATTACHMENT_ID)"
aws ec2 detach-network-interface \
--region "$REGION" \
--attachment-id "$ATTACHMENT_ID" \
--force 2>/dev/null || true
fi
done <<< "$ENIS"

# Wait briefly for detachment
echo " Waiting 15s for ENI detachment..."
sleep 15

# Delete the ENIs
AVAILABLE_ENIS=$(aws ec2 describe-network-interfaces \
--region "$REGION" \
--filters "Name=group-id,Values=$SG_ID" "Name=status,Values=available" \
--query 'NetworkInterfaces[*].NetworkInterfaceId' \
--output text 2>/dev/null) || true

for ENI_ID in $AVAILABLE_ENIS; do
if [[ -n "$ENI_ID" && "$ENI_ID" != "None" ]]; then
echo " Deleting $ENI_ID"
aws ec2 delete-network-interface \
--region "$REGION" \
--network-interface-id "$ENI_ID" 2>/dev/null || true
fi
done
fi
done
fi

# --- Delete the stack ---
# Only count a deletion we actually initiated. Tolerate a single failure
# (e.g. AccessDenied, transient throttling) without aborting the whole run —
# set -e would otherwise kill the loop mid-pass and orphan later stacks.
echo " Deleting stack $STACK_NAME..."
if aws cloudformation delete-stack \
--region "$REGION" \
--stack-name "$STACK_NAME" 2>/dev/null; then
((DELETED++)) || true
else
echo " ERROR: delete-stack failed for $STACK_NAME (continuing)" >&2
((FAILED++)) || true
fi

done <<< "$STACKS"

echo ""
echo "=== Summary ==="
echo " Deleted: $DELETED"
echo " Skipped: $SKIPPED"
echo " Failed: $FAILED"

if [[ "$DELETED" -gt 0 && "$DRY_RUN" == "false" ]]; then
echo ""
echo "Note: Stack deletion is asynchronous. Monitor with:"
echo " aws cloudformation list-stacks --stack-status-filter DELETE_IN_PROGRESS --region $REGION"
fi
Loading