-
Notifications
You must be signed in to change notification settings - Fork 25
feat(infra): add SQS DLQ + CloudWatch alarm to GitHub webhook processor #284
Copy link
Copy link
Open
Labels
P2Priority 2 - medium priorityPriority 2 - medium priorityapprovedWhen an issue has been approved and readyWhen an issue has been approved and readyenhancementNew feature or requestNew feature or requestinfra-cdkCDK stacks/constructs, bootstrap, deploy topology, tags, IAM wiring, teardownCDK stacks/constructs, bootstrap, deploy topology, tags, IAM wiring, teardownobservabilityTracing, attribution, dashboards, metrics, alarms, telemetry redactionTracing, attribution, dashboards, metrics, alarms, telemetry redaction
Metadata
Metadata
Assignees
Labels
P2Priority 2 - medium priorityPriority 2 - medium priorityapprovedWhen an issue has been approved and readyWhen an issue has been approved and readyenhancementNew feature or requestNew feature or requestinfra-cdkCDK stacks/constructs, bootstrap, deploy topology, tags, IAM wiring, teardownCDK stacks/constructs, bootstrap, deploy topology, tags, IAM wiring, teardownobservabilityTracing, attribution, dashboards, metrics, alarms, telemetry redactionTracing, attribution, dashboards, metrics, alarms, telemetry redaction
Type
Fields
Give feedbackNo fields configured for issues without a type.
Component
CDK / infrastructure (notification plane / observability)
Problem
The screenshot pipeline's processor Lambda (
GitHubScreenshotIntegration/WebhookProcessorFn) is invoked async (InvocationType: 'Event') by the receiver. Today it has:onFailureLambda destinationErrorsEvery failure path inside the handler logs and returns. Because the receiver already returned 200 to GitHub, GitHub will not redeliver — so a systemic break (IAM regression, AgentCore quota exhaustion, OAuth token rotation issue, dependency outage) silently stops 100% of screenshots with no signal anywhere.
Spawned from krokoko's review on PR #241 (item #3, blocking).
Expected behavior
A failed processor invocation should produce an operator-visible signal. After Lambda's automatic async retries exhaust, the failed event lands on a DLQ and a CloudWatch alarm fires.
Current behavior
Failed invocations log-and-return; the failure is invisible to operators unless they happen to inspect Lambda metrics for that specific function.
Proposed solution
WebhookProcessorFnvia Lambda's async-invokeonFailuredestination (so failed events land on the DLQ after Lambda's automatic retries).Errorsmetric (>= 1in 5min, sustained for 2 evaluation periods).Reference
LinearIntegration/WebhookProcessorFn— check first; if none exists, this issue establishes the pattern)cdk/src/constructs/task-orchestrator.tsis also a candidate referenceAcceptance criteria
WebhookProcessorFnasync-invokeonFailuredocs/guides/DEPLOY_PREVIEW_SCREENSHOTS_GUIDE.mdexplaining how to inspect the DLQ