Skip to content

feat(infra): add SQS DLQ + CloudWatch alarm to GitHub webhook processor #284

@isadeks

Description

@isadeks

Component

CDK / infrastructure (notification plane / observability)

Problem

The screenshot pipeline's processor Lambda (GitHubScreenshotIntegration/WebhookProcessorFn) is invoked async (InvocationType: 'Event') by the receiver. Today it has:

  • No SQS DLQ
  • No onFailure Lambda destination
  • No CloudWatch alarm on Errors

Every failure path inside the handler logs and returns. Because the receiver already returned 200 to GitHub, GitHub will not redeliver — so a systemic break (IAM regression, AgentCore quota exhaustion, OAuth token rotation issue, dependency outage) silently stops 100% of screenshots with no signal anywhere.

Spawned from krokoko's review on PR #241 (item #3, blocking).

Expected behavior

A failed processor invocation should produce an operator-visible signal. After Lambda's automatic async retries exhaust, the failed event lands on a DLQ and a CloudWatch alarm fires.

Current behavior

Failed invocations log-and-return; the failure is invisible to operators unless they happen to inspect Lambda metrics for that specific function.

Proposed solution

  1. Add an SQS DLQ on WebhookProcessorFn via Lambda's async-invoke onFailure destination (so failed events land on the DLQ after Lambda's automatic retries).
  2. Add a CloudWatch alarm on the processor's Errors metric (>= 1 in 5min, sustained for 2 evaluation periods).
  3. DLQ retention: 14 days; SSE-KMS encryption.
  4. Add cdk-nag suppressions if needed for DLQ encryption / alarm threshold defaults.
  5. Notify via existing alarm SNS topic if one exists; otherwise create one specific to this stack.

Reference

  • Pattern to copy: existing DLQ in the stack (e.g. LinearIntegration/WebhookProcessorFn — check first; if none exists, this issue establishes the pattern)
  • cdk/src/constructs/task-orchestrator.ts is also a candidate reference

Acceptance criteria

  • DLQ attached to WebhookProcessorFn async-invoke onFailure
  • CloudWatch alarm on processor errors (>= 1 in 5min, 2 eval periods)
  • cdk-nag clean
  • Smoke test: deliberately break the processor (e.g. revoke an IAM permission), confirm the failed event lands on the DLQ and the alarm fires
  • Operator documentation update in docs/guides/DEPLOY_PREVIEW_SCREENSHOTS_GUIDE.md explaining how to inspect the DLQ

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority 2 - medium priorityapprovedWhen an issue has been approved and readyenhancementNew feature or requestinfra-cdkCDK stacks/constructs, bootstrap, deploy topology, tags, IAM wiring, teardownobservabilityTracing, attribution, dashboards, metrics, alarms, telemetry redaction

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions