Skip to content

fix(lambda): return batch failures on unhandled errors instead of dropping messages#5129

Open
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/batch-item-failures-on-unhandled-errors
Open

fix(lambda): return batch failures on unhandled errors instead of dropping messages#5129
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/batch-item-failures-on-unhandled-errors

Conversation

@vegardx
Copy link
Copy Markdown

@vegardx vegardx commented May 26, 2026

Problem

The scaleUpHandler catch block returns empty batchItemFailures for non-ScaleError exceptions. This tells SQS that all messages were processed successfully, causing them to be permanently deleted. The corresponding workflow_job events are silently lost — no EC2 instances are launched, no retry occurs, and affected jobs remain queued in GitHub until they time out (24 hours).

Any transient error (GitHub API 404, rate limit, network timeout) that isn't a ScaleError triggers this path.

Fix

Return all message IDs as batchItemFailures when an unhandled exception occurs. SQS will retry delivery according to the queue's visibility timeout. The redrive policy / maxReceiveCount moves persistently-failing messages to the DLQ, preventing infinite retries.

} else {
  logger.error(
    `Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, returning batch for retry`,
    { error: e },
  );
  batchItemFailures.push(...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })));
}

Changes

  • lambdas/functions/control-plane/src/lambda.ts — return all messages as batch failures on non-ScaleError
  • lambdas/functions/control-plane/src/lambda.test.ts — update tests to assert new behavior + add log message test

Impact

  • Non-ScaleError exceptions → messages retried via SQS (previously: silently dropped)
  • ScaleError path → unchanged
  • DLQ catches persistently-failing messages (no infinite retry loop)

Fixes #5024
Also fixes #5105

…pping messages

The scaleUpHandler catch block previously returned empty batchItemFailures
for non-ScaleError exceptions, causing SQS to delete all messages in the
batch. Jobs were permanently lost with no retry.

Return all message IDs as batchItemFailures so SQS retries them. The
redrive policy / maxReceiveCount will move persistently-failing messages
to the DLQ, preventing infinite retries.

Fixes github-aws-runners#5024
@vegardx vegardx requested a review from a team as a code owner May 26, 2026 20:19
Copilot AI review requested due to automatic review settings May 26, 2026 20:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the SQS scale-up Lambda to retry the entire batch on unexpected (non-ScaleError) failures, aligning behavior with partial batch response semantics and adding/adjusting tests to verify retries and logging.

Changes:

  • Return all SQS message IDs as batchItemFailures when scaleUp throws a non-ScaleError, causing SQS/Lambda to retry the batch.
  • Update existing tests to assert the new retry behavior for single and multi-record batches.
  • Add a test asserting the updated error log message for non-ScaleError failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
lambdas/functions/control-plane/src/lambda.ts Changes non-ScaleError handling to return all messages for retry and updates the log message accordingly.
lambdas/functions/control-plane/src/lambda.test.ts Updates expectations to match new retry behavior and adds a log assertion test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +221 to +223
expect(result).toEqual({
batchItemFailures: [{ itemIdentifier: 'message-0' }, { itemIdentifier: 'message-1' }],
});
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in d949963 — switched to expect.arrayContaining with a length check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants