fix(lambda): return batch failures on unhandled errors instead of dropping messages#5129
Open
vegardx wants to merge 2 commits into
Open
Conversation
…pping messages The scaleUpHandler catch block previously returned empty batchItemFailures for non-ScaleError exceptions, causing SQS to delete all messages in the batch. Jobs were permanently lost with no retry. Return all message IDs as batchItemFailures so SQS retries them. The redrive policy / maxReceiveCount will move persistently-failing messages to the DLQ, preventing infinite retries. Fixes github-aws-runners#5024
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates the SQS scale-up Lambda to retry the entire batch on unexpected (non-ScaleError) failures, aligning behavior with partial batch response semantics and adding/adjusting tests to verify retries and logging.
Changes:
- Return all SQS message IDs as
batchItemFailureswhenscaleUpthrows a non-ScaleError, causing SQS/Lambda to retry the batch. - Update existing tests to assert the new retry behavior for single and multi-record batches.
- Add a test asserting the updated error log message for non-
ScaleErrorfailures.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| lambdas/functions/control-plane/src/lambda.ts | Changes non-ScaleError handling to return all messages for retry and updates the log message accordingly. |
| lambdas/functions/control-plane/src/lambda.test.ts | Updates expectations to match new retry behavior and adds a log assertion test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+221
to
+223
| expect(result).toEqual({ | ||
| batchItemFailures: [{ itemIdentifier: 'message-0' }, { itemIdentifier: 'message-1' }], | ||
| }); |
Author
There was a problem hiding this comment.
Good catch. Fixed in d949963 — switched to expect.arrayContaining with a length check.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
scaleUpHandlercatch block returns emptybatchItemFailuresfor non-ScaleErrorexceptions. This tells SQS that all messages were processed successfully, causing them to be permanently deleted. The correspondingworkflow_jobevents are silently lost — no EC2 instances are launched, no retry occurs, and affected jobs remainqueuedin GitHub until they time out (24 hours).Any transient error (GitHub API 404, rate limit, network timeout) that isn't a
ScaleErrortriggers this path.Fix
Return all message IDs as
batchItemFailureswhen an unhandled exception occurs. SQS will retry delivery according to the queue's visibility timeout. The redrive policy /maxReceiveCountmoves persistently-failing messages to the DLQ, preventing infinite retries.Changes
lambdas/functions/control-plane/src/lambda.ts— return all messages as batch failures on non-ScaleErrorlambdas/functions/control-plane/src/lambda.test.ts— update tests to assert new behavior + add log message testImpact
Fixes #5024
Also fixes #5105