Skip to content

Commit 69cb24d

Browse files
authored
Merge pull request #1473 from MITLibraries/etd-677-retry-apt
Adds retry logic to PreservationSubmissionJob
2 parents 1a3ca0f + 30f7a03 commit 69cb24d

File tree

3 files changed

+87
-33
lines changed

3 files changed

+87
-33
lines changed

README.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -374,11 +374,18 @@ processors must create a corresponding Archivematica accession number.
374374
1. Following the processing workflow, stakeholders choose a term to publish (Process theses - Select term - Select
375375
Publication Review - Publish)
376376
2. ETD will now automatically send data to DSS via the SQS queue
377-
3. DSS runs (as of this writing that is a manual process documented in the
378-
[DSS repo](https://github.com/MITLibraries/dspace-submission-service#run-stage))
377+
3. DSS runs (as of this writing that is no longer documented in the
378+
[DSS repo](https://github.com/MITLibraries/dspace-submission-service))
379+
* From the root of a local checkout of `dspace-submission-service`
380+
* Python dependencies do *not* need to be installed
381+
* Authenticate your terminal to appropriate AWS account and role (dssManagement)
382+
* `make run-prod` (or `make run-stage`)
379383
4. ETD processes output queue to update records and send email to stakeholders with summary data and list
380384
of any error records. As of now this is a manual process, but can be triggered via rake task using the following heroku-cli command:
381385

386+
> [!IMPORTANT]
387+
> Check the AWS Console to ensure the APT Lambda is active (i.e. not inactive) before proceeding. If you do not, the preservation jobs may fail if they have not run for a few days.
388+
382389
```shell
383390
# run the output queue processing job
384391
heroku run rails dss:process_output_queue --app TARGET-HEROKU-APP
@@ -397,7 +404,7 @@ You can publish a single thesis that is already in `Publication review` or `Pend
397404
task:
398405

399406
```shell
400-
heroku run -s standard-2x rails dss:publish_thesis_by_id[THESIS_ID] --app TARGET-HEROKU-APP
407+
heroku run rails dss:publish_thesis_by_id[THESIS_ID] --app TARGET-HEROKU-APP
401408
```
402409

403410
Note: `Pending publication` is allowed here, but not expected to be a normal occurence, to handle the edge case of the app thinking data was sent to SQS but the data not arriving for any reason.

app/jobs/preservation_submission_job.rb

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,16 @@ class PreservationSubmissionJob < ActiveJob::Base
44

55
queue_as :default
66

7+
# Custom error class for 502 Bad Gateway responses from APT
8+
class APTBadGatewayError < StandardError; end
9+
10+
# Retry up to 10 times for transient 502 Bad Gateway responses. These 502 errors are generally caused by the lambda
11+
# entering `Inactive` state after a long period of inactivity, which can take several minutes to recover from.
12+
# We are using a fixed wait time of 5 minutes between retries to give the lambda time to warm up, rather than
13+
# retrying immediately with exponential backoff as we expect the first few retries to fail in that scenario so this
14+
# longer time is more effective.
15+
retry_on APTBadGatewayError, wait: 5.minutes, attempts: 10
16+
717
def perform(theses)
818
Rails.logger.info("Preparing to send #{theses.count} theses to preservation")
919
results = { total: theses.count, processed: 0, errors: [] }
@@ -13,7 +23,10 @@ def perform(theses)
1323
preserve_payload(payload)
1424
Rails.logger.info("Thesis #{thesis.id} has been sent to preservation")
1525
results[:processed] += 1
16-
rescue StandardError, Aws::Errors => e
26+
rescue StandardError => e
27+
# Explicitly re-raise the 502-specific error so ActiveJob can retry it.
28+
raise e if e.is_a?(APTBadGatewayError)
29+
1730
preservation_error = "Thesis #{thesis.id} could not be preserved: #{e}"
1831
Rails.logger.info(preservation_error)
1932
results[:errors] << preservation_error
@@ -40,6 +53,12 @@ def post_payload(payload)
4053
http.request(request)
4154
end
4255

56+
# If the remote endpoint returns 502, raise a APTBadGatewayError so ActiveJob can retry.
57+
if response.code.to_s == '502'
58+
Rails.logger.warn("Received 502 from APT for payload #{payload.id}; raising for retry")
59+
raise APTBadGatewayError, 'APT returned 502 Bad Gateway'
60+
end
61+
4362
validate_response(response)
4463
end
4564

@@ -49,8 +68,8 @@ def validate_response(response)
4968
end
5069

5170
result = JSON.parse(response.body)
52-
unless result['success'] == true
53-
raise "APT failed to create a bag: #{response.body}"
54-
end
71+
return if result['success'] == true
72+
73+
raise "APT failed to create a bag: #{response.body}"
5574
end
5675
end

test/jobs/preservation_submission_job_test.rb

Lines changed: 54 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
require 'webmock/minitest'
33

44
class PreservationSubmissionJobTest < ActiveJob::TestCase
5-
65
# because we need to actually use the file it's easier to attach it in the test rather
76
# than use our fixtures as the fixtures oddly don't account for the file actually being
87
# where ActiveStorage expects them to be. We also need this to be a record that looks like
@@ -27,38 +26,38 @@ def setup_thesis_two
2726
end
2827

2928
def stub_apt_lambda_success
30-
stub_request(:post, ENV['APT_LAMBDA_URL'])
29+
stub_request(:post, ENV.fetch('APT_LAMBDA_URL', nil))
3130
.to_return(
3231
status: 200,
3332
body: {
3433
success: true,
3534
bag: {
3635
entries: {
37-
"manifest-md5.txt" => { md5: "fakehash" }
36+
'manifest-md5.txt' => { md5: 'fakehash' }
3837
}
3938
},
40-
output_zip_s3_uri: "s3://my-bucket/apt-testing/test-one-medium.zip"
39+
output_zip_s3_uri: 's3://my-bucket/apt-testing/test-one-medium.zip'
4140
}.to_json,
4241
headers: { 'Content-Type' => 'application/json' }
4342
)
4443
end
4544

4645
def stub_apt_lambda_failure
47-
stub_request(:post, ENV['APT_LAMBDA_URL'])
46+
stub_request(:post, ENV.fetch('APT_LAMBDA_URL', nil))
4847
.to_return(
4948
status: 500,
5049
headers: { 'Content-Type' => 'application/json' }
5150
)
5251
end
5352

5453
def stub_apt_lambda_200_failure
55-
stub_request(:post, ENV['APT_LAMBDA_URL'])
54+
stub_request(:post, ENV.fetch('APT_LAMBDA_URL', nil))
5655
.to_return(
5756
status: 200,
5857
body: {
5958
success: false,
60-
error: "An error occurred (404) when calling the HeadObject operation: Not Found",
61-
bag: { entries: {}},
59+
error: 'An error occurred (404) when calling the HeadObject operation: Not Found',
60+
bag: { entries: {} }
6261
}.to_json,
6362
headers: { 'Content-Type' => 'application/json' }
6463
)
@@ -146,45 +145,74 @@ def stub_apt_lambda_200_failure
146145
end
147146
end
148147

149-
test 'does not create payloads if job fails' do
148+
# This is not the desired behavior, but it confirms the current behavior.
149+
test 'create payloads even if APT job fails' do
150150
stub_apt_lambda_failure
151-
bad_thesis = theses(:one)
152151
good_thesis = setup_thesis
153152
another_good_thesis = setup_thesis_two
154-
assert_empty bad_thesis.archivematica_payloads
155153
assert_empty good_thesis.archivematica_payloads
156154
assert_empty another_good_thesis.archivematica_payloads
157155

158-
PreservationSubmissionJob.perform_now([good_thesis, bad_thesis, another_good_thesis])
156+
PreservationSubmissionJob.perform_now([good_thesis, another_good_thesis])
159157

160-
# first thesis should succeed and have a payload
161158
assert_equal 1, good_thesis.archivematica_payloads.count
162159

163-
# second thesis should fail and not have a payload
164-
assert_empty bad_thesis.archivematica_payloads
165-
166-
# third thesis should succeed and have a payload, despite prior failure
167160
assert_equal 1, another_good_thesis.archivematica_payloads.count
168161
end
169162

170-
test 'does not create payloads if post succeeds but APT fails' do
163+
# This is not the desired behavior, but it confirms the current behavior.
164+
test 'creates payloads if post succeeds but APT fails' do
171165
stub_apt_lambda_200_failure
172-
bad_thesis = theses(:one)
173166
good_thesis = setup_thesis
174167
another_good_thesis = setup_thesis_two
175-
assert_empty bad_thesis.archivematica_payloads
176168
assert_empty good_thesis.archivematica_payloads
177169
assert_empty another_good_thesis.archivematica_payloads
178170

179-
PreservationSubmissionJob.perform_now([good_thesis, bad_thesis, another_good_thesis])
171+
PreservationSubmissionJob.perform_now([good_thesis, another_good_thesis])
180172

181-
# first thesis should succeed and have a payload
182173
assert_equal 1, good_thesis.archivematica_payloads.count
183174

184-
# second thesis should fail and not have a payload
185-
assert_empty bad_thesis.archivematica_payloads
186-
187-
# third thesis should succeed and have a payload, despite prior failure
188175
assert_equal 1, another_good_thesis.archivematica_payloads.count
189176
end
177+
178+
test 'throws exceptions and probably creates payloads when a bad key is provided' do
179+
skip('Test not implemented yet')
180+
# Our lambda returns 400 Bad Request with a error body of Invalid input payload
181+
# This should never happen as we submit it via ENV, but just in case we should understand what it looks like
182+
end
183+
184+
test 'retries on 502 Bad Gateway errors' do
185+
# This syntax will cause the first two calls to return 502, and the third to succeed.
186+
stub_request(:post, ENV.fetch('APT_LAMBDA_URL', nil))
187+
.to_return(
188+
{ status: 502 },
189+
{ status: 502 },
190+
{ status: 200, body: {
191+
success: true,
192+
bag: {
193+
entries: {
194+
'manifest-md5.txt' => { md5: 'fakehash' }
195+
}
196+
},
197+
output_zip_s3_uri: 's3://my-bucket/apt-testing/test-one-medium.zip'
198+
}.to_json, headers: { 'Content-Type' => 'application/json' } }
199+
)
200+
201+
thesis = setup_thesis
202+
assert_equal 0, thesis.archivematica_payloads.count
203+
204+
# We need to wrap this in perform_enqueued_jobs so that the retries are actually executed. Our normal syntax only
205+
# runs a single job, but because we are testing retries this wrapper executes the jobs this job queues.
206+
perform_enqueued_jobs do
207+
PreservationSubmissionJob.perform_now([thesis])
208+
end
209+
210+
# 3 payloads were created. 2 for the failures, 1 for the success.
211+
# This isn't ideal, but it's the current behavior. A refactor to using a pre-preservation job to create
212+
# metadata.csv prior to this job would fix this but is out of scope for now.
213+
assert_equal 3, thesis.archivematica_payloads.count
214+
215+
# The last payload should be marked as preserved.
216+
assert_equal 'preserved', thesis.archivematica_payloads.last.preservation_status
217+
end
190218
end

0 commit comments

Comments
 (0)