Skip to content

fix(ia-save): validate save job status before recording success#590

Open
maxachis wants to merge 2 commits intodevfrom
mc_387_save_status_polling_fix
Open

fix(ia-save): validate save job status before recording success#590
maxachis wants to merge 2 commits intodevfrom
mc_387_save_status_polling_fix

Conversation

@maxachis
Copy link
Collaborator

@maxachis maxachis commented Feb 27, 2026

Summary

  • switch IA save flow to request JSON capture output and require a job_id
  • poll /save/status/{job_id} and only treat status=success as success
  • include authorization header on status polling requests
  • return failure/timeout messages as task errors instead of writing false-positive save metadata
  • add unit tests for save submit success/failure and status polling behavior

Why

Issue #387 reports many URLs marked as saved even when no archive entry exists. The previous implementation treated any 2xx from /save as success, but save is asynchronous and requires status verification.

Testing

  • python3 -m compileall src/external/internet_archives/client.py tests/automated/unit/external/internet_archives/test_client_save.py
  • POSTGRES_USER=test_source_collector_user POSTGRES_PASSWORD=HanviliciousHamiltonHilltops POSTGRES_HOST=localhost POSTGRES_PORT=5432 POSTGRES_DB=source_collector_test_db UV_CACHE_DIR=/tmp/uv-cache uv run pytest -q tests/automated/integration/tasks/scheduled/impl/internet_archives/save/test_new_insert.py tests/automated/integration/tasks/scheduled/impl/internet_archives/save/test_updated_insert.py tests/automated/integration/tasks/scheduled/impl/internet_archives/save/test_error.py tests/automated/unit/external/internet_archives/test_client_save.py
    • result: 8 passed
  • live smoke (real INTERNET_ARCHIVE_S3_KEYS):
    • https://example.com -> has_error=False
    • https://www.pdap.io -> has_error=False

Fixes #387

@maxachis
Copy link
Collaborator Author

Validation summary:

  1. Automated tests (local Docker Postgres):
  • tests/automated/integration/tasks/scheduled/impl/internet_archives/save/test_new_insert.py
  • tests/automated/integration/tasks/scheduled/impl/internet_archives/save/test_updated_insert.py
  • tests/automated/integration/tasks/scheduled/impl/internet_archives/save/test_error.py
  • tests/automated/unit/external/internet_archives/test_client_save.py
  • Result: 8 passed.
  1. Live smoke test (with real INTERNET_ARCHIVE_S3_KEYS):
  • https://example.com -> has_error=False
  • https://www.pdap.io -> has_error=False
  1. Additional bug found/fixed during validation:
  • /save/status/{job_id} was initially returning 401 because status polling lacked Authorization.
  • Added auth header for status polling and re-ran tests + live smoke successfully.

@maxachis
Copy link
Collaborator Author

Additional verification note (live IA capture confirmation):

For the two live-smoke URLs, I verified actual captures were created by checking save-status payloads and opening the resulting archive URLs.

  • https://example.com

    • save status: success
    • timestamp: 20260227000026
    • archive URL: https://web.archive.org/web/20260227000026/https://example.com/
    • archive response: HTTP 200
  • https://www.pdap.io

    • save status: success
    • timestamp: 20260227005638
    • archive URL: https://web.archive.org/web/20260227005638/https://www.pdap.io/
    • archive response: HTTP 200

So these are confirmed captures on IA, not only successful client responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant