mc_89_add_scraping_logic_for_non_pending_urls #355

maxachis · 2025-07-31T19:21:41Z

This PR adds scraping logic for non-pending URLs. The URL HTML task has been updated to scrape non-pending URLs as well, storing their compressed HTML data, as well as other information from the scrape, in the database.

To ensure better information on scrapes, and to avoid disrupting the status of existing URLs, several other changes were performed:

Addition of a url_web_metadata table, which provides information on the HTTP Status of the link when accessed, as well as the content type returned (which is in turn used in scraping).
Addition of a url_scrape_info table, which describes the outcome of a scrape

…be task

maxachis added 3 commits July 30, 2025 17:12

Add scraping logic for non-pending URLs

f0f33c4

Merge branch 'dev' into mc_89_add_scraping_logic_for_non_pending_urls

93d9c26

Add scraping logic for non pending URLs

15e8bee

maxachis requested a review from josh-chamberlain as a code owner July 31, 2025 19:21

maxachis marked this pull request as draft July 31, 2025 19:51

maxachis added 11 commits August 1, 2025 08:11

Clean up logic, refactor URL Requests Interface, begin setting up pro…

e92cd66

…be task

Finish draft of Probe Task logic

20f1f9b

Begin draft of test logic

0c8c5eb

Finish tests for URL Probe

24f2cac

Adjust URL Html Task logic.

ab3071e

Add task to loader

b7a0af0

Fix bugs and refine

7a78aed

Refactor

98edd9a

Refine HTML task

158f211

Fix broken imports

7b80acf

fix bug when checking for marked as 404

284eb66

maxachis marked this pull request as ready for review August 3, 2025 13:05

maxachis merged commit 073b247 into dev Aug 3, 2025
4 checks passed

maxachis deleted the mc_89_add_scraping_logic_for_non_pending_urls branch August 3, 2025 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mc_89_add_scraping_logic_for_non_pending_urls #355

mc_89_add_scraping_logic_for_non_pending_urls #355

Uh oh!

maxachis commented Jul 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mc_89_add_scraping_logic_for_non_pending_urls #355

mc_89_add_scraping_logic_for_non_pending_urls #355

Uh oh!

Conversation

maxachis commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxachis commented Jul 31, 2025 •

edited

Loading