Skip to content

Conversation

@maxachis
Copy link
Collaborator

@maxachis maxachis commented Jul 31, 2025

#89

This PR adds scraping logic for non-pending URLs. The URL HTML task has been updated to scrape non-pending URLs as well, storing their compressed HTML data, as well as other information from the scrape, in the database.

To ensure better information on scrapes, and to avoid disrupting the status of existing URLs, several other changes were performed:

  • Addition of a url_web_metadata table, which provides information on the HTTP Status of the link when accessed, as well as the content type returned (which is in turn used in scraping).
  • Addition of a url_scrape_info table, which describes the outcome of a scrape

@maxachis maxachis marked this pull request as draft July 31, 2025 19:51
@maxachis maxachis marked this pull request as ready for review August 3, 2025 13:05
@maxachis maxachis merged commit 073b247 into dev Aug 3, 2025
4 checks passed
@maxachis maxachis deleted the mc_89_add_scraping_logic_for_non_pending_urls branch August 3, 2025 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants