Skip to content

Anna-Grid/404_error_analysis_tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

🔍 Automated 404 Error Analysis Tool (SEO / GSC)

Status: Portfolio Showcase (Mock Logic) | Domain: E-Commerce, SEO, Data Analysis

Python Data Analysis Web Scraping SEO

Overview

Large e-commerce platforms often generate thousands of 404 errors, making manual analysis inefficient and error-prone. This tool automates the investigation process by combining multiple data sources and applying validation logic to detect patterns and classify issues at scale.

Key Features

  • Multi-source data enrichment:
    • Google Search Console (error reports)
    • Sitemap (XML/CSV parsing)
    • CMS database export
    • Live HTTP requests (status, title, images)
  • URL structure analysis: Slug extraction, category and path parsing.
  • Data validation layers: Presence in sitemap, existence in CMS, publication status (active/hidden), live HTTP response.
  • Automated issue classification: Identifies deleted pages, hidden products, active items returning 404 (critical bugs), and inconsistencies between the CMS and the live site.
  • Controlled processing: Implements rate-limited scraping (Cloudscraper) to avoid triggering anti-bot protections.

Output & Insights

The tool generates a structured Excel report with enriched diagnostics, transforming raw signals into actionable SEO insights:

  • “Active in CMS but returns 404”Critical issue
  • “Not in CMS and not in sitemap”Likely deleted
  • “Hidden in CMS”Expected 404 behavior

Technical Highlights

  • Built with Python (Pandas, NumPy).
  • Robust parsing of inconsistent CSV and XML inputs.
  • Custom CMS data reconstruction (handling broken exports).
  • Concurrent requests with an anti-blocking strategy designed for large datasets (~15k+ URLs).

Notes: Processing is intentionally rate-limited (≈1000 URLs/hour) to avoid blocking. Sequential execution is required in notebook environments. Due to NDA, this repository contains only sanitized core logic.


Developed by Anna Grid

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages