Status: Portfolio Showcase (Mock Logic) | Domain: E-Commerce, SEO, Data Analysis
Large e-commerce platforms often generate thousands of 404 errors, making manual analysis inefficient and error-prone. This tool automates the investigation process by combining multiple data sources and applying validation logic to detect patterns and classify issues at scale.
- Multi-source data enrichment:
- Google Search Console (error reports)
- Sitemap (XML/CSV parsing)
- CMS database export
- Live HTTP requests (status, title, images)
- URL structure analysis: Slug extraction, category and path parsing.
- Data validation layers: Presence in sitemap, existence in CMS, publication status (active/hidden), live HTTP response.
- Automated issue classification: Identifies deleted pages, hidden products, active items returning 404 (critical bugs), and inconsistencies between the CMS and the live site.
- Controlled processing: Implements rate-limited scraping (Cloudscraper) to avoid triggering anti-bot protections.
The tool generates a structured Excel report with enriched diagnostics, transforming raw signals into actionable SEO insights:
- “Active in CMS but returns 404” → Critical issue
- “Not in CMS and not in sitemap” → Likely deleted
- “Hidden in CMS” → Expected 404 behavior
- Built with Python (Pandas, NumPy).
- Robust parsing of inconsistent CSV and XML inputs.
- Custom CMS data reconstruction (handling broken exports).
- Concurrent requests with an anti-blocking strategy designed for large datasets (~15k+ URLs).
Notes: Processing is intentionally rate-limited (≈1000 URLs/hour) to avoid blocking. Sequential execution is required in notebook environments. Due to NDA, this repository contains only sanitized core logic.
Developed by Anna Grid