A Python web scraper that extracts content from reactome.org pages and saves them organized by their URL routes.
The scraper extracts two types of content elements:
<div class="item-page">- Main content pages<div class="leading-n" itemprop="blogpost">- Blog post entries
-
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On Linux/Mac # or .\venv\Scripts\activate # On Windows
-
Install dependencies:
pip install -r requirements.txt
Run the scraper with default settings:
python scraper.pyThis will:
- Crawl reactome.org starting from predefined seed URLs
- Save extracted content to
./scraped_pages/directory - Wait 1 second between requests (to be polite to the server)
python scraper.py [options]
Options:
-o, --output DIR Output directory (default: scraped_pages)
-d, --delay SECONDS Delay between requests (default: 1.0)
-m, --max-pages NUM Maximum pages to scrape (default: unlimited)
-s, --seed-only Only scrape seed URLs, don't crawl furtherScrape only the predefined seed URLs (faster, limited coverage):
python scraper.py --seed-onlyLimit to 50 pages with a 2-second delay:
python scraper.py --max-pages 50 --delay 2.0Specify a custom output directory:
python scraper.py --output ./my_scraped_dataThe scraper organizes files based on their URL routes:
scraped_pages/
├── what-is-reactome/
│ └── item-page.html
├── about/
│ ├── news/
│ │ ├── item-page.html
│ │ ├── blogpost-1.html
│ │ └── blogpost-2.html
│ ├── team/
│ │ └── item-page.html
│ └── statistics/
│ └── item-page.html
├── documentation/
│ ├── item-page.html
│ └── faq/
│ └── item-page.html
└── ...
Each saved HTML file includes comments with:
- Source URL
- Timestamp when scraped
The scraper logs its activity to both:
- Console output
scraper.logfile
- The scraper respects a configurable delay between requests to avoid overloading the server
- URLs to API endpoints, download pages, and non-HTML resources are automatically skipped
- Only internal reactome.org links are followed
- Duplicate URLs are automatically handled
- Images are automatically downloaded and saved to
scraped_pages/images/<page-route>/
Edit the get_seed_urls() function in scraper.py to modify the starting URLs.
Modify the is_valid_url() method in the ReactomeScraper class to change which URLs are scraped or skipped.
Modify the extract_content() method to target different HTML elements.
Converts scraped HTML files to MDX format with frontmatter metadata.
python3 convert_to_mdx.py [options]
Options:
-i, --input DIR Input directory containing HTML files (default: scraped_pages)
-o, --output DIR Output directory for MDX files (default: mdx_pages)
-v, --verbose Enable verbose logging- Extracts title, category (route), and body content
- For article pages (news, spotlight): extracts author, date, and tags
- Converts HTML to clean Markdown using html2text
- Copies images from
scraped_pages/images/tomdx_pages/images/ - Updates image paths to work with the MDX output structure
Each MDX file includes YAML frontmatter:
---
title: "Page Title"
category: "about/news"
author: "Author Name" # For articles only
date: "2024-01-15" # For articles only
tags: ["news", "release"] # For articles only
---mdx_pages/
├── images/
│ ├── about/
│ │ └── news/
│ │ └── image.png
│ └── userguide/
│ └── diagram.png
├── about/
│ └── news/
│ └── item-page.mdx
├── userguide/
│ └── item-page.mdx
└── ...
Reorganizes mdx_pages to match a desired navigation structure by moving folders to new locations.
python3 reorganize_pages.pyEdit the MOVES list in reorganize_pages.py to define source and destination paths:
MOVES = [
("what-is-reactome", "about/what-is-reactome"),
("userguide", "documentation/userguide"),
# Add more moves as needed
]- Moves page directories to new locations
- Automatically moves corresponding images from
mdx_pages/images/ - Updates image paths within MDX files after reorganization
- Skips moves if source doesn't exist or destination already exists
Updates the category field in MDX frontmatter to match the file's actual location after reorganization.
python3 fix_categories.pyAfter running reorganize_pages.py, the category field in frontmatter may not match the new file location. This script:
- Scans all
.mdxfiles inmdx_pages/ - Compares the
categoryvalue with the actual file path - Updates the
categoryto match the current location
Simplifies the directory structure by moving item-page.mdx files up one level and renaming them.
python3 flatten_folders.pyConverts structures like:
mdx_pages/
└── about/
└── news/
└── item-page.mdx
To:
mdx_pages/
└── about/
└── news.mdx
This creates cleaner URLs and simpler file organization for static site generators.
Fixes broken image references in scraped HTML files caused by the scraper's deduplication logic.
python3 fix_image_paths.pyThe scraper has a bug where the same image URL downloaded by multiple pages results in different relative paths in HTML, but the image is only saved at the first location encountered. This script:
- Builds an index of all existing images by filename
- Scans HTML files for image references
- Identifies references pointing to non-existent paths
- Copies images from their actual location to expected locations
Fixes image paths in MDX files to match the actual image locations after flattening and reorganization.
python3 fix_mdx_image_paths.py- Fixes malformed image references (trailing
>characters) - Updates paths for flattened images (e.g.,
folder/subfolder/image.png→folder/subfolder.png) - Resolves general path mismatches by looking up images by filename
Renames remaining item-page.mdx files to index.mdx for cleaner URLs.
python3 rename_to_index.pyConverts structures like:
mdx_pages/
└── documentation/
└── item-page.mdx
To:
mdx_pages/
└── documentation/
└── index.mdx
To scrape and convert the Reactome website to MDX:
# Run all steps automatically
./run_all.shOr run each step individually:
# 1. Scrape pages and images from reactome.org
python3 scraper.py
# 2. Fix image path issues in scraped HTML files
python3 fix_image_paths.py
# 3. Convert HTML to MDX format
python3 convert_to_mdx.py
# 4. Reorganize pages to match desired nav structure
python3 reorganize_pages.py
# 5. Flatten folder structure (rename item-page.mdx to parent folder name)
python3 flatten_folders.py
# 6. Fix category metadata in frontmatter to match new locations
python3 fix_categories.py
# 7. Fix any remaining image path issues in MDX files
python3 fix_mdx_image_paths.py
# 8. Rename remaining item-page.mdx to index.mdx
python3 rename_to_index.py| Script | Purpose |
|---|---|
scraper.py |
Crawls reactome.org and saves HTML pages + images |
fix_image_paths.py |
Fixes missing image references in scraped HTML |
convert_to_mdx.py |
Converts HTML to MDX with frontmatter metadata |
reorganize_pages.py |
Moves pages to match navigation structure |
flatten_folders.py |
Renames item-page.mdx files to parent folder names |
fix_categories.py |
Updates category frontmatter to match file locations |
fix_mdx_image_paths.py |
Fixes image paths in MDX after reorganization |
rename_to_index.py |
Renames remaining item-page.mdx to index.mdx |
After running all scripts, you'll have:
scraped_pages/- Original HTML files and imagesmdx_pages/- Converted MDX files with proper structure and working image paths