CrawlerScope

CrawlerScope is a public crawler intelligence dataset and static GitHub Pages dashboard for crawler, AI bot, SEO bot, monitoring probe, and scanner infrastructure.

The project aggregates operator-published IP ranges, normalizes them into CIDR prefixes, tracks source provenance, and publishes operational exports for gateways, analytics pipelines, SIEM enrichment, bot management, and infrastructure visibility.

Live Dashboard

https://ipanalytics.github.io/CrawlerScope/

Repository:

https://github.com/ipanalytics/CrawlerScope

Overview

Crawler infrastructure is fragmented across vendor JSON feeds, documentation pages, robots specifications, and unofficial community-maintained lists.

CrawlerScope consolidates those sources into a normalized operational dataset with:

CIDR normalization
source attribution
operator metadata
category classification
service labeling
export tooling

The repository is designed for direct machine consumption and lightweight browser-based inspection.

Current Coverage

Search Crawlers

Googlebot
Bingbot
DuckDuckGo
Applebot
YandexBot
Baiduspider

AI Crawlers and Fetchers

OpenAI
Anthropic
Perplexity
Meta
Amazonbot
Bytespider

SEO Crawlers

AhrefsBot
SemrushBot

Security Scanners

Shodan
Censys

Monitoring Probes

Datadog Synthetics
Pingdom
UptimeRobot
Better Stack
StatusCake

Archive and Social Crawlers

Common Crawl
Pinterestbot
LinkedInBot

Source Trust Model

CrawlerScope separates datasets by source quality and publication model.

Source Type	Description
`official_json`	Operator-published structured JSON
`official_text`	Operator-published text-based CIDR lists
`documented_user_agent`	Publicly documented crawler identity without authoritative IP feed
`known_static`	Operationally useful static ranges with limited authority guarantees

This distinction is preserved in exports and dashboard filters.

Dashboard Features

Feature	Description
Interactive map	Country-level operator distribution
Category analytics	Operator/category mix charts
Cascading filters	Filter by category, operator, source type, or service
Full-text search	Search across operators, tags, URLs, and user-agents
Export generation	JSON, CSV, CIDR text, robots.txt, NGINX maps
Presets	AI crawlers, monitoring probes, official feeds
Service table	Sortable infrastructure inventory
Clipboard export	Copy filtered CIDR selections

Architecture

              Public Sources
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      ▼              ▼              ▼
 Vendor JSON     Documentation    Static Lists
      │              │              │
      └──────────────┴──────┬───────┘
                             ▼
                    Normalization Layer
                    CIDR + metadata merge
                             ▼
                    Classification Engine
                 category / tags / source type
                             ▼
                      Export Pipeline
             JSON / CSV / robots / nginx
                             ▼
                     Static Dashboard

Published Outputs

File	Description
`data/current/crawlers.json`	Full normalized crawler dataset
`data/current/robots-ai.txt`	robots.txt snippets for AI crawlers
`data/current/nginx-ai-map.conf`	NGINX user-agent mapping
`data/history/summary.csv`	Historical build metrics
`data/snapshots/*.json`	Compact snapshot summaries

Export Examples

Download current dataset

curl -fsSLO \
  https://raw.githubusercontent.com/ipanalytics/CrawlerScope/main/data/current/crawlers.json

Extract AI crawler ranges

jq -r '
  .records[]
  | select(.category=="ai-crawler")
  | .prefix
' crawlers.json

Generate robots rules

curl -fsSL \
  https://raw.githubusercontent.com/ipanalytics/CrawlerScope/main/data/current/robots-ai.txt

Use exported NGINX map

include /etc/nginx/nginx-ai-map.conf;

if ($is_ai_crawler = 1) {
    return 403;
}

Repository Layout

CrawlerScope/
├── .github/
│   └── workflows/
├── data/
│   ├── current/
│   ├── history/
│   └── snapshots/
├── public/
│   ├── assets/
│   └── index.html
├── scripts/
├── LICENSE
└── README.md

Generated site/ artifacts are intentionally excluded from version control.

Local Development

Update datasets

python3 scripts/update.py

Local preview

rm -rf site

cp -R public site
cp -R data site/data

python3 -m http.server 8080 --directory site

Open:

http://127.0.0.1:8080/

GitHub Pages Deployment

CrawlerScope is deployed through GitHub Actions.

Workflow:

.github/workflows/crawler-scope.yml

Pages configuration:

Source: GitHub Actions
Branch deployment is not required
Generated assets are published from workflow artifacts

Update Schedule

Default refresh interval:

schedule:
  - cron: "23 */6 * * *"

Most upstream crawler sources update daily or less frequently, so sub-hour refresh intervals generally provide limited value.

Operational Notes

IP inventories are only as complete as upstream disclosures
User-Agent strings are trivially spoofable
Some operators publish crawler identities without stable IP feeds
Static/public ranges should be treated as operational hints, not authoritative truth
Multiple services may legitimately share infrastructure prefixes

Use Cases

Domain	Example
Bot Management	AI crawler detection and filtering
SIEM Enrichment	Infrastructure attribution
Analytics	Search and crawler traffic classification
WAF Pipelines	Allow/block automation logic
SEO Monitoring	Search crawler visibility
Threat Hunting	Scanner infrastructure correlation

Roadmap

Planned additions:

ASN-level crawler attribution
Historical prefix diffing
Provider overlap analysis
Signed dataset releases
Compressed bulk exports
Additional crawler verification metadata

License

Licensed under CC0-1.0.

See LICENSE.

Disclaimer

CrawlerScope aggregates publicly available infrastructure information for operational and analytical use. Consumers are responsible for validating suitability within their own environments.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
data		data
public		public
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CrawlerScope

Live Dashboard

https://ipanalytics.github.io/CrawlerScope/

Overview

Current Coverage

Search Crawlers

AI Crawlers and Fetchers

SEO Crawlers

Security Scanners

Monitoring Probes

Archive and Social Crawlers

Source Trust Model

Dashboard Features

Architecture

Published Outputs

Export Examples

Download current dataset

Extract AI crawler ranges

Generate robots rules

Use exported NGINX map

Repository Layout

Local Development

Update datasets

Local preview

GitHub Pages Deployment

Update Schedule

Operational Notes

Use Cases

Roadmap

License

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages