A Python tool for scraping Google Maps local services data and enriching the results with lead-scoring signals.
Two entry points, fully independent:
| Script | Purpose |
|---|---|
mapScraperX.py |
Original scraper CLI β unchanged, backward-compatible |
main.py |
New pipeline CLI with scrape / enrich / full modes |
- Place ID, Maps URL, business name, category, full address
- Phone (local + international format)
- Website domain + URL
- GPS coordinates
- Average star rating + review count
- Concurrent async processing, configurable language / country
- Feature engineering from existing CSV columns
- Lightweight website scraping (contact page, service keywords, modern-stack detection)
- Interpretable lead scoring (0β100) with segment labels
- Works on any previously generated CSV β scraper never has to re-run
- Python 3.10+
- pip
git clone https://github.com/christivn/mapScraper.git
cd mapScraper
pip install -r requirements.txtDependencies: aiohttp, tqdm, pandas, beautifulsoup4
Everything here works exactly as before.
# Single query
python mapScraperX.py "restaurants in Miami" --limit 50
# Multiple queries from file
python mapScraperX.py --queries-file query_example.txt
# With concurrency and custom output
python mapScraperX.py --queries-file query_example.txt \
--lang en --country us --limit 25 \
--output-file data/custom.csv --concurrent 5| Option | Default | Description |
|---|---|---|
query |
β | Single search query |
--queries-file FILE |
β | File with one query per line |
--lang CODE |
en |
Language code |
--country CODE |
us |
Country code |
--limit N |
no limit | Max results (total / per query) |
--output-file PATH |
data/output.csv |
Output CSV path |
--concurrent N |
3 |
Max concurrent queries (3β5 recommended) |
| Mode | What it does |
|---|---|
scrape |
Google Maps scraping only β identical output to mapScraperX.py |
enrich |
Load an existing CSV β add features + lead scores β save enriched CSV |
full |
Scrape first, then enrich the result |
# Scrape (same as mapScraperX.py)
python main.py --mode scrape "marketing agencies in New York" --limit 50
python main.py --mode scrape --queries-file query_example.txt
# Enrich a previously generated CSV
python main.py --mode enrich --input data/output.csv
# Enrich without fetching websites (faster, offline-safe)
python main.py --mode enrich --input data/output.csv --no-web-scraping
# Full pipeline in one command
python main.py --mode full \
--queries-file query_example.txt \
--output-file data/leads.csv
# Verbose debug output
python main.py --mode enrich --input data/output.csv --log-level DEBUG--mode {scrape,enrich,full} Pipeline mode (default: scrape)
query Single search query
--queries-file FILE File with one query per line
--lang CODE Language code (default: en)
--country CODE Country code (default: us)
--limit N Max results per query
--output-file PATH Raw scrape output (default: data/output.csv)
--concurrent N Concurrent scraper tasks (default: 3)
--input PATH Input CSV for enrich mode
--no-web-scraping Skip website fetching during enrichment
--web-concurrent N Concurrent website fetch tasks (default: 10)
--web-batch-size N Rows per website-scraping batch (default: 100)
--web-timeout SEC Per-request timeout for websites (default: 10)
--log-level {DEBUG,INFO,...} Logging verbosity (default: INFO)
| Column | Description | Example |
|---|---|---|
id |
Google Place ID | ChIJN1t_tDeuEmsRUsoyG83frY4 |
url_place |
Google Maps link | https://www.google.com/maps/place/?q=place_id:... |
title |
Business name | Joe's Pizza |
category |
Business category | Pizza restaurant |
address |
Full address | 123 Main St, New York, NY 10001 |
phoneNumber |
Local phone | (555) 123-4567 |
completePhoneNumber |
International phone | +1 555-123-4567 |
domain |
Website domain | joespizza.com |
url |
Full website URL | https://www.joespizza.com |
coor |
Coordinates | 40.7128,-74.0060 |
stars |
Average rating | 4.5 |
reviews |
Review count | 234 |
source_query |
Original query | pizza in New York |
| Column | Type | Description |
|---|---|---|
has_phone |
bool | Phone number present |
has_website |
bool | Website domain or URL present |
domain_valid |
bool | Domain passes basic format validation |
rating_score |
float | stars Γ log(reviews+1) β penalises high ratings with few reviews |
review_density |
float | Normalised review count within the batch (0β1) |
web_has_contact |
bool | Contact page / section detected on the website |
web_has_services |
bool | Services / products section detected |
web_keywords |
str | Top 10 content keywords (comma-separated) |
web_is_modern |
bool | Modern JS framework detected (React, Vue, Next.js, β¦) |
web_scraped |
bool | Whether the website was reachable and scraped |
score |
float | Lead score 0β100 (see breakdown below) |
segment |
str | micro / small / medium / large |
| Signal | Max pts | Thresholds |
|---|---|---|
| Review count | 30 | β₯500β30, β₯200β24, β₯100β18, β₯50β12, β₯10β7, β₯1β3 |
| Star rating | 25 | β₯4.5β25, β₯4.0β20, β₯3.5β15, β₯3.0β10, >0β5 |
| Website presence | 30 | has_website+10, domain_valid+5, web_has_contact+5, web_has_services+5, web_is_modern+5 |
| Phone | 15 | has_phoneβ15 |
| Segment | Score range |
|---|---|
| micro | 0β24 |
| small | 25β49 |
| medium | 50β74 |
| large | 75β100 |
mapScraper/
βββ mapScraperX.py original scraper CLI (unchanged)
βββ main.py new pipeline CLI
βββ requirements.txt
βββ mapScraper/
β βββ placesCrawlerV2.py async Google Maps scraper (deduplicates by id on save)
βββ enrichment/
β βββ features.py feature engineering from CSV columns
β βββ web_scraper.py async website signal extraction
β βββ scoring.py lead scoring (0β100) + segmentation
βββ pipeline/
β βββ orchestrator.py run_pipeline() β wires scrape & enrich
βββ data/
βββ output.csv scrape output (example)
- Scraper and enrichment are fully independent β enrichment never imports scraper logic and vice versa.
- Scraper output schema is frozen β the 13-column CSV is never modified.
- Enriched output is a superset of the raw CSV β every original column is preserved.
- Website scraping is fault-tolerant β any domain that fails returns empty signals without crashing the pipeline.
- Enrichment accepts any previously generated CSV as input β no need to re-scrape.
- Duplicates are removed by
idat save time β first occurrence wins.
| Code | Language | Code | Country |
|---|---|---|---|
en |
English | us |
United States |
es |
Spanish | es |
Spain |
fr |
French | fr |
France |
de |
German | de |
Germany |
it |
Italian | it |
Italy |
pt |
Portuguese | br |
Brazil |
ja |
Japanese | jp |
Japan |
ko |
Korean | kr |
South Korea |
zh |
Chinese | cn |
China |
Google shut down the /localservices/prolist endpoint (HTTP 410).
The scraper now uses a two-step approach:
GET https://www.google.com/maps/search/{query}β extracts a canonicalpb=URL from the Maps SPA page.GET https://www.google.com/search?tbm=map&β¦&pb=β¦β parses the)]}'-prefixed JSON atdata[64].
requests-html / pyppeteer removed; only aiohttp and tqdm needed for scraping.
Empty results / "Could not find pb= search URL"
Google may be serving a consent wall. Try matching --lang and --country to your locale.
"data[64] is missing"
Google may have changed the response structure again. Run with --log-level DEBUG and open an issue.
Enriched CSV has empty web_* columns
The domain may be unreachable. Check web_scraped column β False means the fetch failed silently (expected behaviour). Use --web-timeout 20 for slow sites.
Large CSVs are slow to enrich
Web scraping is the bottleneck. Increase --web-concurrent 20 or skip it entirely with --no-web-scraping.
Provided as-is for educational and research purposes. Please respect Google's Terms of Service.
