Skip to content

Commit 116c1a1

Browse files
committed
initial work to support crawler api
1 parent 4a353f5 commit 116c1a1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+14835
-13
lines changed

.env.example

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Scrapfly API Configuration
2+
# Copy this file to .env and fill in your actual values
3+
4+
# Your Scrapfly API key
5+
SCRAPFLY_KEY=scp-live-your-api-key-here
6+
7+
# Scrapfly API host (optional, defaults to production)
8+
SCRAPFLY_API_HOST=https://api.scrapfly.io

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,6 @@ scrapfly_sdk.egg-info
66
venv
77
examples/scrapy/demo/images
88
examples/scrapy/demo/*.csv
9-
!examples/scrapy/demo/images/.gitkeep
9+
!examples/scrapy/demo/images/.gitkeep
10+
/tests/crawler/*.gz
11+
.env

examples/crawler/.env.example

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Scrapfly API Configuration
2+
# Get your API key from: https://scrapfly.io/dashboard
3+
4+
# Required: Your Scrapfly API key
5+
SCRAPFLY_API_KEY=scp-live-your-key-here
6+
7+
# Usage:
8+
# 1. Copy this file to .env
9+
# 2. Replace 'scp-live-your-key-here' with your actual API key
10+
# 3. The examples will automatically load your API key from the .env file

examples/crawler/README.md

Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
# Scrapfly Crawler API Examples
2+
3+
This directory contains examples demonstrating the Scrapfly Crawler API integration.
4+
5+
## Setup
6+
7+
### Get Your API Key
8+
9+
Get your API key from [https://scrapfly.io/dashboard](https://scrapfly.io/dashboard)
10+
11+
### Configure Your API Key
12+
13+
You have **two options** to provide your API key:
14+
15+
#### Option A: Environment Variable (Recommended)
16+
17+
Export the API key in your terminal:
18+
19+
```bash
20+
export SCRAPFLY_API_KEY='scp-live-your-key-here'
21+
```
22+
23+
Then run any example:
24+
25+
```bash
26+
python3 sync_crawl.py
27+
```
28+
29+
#### Option B: .env File
30+
31+
1. Copy the example .env file:
32+
33+
```bash
34+
cp .env.example .env
35+
```
36+
37+
2. Edit `.env` and replace the placeholder with your actual API key:
38+
39+
```
40+
SCRAPFLY_API_KEY=scp-live-your-actual-key-here
41+
```
42+
43+
3. Run any example (the .env file will be loaded automatically):
44+
45+
```bash
46+
python3 sync_crawl.py
47+
```
48+
49+
> **Note:** Install `python-dotenv` for automatic .env file loading: `pip install python-dotenv`
50+
>
51+
> If you don't install it, the examples will still work with environment variables exported in your shell.
52+
53+
## Quick Start
54+
55+
The easiest way to use the Crawler API is with the high-level `Crawl` object:
56+
57+
```python
58+
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
59+
60+
client = ScrapflyClient(key='your-key')
61+
62+
# Method chaining for concise usage
63+
crawl = Crawl(
64+
client,
65+
CrawlerConfig(
66+
url='https://example.com',
67+
page_limit=10,
68+
max_depth=2
69+
)
70+
).crawl().wait()
71+
72+
# Get results
73+
pages = crawl.warc().get_pages()
74+
for page in pages:
75+
print(f"{page['url']}: {page['status_code']}")
76+
```
77+
78+
## Examples
79+
80+
### High-Level API (Recommended)
81+
82+
- **[test_simple_crawl.py](test_simple_crawl.py)** - Simplest example using method chaining
83+
- **[test_crawl_object.py](test_crawl_object.py)** - Complete demonstration of Crawl object features
84+
- **[demo_markdown.py](demo_markdown.py)** - Retrieve markdown-formatted content
85+
- **[demo_read_iter.py](demo_read_iter.py)** - Pattern-based URL iteration with wildcards
86+
- **[demo_har.py](demo_har.py)** - HAR format with timing analysis
87+
88+
### Low-Level API
89+
90+
- **[test_simple.py](test_simple.py)** - Basic workflow: start, check status
91+
- **[test_complete_workflow.py](test_complete_workflow.py)** - Full workflow: start, poll, download artifact
92+
- **[test_crawl.py](test_crawl.py)** - Original comprehensive test
93+
94+
## Crawl Object Features
95+
96+
The `Crawl` object provides a stateful, high-level interface:
97+
98+
### Methods
99+
100+
- **`crawl()`** - Start the crawler job
101+
- **`wait(poll_interval=5, max_wait=None, verbose=False)`** - Wait for completion
102+
- **`status(refresh=True)`** - Get current status
103+
- **`warc(artifact_type='warc')`** - Download WARC artifact
104+
- **`har()`** - Download HAR (HTTP Archive) artifact with timing data
105+
- **`read(url, format='html')`** - Get content for specific URL
106+
- **`read_iter(pattern, format='html')`** - Iterate through URLs matching wildcard pattern
107+
- **`stats()`** - Get comprehensive statistics
108+
109+
### Properties
110+
111+
- **`uuid`** - Crawler job UUID
112+
- **`started`** - Whether crawler has been started
113+
114+
### Usage Patterns
115+
116+
#### 1. Method Chaining (Most Concise)
117+
118+
```python
119+
crawl = Crawl(client, config).crawl().wait()
120+
pages = crawl.warc().get_pages()
121+
```
122+
123+
#### 2. Step-by-Step (More Control)
124+
125+
```python
126+
crawl = Crawl(client, config)
127+
crawl.crawl()
128+
crawl.wait(verbose=True, max_wait=300)
129+
130+
# Check status
131+
status = crawl.status()
132+
print(f"Crawled {status.urls_crawled} URLs")
133+
134+
# Get results
135+
artifact = crawl.warc()
136+
pages = artifact.get_pages()
137+
```
138+
139+
#### 3. Read Specific URLs
140+
141+
```python
142+
# Get content for a specific URL
143+
html = crawl.read('https://example.com/page1')
144+
if html:
145+
print(html.decode('utf-8'))
146+
```
147+
148+
#### 4. Statistics
149+
150+
```python
151+
stats = crawl.stats()
152+
print(f"URLs discovered: {stats['urls_discovered']}")
153+
print(f"URLs crawled: {stats['urls_crawled']}")
154+
print(f"Crawl rate: {stats['crawl_rate']:.1f}%")
155+
print(f"Total size: {stats['total_size_kb']:.2f} KB")
156+
```
157+
158+
## Configuration Options
159+
160+
The `CrawlerConfig` class supports all crawler parameters:
161+
162+
```python
163+
config = CrawlerConfig(
164+
url='https://example.com',
165+
page_limit=100,
166+
max_depth=3,
167+
exclude_paths=['/admin/*', '/api/*'],
168+
include_paths=['/products/*'],
169+
content_formats=['html', 'markdown'],
170+
# ... and many more options
171+
)
172+
```
173+
174+
See `CrawlerConfig` class documentation for all available parameters.
175+
176+
## Artifact Formats
177+
178+
### WARC Format
179+
180+
The crawler returns results in WARC (Web ARChive) format by default, which is automatically parsed:
181+
182+
```python
183+
artifact = crawl.warc()
184+
185+
# Easy way: Get all pages as dictionaries
186+
pages = artifact.get_pages()
187+
for page in pages:
188+
url = page['url']
189+
status_code = page['status_code']
190+
headers = page['headers']
191+
content = page['content'] # bytes
192+
193+
# Memory-efficient: Iterate one record at a time
194+
for record in artifact.iter_responses():
195+
print(f"{record.url}: {len(record.content)} bytes")
196+
197+
# Save to file
198+
artifact.save('results.warc.gz')
199+
```
200+
201+
### HAR Format
202+
203+
HAR (HTTP Archive) format includes detailed timing information for performance analysis:
204+
205+
```python
206+
artifact = crawl.har()
207+
208+
# Access timing data
209+
for entry in artifact.iter_responses():
210+
print(f"{entry.url}")
211+
print(f" Status: {entry.status_code}")
212+
print(f" Total time: {entry.time}ms")
213+
print(f" Content type: {entry.content_type}")
214+
215+
# Detailed timing breakdown
216+
timings = entry.timings
217+
print(f" DNS: {timings.get('dns', 0)}ms")
218+
print(f" Connect: {timings.get('connect', 0)}ms")
219+
print(f" Wait: {timings.get('wait', 0)}ms")
220+
print(f" Receive: {timings.get('receive', 0)}ms")
221+
222+
# Same easy interface as WARC
223+
pages = artifact.get_pages()
224+
```
225+
226+
See [demo_har.py](demo_har.py) for a complete HAR example.
227+
228+
## Error Handling
229+
230+
```python
231+
from scrapfly import Crawl, CrawlerConfig
232+
233+
try:
234+
crawl = Crawl(client, config)
235+
crawl.crawl().wait(max_wait=300)
236+
237+
if crawl.status().is_complete:
238+
pages = crawl.warc().get_pages()
239+
print(f"Success! Got {len(pages)} pages")
240+
elif crawl.status().is_failed:
241+
print("Crawler failed")
242+
243+
except RuntimeError as e:
244+
print(f"Error: {e}")
245+
```
246+
247+
## Troubleshooting
248+
249+
### "SCRAPFLY_API_KEY environment variable not set"
250+
251+
Make sure you've either:
252+
1. Exported the environment variable: `export SCRAPFLY_API_KEY='your-key'`
253+
2. Created a `.env` file with your API key
254+
255+
### "Invalid API key" error
256+
257+
Double-check that:
258+
1. Your API key is correct and starts with `scp-live-`
259+
2. You have an active Scrapfly subscription
260+
3. You're using the correct API key from your dashboard
261+
262+
### Import errors for dotenv
263+
264+
The `python-dotenv` package is optional. If you see import warnings, you can either:
265+
1. Install it: `pip install python-dotenv`
266+
2. Ignore them - environment variables will still work
267+
268+
## Learn More
269+
270+
- [Scrapfly Crawler API Documentation](https://scrapfly.io/docs/crawler-api)
271+
- [Python SDK Documentation](https://scrapfly.io/docs/sdk/python)

0 commit comments

Comments
 (0)