Skip to content

Commit c64b7cf

Browse files
committed
Hello again, world.xy
1 parent 15f6a79 commit c64b7cf

43 files changed

Lines changed: 2791 additions & 2653 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
.idea
2-
.DS_Store
3-
*.pyc
4-
config.py
1+
# Python
2+
.venv
3+
__pycache__
4+
.pytest_cache
5+
.env
6+
7+
# Claude
8+
.claude/settings.local.json

AGENTS.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Agent Guidelines
2+
3+
## README Examples
4+
5+
Whenever a code example in `README.md` is added or updated, the corresponding test must be added or updated in `tests/test_readme_examples.py`. Run `python -m pytest tests/test_readme_examples.py` to validate before considering the work complete.

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
AGENTS.md

INSTALL.md

Lines changed: 0 additions & 10 deletions
This file was deleted.

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2023 Diffbot
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 147 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,128 +1,195 @@
1-
# Python Diffbot API Client
1+
# Diffbot Python Library
22

3+
Python client library for [Diffbot](https://www.diffbot.com) APIs.
34

4-
## Preface
5-
Identify and extract the important parts of any web page in Python! This client currently supports calls to Diffbot's Automatic APIs and Crawlbot.
65

6+
## Installation
77

8-
Installation
9-
To install activate a new virtual environment and run the following command:
8+
```bash
9+
pip install git+https://github.com/diffbot/diffbot-python.git
10+
```
1011

11-
$ pip install -r requirements.txt
12+
Or, for local development:
1213

13-
## Configuration
14+
```bash
15+
pip install -e ".[dev]"
16+
```
1417

15-
To run the example, you must first configure a working API token in config.py:
18+
## Usage
1619

17-
$ cp config.py.example config.py; vim config.py;
20+
### Authentication
21+
Set your Diffbot API token in your environment or .env.
1822

19-
Then replace the string "SOME_TOKEN" with your API token. Finally, to run the example:
23+
```bash
24+
export DIFFBOT_API_TOKEN=<TOKEN>
25+
```
2026

21-
$ python example.py
27+
### Extract structured content
28+
```python
29+
from diffbot import Diffbot
2230

23-
## Usage
31+
db = Diffbot(token="YOUR_TOKEN")
32+
data = db.extract("https://www.example.com")
33+
```
2434

25-
### Article API
26-
An example call to the Article API:
35+
### Ask Diffbot LLM
36+
```python
37+
from diffbot import Diffbot
2738

28-
```
29-
diffbot = DiffbotClient()
30-
token = "SOME_TOKEN"
31-
version = 2
32-
url = "http://shichuan.github.io/javascript-patterns/"
33-
api = "article"
34-
response = diffbot.request(url, token, api, version=2)
39+
db = Diffbot(token="YOUR_TOKEN")
40+
for chunk in db.ask([{"role": "user", "content": "What's the capital of France?"}]):
41+
print(chunk, end="")
3542
```
3643

37-
### Product API
38-
An example call to the Product API:
44+
### Crawl a site for structured content
45+
```python
46+
from diffbot import Diffbot
3947

40-
```
41-
diffbot = DiffbotClient()
42-
token = "SOME_TOKEN"
43-
version = 2
44-
url = "http://www.overstock.com/Home-Garden/iRobot-650-Roomba-Vacuuming-Robot/7886009/product.html"
45-
api = "product"
46-
response = diffbot.request(url, token, api, version=version)
48+
db = Diffbot(token="YOUR_TOKEN")
49+
for event in db.crawl("https://www.example.com", hops=1):
50+
print(event)
4751
```
4852

49-
### Image API
50-
An example call to the Image API:
53+
### Query the Knowledge Graph
54+
```python
55+
from diffbot import Diffbot
5156

52-
```
53-
diffbot = DiffbotClient()
54-
token = "SOME_TOKEN"
55-
version = 2
56-
url = "http://www.google.com/"
57-
api = "image"
58-
response = diffbot.request(url, token, api, version=version)
57+
db = Diffbot(token="YOUR_TOKEN")
58+
results = db.dql('type:Organization name:"Diffbot"')
5959
```
6060

61-
### Analyze API
62-
An example call to the Analyze API:
61+
### Web Search
62+
```python
63+
from diffbot import Diffbot
6364

64-
```
65-
diffbot = DiffbotClient()
66-
token = "SOME_TOKEN"
67-
version = 2
68-
url = "http://www.twitter.com/"
69-
api = "analyze"
70-
response = diffbot.request(url, token, api, version=version)
65+
db = Diffbot(token="YOUR_TOKEN")
66+
results = db.web_search("diffbot knowledge graph")
67+
for r in results["search_results"]:
68+
print(r["score"], r["title"], r["pageUrl"])
69+
print(r["content"])
7170
```
7271

73-
### Crawlbot API
74-
To start a new crawl, specify a crawl name, seed URLs, and the API via which URLs should be processed. An example call to the Crawlbot API:
72+
### Entities (NLP)
73+
```python
74+
from diffbot import Diffbot
7575

76-
```
77-
token = "SOME_TOKEN"
78-
name = "sampleCrawlName"
79-
seeds = "http://www.twitter.com/"
80-
api = "analyze"
81-
sampleCrawl = DiffbotCrawl(token,name,seeds=seeds,api=api)
76+
db = Diffbot(token="YOUR_TOKEN")
77+
result = db.entities("Apple CEO Tim Cook announced record quarterly earnings.")
78+
for entity in result["entities"]:
79+
print(entity["name"], entity.get("type"), entity.get("id"))
80+
print("sentiment:", result.get("sentiment"))
8281
```
8382

84-
Omit "seeds" and "api" to load an existing crawl, or create a crawl as a placeholder.
83+
## Async Usage
8584

86-
To check the status of a crawl:
85+
### Extract structured content
86+
```python
87+
import asyncio
88+
from diffbot import DiffbotAsync
8789

88-
```
89-
sampleCrawl.status()
90+
async def main():
91+
async with DiffbotAsync(token="YOUR_TOKEN") as db:
92+
data = await db.extract("https://www.example.com")
93+
print(data)
94+
95+
asyncio.run(main())
9096
```
9197

92-
To update a crawl:
98+
### Ask Diffbot LLM
99+
```python
100+
import asyncio
101+
from diffbot import DiffbotAsync
93102

94-
```
95-
maxToCrawl = 100
96-
upp = "diffbot"
97-
sampleCrawl.update(maxToCrawl=maxToCrawl,urlProcessPattern=upp)
103+
async def main():
104+
async with DiffbotAsync(token="YOUR_TOKEN") as db:
105+
async for chunk in db.ask([{"role": "user", "content": "What's the capital of France?"}]):
106+
print(chunk, end="")
107+
108+
asyncio.run(main())
98109
```
99110

100-
To delete or restart a crawl:
111+
### Crawl a site for structured content
112+
```python
113+
import asyncio
114+
from diffbot import DiffbotAsync
101115

102-
```
103-
sampleCrawl.delete()
104-
sampleCrawl.restart()
116+
async def main():
117+
async with DiffbotAsync(token="YOUR_TOKEN") as db:
118+
async for event in db.crawl("https://www.example.com", hops=1):
119+
print(event)
120+
121+
asyncio.run(main())
105122
```
106123

107-
To download crawl data:
124+
### Query the Knowledge Graph
125+
```python
126+
import asyncio
127+
from diffbot import DiffbotAsync
108128

129+
async def main():
130+
async with DiffbotAsync(token="YOUR_TOKEN") as db:
131+
results = await db.dql('type:Organization name:"Diffbot"')
132+
print(results)
133+
134+
asyncio.run(main())
109135
```
110-
sampleCrawl.download() # returns JSON by default
111-
sampleCrawl.download(data_format="csv")
112-
```
113136

114-
To pass additional arguments to a crawl:
137+
### Web Search
138+
```python
139+
import asyncio
140+
from diffbot import DiffbotAsync
141+
142+
async def main():
143+
async with DiffbotAsync(token="YOUR_TOKEN") as db:
144+
results = await db.web_search("diffbot knowledge graph")
145+
for r in results["search_results"]:
146+
print(r["score"], r["title"], r["pageUrl"])
147+
print(r["content"])
115148

149+
asyncio.run(main())
116150
```
117-
sampleCrawl = DiffbotCrawl(token,name,seeds,apiUrl,maxToCrawl=100,maxToProcess=50,notifyEmail="support@diffbot.com")
151+
152+
### Entities (NLP)
153+
```python
154+
import asyncio
155+
from diffbot import DiffbotAsync
156+
157+
async def main():
158+
async with DiffbotAsync(token="YOUR_TOKEN") as db:
159+
result = await db.entities("Apple CEO Tim Cook announced record quarterly earnings.")
160+
for entity in result["entities"]:
161+
print(entity["name"], entity.get("type"), entity.get("id"))
162+
print("sentiment:", result.get("sentiment"))
163+
164+
asyncio.run(main())
118165
```
119166

120-
## Testing
167+
## CLI
121168

122-
First install the test requirements with the following command:
169+
This library also includes a CLI.
123170

124-
$ pip install -r test_requirements.txt
171+
```bash
172+
export DIFFBOT_API_TOKEN=your-token-here
173+
174+
db extract https://www.example.com
175+
db ask "What's the capital of France?"
176+
db crawl https://www.example.com --hops 1
177+
db crawl-list-jobs
178+
db crawl-delete-job crawl-1234567890
179+
db web-search "diffbot knowledge graph"
180+
db web-search "diffbot knowledge graph" -n 5 -f json
181+
db entities "Apple CEO Tim Cook announced record quarterly earnings."
182+
db entities "Apple CEO Tim Cook announced record quarterly earnings." -f dql
183+
```
125184

126-
Currently there are some simple unit tests that mock the API calls and return data from fixtures in the filesystem. From the project directory, simply run:
185+
## Tests
127186

128-
$ nosetests
187+
Run the mock test suite:
188+
```bash
189+
python -m pytest
190+
```
191+
192+
Run live integration tests against the real API (requires a valid token):
193+
```bash
194+
DIFFBOT_TOKEN=your_token python -m pytest -m live
195+
```

0 commit comments

Comments
 (0)