|
1 | | -# Python Diffbot API Client |
| 1 | +# Diffbot Python Library |
2 | 2 |
|
| 3 | +Python client library for [Diffbot](https://www.diffbot.com) APIs. |
3 | 4 |
|
4 | | -## Preface |
5 | | -Identify and extract the important parts of any web page in Python! This client currently supports calls to Diffbot's Automatic APIs and Crawlbot. |
6 | 5 |
|
| 6 | +## Installation |
7 | 7 |
|
8 | | -Installation |
9 | | -To install activate a new virtual environment and run the following command: |
| 8 | +```bash |
| 9 | +pip install git+https://github.com/diffbot/diffbot-python.git |
| 10 | +``` |
10 | 11 |
|
11 | | - $ pip install -r requirements.txt |
| 12 | +Or, for local development: |
12 | 13 |
|
13 | | -## Configuration |
| 14 | +```bash |
| 15 | +pip install -e ".[dev]" |
| 16 | +``` |
14 | 17 |
|
15 | | -To run the example, you must first configure a working API token in config.py: |
| 18 | +## Usage |
16 | 19 |
|
17 | | - $ cp config.py.example config.py; vim config.py; |
| 20 | +### Authentication |
| 21 | +Set your Diffbot API token in your environment or .env. |
18 | 22 |
|
19 | | -Then replace the string "SOME_TOKEN" with your API token. Finally, to run the example: |
| 23 | +```bash |
| 24 | +export DIFFBOT_API_TOKEN=<TOKEN> |
| 25 | +``` |
20 | 26 |
|
21 | | - $ python example.py |
| 27 | +### Extract structured content |
| 28 | +```python |
| 29 | +from diffbot import Diffbot |
22 | 30 |
|
23 | | -## Usage |
| 31 | +db = Diffbot(token="YOUR_TOKEN") |
| 32 | +data = db.extract("https://www.example.com") |
| 33 | +``` |
24 | 34 |
|
25 | | -### Article API |
26 | | -An example call to the Article API: |
| 35 | +### Ask Diffbot LLM |
| 36 | +```python |
| 37 | +from diffbot import Diffbot |
27 | 38 |
|
28 | | -``` |
29 | | -diffbot = DiffbotClient() |
30 | | -token = "SOME_TOKEN" |
31 | | -version = 2 |
32 | | -url = "http://shichuan.github.io/javascript-patterns/" |
33 | | -api = "article" |
34 | | -response = diffbot.request(url, token, api, version=2) |
| 39 | +db = Diffbot(token="YOUR_TOKEN") |
| 40 | +for chunk in db.ask([{"role": "user", "content": "What's the capital of France?"}]): |
| 41 | + print(chunk, end="") |
35 | 42 | ``` |
36 | 43 |
|
37 | | -### Product API |
38 | | -An example call to the Product API: |
| 44 | +### Crawl a site for structured content |
| 45 | +```python |
| 46 | +from diffbot import Diffbot |
39 | 47 |
|
40 | | -``` |
41 | | -diffbot = DiffbotClient() |
42 | | -token = "SOME_TOKEN" |
43 | | -version = 2 |
44 | | -url = "http://www.overstock.com/Home-Garden/iRobot-650-Roomba-Vacuuming-Robot/7886009/product.html" |
45 | | -api = "product" |
46 | | -response = diffbot.request(url, token, api, version=version) |
| 48 | +db = Diffbot(token="YOUR_TOKEN") |
| 49 | +for event in db.crawl("https://www.example.com", hops=1): |
| 50 | + print(event) |
47 | 51 | ``` |
48 | 52 |
|
49 | | -### Image API |
50 | | -An example call to the Image API: |
| 53 | +### Query the Knowledge Graph |
| 54 | +```python |
| 55 | +from diffbot import Diffbot |
51 | 56 |
|
52 | | -``` |
53 | | -diffbot = DiffbotClient() |
54 | | -token = "SOME_TOKEN" |
55 | | -version = 2 |
56 | | -url = "http://www.google.com/" |
57 | | -api = "image" |
58 | | -response = diffbot.request(url, token, api, version=version) |
| 57 | +db = Diffbot(token="YOUR_TOKEN") |
| 58 | +results = db.dql('type:Organization name:"Diffbot"') |
59 | 59 | ``` |
60 | 60 |
|
61 | | -### Analyze API |
62 | | -An example call to the Analyze API: |
| 61 | +### Web Search |
| 62 | +```python |
| 63 | +from diffbot import Diffbot |
63 | 64 |
|
64 | | -``` |
65 | | -diffbot = DiffbotClient() |
66 | | -token = "SOME_TOKEN" |
67 | | -version = 2 |
68 | | -url = "http://www.twitter.com/" |
69 | | -api = "analyze" |
70 | | -response = diffbot.request(url, token, api, version=version) |
| 65 | +db = Diffbot(token="YOUR_TOKEN") |
| 66 | +results = db.web_search("diffbot knowledge graph") |
| 67 | +for r in results["search_results"]: |
| 68 | + print(r["score"], r["title"], r["pageUrl"]) |
| 69 | + print(r["content"]) |
71 | 70 | ``` |
72 | 71 |
|
73 | | -### Crawlbot API |
74 | | -To start a new crawl, specify a crawl name, seed URLs, and the API via which URLs should be processed. An example call to the Crawlbot API: |
| 72 | +### Entities (NLP) |
| 73 | +```python |
| 74 | +from diffbot import Diffbot |
75 | 75 |
|
76 | | -``` |
77 | | -token = "SOME_TOKEN" |
78 | | -name = "sampleCrawlName" |
79 | | -seeds = "http://www.twitter.com/" |
80 | | -api = "analyze" |
81 | | -sampleCrawl = DiffbotCrawl(token,name,seeds=seeds,api=api) |
| 76 | +db = Diffbot(token="YOUR_TOKEN") |
| 77 | +result = db.entities("Apple CEO Tim Cook announced record quarterly earnings.") |
| 78 | +for entity in result["entities"]: |
| 79 | + print(entity["name"], entity.get("type"), entity.get("id")) |
| 80 | +print("sentiment:", result.get("sentiment")) |
82 | 81 | ``` |
83 | 82 |
|
84 | | -Omit "seeds" and "api" to load an existing crawl, or create a crawl as a placeholder. |
| 83 | +## Async Usage |
85 | 84 |
|
86 | | -To check the status of a crawl: |
| 85 | +### Extract structured content |
| 86 | +```python |
| 87 | +import asyncio |
| 88 | +from diffbot import DiffbotAsync |
87 | 89 |
|
88 | | -``` |
89 | | -sampleCrawl.status() |
| 90 | +async def main(): |
| 91 | + async with DiffbotAsync(token="YOUR_TOKEN") as db: |
| 92 | + data = await db.extract("https://www.example.com") |
| 93 | + print(data) |
| 94 | + |
| 95 | +asyncio.run(main()) |
90 | 96 | ``` |
91 | 97 |
|
92 | | -To update a crawl: |
| 98 | +### Ask Diffbot LLM |
| 99 | +```python |
| 100 | +import asyncio |
| 101 | +from diffbot import DiffbotAsync |
93 | 102 |
|
94 | | -``` |
95 | | -maxToCrawl = 100 |
96 | | -upp = "diffbot" |
97 | | -sampleCrawl.update(maxToCrawl=maxToCrawl,urlProcessPattern=upp) |
| 103 | +async def main(): |
| 104 | + async with DiffbotAsync(token="YOUR_TOKEN") as db: |
| 105 | + async for chunk in db.ask([{"role": "user", "content": "What's the capital of France?"}]): |
| 106 | + print(chunk, end="") |
| 107 | + |
| 108 | +asyncio.run(main()) |
98 | 109 | ``` |
99 | 110 |
|
100 | | -To delete or restart a crawl: |
| 111 | +### Crawl a site for structured content |
| 112 | +```python |
| 113 | +import asyncio |
| 114 | +from diffbot import DiffbotAsync |
101 | 115 |
|
102 | | -``` |
103 | | -sampleCrawl.delete() |
104 | | -sampleCrawl.restart() |
| 116 | +async def main(): |
| 117 | + async with DiffbotAsync(token="YOUR_TOKEN") as db: |
| 118 | + async for event in db.crawl("https://www.example.com", hops=1): |
| 119 | + print(event) |
| 120 | + |
| 121 | +asyncio.run(main()) |
105 | 122 | ``` |
106 | 123 |
|
107 | | -To download crawl data: |
| 124 | +### Query the Knowledge Graph |
| 125 | +```python |
| 126 | +import asyncio |
| 127 | +from diffbot import DiffbotAsync |
108 | 128 |
|
| 129 | +async def main(): |
| 130 | + async with DiffbotAsync(token="YOUR_TOKEN") as db: |
| 131 | + results = await db.dql('type:Organization name:"Diffbot"') |
| 132 | + print(results) |
| 133 | + |
| 134 | +asyncio.run(main()) |
109 | 135 | ``` |
110 | | -sampleCrawl.download() # returns JSON by default |
111 | | -sampleCrawl.download(data_format="csv") |
112 | | -``` |
113 | 136 |
|
114 | | -To pass additional arguments to a crawl: |
| 137 | +### Web Search |
| 138 | +```python |
| 139 | +import asyncio |
| 140 | +from diffbot import DiffbotAsync |
| 141 | + |
| 142 | +async def main(): |
| 143 | + async with DiffbotAsync(token="YOUR_TOKEN") as db: |
| 144 | + results = await db.web_search("diffbot knowledge graph") |
| 145 | + for r in results["search_results"]: |
| 146 | + print(r["score"], r["title"], r["pageUrl"]) |
| 147 | + print(r["content"]) |
115 | 148 |
|
| 149 | +asyncio.run(main()) |
116 | 150 | ``` |
117 | | -sampleCrawl = DiffbotCrawl(token,name,seeds,apiUrl,maxToCrawl=100,maxToProcess=50,notifyEmail="support@diffbot.com") |
| 151 | + |
| 152 | +### Entities (NLP) |
| 153 | +```python |
| 154 | +import asyncio |
| 155 | +from diffbot import DiffbotAsync |
| 156 | + |
| 157 | +async def main(): |
| 158 | + async with DiffbotAsync(token="YOUR_TOKEN") as db: |
| 159 | + result = await db.entities("Apple CEO Tim Cook announced record quarterly earnings.") |
| 160 | + for entity in result["entities"]: |
| 161 | + print(entity["name"], entity.get("type"), entity.get("id")) |
| 162 | + print("sentiment:", result.get("sentiment")) |
| 163 | + |
| 164 | +asyncio.run(main()) |
118 | 165 | ``` |
119 | 166 |
|
120 | | -## Testing |
| 167 | +## CLI |
121 | 168 |
|
122 | | -First install the test requirements with the following command: |
| 169 | +This library also includes a CLI. |
123 | 170 |
|
124 | | - $ pip install -r test_requirements.txt |
| 171 | +```bash |
| 172 | +export DIFFBOT_API_TOKEN=your-token-here |
| 173 | + |
| 174 | +db extract https://www.example.com |
| 175 | +db ask "What's the capital of France?" |
| 176 | +db crawl https://www.example.com --hops 1 |
| 177 | +db crawl-list-jobs |
| 178 | +db crawl-delete-job crawl-1234567890 |
| 179 | +db web-search "diffbot knowledge graph" |
| 180 | +db web-search "diffbot knowledge graph" -n 5 -f json |
| 181 | +db entities "Apple CEO Tim Cook announced record quarterly earnings." |
| 182 | +db entities "Apple CEO Tim Cook announced record quarterly earnings." -f dql |
| 183 | +``` |
125 | 184 |
|
126 | | -Currently there are some simple unit tests that mock the API calls and return data from fixtures in the filesystem. From the project directory, simply run: |
| 185 | +## Tests |
127 | 186 |
|
128 | | - $ nosetests |
| 187 | +Run the mock test suite: |
| 188 | +```bash |
| 189 | +python -m pytest |
| 190 | +``` |
| 191 | + |
| 192 | +Run live integration tests against the real API (requires a valid token): |
| 193 | +```bash |
| 194 | +DIFFBOT_TOKEN=your_token python -m pytest -m live |
| 195 | +``` |
0 commit comments