DANK is a Dense Analysis project focused on collecting and analyzing live data from the public Internet. It uses API access, web scraping, RSS feeds, and semantic indexing tools to ingest external content in real time. It applies sentiment analysis, semantic clustering, and AI models to build structured insights about the world, including trends, public perception, and evolving narratives. The goal is to automate contextual understanding and surface relevant knowledge as it emerges.
- Python 3.13
- uv
- ClickHouse (local server)
- Install ClickHouse: https://clickhouse.com/docs/en/install
- Start the ClickHouse server (systemd or
clickhouse server). - Create the schema:
~/clickhouse/clickhouse client --multiquery < schema.sql
The schema uses the dank database by default. Adjust config.toml if you
need a different database name.
Configuration lives in config.toml and should not be committed. Example:
sources = [
{ domain = "x.com", accounts = ["example"] },
"blog.codinghorror.com",
]
[clickhouse]
host = "localhost"
port = 8123
database = "dank"
username = "default"
password = ""
secure = false
use_http = true
[x]
username = "your-x-username"
password = "your-x-password"
max_posts = 200
max_scrolls = 20
scroll_pause_seconds = 1.5
[storage]
data_dir = "data"
max_asset_bytes = 10485760
[browser]
# Optional: full path or command name for a Chromium-based browser.
executable_path = "thorium-browser"
# Optional: extra time to wait for the browser to start.
connection_timeout = 1.0
# Optional: connection retry count for slow browser startups.
connection_max_tries = 30
[email]
# Optional: IMAP settings for OTP codes.
host = "imap.example.com"
username = "you@example.com"
password = "your-imap-password"
port = 993
[logging]
# Optional: file path for scrape/process logs.
file = "dank.log"
# Optional: logging level (DEBUG, INFO, WARNING, ERROR).
level = "INFO"sources controls which domains to scrape and process. Each entry can provide
accounts for account-based sources like x.com.
If any particular domain lacks a specific configuration, the root of the domain will be scraped to discover RSS feeds to read from.
browser.executable_path sets the browser binary to launch. If unset, DANK
will try common Chromium locations.
storage.max_asset_bytes caps asset downloads (bytes). Larger assets are
skipped but still recorded.
When X prompts for a one-time code, DANK will poll the IMAP inbox for messages
from x.com that arrived after the login attempt and extract the confirmation
code.
If the browser takes longer to start, increase
browser.connection_timeout or browser.connection_max_tries.
logging.file controls where scrape/process logs are written. Relative paths
are resolved from the current working directory.
Dank offers the following commands.
uv run scrape-- Scrape the web for data- Pass
--domainsto scrape only matching domains fromsources, for example--domains '^x\\.com$'.
- Pass
uv run process-- Process previously scraped data- The
--ageargument can be given a duration to process, for example6hoursor2days.
- The
uv run clickhouse-query-- Run queries on the database- You can only run
SELECT,SHOW, orEXPLAINqueries through this tool - Query results are well formatted and easy to read
- Query results are truncated unless you pass
--full
- You can only run
uv run embed-text "your text"-- Print an embedding vector- Output is a JSON
list[float]for easy copy/paste into other tools.
- Output is a JSON
uv run download-embedding-model-- Download and cache embeddings model- Pass
--modelto choose another Hugging Face model id.
- Pass
uv run web-- Start a simple web server to view content.- Pass
--no-reloadto disable hot code reloading. - Supports search filters for domain/account and a days-back slider.
- Pass
uv run pytest-- Run default test suite.uv run pytest -m embeddings -s-- Run real-model embedding checks.- These tests are skipped by default and require the model cache.
- Includes per-case similarity and margin output for each model.