diff --git a/README.md b/README.md index 4897050..91fe80b 100644 --- a/README.md +++ b/README.md @@ -1 +1,346 @@ -# open-data-agent \ No newline at end of file +# Open Data Agent + +A local CLI tool (`oda`) that lets developers, data analysts, data scientists, and AI agents query databases safely — with auto-generated schema context, read-only enforcement, and self-healing diagnostics. + +--- + +## What is Open Data Agent? + +Open Data Agent (`oda`) is a local command-line tool that bridges the gap between natural language questions and SQL databases. It is designed to be used by: + +- **Data analysts and data scientists** doing exploratory data analysis (EDA) — query any connected database without memorising schema details +- **Developers** running ad-hoc SQL safely, with automatic LIMIT injection and query history +- **AI coding agents** (primarily [OpenCode](https://opencode.ai)) — `oda` acts as a structured, read-only database interface that an agent can call as a tool, using auto-generated schema documentation as context + +Key properties: + +- **Read-only by default** — write operations are hard-blocked with no bypass mechanism +- **Schema docs as context** — human- and agent-readable markdown catalog generated from your live schema +- **Self-healing diagnostics** — zero-row results and errors emit structured diagnostic context (row counts, sample values, NULL counts) so an AI agent can self-correct and retry +- **Local-first** — no server, no network service, all data on disk +- **Multi-dialect** — supports PostgreSQL, MySQL, and SQLite + +--- + +## How It Works + +### End-to-End Flow + +``` +Question (natural language or SQL) + │ + ├─ [AI agent] reads .opencode/rules/data-agent.md ← generated by `oda connect` + ├─ [AI agent] reads docs/data-catalog/_index.md ← generated by `oda docs generate` + ├─ [AI agent] navigates to relevant table docs + ├─ [AI agent] checks memory/ for known data quirks + │ + └─ oda query "SELECT ..." + │ + ├─ SafetyChecker read-only whitelist, dangerous pattern detection + ├─ LIMIT injection auto-appends or clamps LIMIT (default 1000, max 10000) + ├─ Timeout server-side (PostgreSQL/MySQL) or thread-based (SQLite) + ├─ Execution results printed to stdout + └─ History log appended to ~/.config/open-data-agent/history.jsonl + │ + └─ (zero rows or error) + └─ DiagnosticEngine → structured context to stderr → agent retries +``` + +### Core Components + +**1. Schema Catalog** + +`oda docs generate` introspects your connected database and produces a hierarchical markdown catalog under `docs/data-catalog/`. Each table gets its own file with columns, types, nullability, sample rows, and (optionally) column statistics. This catalog is the primary context source — neither you nor the agent needs to run live introspection queries. + +**2. Memory Store** + +`memory/` contains curated markdown files (with YAML frontmatter) that capture data quirks, business logic, known anomalies, and tribal knowledge about your data. The agent checks memory before constructing a query. + +**3. Query Engine** + +Every query passes through a safety pipeline: +- **Whitelist check** (primary guard) — only explicitly allowed prefixes pass: `SELECT`, `WITH`, `EXPLAIN`, and dialect-specific commands (`PRAGMA` for SQLite; `SHOW` for PostgreSQL and MySQL; `TABLE` for MySQL). Any token not on the list is blocked outright. +- **Blacklist check** (secondary guard) — catches injection patterns inside otherwise-valid SQL: `INSERT`, `UPDATE`, `DELETE`, `DROP`, `CREATE`, `ALTER`, `TRUNCATE`, `REPLACE`, `MERGE`, `GRANT`, `REVOKE`, `CALL`, `EXEC`, `EXECUTE`, and dialect-specific patterns such as `COPY`, `LOAD DATA`, `ATTACH DATABASE` +- **LIMIT injection** — if no LIMIT is present, one is appended automatically; if the limit exceeds the configured maximum, it is clamped +- **Timeout** — queries are killed after a configurable timeout (default 30 seconds) +- **History** — every query is appended to a local JSONL log with timing and metadata + +**4. Diagnostics** + +When a query returns zero rows or errors, `DiagnosticEngine` automatically emits structured context to `stderr`: table row counts, sample column values, and NULL counts for relevant columns. This gives an AI agent (or a developer) enough signal to self-correct and retry without running additional queries manually. + +### Using with OpenCode + +`oda` is designed as a first-class tool for the [OpenCode](https://opencode.ai) AI coding agent. When you run `oda connect `, it renders a rules file at `.opencode/rules/data-agent.md` that tells OpenCode: + +- Which database is active and how to explore it +- How to navigate the schema catalog +- How to call `oda query`, `oda memory`, and `oda docs` commands +- How to interpret diagnostic output and self-heal on failures + +This means **no extra LLM setup is required for natural language queries** — OpenCode reads your schema docs, understands your data, constructs the SQL, and calls `oda query` as a tool. The full NL→SQL→results loop works out of the box. + +See [Asking Questions in Natural Language with OpenCode](#asking-questions-in-natural-language-with-opencode) for a step-by-step walkthrough. + +--- + +## How to Use It + +### Prerequisites + +- Python 3.12+ +- [`uv`](https://docs.astral.sh/uv/) — install with `curl -LsSf https://astral.sh/uv/install.sh | sh` + +### Install + +```bash +git clone +cd open-data-agent +uv sync +``` + +### Initialise + +```bash +uv run oda init +``` + +Creates `~/.config/open-data-agent/` with default config files. + +### Add a Connection + +```bash +uv run oda connections add +``` + +Prompts for connection name, database type (postgresql / mysql / sqlite), host, port, database name, username, and password. Passwords are stored in the OS keychain where available (macOS Keychain, GNOME Keyring, Windows Credential Manager). On headless or CI environments without a keychain backend, passwords fall back to plaintext in `~/.config/open-data-agent/connections.yaml` with a warning. + +### Activate a Connection + +```bash +uv run oda connect +``` + +Sets the active connection and renders `.opencode/rules/data-agent.md` for OpenCode. + +> **Note:** `oda connect` writes the rules file relative to the current working directory. Always run this command from your project root, otherwise the file will be created in the wrong location. + +### Generate Schema Docs + +```bash +uv run oda docs generate + +# Include column statistics (null counts, distinct counts, min/max): +uv run oda docs generate --enrich +``` + +### Run a Query + +```bash +uv run oda query "SELECT * FROM orders LIMIT 10" + +# Output as JSON or CSV: +uv run oda query "SELECT * FROM orders" --format json +uv run oda query "SELECT * FROM orders" --format csv +``` + +### Command Reference + +| Command | Description | +|---|---| +| `oda init` | First-run setup: create config directories and default files | +| `oda connect ` | Activate a connection; render OpenCode rules file | +| `oda connections list` | List all configured connections | +| `oda connections add` | Add a new connection (interactive) | +| `oda connections remove ` | Remove a connection | +| `oda connections test ` | Test live connectivity | +| `oda schemas` | List schemas in the active database | +| `oda tables []` | List tables (optional schema filter) | +| `oda describe ` | Show columns and types | +| `oda sample
[--n N]` | Show N sample rows (default 5) | +| `oda profile
` | Column statistics: null count, distinct, min, max | +| `oda query ""` | Execute SQL; auto-logged to history | +| `oda docs generate [--enrich]` | Generate schema documentation catalog | +| `oda docs status` | Show freshness report for schema docs | +| `oda memory list` | List memory entries | +| `oda memory add` | Add a memory entry | +| `oda memory search ` | Search memory entries | +| `oda history list [--n N]` | Show most recent N query history entries | +| `oda history search ` | Search query history | +| `oda history stats` | Query history statistics | + +### Configuration + +Global config lives at `~/.config/open-data-agent/config.yaml`: + +```yaml +row_limit: 1000 # default LIMIT auto-injected if absent +max_row_limit: 10000 # hard ceiling; never exceeded +query_timeout_seconds: 30 # query execution timeout +docs_staleness_days: 7 # warn if schema docs older than this +log_level: INFO +strict_mode: false # if true: block queries when docs are stale + # equivalent to passing --strict on every oda query call +``` + +### Development + +```bash +# Unit tests (no external database required) +uv run pytest tests/unit/ -q + +# Lint and format +uv run ruff check . +uv run ruff format --check . + +# Type check +uv run mypy src/open_data_agent + +# Integration tests (requires Docker) +docker compose up -d +uv run pytest -m integration +docker compose down +``` + +--- + +## Asking Questions in Natural Language with OpenCode + +[OpenCode](https://opencode.ai) is an AI coding agent that runs in your terminal. When paired with `oda`, it acts as a natural language interface to your database — you ask a question in plain English, OpenCode reads your schema docs, writes the SQL, and calls `oda query` to execute it. No extra LLM configuration or API keys are needed beyond your OpenCode setup. + +### Prerequisites + +- OpenCode installed and configured (`opencode` available in your PATH) +- `oda` set up with a connection and schema docs generated (see [How to Use It](#how-to-use-it)) + +### Step 1 — Connect and generate schema docs + +```bash +# Run from your project root — oda connect writes .opencode/rules/data-agent.md +# relative to the current working directory +uv run oda connect my-db + +# Generate the schema catalog OpenCode will use as context +uv run oda docs generate +``` + +After this, `docs/data-catalog/` contains a markdown file for every table in your database. OpenCode reads these files to understand your schema without running any live introspection queries. + +### Step 2 — Open OpenCode in your project + +```bash +opencode +``` + +OpenCode automatically loads `.opencode/rules/data-agent.md` on startup. This file tells it which database is active, how to navigate the schema catalog, and how to use `oda` commands. + +### Step 3 — Ask a question in natural language + +Type your question directly in the OpenCode chat. Examples: + +``` +How many orders were placed last month, broken down by status? +``` + +``` +Which customers have spent more than $10,000 in total? +``` + +``` +Show me the top 10 products by revenue in Q1 2025. +``` + +OpenCode will: + +1. Read the relevant table docs from `docs/data-catalog/` +2. Check `memory/` for any known data quirks affecting those tables +3. Construct a safe, read-only SQL query +4. Call `oda query "..."` to execute it +5. Present the results — and if zero rows are returned, use the diagnostic output to self-correct and retry + +### Step 4 — Refine with follow-up questions + +You can ask follow-up questions in the same session: + +``` +Filter that to just the 'enterprise' customer segment. +``` + +``` +Now group by region instead of status. +``` + +``` +Export that as CSV. +``` + +OpenCode maintains conversation context, so follow-ups build on the previous query. + +### Step 5 — Save data knowledge to memory + +When you or OpenCode discovers something important about your data (a quirky column, a misleading field name, a known data quality issue), save it so future sessions benefit: + +```bash +uv run oda memory add --title "Revenue column" \ + --category data_quality \ + --content "Use net_item_price not item_price — item_price includes tax" +``` + +Or ask OpenCode to do it: + +``` +Remember that the revenue column to use is net_item_price, not item_price. +``` + +### How OpenCode self-heals + +If a query returns zero rows, `oda query` automatically emits structured diagnostics to stderr: + +- Row counts for each table referenced in the SQL +- Sample values for filter columns (e.g. `Column 'status' sample values: ['active', 'pending']`) +- NULL counts for filter columns + +OpenCode reads this output and retries with a corrected query, without any input from you. + +### Keeping schema docs fresh + +If your database schema changes, regenerate the catalog before asking questions: + +```bash +uv run oda docs generate + +# Check freshness at any time: +uv run oda docs status +``` + +Use `--strict` mode to block queries against stale docs: + +```bash +uv run oda query "SELECT ..." --strict +``` + +--- + +## License + +MIT License + +Copyright (c) 2026 Adoit + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE.