Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
347 changes: 346 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,346 @@
# open-data-agent
# Open Data Agent

A local CLI tool (`oda`) that lets developers, data analysts, data scientists, and AI agents query databases safely — with auto-generated schema context, read-only enforcement, and self-healing diagnostics.

---

## What is Open Data Agent?

Open Data Agent (`oda`) is a local command-line tool that bridges the gap between natural language questions and SQL databases. It is designed to be used by:

- **Data analysts and data scientists** doing exploratory data analysis (EDA) — query any connected database without memorising schema details
- **Developers** running ad-hoc SQL safely, with automatic LIMIT injection and query history
- **AI coding agents** (primarily [OpenCode](https://opencode.ai)) — `oda` acts as a structured, read-only database interface that an agent can call as a tool, using auto-generated schema documentation as context

Key properties:

- **Read-only by default** — write operations are hard-blocked with no bypass mechanism
- **Schema docs as context** — human- and agent-readable markdown catalog generated from your live schema
- **Self-healing diagnostics** — zero-row results and errors emit structured diagnostic context (row counts, sample values, NULL counts) so an AI agent can self-correct and retry
- **Local-first** — no server, no network service, all data on disk
- **Multi-dialect** — supports PostgreSQL, MySQL, and SQLite

---

## How It Works

### End-to-End Flow

```
Question (natural language or SQL)
├─ [AI agent] reads .opencode/rules/data-agent.md ← generated by `oda connect`
├─ [AI agent] reads docs/data-catalog/_index.md ← generated by `oda docs generate`
├─ [AI agent] navigates to relevant table docs
├─ [AI agent] checks memory/ for known data quirks
└─ oda query "SELECT ..."
├─ SafetyChecker read-only whitelist, dangerous pattern detection
├─ LIMIT injection auto-appends or clamps LIMIT (default 1000, max 10000)
├─ Timeout server-side (PostgreSQL/MySQL) or thread-based (SQLite)
├─ Execution results printed to stdout
└─ History log appended to ~/.config/open-data-agent/history.jsonl
└─ (zero rows or error)
└─ DiagnosticEngine → structured context to stderr → agent retries
```

### Core Components

**1. Schema Catalog**

`oda docs generate` introspects your connected database and produces a hierarchical markdown catalog under `docs/data-catalog/`. Each table gets its own file with columns, types, nullability, sample rows, and (optionally) column statistics. This catalog is the primary context source — neither you nor the agent needs to run live introspection queries.

**2. Memory Store**

`memory/` contains curated markdown files (with YAML frontmatter) that capture data quirks, business logic, known anomalies, and tribal knowledge about your data. The agent checks memory before constructing a query.

**3. Query Engine**

Every query passes through a safety pipeline:
- **Whitelist check** (primary guard) — only explicitly allowed prefixes pass: `SELECT`, `WITH`, `EXPLAIN`, and dialect-specific commands (`PRAGMA` for SQLite; `SHOW` for PostgreSQL and MySQL; `TABLE` for MySQL). Any token not on the list is blocked outright.
- **Blacklist check** (secondary guard) — catches injection patterns inside otherwise-valid SQL: `INSERT`, `UPDATE`, `DELETE`, `DROP`, `CREATE`, `ALTER`, `TRUNCATE`, `REPLACE`, `MERGE`, `GRANT`, `REVOKE`, `CALL`, `EXEC`, `EXECUTE`, and dialect-specific patterns such as `COPY`, `LOAD DATA`, `ATTACH DATABASE`
- **LIMIT injection** — if no LIMIT is present, one is appended automatically; if the limit exceeds the configured maximum, it is clamped
- **Timeout** — queries are killed after a configurable timeout (default 30 seconds)
- **History** — every query is appended to a local JSONL log with timing and metadata

**4. Diagnostics**

When a query returns zero rows or errors, `DiagnosticEngine` automatically emits structured context to `stderr`: table row counts, sample column values, and NULL counts for relevant columns. This gives an AI agent (or a developer) enough signal to self-correct and retry without running additional queries manually.

### Using with OpenCode

`oda` is designed as a first-class tool for the [OpenCode](https://opencode.ai) AI coding agent. When you run `oda connect <name>`, it renders a rules file at `.opencode/rules/data-agent.md` that tells OpenCode:

- Which database is active and how to explore it
- How to navigate the schema catalog
- How to call `oda query`, `oda memory`, and `oda docs` commands
- How to interpret diagnostic output and self-heal on failures

This means **no extra LLM setup is required for natural language queries** — OpenCode reads your schema docs, understands your data, constructs the SQL, and calls `oda query` as a tool. The full NL→SQL→results loop works out of the box.

See [Asking Questions in Natural Language with OpenCode](#asking-questions-in-natural-language-with-opencode) for a step-by-step walkthrough.

---

## How to Use It

### Prerequisites

- Python 3.12+
- [`uv`](https://docs.astral.sh/uv/) — install with `curl -LsSf https://astral.sh/uv/install.sh | sh`

### Install

```bash
git clone <repository-url>
cd open-data-agent
uv sync
```

### Initialise

```bash
uv run oda init
```

Creates `~/.config/open-data-agent/` with default config files.

### Add a Connection

```bash
uv run oda connections add
```

Prompts for connection name, database type (postgresql / mysql / sqlite), host, port, database name, username, and password. Passwords are stored in the OS keychain where available (macOS Keychain, GNOME Keyring, Windows Credential Manager). On headless or CI environments without a keychain backend, passwords fall back to plaintext in `~/.config/open-data-agent/connections.yaml` with a warning.

### Activate a Connection

```bash
uv run oda connect <name>
```

Sets the active connection and renders `.opencode/rules/data-agent.md` for OpenCode.

> **Note:** `oda connect` writes the rules file relative to the current working directory. Always run this command from your project root, otherwise the file will be created in the wrong location.

### Generate Schema Docs

```bash
uv run oda docs generate

# Include column statistics (null counts, distinct counts, min/max):
uv run oda docs generate --enrich
```

### Run a Query

```bash
uv run oda query "SELECT * FROM orders LIMIT 10"

# Output as JSON or CSV:
uv run oda query "SELECT * FROM orders" --format json
uv run oda query "SELECT * FROM orders" --format csv
```

### Command Reference

| Command | Description |
|---|---|
| `oda init` | First-run setup: create config directories and default files |
| `oda connect <name>` | Activate a connection; render OpenCode rules file |
| `oda connections list` | List all configured connections |
| `oda connections add` | Add a new connection (interactive) |
| `oda connections remove <name>` | Remove a connection |
| `oda connections test <name>` | Test live connectivity |
| `oda schemas` | List schemas in the active database |
| `oda tables [<schema>]` | List tables (optional schema filter) |
| `oda describe <table>` | Show columns and types |
| `oda sample <table> [--n N]` | Show N sample rows (default 5) |
| `oda profile <table>` | Column statistics: null count, distinct, min, max |
| `oda query "<sql>"` | Execute SQL; auto-logged to history |
| `oda docs generate [--enrich]` | Generate schema documentation catalog |
| `oda docs status` | Show freshness report for schema docs |
| `oda memory list` | List memory entries |
| `oda memory add` | Add a memory entry |
| `oda memory search <term>` | Search memory entries |
| `oda history list [--n N]` | Show most recent N query history entries |
| `oda history search <term>` | Search query history |
| `oda history stats` | Query history statistics |

### Configuration

Global config lives at `~/.config/open-data-agent/config.yaml`:

```yaml
row_limit: 1000 # default LIMIT auto-injected if absent
max_row_limit: 10000 # hard ceiling; never exceeded
query_timeout_seconds: 30 # query execution timeout
docs_staleness_days: 7 # warn if schema docs older than this
log_level: INFO
strict_mode: false # if true: block queries when docs are stale
# equivalent to passing --strict on every oda query call
```

### Development

```bash
# Unit tests (no external database required)
uv run pytest tests/unit/ -q

# Lint and format
uv run ruff check .
uv run ruff format --check .

# Type check
uv run mypy src/open_data_agent

# Integration tests (requires Docker)
docker compose up -d
uv run pytest -m integration
docker compose down
```

---

## Asking Questions in Natural Language with OpenCode

[OpenCode](https://opencode.ai) is an AI coding agent that runs in your terminal. When paired with `oda`, it acts as a natural language interface to your database — you ask a question in plain English, OpenCode reads your schema docs, writes the SQL, and calls `oda query` to execute it. No extra LLM configuration or API keys are needed beyond your OpenCode setup.

### Prerequisites

- OpenCode installed and configured (`opencode` available in your PATH)
- `oda` set up with a connection and schema docs generated (see [How to Use It](#how-to-use-it))

### Step 1 — Connect and generate schema docs

```bash
# Run from your project root — oda connect writes .opencode/rules/data-agent.md
# relative to the current working directory
uv run oda connect my-db

# Generate the schema catalog OpenCode will use as context
uv run oda docs generate
```

After this, `docs/data-catalog/` contains a markdown file for every table in your database. OpenCode reads these files to understand your schema without running any live introspection queries.

### Step 2 — Open OpenCode in your project

```bash
opencode
```

OpenCode automatically loads `.opencode/rules/data-agent.md` on startup. This file tells it which database is active, how to navigate the schema catalog, and how to use `oda` commands.

### Step 3 — Ask a question in natural language

Type your question directly in the OpenCode chat. Examples:

```
How many orders were placed last month, broken down by status?
```

```
Which customers have spent more than $10,000 in total?
```

```
Show me the top 10 products by revenue in Q1 2025.
```

OpenCode will:

1. Read the relevant table docs from `docs/data-catalog/`
2. Check `memory/` for any known data quirks affecting those tables
3. Construct a safe, read-only SQL query
4. Call `oda query "..."` to execute it
5. Present the results — and if zero rows are returned, use the diagnostic output to self-correct and retry

### Step 4 — Refine with follow-up questions

You can ask follow-up questions in the same session:

```
Filter that to just the 'enterprise' customer segment.
```

```
Now group by region instead of status.
```

```
Export that as CSV.
```

OpenCode maintains conversation context, so follow-ups build on the previous query.

### Step 5 — Save data knowledge to memory

When you or OpenCode discovers something important about your data (a quirky column, a misleading field name, a known data quality issue), save it so future sessions benefit:

```bash
uv run oda memory add --title "Revenue column" \
--category data_quality \
--content "Use net_item_price not item_price — item_price includes tax"
```

Or ask OpenCode to do it:

```
Remember that the revenue column to use is net_item_price, not item_price.
```

### How OpenCode self-heals

If a query returns zero rows, `oda query` automatically emits structured diagnostics to stderr:

- Row counts for each table referenced in the SQL
- Sample values for filter columns (e.g. `Column 'status' sample values: ['active', 'pending']`)
- NULL counts for filter columns

OpenCode reads this output and retries with a corrected query, without any input from you.

### Keeping schema docs fresh

If your database schema changes, regenerate the catalog before asking questions:

```bash
uv run oda docs generate

# Check freshness at any time:
uv run oda docs status
```

Use `--strict` mode to block queries against stale docs:

```bash
uv run oda query "SELECT ..." --strict
```

---

## License

MIT License

Copyright (c) 2026 Adoit

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.