Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 54 additions & 102 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,163 +1,115 @@
# ibis-hotdata

Experimental [Ibis](https://ibis-project.org/) backend for [Hotdata](https://www.hotdata.dev/docs/api-reference)—federated, Postgres-compatible SQL executed over HTTPS.
Experimental [Ibis](https://ibis-project.org/) backend for [Hotdata](https://www.hotdata.dev/docs/api-reference): compile expressions with Ibis, run federated SQL over the Hotdata API. REST calls use the official **[hotdata](https://github.com/hotdata-dev/sdk-python)** Python SDK. Repo examples use **httpx** (listed under the **dev** dependency group).

Hotdata exposes `POST /v1/query`, optional asynchronous execution (`202` + `GET /v1/query-runs/{id}` + `GET /v1/results/{id}`), and catalog metadata via `GET /v1/information_schema`. This package forwards compiled Ibis SQL through those endpoints.
**Requirements:** Python 3.10+, **ibis-framework** 10.x, **hotdata** ≥0.1.

## Install

**From PyPI** (pick your installer):

```bash
uv pip install ibis-hotdata
# or
python -m pip install ibis-hotdata
# or: python -m pip install ibis-hotdata
```

Use Python **3.10+**. This package pins **`ibis-framework>=10,<11`** to match the Ibis major line.

## Connect

Programmatic API:

```python
import ibis

con = ibis.hotdata.connect(
api_url="https://api.hotdata.dev",
token="YOUR_API_TOKEN",
workspace_id="ws_…",
session_id=None, # optional sandbox: X-Session-Id
session_id=None, # optional: X-Session-Id (sandbox)
verify_ssl=True,
timeout=120.0,
default_connection=None, # Hotdata connection id (Ibis catalog”); see below
default_schema=None, # remote schema name (Ibis database”)
prefer_async=False, # set True to prefer async query submission
default_connection=None, # Hotdata connection id Ibis catalog
default_schema=None, # remote schema Ibis database
prefer_async=False,
)
```

### URL form
URL style (token may live in the query string or the URL “password” segment):

```python
con = ibis.connect(
"hotdata://api.hotdata.dev/?token=…&workspace_id=ws_…&verify_ssl=true"
)
```

The host becomes `https://{host}` (plus any path on the URL). You may place the token in the password segment (`hotdata://x:TOKEN@host/…`) instead of the query string.

After `pip install`, both `ibis.hotdata.connect(...)` and `ibis.connect("hotdata://…")` resolve to this backend via the `ibis.backends` entry point.

## Headers and sessions

Per the [Hotdata API](https://www.hotdata.dev/docs/api-reference), the client sends:

- `Authorization: Bearer <token>`
- `X-Workspace-Id: <workspace_public_id>`
- optionally `X-Session-Id: <sandbox_public_id>` when `session_id` is set.

## Ibis identifiers vs Hotdata hierarchy

Following Ibis terminology ([catalog → database → table](https://ibis-project.org/concepts/backend-table-hierarchy.qmd)), this backend maps:

| Ibis surface | Hotdata meaning |
|-------------|----------------|
| **Catalog** | Connection **id** from `GET /v1/connections` (same identifier as `connection` on `information_schema` rows). |
| **Database** | Remote **schema name** surfaced by Hotdata. |
| **Table name** | Remote table name. |

Typical federated references in SQL are `connection.schema.table` (quoted as needed):

```python
orders = con.table("orders", database=("conn_abc", "public"))
```

If the workspace exposes **exactly one** connection and **one** schema discovered for it, defaults are inferred; otherwise provide `default_connection` / `default_schema` when connecting.

## SQL dialect and compilation

The backend reuses Ibis’s **PostgreSQL SQLGlot compiler** (`postgres` dialect) so expressions compile to Postgres-oriented SQL aligned with Hotdata’s documented Postgres-style surface. Operational SQL details and federation edge cases belong in the [Hotdata SQL docs](https://www.hotdata.dev/docs/sql)—this client does not re-validate server capabilities.

## Query execution and async

- By default queries use synchronous `POST /v1/query` with `"async": false`.
- With `prefer_async=True`, requests use `"async": true`. The HTTP client honors `202` by polling **`GET /v1/query-runs/{id}`** until `succeeded`, then **`GET /v1/results/{id}`** until tabular payload is available.
- You can tune `poll_interval_s` and `poll_timeout_s` on `connect()`.

## Types and result materialization

- **Known tables:** column types come from `information_schema` when `include_columns=true` and are parsed with the same `PostgresType` mapper Ibis uses for PostgreSQL, with graceful fallback to `string`.
- **`con.sql(...)`:**
inferred via `SELECT * FROM (<your query>) AS ibis_hotdata_preview LIMIT 1`, using HTTP `columns`/`nullable` and the first JSON row shape for coarse inference (Decimals from JSON rarely round-trip cleanly; timestamps may appear as ISO strings unless the API returns richer metadata; nested structures map toward `JSON` / `Array(JSON)`).
**Mapping:** Ibis **catalog** = Hotdata connection id; **database** = remote schema; **table** = table name. SQL references look like `connection.schema.table`. With a single connection and schema, defaults are inferred; otherwise set `default_connection` / `default_schema` or qualify `con.table(..., database=(conn_id, schema))`.

Results are fetched into **pandas** by default (`execute`), matching core SQL backends. PyArrow batches follow Ibis’s `to_pyarrow` / `to_pyarrow_batches` path over the same row materialization.
**Execution:** SQL is compiled with Ibis’s **Postgres** SQLGlot compiler. The client uses `POST /v1/query`; with `prefer_async=True` it follows `202` and polls query-run and result endpoints until rows are ready. Tuning: `poll_interval_s`, `poll_timeout_s` on `connect()`.

## Out of scope (v1)
**Types:** Typed tables come from Hotdata’s information schema. `con.sql(...)` types are inferred from a small preview query; see [Hotdata SQL](https://www.hotdata.dev/docs/sql) for server behavior.

Table creation/DML helpers, uploads, embeddings, indexes, dataset lifecycle—these remain unimplemented unless you drive them explicitly with `.sql(...)`.
**Not in v1:** Ibis `create_table`, embeddings, indexes. **Uploads:** use `upload_file` + `create_dataset_from_upload` on the connection object (or raw SQL); query datasets as `datasets.<schema>.<table>` per Hotdata.

## Development

This repo uses **[uv](https://docs.astral.sh/uv/)** for environments and **`uv.lock`**.

```bash
uv sync # editable project + dev group (pytest, pytest-httpserver, ruff)
uv sync --group dev # pytest, ruff, httpx (for examples)
uv run pytest
uv run ruff check src tests examples
```

Optional Python pin:
Lockfile CI: `uv sync --locked --group dev && uv run pytest`.

```bash
uv python pin 3.12
uv sync
```
## TPC-H for the examples

Examples assume something like **`tpch.tpch_sf1.customer`**. Provision TPC-H in your workspace (commonly a **DuckDB** connection, then DuckDB’s `tpch` extension and `CALL dbgen(sf = 1)` — see [DuckDB TPC-H](https://www.duckdb.org/docs/current/core_extensions/tpch.html) and [Hotdata Quick Start](https://www.hotdata.dev/docs/quick-start)). If your data lives under `main` instead, pass `--default-schema` / `--default-connection` or set `HOTDATA_DEFAULT_*` (see `examples/_helpers.py`).

## Examples

CI-oriented checks:
Needs `HOTDATA_TOKEN` and `HOTDATA_WORKSPACE_ID`.

```bash
uv sync --locked # fail if uv.lock is out of date relative to pyproject.toml
uv run pytest
uv sync --group dev
export HOTDATA_TOKEN=…
export HOTDATA_WORKSPACE_ID=…
uv run python examples/01_catalog_introspection.py
uv run python examples/02_execute_sql.py 'SELECT COUNT(*) AS n FROM tpch.tpch_sf1.customer'
uv run python examples/03_connect_via_url.py
uv run python examples/04_ibis_table_workflows.py
```

Without uv, use `pip install -e .` and install dev tools separately (`pytest`, `pytest-httpserver`, `ruff`).
### Ibis tables → pandas DataFrames

## TPC-H in Hotdata
Calling **`.execute()`** on a table expression runs the compiled SQL on Hotdata and returns a **pandas** `DataFrame` (Ibis’s default for this backend).

Hotdata does not ship TPC-H as a single “upload this file” dataset. You expose the benchmark tables through a **connection** in your workspace, then query them like any other federated tables. See [Quick Start](https://www.hotdata.dev/docs/quick-start) (workspaces and connections) and [Data Sources](https://www.hotdata.dev/docs/data-sources) (supported engines).
Hotdata’s SQL often uses a **federated prefix** (for example `tpch.tpch_sf1`) that may not match the Ibis **catalog** string (the connection id). A reliable pattern is to start from **`con.sql("SELECT * FROM tpch.tpch_sf1.mytable", dialect="postgres")`**, then chain filters and aggregates—see **`examples/04_ibis_table_workflows.py`**.

A practical approach is a **DuckDB** connection: in the [Hotdata app](https://app.hotdata.dev/), add DuckDB for your workspace, then run SQL against that connection (for example with `hotdata query '…' --workspace-id … --connection …` from the CLI) to install and generate data using DuckDB’s built-in TPC-H extension:
When **`con.table("mytable")`** is enough (single connection/schema and names align with compiled SQL), the same operations apply:

```sql
INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 1);
```
```python
t = con.table("customer") # or con.table("customer", database=(conn_id, "tpch_sf1"))

Details, cleanup between runs, and optional query harnesses are in the [DuckDB TPC-H extension](https://www.duckdb.org/docs/current/core_extensions/tpch.html) documentation. By default, `dbgen` creates the TPC-H tables in DuckDB’s default schema (often `main`).
df = (
t.filter(t.c_mktsegment == "AUTOMOBILE")
.select("c_custkey", "c_name")
.limit(100)
.execute()
)

The examples in this repo assume federated names like **`tpch.tpch_sf1.customer`**: a connection whose id matches **`tpch`** (or is picked by the helper’s resolver) and a schema **`tpch_sf1`**. If your tables live in `main` instead, run the examples with `--default-schema main` and the correct `--default-connection`, or set **`HOTDATA_TPCH_RESOLVE=false`** and **`HOTDATA_DEFAULT_SCHEMA`** / **`HOTDATA_DEFAULT_CONNECTION`** (see `examples/_helpers.py`). Alternatively, create a `tpch_sf1` schema in DuckDB and move or recreate the generated tables there so the layout matches the defaults.
by_seg = t.group_by(t.c_mktsegment).agg(n=t.count()).execute()

## Examples

The `examples/` directory has small CLIs that assume TPC-H defaults (**`tpch` / `tpch_sf1`**
for REST metadata, aligning with federation SQL **`tpch.tpch_sf1.*`**). Helpers resolve the
friendly labels to Hotdata connection ids when possible (`examples/_helpers.py`). Override
via `--default-connection`, `--default-schema`, or **`HOTDATA_DEFAULT_*`**.
o = con.table("orders")
orders_with_names = (
t.join(o, t.c_custkey == o.o_custkey)
.select(t.c_name, o.o_totalprice)
.limit(50)
.execute()
)

```bash
uv sync
export HOTDATA_TOKEN=...
export HOTDATA_WORKSPACE_ID=...
uv run python examples/01_catalog_introspection.py
uv run python examples/02_execute_sql.py 'SELECT COUNT(*) AS n FROM tpch.tpch_sf1.customer'
uv run python examples/03_connect_via_url.py
total = t.c_acctbal.sum().execute()
```

See each script's docstring and `examples/_helpers.py` for flags (`--catalog`, `--schema`, `--prefer-async`, `--insecure`, …).

Tests use **pytest-httpserver**; no workspace tokens are embedded in this repository.
Other useful paths: **`.to_pyarrow()`** / **`.to_pyarrow_batches()`** for Arrow; **`con.sql("SELECT …", dialect="postgres")`** then chain the returned table expression.

## References

- [Hotdata API reference](https://www.hotdata.dev/docs/api-reference)
- [Hotdata SQL reference](https://www.hotdata.dev/docs/sql)
- [Ibis](https://ibis-project.org/)
- [Hotdata Python SDK](https://github.com/hotdata-dev/sdk-python)
- [Hotdata API](https://www.hotdata.dev/docs/api-reference) · [Hotdata SQL](https://www.hotdata.dev/docs/sql)
- [Ibis](https://ibis-project.org/) · [Ibis backend hierarchy](https://ibis-project.org/concepts/backend-table-hierarchy.qmd)
70 changes: 70 additions & 0 deletions examples/04_ibis_table_workflows.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/usr/bin/env python3
"""
Ibis table expressions on TPC-H, executed to pandas via Hotdata.

Hotdata SQL often uses a short federated prefix (e.g. ``tpch.tpch_sf1``) that may not
match the Ibis **catalog** string (connection id). Building from ``con.sql(...)`` keeps
qualifiers aligned with working ``SELECT ... FROM tpch.tpch_sf1.*`` queries.

From the repo root::

HOTDATA_TOKEN=... HOTDATA_WORKSPACE_ID=... \\
uv run python examples/04_ibis_table_workflows.py
"""

from __future__ import annotations

import sys
from pathlib import Path

_examples = Path(__file__).resolve().parent
sys.path.insert(0, str(_examples))

import ibis

from _helpers import connect_kwargs, parsed_args, parser

_argp = parser("Ibis table workflows → pandas (Hotdata / TPC-H).")
_ns = parsed_args(_argp)
con = ibis.hotdata.connect(**connect_kwargs(_ns))

# Federation prefix as in ``examples/02_execute_sql.py`` (not always == Ibis catalog id).
FED = "tpch.tpch_sf1"


def main() -> None:
customer = con.sql(f"SELECT * FROM {FED}.customer", dialect="postgres")
orders = con.sql(f"SELECT * FROM {FED}.orders", dialect="postgres")

print("— project + limit —")
q1 = customer.select("c_custkey", "c_name", "c_mktsegment").limit(5)
print(con.compile(q1))
print(q1.execute(), end="\n\n")

print("— filter + limit —")
q2 = customer.filter(customer.c_mktsegment == "AUTOMOBILE").limit(5)
print(con.compile(q2))
print(q2.execute(), end="\n\n")

print("— group by segment —")
q3 = customer.group_by(customer.c_mktsegment).agg(n=customer.count())
print(con.compile(q3))
print(q3.execute(), end="\n\n")

print("— join customer to orders —")
q4 = (
customer.join(orders, customer.c_custkey == orders.o_custkey)
.select(customer.c_name, orders.o_totalprice, orders.o_orderkey)
.limit(8)
)
print(con.compile(q4))
print(q4.execute(), end="\n\n")

print("— scalar aggregate —")
expr = customer.c_acctbal.sum()
print(con.compile(expr))
print(expr.execute())


if __name__ == "__main__":
main()
9 changes: 5 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ classifiers = [
]
dependencies = [
"ibis-framework>=10.0,<11",
"httpx>=0.27",
"hotdata>=0.1.0",
"pyarrow>=15",
"pyarrow-hotfix>=0.6",
"pandas>=2",
Expand All @@ -33,9 +33,10 @@ dependencies = [

[dependency-groups]
dev = [
"pytest>=8",
"pytest-httpserver>=1",
"ruff>=0.5",
"httpx>=0.27",
"pytest>=8",
"pytest-httpserver>=1",
"ruff>=0.5",
]

[project.urls]
Expand Down
41 changes: 37 additions & 4 deletions src/ibis_hotdata/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ def do_connect(
timeout
HTTP timeout in seconds (per request).
verify_ssl
Passed to ``httpx`` (boolean or path to a CA bundle).
Passed through to the Hotdata SDK configuration (boolean or path to a CA bundle).
default_connection
Optional default **catalog** (Hotdata connection id). If omitted and the
workspace exposes exactly one connection, it is chosen automatically;
Expand Down Expand Up @@ -270,7 +270,7 @@ def _to_catalog_db_tuple(self, table_loc: sge.Table):
return sg_cat, sg_db

def _connection_ids(self) -> list[str]:
data = self._http.get_json("/v1/connections")
data = self._http.list_connections()
return [c["id"] for c in data["connections"]]

def list_catalogs(self, *, like: str | None = None) -> list[str]:
Expand Down Expand Up @@ -322,7 +322,7 @@ def _iterate_information_schema(
params["include_columns"] = include_columns
if cursor:
params["cursor"] = cursor
chunk = self._http.get_json("/v1/information_schema", params=params)
chunk = self._http.get_information_schema(params)
yield from chunk["tables"]
if not chunk.get("has_more"):
break
Expand Down Expand Up @@ -425,8 +425,41 @@ def _fetch_from_cursor(self, cursor, schema: sch.Schema) -> pd.DataFrame:
df = PandasData.convert_table(df, schema)
return df

def upload_file(self, data: bytes) -> dict[str, Any]:
"""POST ``/v1/files``; returns the upload record (use ``id`` with :meth:`create_dataset_from_upload`)."""
try:
return self._http.upload_file(data)
except HotdataAPIError as exc:
raise com.IbisError(str(exc)) from exc

def create_dataset_from_upload(
self,
upload_id: str,
label: str,
*,
table_name: str | None = None,
file_format: str = "csv",
) -> dict[str, Any]:
"""POST ``/v1/datasets`` with an upload source—materializes a queryable dataset table.

The response includes ``schema_name`` and ``table_name``. Reference the table in SQL as
``datasets.<schema_name>.<table_name>`` (see Hotdata ``datasets`` documentation).
"""
try:
return self._http.create_dataset_from_upload(
upload_id=upload_id,
label=label,
table_name=table_name,
file_format=file_format,
)
except HotdataAPIError as exc:
raise com.IbisError(str(exc)) from exc

def create_table(self, *_args: Any, **_kwargs: Any) -> ir.Table:
raise NotImplementedError("Hotdata backend does not implement create_table in v1.")
raise NotImplementedError(
"Hotdata does not implement Ibis create_table in v1; use upload_file + "
"create_dataset_from_upload, then SQL or con.table with the returned names."
)

def drop_table(self, *_args: Any, **_kwargs: Any) -> None:
raise NotImplementedError("Hotdata backend does not implement drop_table in v1.")
Expand Down
Loading
Loading