Skip to content

Commit 6c66835

Browse files
eddietejedaclaude
andcommitted
fix: address PR review feedback
- backend.py: add _find_managed_connection helper that returns None for not-found vs raising IbisError; use it in create_database so real 5xx API failures are no longer swallowed by the broad `except IbisError: pass` - backend.py: always overwrite _database_id in _table_location (drop the `or`) so both cached fields stay in sync when multiple managed databases are used - backend.py: add explicit parens to api_conn ternary in get_schema for clarity - backend.py: document the database_id parameter in do_connect docstring - README.md: rewrite as user-facing docs — quick start first, plain language, no private method calls in examples, support table replaces spec-style prose Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent d43d8a5 commit 6c66835

2 files changed

Lines changed: 137 additions & 128 deletions

File tree

README.md

Lines changed: 106 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -1,91 +1,109 @@
11
# hotdata-ibis
22

3-
Experimental [Ibis](https://ibis-project.org/) backend for [Hotdata](https://www.hotdata.dev/docs/api-reference): compile expressions with Ibis, run federated SQL over the Hotdata API. REST calls use the official **[hotdata](https://github.com/hotdata-dev/sdk-python)** Python SDK. Repo examples use **httpx** (listed under the **dev** dependency group).
3+
Use [Ibis](https://ibis-project.org/) to query and upload data in your [Hotdata](https://www.hotdata.dev/docs/api-reference) workspace — write Python expressions instead of SQL, get pandas or Arrow results back.
44

55
**Requirements:** Python 3.10+, **ibis-framework** 10.x, **hotdata** ≥0.2.3.
66

77
## Install
88

99
```bash
1010
uv pip install hotdata-ibis
11-
# or: python -m pip install hotdata-ibis
11+
# or: pip install hotdata-ibis
1212
```
1313

14-
## Features
14+
## Quick start
1515

16-
- **Ibis connection API** — connect with `ibis.hotdata.connect(...)` or `ibis.connect("hotdata://...")`.
17-
- **Hotdata catalog mapping** — expose Hotdata connections, schemas, and tables through Ibis catalogs, databases, and tables.
18-
- **SQL-backed expression execution** — compile Ibis expressions with the Postgres SQLGlot compiler and execute them through Hotdata query APIs.
19-
- **Typed table discovery** — load schema metadata from Hotdata information schema and map SQL types into Ibis types. Both SQL-style names (`INTEGER`, `VARCHAR`) and Arrow-style names (`Float64`, `Utf8`) returned by Parquet/managed tables are handled.
20-
- **Arrow and pandas results** — materialize expressions as pandas DataFrames, PyArrow tables, or local Arrow record batches.
21-
- **Raw SQL escape hatch** — use `con.sql(..., dialect="postgres")` when Hotdata-specific federated SQL is clearer than modeled Ibis expressions.
22-
- **Managed database writes** — create managed connections with `create_database`, load local pandas or PyArrow data through `create_table`, and clean up with `drop_table` / `drop_database`.
16+
```python
17+
import ibis
2318

24-
## Connect
19+
con = ibis.hotdata.connect(
20+
api_url="https://api.hotdata.dev",
21+
token="YOUR_API_TOKEN",
22+
workspace_id="ws_…",
23+
)
2524

26-
Programmatic API:
25+
# List available tables
26+
con.list_tables()
2727

28-
```python
29-
import ibis
28+
# Query with Ibis expressions
29+
t = con.table("customer", database=("my_connection", "tpch_sf1"))
30+
df = (
31+
t.filter(t.c_mktsegment == "AUTOMOBILE")
32+
.select("c_custkey", "c_name")
33+
.limit(100)
34+
.execute() # returns a pandas DataFrame
35+
)
36+
```
37+
38+
## Connect
3039

40+
```python
3141
con = ibis.hotdata.connect(
3242
api_url="https://api.hotdata.dev",
3343
token="YOUR_API_TOKEN",
3444
workspace_id="ws_…",
35-
session_id=None, # optional: X-Session-Id (sandbox)
36-
verify_ssl=True,
45+
default_connection="my_connection", # skip qualifying every table reference
46+
default_schema="public", # skip qualifying every table reference
47+
session_id=None, # optional sandbox session
3748
timeout=120.0,
38-
default_connection=None, # Hotdata connection id → Ibis catalog
39-
default_schema=None, # remote schema → Ibis database
49+
verify_ssl=True,
4050
poll_interval_s=0.25,
4151
poll_timeout_s=600.0,
4252
)
4353
```
4454

45-
URL style (token may live in the query string or the URL password segment):
55+
URL style also works — token can go in the query string or the URL password segment:
4656

4757
```python
48-
con = ibis.connect(
49-
"hotdata://api.hotdata.dev/?token=…&workspace_id=ws_…&verify_ssl=true"
50-
)
58+
con = ibis.connect("hotdata://api.hotdata.dev/?token=…&workspace_id=ws_…")
5159
```
5260

53-
**Mapping:** Ibis **catalog** = Hotdata connection id; **database** = remote schema; **table** = table name. SQL references look like `connection.schema.table`. With a single connection and schema, defaults are inferred; otherwise set `default_connection` / `default_schema` or qualify `con.table(..., database=(conn_id, schema))`.
61+
**Table addressing:** Hotdata organizes data as `connection → schema → table`. In Ibis terms that maps to `catalog → database → table`. With a single connection and schema, defaults are inferred automatically. For multiple connections or schemas, pass `database=(connection_id, schema)` when referencing a table, or set `default_connection` / `default_schema` at connect time.
5462

55-
> **Managed databases:** SQL and Ibis expressions against managed database tables use `"default"` as the catalog rather than the connection id. The backend resolves this automatically — see [Managed databases](#managed-databases) below.
63+
## Querying
5664

57-
**Execution:** SQL is compiled with Ibis’s **Postgres** SQLGlot compiler. The client submits queries asynchronously with `POST /v1/query`, polls `GET /v1/query-runs/{id}`, then downloads ready results as Arrow IPC from `GET /v1/results/{id}`. Tuning: `poll_interval_s`, `poll_timeout_s` on `connect()`.
65+
### Ibis expressions
5866

59-
**Types:** Typed tables come from Hotdata’s information schema. `con.sql(...)` types are inferred from a small preview query and Arrow schema. Both SQL-style names (`INTEGER`, `DOUBLE PRECISION`) and Arrow-style names (`Float64`, `Utf8`, `Date32`) returned by Parquet/managed tables are supported; see [Hotdata SQL](https://www.hotdata.dev/docs/sql) for server behavior.
67+
```python
68+
t = con.table("orders")
69+
70+
# Filter, select, aggregate — all run as SQL on Hotdata
71+
summary = (
72+
t.filter(t.status == "shipped")
73+
.group_by("region")
74+
.agg(total=t.amount.sum(), n=t.count())
75+
.order_by("total", ascending=False)
76+
.execute()
77+
)
78+
```
6079

61-
## Ibis Support Overview
80+
`.execute()` returns a **pandas DataFrame**. Use `.to_pyarrow()` for an Arrow table or `.to_pyarrow_batches()` for a record batch reader.
6281

63-
`hotdata-ibis` is a read-oriented SQL backend. It is useful for exploring Hotdata workspaces with Ibis expressions, running federated SQL, and materializing results locally, but it is not a full mutable database backend.
82+
### Raw SQL
6483

65-
Supported today:
84+
When you need Hotdata-specific syntax, federated table names, or SQL that Ibis doesn't model:
6685

67-
- **Connection setup:** `ibis.hotdata.connect(...)` and `ibis.connect("hotdata://...")` with token, workspace, optional sandbox session, TLS, timeout, and polling settings.
68-
- **Catalog discovery:** `list_catalogs`, `list_databases`, `list_tables`, `current_catalog`, and `current_database` map Hotdata connections and remote schemas into Ibis' catalog/database/table hierarchy.
69-
- **Table schemas:** `con.table(...)` uses Hotdata information schema column metadata and maps SQL types through Ibis' Postgres type parser.
70-
- **SQL-backed expressions:** Ibis expressions compile with the Postgres SQLGlot compiler and execute through Hotdata. Common `SELECT` workloads such as projection, filtering, joins, grouping, aggregation, ordering, limits, scalar expressions, and `con.sql(...)` work when the generated SQL is accepted by Hotdata.
71-
- **Result materialization:** `.execute()` returns pandas objects. `.to_pyarrow()` and `.to_pyarrow_batches()` use the Arrow IPC result data exposed by Hotdata without converting through JSON rows; batches are split locally after the result is downloaded.
72-
- **Raw SQL escape hatch:** `con.sql("SELECT ...", dialect="postgres")` is the most reliable way to use Hotdata-specific federated table names or SQL that Ibis does not model directly.
73-
- **Managed database lifecycle:** `create_database("sales", schema="public", tables=["orders"])` provisions a managed connection (Ibis catalog). `create_table("orders", pandas_df, database=("sales", "public"))` uploads Parquet and loads it. Query using `database=("default", "public")` or the `"default"."public"."orders"` SQL prefix. `drop_table` clears a managed table; `drop_database` deletes the connection. See [Managed databases](#managed-databases) for a complete example.
74-
- **Parquet uploads:** `create_table` accepts pandas DataFrames, PyArrow tables, or schema-only empty tables. Tables must live in a managed connection — declare them with `create_database(..., tables=[...])` first. Loads are asynchronous; poll `_managed_table_synced(conn_id, schema, table)` if you need to query immediately. Loads always use replace mode; pass `overwrite=True` to replace an existing synced table (the default `overwrite=False` raises if the table already exists).
86+
```python
87+
df = con.sql(
88+
"SELECT region, SUM(amount) AS total FROM my_conn.public.orders GROUP BY region",
89+
dialect="postgres",
90+
).execute()
91+
```
92+
93+
You can chain Ibis expressions on the result of `con.sql(...)` the same way you would on `con.table(...)`.
7594

76-
Not supported as full Ibis backend features:
95+
### Discover what's available
7796

78-
- **General DDL and mutations:** Arbitrary remote DDL, inserts, updates, deletes, and schema-altering operations on external connections are not implemented. Managed-database writes are limited to `create_database`, `create_table`, `drop_table`, and `drop_database` as described above.
79-
- **Temporary tables and in-memory registration:** `supports_temporary_tables` is false, and in-memory tables are not uploaded automatically for joins.
80-
- **Python UDFs:** `supports_python_udfs` is false.
81-
- **Transactions and sessions as database state:** Hotdata sandbox sessions can be passed as `session_id`, but the backend does not expose transaction APIs.
82-
- **Backend-native SQL dialect:** Compilation uses Ibis' Postgres dialect as the closest fit. Hotdata SQL and federation rules are authoritative, so not every Ibis expression that compiles is guaranteed to execute remotely.
83-
- **Complete Ibis compliance:** The backend is experimental and has focused test coverage for connection, discovery, schema mapping, execution, uploads, and Arrow results. It has not yet been validated against the full Ibis backend test suite.
84-
- **Hotdata platform APIs beyond SQL and managed databases:** embeddings, indexes, query history management, sandbox lifecycle management, and other Hotdata-specific APIs are outside the Ibis backend surface.
97+
```python
98+
con.list_catalogs() # Hotdata connection ids
99+
con.list_databases(catalog="my_connection") # schemas for a connection
100+
con.list_tables(database=("my_connection", "public"))
101+
con.get_schema("orders", catalog="my_connection", database="public")
102+
```
85103

86104
## Managed databases
87105

88-
Managed databases are temporary, workspace-owned connections for uploading and querying your own data. Tables must be declared at creation time, loads are asynchronous, and SQL uses `"default"` as the catalog (not the raw connection id).
106+
Managed databases let you upload your own data (pandas DataFrames or PyArrow tables) and query it alongside your other Hotdata connections. They are provisioned on demand and scoped to your workspace.
89107

90108
```python
91109
import time
@@ -98,63 +116,69 @@ con = ibis.hotdata.connect(
98116
workspace_id="ws_…",
99117
)
100118

101-
# 1. Create the managed database and declare tables upfront.
102-
# Tables must be declared here — load_managed_table rejects undeclared names.
119+
# 1. Create the database and declare which tables you'll upload.
120+
# Table names must be declared here — uploads to undeclared names are rejected.
103121
con.create_database("my-dataset", schema="public", tables=["orders"])
104122

105-
# 2. Resolve the database id + underlying connection id.
106-
db = con._resolve_managed_connection("my-dataset")
107-
db_id = db["id"] # "dbid…"
108-
conn_id = db["default_connection_id"] # "conn…"
109-
110-
# 3. Upload data (pandas DataFrame or PyArrow table).
123+
# 2. Upload data.
111124
df = pd.DataFrame({"order_id": [1, 2, 3], "amount": [9.99, 49.99, 5.00]})
112-
con.create_table("orders", df, database=(db_id, "public"), overwrite=True)
125+
con.create_table("orders", df, database=("my-dataset", "public"), overwrite=True)
113126

114-
# 4. Loads are async — wait for the table to sync before querying.
115-
while not con._managed_table_synced(conn_id, "public", "orders"):
116-
time.sleep(1)
127+
# 3. Uploads are asynchronous — wait a moment before querying.
128+
time.sleep(2)
117129

118-
# 5. Query with Ibis expressions.
119-
# Use database=("default", schema) — managed databases require "default" as the
120-
# SQL catalog; the backend resolves the underlying connection automatically.
130+
# 4. Query with Ibis expressions.
131+
# Managed tables use "default" as the catalog — the backend handles this automatically.
121132
t = con.table("orders", database=("default", "public"))
122133
result = t.filter(t.amount > 10).order_by("amount").execute()
123134

124-
# 6. Or with raw SQL (same "default" catalog prefix).
125-
result = con.sql('SELECT sum(amount) AS total FROM "default"."public"."orders"').execute()
135+
# 5. Or with raw SQL.
136+
result = con.sql('SELECT SUM(amount) AS total FROM "default"."public"."orders"').execute()
126137

127-
# 7. Clean up.
138+
# 6. Clean up.
139+
con.drop_table("orders", database=("my-dataset", "public"))
128140
con.drop_database("my-dataset")
129141
```
130142

131-
**Key points:**
132-
- `create_database(..., tables=[...])` — table names must be listed here before uploading.
133-
- `create_table(..., database=(db_id, schema))` — pass the managed database id (from `_resolve_managed_connection`) as the first element of the tuple, not the connection id.
134-
- SQL catalog is `"default"`, not the connection id — `"default"."schema"."table"` is the correct form.
135-
- After `create_table`, ibis table references automatically use `database=("default", schema)`; use the same form for subsequent `con.table(...)` calls.
136-
- Loads are asynchronous. Poll `_managed_table_synced(conn_id, schema, table)` or add a small sleep before querying.
143+
**Things to know:**
144+
- Declare all table names in `create_database(..., tables=[...])` before uploading — you can't add them later without recreating the database.
145+
- Use `database=("my-dataset", schema)` when uploading (`create_table`) or dropping tables (`drop_table`).
146+
- Use `database=("default", schema)` when querying — managed tables always use `"default"` as the SQL catalog prefix.
147+
- `create_table` accepts pandas DataFrames, PyArrow tables, or an Ibis schema for creating an empty table.
148+
- Uploads use replace mode. Pass `overwrite=True` to replace a table that already exists; without it, uploading to an existing table raises an error.
149+
150+
## What's supported
151+
152+
| Feature | Status |
153+
|---|---|
154+
| `list_catalogs`, `list_databases`, `list_tables` ||
155+
| `con.table(...)` with full schema metadata ||
156+
| Ibis expressions: filter, select, join, group\_by, agg, order\_by, limit ||
157+
| `con.sql(...)` raw SQL ||
158+
| `.execute()` → pandas, `.to_pyarrow()`, `.to_pyarrow_batches()` ||
159+
| `create_database` / `drop_database` (managed) ||
160+
| `create_table` / `drop_table` (managed, Parquet upload) ||
161+
| Temporary tables ||
162+
| Python UDFs ||
163+
| INSERT / UPDATE / DELETE on external connections ||
164+
165+
SQL compilation uses Ibis's Postgres dialect as the closest fit. Most common `SELECT` workloads run fine; complex expressions may generate SQL that Hotdata doesn't support — use `con.sql(...)` as a fallback.
137166

138167
## Development
139168

140169
```bash
141-
uv sync # installs dev group by default (pytest, ruff, httpx for examples)
170+
uv sync # installs dev group (pytest, ruff, httpx)
142171
uv run pytest
143-
uv run ruff check src tests examples
172+
uv run ruff check src tests
144173
```
145174

146-
Lockfile CI: `uv sync --locked && uv run pytest`.
147-
148-
## TPC-H for the examples
149-
150-
Examples assume something like **`tpch.tpch_sf1.customer`**. Provision TPC-H in your workspace (commonly a **DuckDB** connection, then DuckDB’s `tpch` extension and `CALL dbgen(sf = 1)` — see [DuckDB TPC-H](https://www.duckdb.org/docs/current/core_extensions/tpch.html) and [Hotdata Quick Start](https://www.hotdata.dev/docs/quick-start)). If your data lives under `main` instead, pass `--default-schema` / `--default-connection` or set `HOTDATA_DEFAULT_*` (see `examples/_helpers.py`).
175+
CI: `uv sync --locked && uv run pytest`.
151176

152177
## Examples
153178

154-
Needs `HOTDATA_API_KEY` and `HOTDATA_WORKSPACE`.
179+
Set your credentials, then run any example script:
155180

156181
```bash
157-
uv sync
158182
export HOTDATA_API_KEY=…
159183
export HOTDATA_WORKSPACE=…
160184
uv run python examples/01_catalog_introspection.py
@@ -163,41 +187,10 @@ uv run python examples/03_connect_via_url.py
163187
uv run python examples/04_ibis_table_workflows.py
164188
```
165189

166-
### Ibis tables → pandas DataFrames
167-
168-
Calling **`.execute()`** on a table expression runs the compiled SQL on Hotdata and returns a **pandas** `DataFrame` (Ibis’s default for this backend).
169-
170-
Hotdata’s SQL often uses a **federated prefix** (for example `tpch.tpch_sf1`) that may not match the Ibis **catalog** string (the connection id). A reliable pattern is to start from **`con.sql("SELECT * FROM tpch.tpch_sf1.mytable", dialect="postgres")`**, then chain filters and aggregates—see **`examples/04_ibis_table_workflows.py`**.
171-
172-
When **`con.table("mytable")`** is enough (single connection/schema and names align with compiled SQL), the same operations apply:
173-
174-
```python
175-
t = con.table("customer") # or con.table("customer", database=(conn_id, "tpch_sf1"))
176-
177-
df = (
178-
t.filter(t.c_mktsegment == "AUTOMOBILE")
179-
.select("c_custkey", "c_name")
180-
.limit(100)
181-
.execute()
182-
)
183-
184-
by_seg = t.group_by(t.c_mktsegment).agg(n=t.count()).execute()
185-
186-
o = con.table("orders")
187-
orders_with_names = (
188-
t.join(o, t.c_custkey == o.o_custkey)
189-
.select(t.c_name, o.o_totalprice)
190-
.limit(50)
191-
.execute()
192-
)
193-
194-
total = t.c_acctbal.sum().execute()
195-
```
196-
197-
Other useful paths: **`.to_pyarrow()`** / **`.to_pyarrow_batches()`** for Arrow; **`con.sql("SELECT …", dialect="postgres")`** then chain the returned table expression.
190+
The examples assume a TPC-H dataset at `tpch.tpch_sf1`. To provision it: create a DuckDB connection in Hotdata, then run `CALL dbgen(sf = 1)` using DuckDB's [tpch extension](https://duckdb.org/docs/extensions/tpch.html).
198191

199192
## References
200193

201194
- [Hotdata Python SDK](https://github.com/hotdata-dev/sdk-python)
202-
- [Hotdata API](https://www.hotdata.dev/docs/api-reference) · [Hotdata SQL](https://www.hotdata.dev/docs/sql)
203-
- [Ibis](https://ibis-project.org/) · [Ibis backend hierarchy](https://ibis-project.org/concepts/backend-table-hierarchy.qmd)
195+
- [Hotdata API reference](https://www.hotdata.dev/docs/api-reference) · [Hotdata SQL](https://www.hotdata.dev/docs/sql)
196+
- [Ibis documentation](https://ibis-project.org/) · [Ibis backend concepts](https://ibis-project.org/concepts/backend-table-hierarchy.qmd)

0 commit comments

Comments
 (0)