Skip to content

Commit 1f9f5eb

Browse files
authored
feat: managed database demo flow — explicit flags, catalog resolution, BM25 search (#127)
* feat(databases): explicit flags for load, unset command, and star in list - Replace dot-notation positional <TARGET> on `databases load` with explicit --catalog, --schema, --table flags - Add `databases unset` to clear the active database from config - Show * marker on the active database in `databases list` - Remove parse_db_target and its tests (no longer needed) * feat(databases): resolve database by catalog alias; auto-declare table on load - try_resolve_database now falls through to match by default_catalog when neither id nor name match, so --catalog works as a lookup key everywhere - databases load auto-recovers from "not declared": deletes the empty database, recreates it with the table declared, then retries the load - Add default_catalog to DatabaseSummary so the list response can be matched without a per-row fetch * fix(indexes,search): resolve catalog aliases in connection lookup; fix duplicate score column - resolve_connection_id falls back to managed database catalog lookup so `airbnb4.listings[description]` works in indexes create and search - BM25 search no longer appends 'score' when --select already includes it * refactor(indexes): replace dot/bracket notation with explicit --catalog/--schema/--table/--column flags Removes the positional `connection.table[col1,col2]` target argument and parse_index_target helper. All index creation now uses named flags, consistent with databases load and search. * fix: prefer active database connection when resolving catalog name * docs: update README and skills to reflect new CLI syntax - databases load: explicit --catalog/--schema/--table flags (no more dot-notation) - databases list: note * marker on active database - databases set/unset: documented - indexes create: --catalog option for managed databases (in addition to --connection-id) - search: --type and --column are now optional (inferred from indexes) - workflows: updated examples throughout * fix: address PR review feedback - auto-declare: collect existing tables before delete+recreate so they are preserved in the new database; also pass expires_at through - databases create hint: update to new --catalog/--table flag syntax - api.rs: fix workspace_id() doc comment placement * fix: warn before auto-declare when existing tables have synced data When a 'not declared' error triggers delete+recreate, check if any existing tables are synced. If so, show a yellow warning listing the tables whose data will be lost and prompt for y/N confirmation. In non-interactive mode (CI, piped stdin, --no-input) the command errors out with a clear message instead of silently destroying data. * docs: update hotdata-analytics skill to new databases load and --column flag syntax --------- Co-authored-by: Eddie A Tejeda <669988+eddietejeda@users.noreply.github.com>
1 parent af4a090 commit 1f9f5eb

13 files changed

Lines changed: 350 additions & 276 deletions

File tree

README.md

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -135,28 +135,35 @@ Managed databases are Hotdata-owned catalogs you create and populate yourself (n
135135
```sh
136136
hotdata databases list [-w <id>] [-o table|json|yaml]
137137
hotdata databases create [--name <display_name>] [--catalog <alias>] [--table <table> ...] [--schema public] [--expires-at <duration|timestamp>] [-o table|json|yaml]
138+
hotdata databases set <id>
139+
hotdata databases unset
138140
hotdata databases <name_or_id> [-o table|json|yaml]
139141
hotdata databases delete <name_or_id>
140142
hotdata databases run [--database <id>] [--name <label>] [--schema public] [--table <table> ...] [--expires-at <duration|timestamp>] <cmd> [args...]
141143
hotdata databases <id> run <cmd> [args...]
142144

143-
hotdata databases tables list <database> [--schema <name>] [-o table|json|yaml]
144-
hotdata databases tables load <database> <table> --file ./data.parquet [--schema public]
145-
hotdata databases tables load <database> <table> --upload-id <id> [--schema public]
146-
hotdata databases tables delete <database> <table> [--schema public]
145+
# Preferred: load by catalog alias (auto-declares table if needed)
146+
hotdata databases load --catalog <alias> --table <table> [--schema public] (--file <path> | --url <url> | --upload-id <id>)
147+
148+
# Also available: explicit database flag
149+
hotdata databases tables list [--database <id_or_name>] [--schema <name>] [-o table|json|yaml]
150+
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] (--file <path> | --url <url> | --upload-id <id>)
151+
hotdata databases tables delete <table> [--database <id_or_name>] [--schema public]
147152
```
148153

149-
- `create` registers a managed connection with no external credentials. `--name` is a human-readable display name; `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`) and must be `[a-z_][a-z0-9_]*`. Use `--table` to declare tables up front (required before `tables load` on the current API).
154+
- `create` registers a managed connection with no external credentials. `--name` is a human-readable display name; `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`) and must be `[a-z_][a-z0-9_]*`.
155+
- `set` / `unset` — save or clear the active database. All `databases tables` and `context` commands default to it. The active database is marked with `*` in `databases list`.
156+
- `load` (top-level shorthand) — loads a parquet file into `--catalog.--schema.--table`. If the table was not declared at create time, the CLI automatically deletes and recreates the database with the table declared, then retries the load.
150157
- `tables load` uploads a **parquet** file (or uses a staged `upload_id` from `POST /v1/files`) and publishes it as the table generation (`replace` mode).
151-
- `run` mints a database-scoped JWT and execs `<cmd>` with `HOTDATA_DATABASE_TOKEN`, `HOTDATA_DATABASE_REFRESH_TOKEN`, `HOTDATA_DATABASE`, `HOTDATA_WORKSPACE`, and `HOTDATA_API_URL` injected into its environment. Pass a database id (group-positional `<id>` like `sandbox run`, or `--database <id>`) to scope an existing database; omit both to auto-create a scratch one using `--name` / `--schema` / `--table` / `--expires-at`. Useful for launching an agent or child process whose API access is restricted to a single database.
158+
- `run` mints a database-scoped JWT and execs `<cmd>` with `HOTDATA_DATABASE_TOKEN`, `HOTDATA_DATABASE_REFRESH_TOKEN`, `HOTDATA_DATABASE`, `HOTDATA_WORKSPACE`, and `HOTDATA_API_URL` injected into its environment.
152159
- For CSV/JSON uploads without a managed database, use `hotdata datasets create` instead (`datasets.main.*`).
153160

154161
Example:
155162

156163
```sh
157-
hotdata databases create --name "Sales reporting" --catalog sales --table orders
158-
hotdata databases tables load sales orders --file ./orders.parquet
159-
hotdata query "SELECT count(*) FROM sales.public.orders"
164+
hotdata databases create --catalog airbnb
165+
hotdata databases load --catalog airbnb --table listings --url https://example.com/listings.parquet
166+
hotdata query "SELECT count(*) FROM airbnb.public.listings"
160167
```
161168

162169
## Tables
@@ -233,14 +240,14 @@ hotdata queries <query_run_id> [-o table|json|yaml]
233240

234241
## Search
235242

236-
`--type` is **required** — no default. Pass either `vector` (similarity search via the index's embedding provider) or `bm25` (full-text search). Both run entirely server-side.
243+
Both run entirely server-side. `--type` and `--column` are **optional** when the table has exactly one search index — they are inferred automatically. Pass them explicitly when multiple indexes exist.
237244

238245
```sh
239246
# BM25 full-text search (requires a BM25 index on the column)
240-
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
247+
hotdata search "<query>" --table <connection.schema.table> [--type bm25] [--column <column>] [--select <columns>] [--limit <n>] [-o table|json|csv]
241248

242249
# Vector search (requires a vector index with auto-embedding on the column)
243-
hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
250+
hotdata search "<query>" --table <table> [--type vector] [--column <source_text_column>] [--limit <n>]
244251
```
245252

246253
- **`--type vector`** — pass your query as **plain text**, name the **source text column** (e.g. `title`). The server embeds the query at the same time, using the same provider that auto-embedded the column when the index was built — so distance metric, model, and dimensions all match automatically. No `OPENAI_API_KEY`, no client-side embedding, no need to know about the auto-generated `_embedding` column. Generated SQL: `vector_distance(col, 'query')` server-side.
@@ -255,17 +262,21 @@ hotdata search "<query>" --type vector --table <table> --column <source_text_col
255262
Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`). The two scopes are mutually exclusive.
256263

257264
```sh
258-
# Connection-table scope
265+
# Managed database scope (catalog alias resolves via active database)
266+
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
267+
--column <cols> --type bm25|vector|sorted \
268+
[--name <name>] [--metric l2|cosine|dot] [--async] \
269+
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]
270+
271+
# Connection-table scope (for non-managed connections)
259272
hotdata indexes list --connection-id <id> --schema <schema> --table <table> [-o table|json|yaml]
260273
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
261-
--name <name> --columns <cols> --type sorted|bm25|vector \
262-
[--metric l2|cosine|dot] [--async] \
263-
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]
274+
--column <cols> --type sorted|bm25|vector [--name <name>] ...
264275
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>
265276

266277
# Dataset scope
267278
hotdata indexes list --dataset-id <id> [-o table|json|yaml]
268-
hotdata indexes create --dataset-id <id> --name <name> --columns <cols> --type sorted|bm25|vector ...
279+
hotdata indexes create --dataset-id <id> --column <cols> --type sorted|bm25|vector [--name <name>] ...
269280
hotdata indexes delete --dataset-id <id> --name <name>
270281
```
271282

skills/hotdata-analytics/SKILL.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -89,9 +89,8 @@ hotdata results <result_id> [--workspace-id <workspace_id>] [--output table|json
8989
Or managed parquet:
9090

9191
```bash
92-
hotdata databases create --name analytics --table slice
93-
hotdata databases set <returned-id>
94-
hotdata databases tables load slice --file ./slice.parquet
92+
hotdata databases create --catalog analytics
93+
hotdata databases load --catalog analytics --table slice --file ./slice.parquet
9594
```
9695

9796
3. **Chain query** — use printed **`full_name`** or `datasets list` **FULL NAME** column:
@@ -113,7 +112,7 @@ For equality, range, and sort-heavy OLAP — not full-text or vector (see **`hot
113112

114113
```bash
115114
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
116-
--name idx_orders_created --columns created_at --type sorted [--async]
115+
--name idx_orders_created --column created_at --type sorted [--async]
117116
```
118117

119118
List and delete use the same `hotdata indexes` commands as in the search skill; only **`--type sorted`** is the analytics focus here.

skills/hotdata-analytics/references/WORKFLOWS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,8 @@ hotdata datasets create --label "from saved" --query-id <query_id> [--table-name
7676
**Managed database** (parquet → `<database>.<schema>.<table>`):
7777

7878
```bash
79-
hotdata databases create --name chain_db --table revenue_slice
80-
hotdata databases tables load chain_db revenue_slice --file ./revenue_slice.parquet
79+
hotdata databases create --catalog chain_db
80+
hotdata databases load --catalog chain_db --table revenue_slice --file ./revenue_slice.parquet
8181
```
8282

8383
Note the printed **`full_name`** (e.g. `datasets.main.chain_revenue_slice` or `chain_db.public.revenue_slice`). For datasets, **`FULL NAME`** from `datasets list` is authoritative.

skills/hotdata-search/SKILL.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,15 @@ Retrieval workloads in Hotdata: **BM25 full-text**, **vector similarity**, and t
1616

1717
## Search CLI
1818

19-
`--type` is **required**: `bm25` or `vector`. Both run server-side.
19+
Both run server-side. `--type` and `--column` are **optional** when the table has exactly one search index — they are inferred automatically. Specify them when multiple indexes exist.
2020

2121
```bash
2222
# BM25 (requires a BM25 index on the column)
23-
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> \
23+
hotdata search "<query>" --table <connection.schema.table> [--type bm25] [--column <column>] \
2424
[--select <columns>] [--limit <n>] [--workspace-id <workspace_id>] [--output table|json|csv]
2525

2626
# Vector (requires a vector index; server auto-embeds the query text)
27-
hotdata search "<query>" --type vector --table <connection.schema.table> --column <source_text_column> \
27+
hotdata search "<query>" --table <connection.schema.table> [--type vector] [--column <source_text_column>] \
2828
[--select <columns>] [--limit <n>] [--workspace-id <workspace_id>] [--output table|json|csv]
2929
```
3030

@@ -33,6 +33,7 @@ hotdata search "<query>" --type vector --table <connection.schema.table> --colum
3333
| **`bm25`** | Server generates `bm25_search(table, col, 'text')`. Results sort by score (descending). |
3434
| **`vector`** | Pass plain-text query; name the **source text column** (e.g. `title`). Server embeds using the same provider/metric/dimensions as the index. SQL uses `vector_distance(col, 'text')`. Results sort by distance (ascending). |
3535

36+
- **Inference:** when `--type` or `--column` are omitted, the CLI fetches the table's indexes and selects the only BM25/vector index. If multiple exist, you must specify both flags.
3637
- **No vector index, or custom embedding model?** Use raw SQL via `hotdata query` (e.g. `cosine_distance(col, [<vec>])`). The removed `--model` / stdin-vector paths hardcoded `l2_distance` and are not supported.
3738
- **Before search:** create the right index (`indexes create --type bm25` or `--type vector`). See [references/INDEXES.md](references/INDEXES.md).
3839
- Default `--limit` is 10.
@@ -48,15 +49,19 @@ Indexes attach to a **connection table** (`--connection-id` + `--schema` + `--ta
4849
hotdata indexes list [--connection-id <id>] [--schema <schema>] [--table <table>] [--workspace-id <ws>] [--output table|json|yaml]
4950
hotdata indexes list --dataset-id <dataset_id> [--workspace-id <ws>] [--output table|json|yaml]
5051

51-
# Connection table
52-
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
53-
--name <name> --columns <cols> --type bm25|vector \
54-
[--metric l2|cosine|dot] [--async] \
52+
# Managed database (catalog alias — uses the active database when the catalog matches)
53+
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
54+
--column <col> --type bm25|vector \
55+
[--name <name>] [--metric l2|cosine|dot] [--async] \
5556
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]
57+
58+
# Connection table (raw connection ID)
59+
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
60+
--column <col> --type bm25|vector [--name <name>] ...
5661
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>
5762

5863
# Dataset
59-
hotdata indexes create --dataset-id <dataset_id> --name <name> --columns <cols> --type bm25|vector ...
64+
hotdata indexes create --dataset-id <dataset_id> --column <col> --type bm25|vector [--name <name>] ...
6065
hotdata indexes delete --dataset-id <dataset_id> --name <name>
6166
```
6267

@@ -89,6 +94,6 @@ hotdata embedding-providers delete <id> [--workspace-id <workspace_id>]
8994

9095
1. `hotdata tables list --connection-id <id>` — confirm column types.
9196
2. `hotdata indexes list` — avoid duplicate indexes.
92-
3. `hotdata indexes create ... --type bm25|vector` (add `--async` if large).
93-
4. `hotdata search "..." --type bm25|vector --table ... --column ...`
97+
3. `hotdata indexes create --catalog <alias> --table <table> --column <col> --type bm25|vector` (add `--async` if large).
98+
4. `hotdata search "..." --table <catalog.table>``--type` and `--column` are inferred when there is one search index.
9499
5. Record what exists in **context:DATAMODEL** (core skill) when the workspace should remember index choices.

skills/hotdata-search/references/INDEXES.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,24 @@ Skip duplicates (same table, column, and purpose).
3030

3131
## 3. Create indexes
3232

33+
For managed databases (catalog alias — auto-selects the active database connection):
34+
35+
```bash
36+
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
37+
--column body --type bm25
38+
39+
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
40+
--column embedding --type vector --metric cosine
41+
```
42+
43+
For regular connections (explicit connection ID):
44+
3345
```bash
3446
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
35-
--name idx_posts_body_bm25 --columns body --type bm25
47+
--name idx_posts_body_bm25 --column body --type bm25
3648

3749
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
38-
--name idx_chunks_embedding --columns embedding --type vector --metric cosine
50+
--name idx_chunks_embedding --column embedding --type vector --metric cosine
3951
```
4052

4153
Large builds: `--async`, then `hotdata jobs list` / `hotdata jobs <job_id>`.

skills/hotdata/SKILL.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -189,25 +189,28 @@ hotdata connections create \
189189
hotdata databases list [--workspace-id <workspace_id>] [--output table|json|yaml]
190190
hotdata databases create [--name <display_name>] [--catalog <alias>] [--table <table> ...] [--schema public] [--expires-at <duration|timestamp>] [--workspace-id <workspace_id>] [--output table|json|yaml]
191191
hotdata databases set <id_or_name>
192+
hotdata databases unset
192193
hotdata databases <id_or_name> [--workspace-id <workspace_id>] [--output table|json|yaml]
193194
hotdata databases delete <id_or_name> [--workspace-id <workspace_id>]
194195
hotdata databases run [--database <id>] [--name <label>] [--schema public] [--table <table> ...] [--expires-at <duration|timestamp>] [--workspace-id <workspace_id>] <cmd> [args...]
195196
hotdata databases <id> run <cmd> [args...]
196197
197-
# Dot-notation shorthand for load: database.table or database.schema.table
198-
hotdata databases load <database.table> [--file ./data.parquet] [--url <url>] [--upload-id <id>] [--workspace-id <workspace_id>]
198+
# Preferred: load by catalog alias (auto-declares table if needed)
199+
hotdata databases load --catalog <alias> --table <table> [--schema public] (--file <path> | --url <url> | --upload-id <id>) [--workspace-id <workspace_id>]
199200
201+
# Also available via tables subcommand
200202
hotdata databases tables list [--database <id_or_name>] [--schema <name>] [--workspace-id <workspace_id>] [--output table|json|yaml]
201-
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] [--file ./data.parquet] [--url <url>] [--upload-id <id>] [--workspace-id <workspace_id>]
203+
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] (--file <path> | --url <url> | --upload-id <id>) [--workspace-id <workspace_id>]
202204
hotdata databases tables delete <table> [--database <id_or_name>] [--schema public] [--workspace-id <workspace_id>]
203205
```
204206

205-
- `list` — all managed databases in the workspace.
207+
- `list` — all managed databases in the workspace. Active database is marked with `*`.
206208
- `create` — creates a new managed database. `--name` is an optional human-readable display name. `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`); must be `[a-z_][a-z0-9_]*`. `--expires-at` accepts relative durations (`24h`, `7d`, `90m`) or an RFC 3339 timestamp; omitting means no expiry. Repeat `--table` to declare tables up front.
207209
- `set` — saves `<id_or_name>` as the active database. Subsequent `databases tables` and `context` commands use it automatically.
210+
- `unset` — clears the active database from config.
208211
- `<id_or_name>` — inspect one database (id, catalog, name, expires_at).
209212
- `delete` — removes the managed database; clears the active-database config if it matched.
210-
- `load` shorthand with dot notation (`database.table` or `database.schema.table`). Schema defaults to `public`.
213+
- `load` (top-level shorthand) — loads parquet into `--catalog.--schema.--table`. Accepts `--file`, `--url`, or `--upload-id`. If the table was not declared at create time, the CLI automatically deletes and recreates the database with the table declared, then retries the load.
211214
- `tables list` — lists tables with `TABLE` (`<catalog>.<schema>.<table>`), `SYNCED`, `LAST_SYNC`. Uses active database when `--database` is omitted.
212215
- `tables load` — uploads a local parquet file (`--file`), a remote parquet URL (`--url`), or a pre-staged upload (`--upload-id`) and publishes with **replace** mode.
213216
- `tables delete` — drops a table from the managed database.
@@ -216,10 +219,9 @@ hotdata databases tables delete <table> [--database <id_or_name>] [--schema publ
216219
Example:
217220

218221
```
219-
hotdata databases create --name "Sales reporting" --catalog sales --table orders
220-
hotdata databases set <returned-id>
221-
hotdata databases tables load orders --file ./orders.parquet
222-
hotdata query "SELECT count(*) FROM sales.public.orders"
222+
hotdata databases create --catalog airbnb
223+
hotdata databases load --catalog airbnb --table listings --url https://example.com/listings.parquet
224+
hotdata query "SELECT count(*) FROM airbnb.public.listings"
223225
```
224226

225227
### List Tables and Columns
@@ -457,17 +459,18 @@ Use a sandbox to explore tables and capture **analysis-oriented** notes in sandb
457459

458460
## Workflow: Creating a managed database (parquet)
459461

460-
1. Create the database and declare tables up front:
462+
1. Create the database with a catalog alias:
461463
```
462-
hotdata databases create --name mydb --table events --table users
464+
hotdata databases create --catalog mydb
463465
```
464-
2. Load parquet into each table:
466+
2. Load parquet per table (tables are auto-declared if needed):
465467
```
466-
hotdata databases tables load mydb events --file ./events.parquet
468+
hotdata databases load --catalog mydb --table events --file ./events.parquet
469+
hotdata databases load --catalog mydb --table events --url https://example.com/events.parquet
467470
```
468471
3. Confirm tables and query:
469472
```
470-
hotdata databases tables list mydb
473+
hotdata databases tables list
471474
hotdata query "SELECT * FROM mydb.public.events LIMIT 10"
472475
```
473476

0 commit comments

Comments
 (0)