Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 28 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,28 +135,35 @@ Managed databases are Hotdata-owned catalogs you create and populate yourself (n
```sh
hotdata databases list [-w <id>] [-o table|json|yaml]
hotdata databases create [--name <display_name>] [--catalog <alias>] [--table <table> ...] [--schema public] [--expires-at <duration|timestamp>] [-o table|json|yaml]
hotdata databases set <id>
hotdata databases unset
hotdata databases <name_or_id> [-o table|json|yaml]
hotdata databases delete <name_or_id>
hotdata databases run [--database <id>] [--name <label>] [--schema public] [--table <table> ...] [--expires-at <duration|timestamp>] <cmd> [args...]
hotdata databases <id> run <cmd> [args...]

hotdata databases tables list <database> [--schema <name>] [-o table|json|yaml]
hotdata databases tables load <database> <table> --file ./data.parquet [--schema public]
hotdata databases tables load <database> <table> --upload-id <id> [--schema public]
hotdata databases tables delete <database> <table> [--schema public]
# Preferred: load by catalog alias (auto-declares table if needed)
hotdata databases load --catalog <alias> --table <table> [--schema public] (--file <path> | --url <url> | --upload-id <id>)

# Also available: explicit database flag
hotdata databases tables list [--database <id_or_name>] [--schema <name>] [-o table|json|yaml]
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] (--file <path> | --url <url> | --upload-id <id>)
hotdata databases tables delete <table> [--database <id_or_name>] [--schema public]
```

- `create` registers a managed connection with no external credentials. `--name` is a human-readable display name; `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`) and must be `[a-z_][a-z0-9_]*`. Use `--table` to declare tables up front (required before `tables load` on the current API).
- `create` registers a managed connection with no external credentials. `--name` is a human-readable display name; `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`) and must be `[a-z_][a-z0-9_]*`.
- `set` / `unset` — save or clear the active database. All `databases tables` and `context` commands default to it. The active database is marked with `*` in `databases list`.
- `load` (top-level shorthand) — loads a parquet file into `--catalog.--schema.--table`. If the table was not declared at create time, the CLI automatically deletes and recreates the database with the table declared, then retries the load.
- `tables load` uploads a **parquet** file (or uses a staged `upload_id` from `POST /v1/files`) and publishes it as the table generation (`replace` mode).
- `run` mints a database-scoped JWT and execs `<cmd>` with `HOTDATA_DATABASE_TOKEN`, `HOTDATA_DATABASE_REFRESH_TOKEN`, `HOTDATA_DATABASE`, `HOTDATA_WORKSPACE`, and `HOTDATA_API_URL` injected into its environment. Pass a database id (group-positional `<id>` like `sandbox run`, or `--database <id>`) to scope an existing database; omit both to auto-create a scratch one using `--name` / `--schema` / `--table` / `--expires-at`. Useful for launching an agent or child process whose API access is restricted to a single database.
- `run` mints a database-scoped JWT and execs `<cmd>` with `HOTDATA_DATABASE_TOKEN`, `HOTDATA_DATABASE_REFRESH_TOKEN`, `HOTDATA_DATABASE`, `HOTDATA_WORKSPACE`, and `HOTDATA_API_URL` injected into its environment.
- For CSV/JSON uploads without a managed database, use `hotdata datasets create` instead (`datasets.main.*`).

Example:

```sh
hotdata databases create --name "Sales reporting" --catalog sales --table orders
hotdata databases tables load sales orders --file ./orders.parquet
hotdata query "SELECT count(*) FROM sales.public.orders"
hotdata databases create --catalog airbnb
hotdata databases load --catalog airbnb --table listings --url https://example.com/listings.parquet
hotdata query "SELECT count(*) FROM airbnb.public.listings"
```

## Tables
Expand Down Expand Up @@ -233,14 +240,14 @@ hotdata queries <query_run_id> [-o table|json|yaml]

## Search

`--type` is **required** — no default. Pass either `vector` (similarity search via the index's embedding provider) or `bm25` (full-text search). Both run entirely server-side.
Both run entirely server-side. `--type` and `--column` are **optional** when the table has exactly one search index — they are inferred automatically. Pass them explicitly when multiple indexes exist.

```sh
# BM25 full-text search (requires a BM25 index on the column)
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
hotdata search "<query>" --table <connection.schema.table> [--type bm25] [--column <column>] [--select <columns>] [--limit <n>] [-o table|json|csv]

# Vector search (requires a vector index with auto-embedding on the column)
hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
hotdata search "<query>" --table <table> [--type vector] [--column <source_text_column>] [--limit <n>]
```

- **`--type vector`** — pass your query as **plain text**, name the **source text column** (e.g. `title`). The server embeds the query at the same time, using the same provider that auto-embedded the column when the index was built — so distance metric, model, and dimensions all match automatically. No `OPENAI_API_KEY`, no client-side embedding, no need to know about the auto-generated `_embedding` column. Generated SQL: `vector_distance(col, 'query')` server-side.
Expand All @@ -255,17 +262,21 @@ hotdata search "<query>" --type vector --table <table> --column <source_text_col
Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`). The two scopes are mutually exclusive.

```sh
# Connection-table scope
# Managed database scope (catalog alias resolves via active database)
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column <cols> --type bm25|vector|sorted \
[--name <name>] [--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]

# Connection-table scope (for non-managed connections)
hotdata indexes list --connection-id <id> --schema <schema> --table <table> [-o table|json|yaml]
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name <name> --columns <cols> --type sorted|bm25|vector \
[--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]
--column <cols> --type sorted|bm25|vector [--name <name>] ...
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>

# Dataset scope
hotdata indexes list --dataset-id <id> [-o table|json|yaml]
hotdata indexes create --dataset-id <id> --name <name> --columns <cols> --type sorted|bm25|vector ...
hotdata indexes create --dataset-id <id> --column <cols> --type sorted|bm25|vector [--name <name>] ...
hotdata indexes delete --dataset-id <id> --name <name>
```

Expand Down
25 changes: 15 additions & 10 deletions skills/hotdata-search/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@ Retrieval workloads in Hotdata: **BM25 full-text**, **vector similarity**, and t

## Search CLI

`--type` is **required**: `bm25` or `vector`. Both run server-side.
Both run server-side. `--type` and `--column` are **optional** when the table has exactly one search index — they are inferred automatically. Specify them when multiple indexes exist.

```bash
# BM25 (requires a BM25 index on the column)
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> \
hotdata search "<query>" --table <connection.schema.table> [--type bm25] [--column <column>] \
[--select <columns>] [--limit <n>] [--workspace-id <workspace_id>] [--output table|json|csv]

# Vector (requires a vector index; server auto-embeds the query text)
hotdata search "<query>" --type vector --table <connection.schema.table> --column <source_text_column> \
hotdata search "<query>" --table <connection.schema.table> [--type vector] [--column <source_text_column>] \
[--select <columns>] [--limit <n>] [--workspace-id <workspace_id>] [--output table|json|csv]
```

Expand All @@ -33,6 +33,7 @@ hotdata search "<query>" --type vector --table <connection.schema.table> --colum
| **`bm25`** | Server generates `bm25_search(table, col, 'text')`. Results sort by score (descending). |
| **`vector`** | Pass plain-text query; name the **source text column** (e.g. `title`). Server embeds using the same provider/metric/dimensions as the index. SQL uses `vector_distance(col, 'text')`. Results sort by distance (ascending). |

- **Inference:** when `--type` or `--column` are omitted, the CLI fetches the table's indexes and selects the only BM25/vector index. If multiple exist, you must specify both flags.
- **No vector index, or custom embedding model?** Use raw SQL via `hotdata query` (e.g. `cosine_distance(col, [<vec>])`). The removed `--model` / stdin-vector paths hardcoded `l2_distance` and are not supported.
- **Before search:** create the right index (`indexes create --type bm25` or `--type vector`). See [references/INDEXES.md](references/INDEXES.md).
- Default `--limit` is 10.
Expand All @@ -48,15 +49,19 @@ Indexes attach to a **connection table** (`--connection-id` + `--schema` + `--ta
hotdata indexes list [--connection-id <id>] [--schema <schema>] [--table <table>] [--workspace-id <ws>] [--output table|json|yaml]
hotdata indexes list --dataset-id <dataset_id> [--workspace-id <ws>] [--output table|json|yaml]

# Connection table
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name <name> --columns <cols> --type bm25|vector \
[--metric l2|cosine|dot] [--async] \
# Managed database (catalog alias — uses the active database when the catalog matches)
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column <col> --type bm25|vector \
[--name <name>] [--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]

# Connection table (raw connection ID)
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--column <col> --type bm25|vector [--name <name>] ...
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>

# Dataset
hotdata indexes create --dataset-id <dataset_id> --name <name> --columns <cols> --type bm25|vector ...
hotdata indexes create --dataset-id <dataset_id> --column <col> --type bm25|vector [--name <name>] ...
hotdata indexes delete --dataset-id <dataset_id> --name <name>
```

Expand Down Expand Up @@ -89,6 +94,6 @@ hotdata embedding-providers delete <id> [--workspace-id <workspace_id>]

1. `hotdata tables list --connection-id <id>` — confirm column types.
2. `hotdata indexes list` — avoid duplicate indexes.
3. `hotdata indexes create ... --type bm25|vector` (add `--async` if large).
4. `hotdata search "..." --type bm25|vector --table ... --column ...`
3. `hotdata indexes create --catalog <alias> --table <table> --column <col> --type bm25|vector` (add `--async` if large).
4. `hotdata search "..." --table <catalog.table>` — `--type` and `--column` are inferred when there is one search index.
5. Record what exists in **context:DATAMODEL** (core skill) when the workspace should remember index choices.
16 changes: 14 additions & 2 deletions skills/hotdata-search/references/INDEXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,24 @@ Skip duplicates (same table, column, and purpose).

## 3. Create indexes

For managed databases (catalog alias — auto-selects the active database connection):

```bash
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column body --type bm25

hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column embedding --type vector --metric cosine
```

For regular connections (explicit connection ID):

```bash
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name idx_posts_body_bm25 --columns body --type bm25
--name idx_posts_body_bm25 --column body --type bm25

hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name idx_chunks_embedding --columns embedding --type vector --metric cosine
--name idx_chunks_embedding --column embedding --type vector --metric cosine
```

Large builds: `--async`, then `hotdata jobs list` / `hotdata jobs <job_id>`.
Expand Down
31 changes: 17 additions & 14 deletions skills/hotdata/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,25 +189,28 @@ hotdata connections create \
hotdata databases list [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases create [--name <display_name>] [--catalog <alias>] [--table <table> ...] [--schema public] [--expires-at <duration|timestamp>] [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases set <id_or_name>
hotdata databases unset
hotdata databases <id_or_name> [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases delete <id_or_name> [--workspace-id <workspace_id>]
hotdata databases run [--database <id>] [--name <label>] [--schema public] [--table <table> ...] [--expires-at <duration|timestamp>] [--workspace-id <workspace_id>] <cmd> [args...]
hotdata databases <id> run <cmd> [args...]

# Dot-notation shorthand for load: database.table or database.schema.table
hotdata databases load <database.table> [--file ./data.parquet] [--url <url>] [--upload-id <id>] [--workspace-id <workspace_id>]
# Preferred: load by catalog alias (auto-declares table if needed)
hotdata databases load --catalog <alias> --table <table> [--schema public] (--file <path> | --url <url> | --upload-id <id>) [--workspace-id <workspace_id>]

# Also available via tables subcommand
hotdata databases tables list [--database <id_or_name>] [--schema <name>] [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] [--file ./data.parquet] [--url <url>] [--upload-id <id>] [--workspace-id <workspace_id>]
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] (--file <path> | --url <url> | --upload-id <id>) [--workspace-id <workspace_id>]
hotdata databases tables delete <table> [--database <id_or_name>] [--schema public] [--workspace-id <workspace_id>]
```

- `list` — all managed databases in the workspace.
- `list` — all managed databases in the workspace. Active database is marked with `*`.
- `create` — creates a new managed database. `--name` is an optional human-readable display name. `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`); must be `[a-z_][a-z0-9_]*`. `--expires-at` accepts relative durations (`24h`, `7d`, `90m`) or an RFC 3339 timestamp; omitting means no expiry. Repeat `--table` to declare tables up front.
- `set` — saves `<id_or_name>` as the active database. Subsequent `databases tables` and `context` commands use it automatically.
- `unset` — clears the active database from config.
- `<id_or_name>` — inspect one database (id, catalog, name, expires_at).
- `delete` — removes the managed database; clears the active-database config if it matched.
- `load` shorthand with dot notation (`database.table` or `database.schema.table`). Schema defaults to `public`.
- `load` (top-level shorthand) — loads parquet into `--catalog.--schema.--table`. Accepts `--file`, `--url`, or `--upload-id`. If the table was not declared at create time, the CLI automatically deletes and recreates the database with the table declared, then retries the load.
- `tables list` — lists tables with `TABLE` (`<catalog>.<schema>.<table>`), `SYNCED`, `LAST_SYNC`. Uses active database when `--database` is omitted.
- `tables load` — uploads a local parquet file (`--file`), a remote parquet URL (`--url`), or a pre-staged upload (`--upload-id`) and publishes with **replace** mode.
- `tables delete` — drops a table from the managed database.
Expand All @@ -216,10 +219,9 @@ hotdata databases tables delete <table> [--database <id_or_name>] [--schema publ
Example:

```
hotdata databases create --name "Sales reporting" --catalog sales --table orders
hotdata databases set <returned-id>
hotdata databases tables load orders --file ./orders.parquet
hotdata query "SELECT count(*) FROM sales.public.orders"
hotdata databases create --catalog airbnb
hotdata databases load --catalog airbnb --table listings --url https://example.com/listings.parquet
hotdata query "SELECT count(*) FROM airbnb.public.listings"
```

### List Tables and Columns
Expand Down Expand Up @@ -457,17 +459,18 @@ Use a sandbox to explore tables and capture **analysis-oriented** notes in sandb

## Workflow: Creating a managed database (parquet)

1. Create the database and declare tables up front:
1. Create the database with a catalog alias:
```
hotdata databases create --name mydb --table events --table users
hotdata databases create --catalog mydb
```
2. Load parquet into each table:
2. Load parquet per table (tables are auto-declared if needed):
```
hotdata databases tables load mydb events --file ./events.parquet
hotdata databases load --catalog mydb --table events --file ./events.parquet
hotdata databases load --catalog mydb --table events --url https://example.com/events.parquet
```
3. Confirm tables and query:
```
hotdata databases tables list mydb
hotdata databases tables list
hotdata query "SELECT * FROM mydb.public.events LIMIT 10"
```

Expand Down
23 changes: 11 additions & 12 deletions skills/hotdata/references/WORKFLOWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,12 @@ End-to-end checklists. Use the linked sections for command detail and guardrails
1. [ ] `hotdata tables list --connection-id <id>` — pick text column (BM25) or embedding/text column (vector)
2. [ ] `hotdata indexes list` — avoid duplicate bm25/vector indexes on the same column
3. [ ] Create index:
- [ ] **Keyword:** `hotdata indexes create … --type bm25 --columns <text_col>`
- [ ] **Semantic:** `hotdata indexes create … --type vector --columns <col> [--metric cosine|l2|dot]`
- [ ] **Managed DB:** `hotdata indexes create --catalog <alias> --table <tbl> --column <text_col> --type bm25|vector`
- [ ] **Connection:** `hotdata indexes create --connection-id <id> --schema <s> --table <t> --column <col> --type bm25|vector [--metric cosine|l2|dot]`
- [ ] Large build: add `--async`, then `hotdata jobs <job_id>`
4. [ ] Search:
- [ ] `hotdata search "…" --type bm25 --table <connection.schema.table> --column <col>`
- [ ] `hotdata search "…" --type vector --table … --column <source_text_col>`
4. [ ] Search (--type and --column inferred when one search index exists):
- [ ] `hotdata search "…" --table <catalog.schema.table>` (auto-infer)
- [ ] `hotdata search "…" --table … --type bm25 --column <col>` (explicit)
5. [ ] (Optional) Note indexes in **context:DATAMODEL → Search & index summary**

**Detail:** [hotdata-search INDEXES.md](../../hotdata-search/references/INDEXES.md)
Expand Down Expand Up @@ -116,24 +116,23 @@ Both land queryable tables in the workspace; the path depends on **format** and

### Workflow: managed database (parquet)

1. Create the database and **declare tables** up front:
1. Create the database with a catalog alias:

```bash
hotdata databases create --name sales --table orders --table customers
hotdata databases create --catalog sales
```

2. Load parquet per table:
2. Load parquet per table (tables are auto-declared if needed):

```bash
hotdata databases tables load sales orders --file ./orders.parquet
hotdata databases load --catalog sales --table orders --file ./orders.parquet
hotdata databases load --catalog sales --table customers --url https://example.com/customers.parquet
```

If load fails with *not declared*, add `--table` at create time. There is no `--url` on load — download parquet locally first.

3. Confirm and query:

```bash
hotdata databases tables list sales
hotdata databases tables list
hotdata query "SELECT count(*) FROM sales.public.orders"
```

Expand Down
Loading
Loading