Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 28 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,28 +135,35 @@ Managed databases are Hotdata-owned catalogs you create and populate yourself (n
```sh
hotdata databases list [-w <id>] [-o table|json|yaml]
hotdata databases create [--name <display_name>] [--catalog <alias>] [--table <table> ...] [--schema public] [--expires-at <duration|timestamp>] [-o table|json|yaml]
hotdata databases set <id>
hotdata databases unset
hotdata databases <name_or_id> [-o table|json|yaml]
hotdata databases delete <name_or_id>
hotdata databases run [--database <id>] [--name <label>] [--schema public] [--table <table> ...] [--expires-at <duration|timestamp>] <cmd> [args...]
hotdata databases <id> run <cmd> [args...]

hotdata databases tables list <database> [--schema <name>] [-o table|json|yaml]
hotdata databases tables load <database> <table> --file ./data.parquet [--schema public]
hotdata databases tables load <database> <table> --upload-id <id> [--schema public]
hotdata databases tables delete <database> <table> [--schema public]
# Preferred: load by catalog alias (auto-declares table if needed)
hotdata databases load --catalog <alias> --table <table> [--schema public] (--file <path> | --url <url> | --upload-id <id>)

# Also available: explicit database flag
hotdata databases tables list [--database <id_or_name>] [--schema <name>] [-o table|json|yaml]
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] (--file <path> | --url <url> | --upload-id <id>)
hotdata databases tables delete <table> [--database <id_or_name>] [--schema public]
```

- `create` registers a managed connection with no external credentials. `--name` is a human-readable display name; `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`) and must be `[a-z_][a-z0-9_]*`. Use `--table` to declare tables up front (required before `tables load` on the current API).
- `create` registers a managed connection with no external credentials. `--name` is a human-readable display name; `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`) and must be `[a-z_][a-z0-9_]*`.
- `set` / `unset` — save or clear the active database. All `databases tables` and `context` commands default to it. The active database is marked with `*` in `databases list`.
- `load` (top-level shorthand) — loads a parquet file into `--catalog.--schema.--table`. If the table was not declared at create time, the CLI automatically deletes and recreates the database with the table declared, then retries the load.
- `tables load` uploads a **parquet** file (or uses a staged `upload_id` from `POST /v1/files`) and publishes it as the table generation (`replace` mode).
- `run` mints a database-scoped JWT and execs `<cmd>` with `HOTDATA_DATABASE_TOKEN`, `HOTDATA_DATABASE_REFRESH_TOKEN`, `HOTDATA_DATABASE`, `HOTDATA_WORKSPACE`, and `HOTDATA_API_URL` injected into its environment. Pass a database id (group-positional `<id>` like `sandbox run`, or `--database <id>`) to scope an existing database; omit both to auto-create a scratch one using `--name` / `--schema` / `--table` / `--expires-at`. Useful for launching an agent or child process whose API access is restricted to a single database.
- `run` mints a database-scoped JWT and execs `<cmd>` with `HOTDATA_DATABASE_TOKEN`, `HOTDATA_DATABASE_REFRESH_TOKEN`, `HOTDATA_DATABASE`, `HOTDATA_WORKSPACE`, and `HOTDATA_API_URL` injected into its environment.
- For CSV/JSON uploads without a managed database, use `hotdata datasets create` instead (`datasets.main.*`).

Example:

```sh
hotdata databases create --name "Sales reporting" --catalog sales --table orders
hotdata databases tables load sales orders --file ./orders.parquet
hotdata query "SELECT count(*) FROM sales.public.orders"
hotdata databases create --catalog airbnb
hotdata databases load --catalog airbnb --table listings --url https://example.com/listings.parquet
hotdata query "SELECT count(*) FROM airbnb.public.listings"
```

## Tables
Expand Down Expand Up @@ -233,14 +240,14 @@ hotdata queries <query_run_id> [-o table|json|yaml]

## Search

`--type` is **required** — no default. Pass either `vector` (similarity search via the index's embedding provider) or `bm25` (full-text search). Both run entirely server-side.
Both run entirely server-side. `--type` and `--column` are **optional** when the table has exactly one search index — they are inferred automatically. Pass them explicitly when multiple indexes exist.

```sh
# BM25 full-text search (requires a BM25 index on the column)
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
hotdata search "<query>" --table <connection.schema.table> [--type bm25] [--column <column>] [--select <columns>] [--limit <n>] [-o table|json|csv]

# Vector search (requires a vector index with auto-embedding on the column)
hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
hotdata search "<query>" --table <table> [--type vector] [--column <source_text_column>] [--limit <n>]
```

- **`--type vector`** — pass your query as **plain text**, name the **source text column** (e.g. `title`). The server embeds the query at the same time, using the same provider that auto-embedded the column when the index was built — so distance metric, model, and dimensions all match automatically. No `OPENAI_API_KEY`, no client-side embedding, no need to know about the auto-generated `_embedding` column. Generated SQL: `vector_distance(col, 'query')` server-side.
Expand All @@ -255,17 +262,21 @@ hotdata search "<query>" --type vector --table <table> --column <source_text_col
Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`). The two scopes are mutually exclusive.

```sh
# Connection-table scope
# Managed database scope (catalog alias resolves via active database)
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column <cols> --type bm25|vector|sorted \
[--name <name>] [--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]

# Connection-table scope (for non-managed connections)
hotdata indexes list --connection-id <id> --schema <schema> --table <table> [-o table|json|yaml]
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name <name> --columns <cols> --type sorted|bm25|vector \
[--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]
--column <cols> --type sorted|bm25|vector [--name <name>] ...
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>

# Dataset scope
hotdata indexes list --dataset-id <id> [-o table|json|yaml]
hotdata indexes create --dataset-id <id> --name <name> --columns <cols> --type sorted|bm25|vector ...
hotdata indexes create --dataset-id <id> --column <cols> --type sorted|bm25|vector [--name <name>] ...
hotdata indexes delete --dataset-id <id> --name <name>
```

Expand Down
7 changes: 3 additions & 4 deletions skills/hotdata-analytics/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,9 +89,8 @@ hotdata results <result_id> [--workspace-id <workspace_id>] [--output table|json
Or managed parquet:

```bash
hotdata databases create --name analytics --table slice
hotdata databases set <returned-id>
hotdata databases tables load slice --file ./slice.parquet
hotdata databases create --catalog analytics
hotdata databases load --catalog analytics --table slice --file ./slice.parquet
```

3. **Chain query** — use printed **`full_name`** or `datasets list` **FULL NAME** column:
Expand All @@ -113,7 +112,7 @@ For equality, range, and sort-heavy OLAP — not full-text or vector (see **`hot

```bash
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name idx_orders_created --columns created_at --type sorted [--async]
--name idx_orders_created --column created_at --type sorted [--async]
```

List and delete use the same `hotdata indexes` commands as in the search skill; only **`--type sorted`** is the analytics focus here.
Expand Down
4 changes: 2 additions & 2 deletions skills/hotdata-analytics/references/WORKFLOWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,8 @@ hotdata datasets create --label "from saved" --query-id <query_id> [--table-name
**Managed database** (parquet → `<database>.<schema>.<table>`):

```bash
hotdata databases create --name chain_db --table revenue_slice
hotdata databases tables load chain_db revenue_slice --file ./revenue_slice.parquet
hotdata databases create --catalog chain_db
hotdata databases load --catalog chain_db --table revenue_slice --file ./revenue_slice.parquet
```

Note the printed **`full_name`** (e.g. `datasets.main.chain_revenue_slice` or `chain_db.public.revenue_slice`). For datasets, **`FULL NAME`** from `datasets list` is authoritative.
Expand Down
25 changes: 15 additions & 10 deletions skills/hotdata-search/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@ Retrieval workloads in Hotdata: **BM25 full-text**, **vector similarity**, and t

## Search CLI

`--type` is **required**: `bm25` or `vector`. Both run server-side.
Both run server-side. `--type` and `--column` are **optional** when the table has exactly one search index — they are inferred automatically. Specify them when multiple indexes exist.

```bash
# BM25 (requires a BM25 index on the column)
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> \
hotdata search "<query>" --table <connection.schema.table> [--type bm25] [--column <column>] \
[--select <columns>] [--limit <n>] [--workspace-id <workspace_id>] [--output table|json|csv]

# Vector (requires a vector index; server auto-embeds the query text)
hotdata search "<query>" --type vector --table <connection.schema.table> --column <source_text_column> \
hotdata search "<query>" --table <connection.schema.table> [--type vector] [--column <source_text_column>] \
[--select <columns>] [--limit <n>] [--workspace-id <workspace_id>] [--output table|json|csv]
```

Expand All @@ -33,6 +33,7 @@ hotdata search "<query>" --type vector --table <connection.schema.table> --colum
| **`bm25`** | Server generates `bm25_search(table, col, 'text')`. Results sort by score (descending). |
| **`vector`** | Pass plain-text query; name the **source text column** (e.g. `title`). Server embeds using the same provider/metric/dimensions as the index. SQL uses `vector_distance(col, 'text')`. Results sort by distance (ascending). |

- **Inference:** when `--type` or `--column` are omitted, the CLI fetches the table's indexes and selects the only BM25/vector index. If multiple exist, you must specify both flags.
- **No vector index, or custom embedding model?** Use raw SQL via `hotdata query` (e.g. `cosine_distance(col, [<vec>])`). The removed `--model` / stdin-vector paths hardcoded `l2_distance` and are not supported.
- **Before search:** create the right index (`indexes create --type bm25` or `--type vector`). See [references/INDEXES.md](references/INDEXES.md).
- Default `--limit` is 10.
Expand All @@ -48,15 +49,19 @@ Indexes attach to a **connection table** (`--connection-id` + `--schema` + `--ta
hotdata indexes list [--connection-id <id>] [--schema <schema>] [--table <table>] [--workspace-id <ws>] [--output table|json|yaml]
hotdata indexes list --dataset-id <dataset_id> [--workspace-id <ws>] [--output table|json|yaml]

# Connection table
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name <name> --columns <cols> --type bm25|vector \
[--metric l2|cosine|dot] [--async] \
# Managed database (catalog alias — uses the active database when the catalog matches)
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column <col> --type bm25|vector \
[--name <name>] [--metric l2|cosine|dot] [--async] \
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]

# Connection table (raw connection ID)
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--column <col> --type bm25|vector [--name <name>] ...
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>

# Dataset
hotdata indexes create --dataset-id <dataset_id> --name <name> --columns <cols> --type bm25|vector ...
hotdata indexes create --dataset-id <dataset_id> --column <col> --type bm25|vector [--name <name>] ...
hotdata indexes delete --dataset-id <dataset_id> --name <name>
```

Expand Down Expand Up @@ -89,6 +94,6 @@ hotdata embedding-providers delete <id> [--workspace-id <workspace_id>]

1. `hotdata tables list --connection-id <id>` — confirm column types.
2. `hotdata indexes list` — avoid duplicate indexes.
3. `hotdata indexes create ... --type bm25|vector` (add `--async` if large).
4. `hotdata search "..." --type bm25|vector --table ... --column ...`
3. `hotdata indexes create --catalog <alias> --table <table> --column <col> --type bm25|vector` (add `--async` if large).
4. `hotdata search "..." --table <catalog.table>` — `--type` and `--column` are inferred when there is one search index.
5. Record what exists in **context:DATAMODEL** (core skill) when the workspace should remember index choices.
16 changes: 14 additions & 2 deletions skills/hotdata-search/references/INDEXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,24 @@ Skip duplicates (same table, column, and purpose).

## 3. Create indexes

For managed databases (catalog alias — auto-selects the active database connection):

```bash
hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column body --type bm25

hotdata indexes create --catalog <alias> --schema <schema> --table <table> \
--column embedding --type vector --metric cosine
```

For regular connections (explicit connection ID):

```bash
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name idx_posts_body_bm25 --columns body --type bm25
--name idx_posts_body_bm25 --column body --type bm25

hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
--name idx_chunks_embedding --columns embedding --type vector --metric cosine
--name idx_chunks_embedding --column embedding --type vector --metric cosine
```

Large builds: `--async`, then `hotdata jobs list` / `hotdata jobs <job_id>`.
Expand Down
31 changes: 17 additions & 14 deletions skills/hotdata/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,25 +189,28 @@ hotdata connections create \
hotdata databases list [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases create [--name <display_name>] [--catalog <alias>] [--table <table> ...] [--schema public] [--expires-at <duration|timestamp>] [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases set <id_or_name>
hotdata databases unset
hotdata databases <id_or_name> [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases delete <id_or_name> [--workspace-id <workspace_id>]
hotdata databases run [--database <id>] [--name <label>] [--schema public] [--table <table> ...] [--expires-at <duration|timestamp>] [--workspace-id <workspace_id>] <cmd> [args...]
hotdata databases <id> run <cmd> [args...]

# Dot-notation shorthand for load: database.table or database.schema.table
hotdata databases load <database.table> [--file ./data.parquet] [--url <url>] [--upload-id <id>] [--workspace-id <workspace_id>]
# Preferred: load by catalog alias (auto-declares table if needed)
hotdata databases load --catalog <alias> --table <table> [--schema public] (--file <path> | --url <url> | --upload-id <id>) [--workspace-id <workspace_id>]

# Also available via tables subcommand
hotdata databases tables list [--database <id_or_name>] [--schema <name>] [--workspace-id <workspace_id>] [--output table|json|yaml]
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] [--file ./data.parquet] [--url <url>] [--upload-id <id>] [--workspace-id <workspace_id>]
hotdata databases tables load <table> [--database <id_or_name>] [--schema public] (--file <path> | --url <url> | --upload-id <id>) [--workspace-id <workspace_id>]
hotdata databases tables delete <table> [--database <id_or_name>] [--schema public] [--workspace-id <workspace_id>]
```

- `list` — all managed databases in the workspace.
- `list` — all managed databases in the workspace. Active database is marked with `*`.
- `create` — creates a new managed database. `--name` is an optional human-readable display name. `--catalog` sets the SQL alias used in queries (`SELECT … FROM <catalog>.schema.table`); must be `[a-z_][a-z0-9_]*`. `--expires-at` accepts relative durations (`24h`, `7d`, `90m`) or an RFC 3339 timestamp; omitting means no expiry. Repeat `--table` to declare tables up front.
- `set` — saves `<id_or_name>` as the active database. Subsequent `databases tables` and `context` commands use it automatically.
- `unset` — clears the active database from config.
- `<id_or_name>` — inspect one database (id, catalog, name, expires_at).
- `delete` — removes the managed database; clears the active-database config if it matched.
- `load` shorthand with dot notation (`database.table` or `database.schema.table`). Schema defaults to `public`.
- `load` (top-level shorthand) — loads parquet into `--catalog.--schema.--table`. Accepts `--file`, `--url`, or `--upload-id`. If the table was not declared at create time, the CLI automatically deletes and recreates the database with the table declared, then retries the load.
- `tables list` — lists tables with `TABLE` (`<catalog>.<schema>.<table>`), `SYNCED`, `LAST_SYNC`. Uses active database when `--database` is omitted.
- `tables load` — uploads a local parquet file (`--file`), a remote parquet URL (`--url`), or a pre-staged upload (`--upload-id`) and publishes with **replace** mode.
- `tables delete` — drops a table from the managed database.
Expand All @@ -216,10 +219,9 @@ hotdata databases tables delete <table> [--database <id_or_name>] [--schema publ
Example:

```
hotdata databases create --name "Sales reporting" --catalog sales --table orders
hotdata databases set <returned-id>
hotdata databases tables load orders --file ./orders.parquet
hotdata query "SELECT count(*) FROM sales.public.orders"
hotdata databases create --catalog airbnb
hotdata databases load --catalog airbnb --table listings --url https://example.com/listings.parquet
hotdata query "SELECT count(*) FROM airbnb.public.listings"
```

### List Tables and Columns
Expand Down Expand Up @@ -457,17 +459,18 @@ Use a sandbox to explore tables and capture **analysis-oriented** notes in sandb

## Workflow: Creating a managed database (parquet)

1. Create the database and declare tables up front:
1. Create the database with a catalog alias:
```
hotdata databases create --name mydb --table events --table users
hotdata databases create --catalog mydb
```
2. Load parquet into each table:
2. Load parquet per table (tables are auto-declared if needed):
```
hotdata databases tables load mydb events --file ./events.parquet
hotdata databases load --catalog mydb --table events --file ./events.parquet
hotdata databases load --catalog mydb --table events --url https://example.com/events.parquet
```
3. Confirm tables and query:
```
hotdata databases tables list mydb
hotdata databases tables list
hotdata query "SELECT * FROM mydb.public.events LIMIT 10"
```

Expand Down
Loading
Loading