Skip to content

Commit 27dcdb0

Browse files
authored
feat: data/dataset refresh + indexes auto-embedding + embedding providers (#67)
* feat(connections): add data refresh with async, scope, and include-uncached `hotdata connections refresh` previously only triggered a synchronous schema refresh. The /v1/refresh endpoint actually supports data refresh (per-table or whole-connection), an async/background-job mode, and an include_uncached toggle for picking up newly-discovered tables — none of which were exposed. Adds: - --data: refresh data instead of schema metadata - --schema/--table: narrow scope (server requires both together for data refresh) - --async: submit as a background job, returns a job id to poll via `hotdata jobs <id>` - --include-uncached: connection-wide data refresh only, includes uncached tables - CLI-side validation mirroring server rules so we fail fast with clear errors - Richer output: schema refresh now reports tables_discovered/added/modified; data refresh reports rows_synced and duration Also adds `dataset_refresh` to the allowed values for `jobs list --job-type`, which the server emits but the CLI didn't accept as a filter. * docs: document connections refresh data/async/include-uncached flags Updates the README and the bundled hotdata skill to match the expanded `hotdata connections refresh` surface (--data, --schema/--table, --async, --include-uncached) and to add `dataset_refresh` to the documented values for `hotdata jobs list --job-type`. * feat(datasets): add refresh subcommand with async support Adds `hotdata datasets refresh <dataset_id> [--async]` for re-running a dataset's source (URL fetch or saved query) to create a new version. Calls the same `/v1/refresh` endpoint as `connections refresh`, but with `dataset_id` set instead of `connection_id`. The sync path prints the new version and status; the async path prints the job ID and points the user at `hotdata jobs <id>` to poll. Upload- source datasets have no remote to re-pull from, so the server's 400 ("Refresh not supported for source type 'upload'") is surfaced directly. Updates README.md and SKILL.md to document the new subcommand. * feat(indexes): dataset scope, auto-embedding flags, delete, embedding-providers CRUD INDEXES - New `--dataset-id` scope alternative to `--connection-id`/`--schema`/`--table` on `indexes list`, `indexes create`, and the new `indexes delete` subcommand. Scopes are mutually exclusive (clap-enforced). - New auto-embedding flags on `indexes create`: --embedding-provider-id --dimensions --output-column --description When `--type vector` runs against a text column, the server generates embeddings automatically using the named provider (or the first system provider). Generated column defaults to `{column}_embedding`. - `--type` is now required on `indexes create` (previously defaulted to `sorted`). Forces deliberate choice. BREAKING for scripts that omitted it. - New `indexes delete` subcommand for both connection and dataset scopes. - CLI-side pre-validation: * scope flags can't be mixed (clap mutex) * `--schema`/`--table` require `--connection-id` (clap) * `--connection-id` requires both `--schema` and `--table` (clap) * auto-embed flags only valid with `--type vector` (custom) * `--type vector` requires exactly one column in `--columns` (custom) BUG FIX - `indexes create --async` previously read `parsed["job_id"]` from the response, but the server returns `id` (per `SubmitJobResponse`). Result: it always printed `job_id: unknown`. Now reads `id` correctly. Confirmed end-to-end against prod with `hotdata jobs <id>` lookups working. EMBEDDING PROVIDERS - New `hotdata embedding-providers` command surface: list, get, create, update, delete - The "inline API key" flag is named `--inline-api-key` (struct field `inline_api_key`) to avoid colliding with the global `--api-key` auth flag — clap merges fields by their internal id, so reusing the name `api_key` would silently route the value to the auth layer. JOBS - Added `create_dataset_index` to the `--job-type` value list (server emits this type for async dataset index creation; the CLI was rejecting it as an invalid filter value). API LAYER - Added `ApiClient::delete_raw` — needed for `indexes delete` and `embedding-providers delete`. Mirrors `post_raw`/`get_raw` shape. * docs: document indexes scope flags, auto-embedding, and embedding-providers CRUD Updates README.md and skills/hotdata/SKILL.md for the new surface: - `hotdata indexes` now supports both connection and dataset scope; show both invocation forms side-by-side, note `--type` is required, and document the auto-embedding flags (--embedding-provider-id, --dimensions, --output-column, --description). - `hotdata indexes delete` is new; documented for both scopes. - `hotdata embedding-providers` is new; full list/get/create/update/delete surface documented, with a callout that `--inline-api-key` (not `--api-key`) is the inline-secret flag — to avoid colliding with the global auth `--api-key`. - `--job-type` filter list updated with `create_dataset_index`. * refactor(embedding-providers): rename --inline-api-key to --provider-api-key Renames the flag (and Rust struct field) on `embedding-providers create` and `embedding-providers update` from `--inline-api-key` / `inline_api_key` to `--provider-api-key` / `provider_api_key`. Why: - Pairs naturally with the existing `--provider-type` flag on the same subcommand (consistent prefix family). - Self-documenting: this is the embedding service's own API key (e.g. an OpenAI sk-... key), not the user's Hotdata auth credential. - Avoids the clap field-id collision with the global `Cli::api_key` that motivated the original rename, but does so via a name that reads more naturally than `--inline-api-key`. The JSON request body field stays `api_key` per the OpenAPI schema — only the user-facing CLI flag and Rust field are renamed. * feat(search): require --type, route to server-side bm25_search/vector_distance `hotdata search` now requires `--type vector|bm25` (no default; same rule as `indexes create --type`) and a positional query text argument. Both modes run entirely server-side with no client-side embedding. Routing: - `--type vector "<query>"` → SELECT *, vector_distance(<col>, '<query>') AS dist FROM <t> ORDER BY dist Server resolves the embedding column, model, dimensions, and metric from the index metadata. The user names the source text column. - `--type bm25 "<query>"` → existing bm25_search() server-side path. Removed: - `--model` flag (was: client-side OpenAI embedding + `l2_distance` SQL). - Stdin-piped-vector path (was: read JSON vector from stdin, generate `l2_distance` SQL). - `src/embedding.rs` module (its only callers were the two paths above). Both removed paths hardcoded `l2_distance` regardless of the index's actual metric, which silently produced wrong rankings on cosine indexes. They also required the user to point `--column` at the auto-generated `_embedding` column rather than the source text column. Power users who need client-side embedding or want to query with a precomputed vector can use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(...)`). Verified against prod on `my_ducklake.main.internet_pages_small`: - BM25 "basketball" → finds the basketball ProCamp title (score 2.92) - BM25 "HIV" → finds the HIV Story titles (score 4.81) - Vector "sports games athletes" → ranks the basketball ProCamp first (cosine distance 0.69), heart-attack-fitness second (0.80) - Vector "travel vacation cruise" → ranks the cruise excursion first (0.63), 48-hours-in-Cesky-Krumlov second (0.74) The semantically meaningful vector results confirm auto-embedding produced useful vectors AND the server-side rewrite correctly resolves provider+metric+output_column from index metadata. Cleaned up indexes after the test run. * fix(embedding-providers): use renamed --provider-api-key in update no-op error `update`'s "provide at least one field" guard message still listed `--api-key`, which was the original local flag name before it was renamed to `--provider-api-key` to avoid colliding with the global Hotdata auth flag. A user following the error guidance would reach for `--api-key` (the global auth flag), not the provider key. One-character class of fix; caught by the PR review bot on #67. * test: add fixture tests for new code paths to match repo patterns Adds 7 unit tests covering the lightest-weight gaps in the new code: - `src/api.rs` — 2 mockito tests for `delete_raw` (204 success, 404 with error body), matching the existing `get_none_if_not_found_*` pattern. - `src/indexes.rs` — 2 path-construction tests for the new `IndexScope` enum (`Connection` and `Dataset` variants), matching the existing pure-function tests in this module. - `src/embedding_providers.rs` — 3 deserialization fixture tests for the `Provider` and `ListResponse` shapes plus a `parse_config` smoke test, matching the runtimedb-payload-deserialization pattern from `datasets.rs::update_response_deserializes_runtimedb_payload`. Skipped `datasets.rs::refresh` — adding a typed response struct purely for test fixture purposes would be over-engineering since the refresh function reads the response as `serde_json::Value` directly. 110 tests → 117 (105 → 112 unit + 5 integration unchanged). Patch coverage on the PR is still informational by repo policy (codecov.yml). * docs(search): make auto-embedding flow explicit in the Search section Reading just the Search section, a user might miss that vector search is end-to-end auto-embedded — both the column's embeddings (built when the index was created) and the query embedding (computed at search time) come from the same server-configured provider, with matching metric, model, and dimensions. Spells that out at the top of the `--type vector` bullet, and adds an explicit pointer to raw SQL via `hotdata query` for cases where the user needs a different model than the index, or has no index at all (the SQL reference covers the underlying distance functions and table UDFs).
1 parent ae8ec55 commit 27dcdb0

10 files changed

Lines changed: 1093 additions & 266 deletions

File tree

README.md

Lines changed: 56 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,8 @@ API key priority (lowest to highest): config file → `HOTDATA_API_KEY` env var
7171
| `query` | | Execute a SQL query |
7272
| `queries` | `list` | Inspect query run history |
7373
| `search` | | Full-text search across a table column |
74-
| `indexes` | `list`, `create` | Manage indexes on a table |
74+
| `indexes` | `list`, `create`, `delete` | Manage indexes on a table or dataset |
75+
| `embedding-providers` | `list`, `get`, `create`, `update`, `delete` | Manage embedding providers used by vector indexes |
7576
| `results` | `list` | Retrieve stored query results |
7677
| `jobs` | `list` | Manage background jobs |
7778
| `sandbox` | `list`, `new`, `set`, `read`, `update`, `run` | Manage sandboxes |
@@ -101,13 +102,16 @@ hotdata workspaces set [<workspace_id>]
101102
```sh
102103
hotdata connections list [-w <id>] [-o table|json|yaml]
103104
hotdata connections <connection_id> [-w <id>] [-o table|json|yaml]
104-
hotdata connections refresh <connection_id> [-w <id>]
105+
hotdata connections refresh <connection_id> [-w <id>] [--data] [--schema <name> --table <name>] [--async] [--include-uncached]
105106
hotdata connections new [-w <id>]
106107
```
107108

108109
- `list` returns `id`, `name`, `source_type` for each connection.
109110
- Pass a connection ID to view details (id, name, source type, table counts).
110-
- `refresh` triggers a schema refresh for a connection.
111+
- `refresh` triggers a schema refresh by default. Pass `--data` to refresh cached row data instead.
112+
- `--schema` and `--table` narrow a data refresh to a single table (must be supplied together).
113+
- `--async` submits a data refresh as a background job and returns a job ID; poll with `hotdata jobs <job_id>`. Only valid with `--data` — schema refresh is always synchronous.
114+
- `--include-uncached` includes tables that haven't been cached yet in a connection-wide data refresh. Only valid with `--data` and no `--table`.
111115
- `new` launches an interactive connection creation wizard.
112116

113117
### Create a connection
@@ -143,13 +147,16 @@ hotdata datasets create --file data.csv [--label "My Dataset"] [--table-name my_
143147
hotdata datasets create --sql "SELECT ..." --label "My Dataset"
144148
hotdata datasets create --url "https://example.com/data.parquet" --label "My Dataset"
145149
hotdata datasets update <dataset_id> [--label "New Label"] [--table-name new_table]
150+
hotdata datasets refresh <dataset_id> [--workspace-id <id>] [--async]
146151
```
147152

148153
- Datasets are queryable as `datasets.main.<table_name>`.
149154
- `--file`, `--sql`, `--query-id`, and `--url` are mutually exclusive.
150155
- `--url` imports data directly from a URL (supports csv, json, parquet).
151156
- Format is auto-detected from file extension or content.
152157
- Piped stdin is supported: `cat data.csv | hotdata datasets create --label "My Dataset"`
158+
- `refresh` re-runs the dataset's source (URL fetch or saved query) and creates a new version. Not supported for upload-source datasets.
159+
- `--async` submits the refresh as a background job and returns a job ID; poll with `hotdata jobs <job_id>`.
153160

154161
## Workspace context
155162

@@ -194,33 +201,62 @@ hotdata queries <query_run_id> [-o table|json|yaml]
194201

195202
## Search
196203

197-
```sh
198-
# BM25 full-text search
199-
hotdata search "query text" --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
204+
`--type` is **required** — no default. Pass either `vector` (similarity search via the index's embedding provider) or `bm25` (full-text search). Both run entirely server-side.
200205

201-
# Vector search with --model (calls OpenAI to embed the query)
202-
hotdata search "query text" --table <table> --column <vector_column> --model text-embedding-3-small [--limit <n>]
206+
```sh
207+
# BM25 full-text search (requires a BM25 index on the column)
208+
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
203209

204-
# Vector search with piped embedding
205-
echo '[0.1, -0.2, ...]' | hotdata search --table <table> --column <vector_column> [--limit <n>]
210+
# Vector search (requires a vector index with auto-embedding on the column)
211+
hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
206212
```
207213

208-
- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column.
209-
- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var.
210-
- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search.
211-
- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending).
214+
- **`--type vector`** — pass your query as **plain text**, name the **source text column** (e.g. `title`). The server embeds the query at the same time, using the same provider that auto-embedded the column when the index was built — so distance metric, model, and dimensions all match automatically. No `OPENAI_API_KEY`, no client-side embedding, no need to know about the auto-generated `_embedding` column. Generated SQL: `vector_distance(col, 'query')` server-side.
215+
- **`--type bm25`** runs `bm25_search(table, col, 'query')` — requires a BM25 index on the column.
216+
- **No vector index, or want to use a different model than the index?** Skip `hotdata search` and use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, [<your_vec>]) FROM ...`). The SQL reference covers the available distance functions and table UDFs.
217+
- BM25 results sort by score (descending). Vector results sort by distance (ascending).
212218
- `--select` specifies which columns to return (comma-separated, defaults to all).
219+
- The previous `--model` flag and stdin-piped-vector path are **removed** — both hardcoded `l2_distance` regardless of the index's actual metric, which silently produced wrong rankings on cosine indexes. For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, [<vec>]) ...`).
213220

214221
## Indexes
215222

223+
Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`). The two scopes are mutually exclusive.
224+
225+
```sh
226+
# Connection-table scope
227+
hotdata indexes list --connection-id <id> --schema <schema> --table <table> [-o table|json|yaml]
228+
hotdata indexes create --connection-id <id> --schema <schema> --table <table> \
229+
--name <name> --columns <cols> --type sorted|bm25|vector \
230+
[--metric l2|cosine|dot] [--async] \
231+
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]
232+
hotdata indexes delete --connection-id <id> --schema <schema> --table <table> --name <name>
233+
234+
# Dataset scope
235+
hotdata indexes list --dataset-id <id> [-o table|json|yaml]
236+
hotdata indexes create --dataset-id <id> --name <name> --columns <cols> --type sorted|bm25|vector ...
237+
hotdata indexes delete --dataset-id <id> --name <name>
238+
```
239+
240+
- `--type` is **required** — choose `sorted` (B-tree-like), `bm25` (full-text), or `vector` (similarity).
241+
- `--type vector` requires exactly one column.
242+
- `--async` submits index creation as a background job and returns a job ID; poll with `hotdata jobs <job_id>`.
243+
- **Auto-embedding (text → vector):** when `--type vector` is used on a text column, embeddings are generated automatically. The embedding provider can be specified with `--embedding-provider-id`; if omitted, the first system provider is used. The generated column defaults to `{column}_embedding` and can be overridden with `--output-column`.
244+
245+
## Embedding providers
246+
216247
```sh
217-
hotdata indexes list --connection-id <id> --schema <schema> --table <table> [--workspace-id <id>] [--format table|json|yaml]
218-
hotdata indexes create --connection-id <id> --schema <schema> --table <table> --name <name> --columns <cols> [--type sorted|bm25|vector] [--metric l2|cosine|dot] [--async]
248+
hotdata embedding-providers list [-o table|json|yaml]
249+
hotdata embedding-providers get <id> [-o table|json|yaml]
250+
hotdata embedding-providers create --name <name> --provider-type service|local \
251+
[--config '<json>'] [--provider-api-key <key> | --secret-name <name>]
252+
hotdata embedding-providers update <id> [--name <name>] [--config '<json>'] \
253+
[--provider-api-key <key> | --secret-name <name>]
254+
hotdata embedding-providers delete <id>
219255
```
220256

221-
- `list` shows indexes on a table with name, type, columns, status, and creation date.
222-
- `create` creates an index. Use `--type bm25` for full-text search, `--type vector` for vector search (requires `--metric`).
223-
- `--async` submits index creation as a background job.
257+
- `list`/`get` show registered providers (system providers like `sys_emb_openai` come pre-configured).
258+
- `--provider-api-key` auto-creates a managed secret for the provider; `--secret-name` references an existing secret. They are mutually exclusive.
259+
- `--provider-api-key` pairs with `--provider-type` and avoids colliding with the global `--api-key` (Hotdata auth).
224260

225261
## Results
226262

@@ -239,7 +275,7 @@ hotdata jobs <job_id> [--workspace-id <id>] [--format table|json|yaml]
239275
```
240276

241277
- `list` shows only active jobs (`pending` and `running`) by default. Use `--all` to see all jobs.
242-
- `--job-type` accepts: `data_refresh_table`, `data_refresh_connection`, `create_index`.
278+
- `--job-type` accepts: `data_refresh_table`, `data_refresh_connection`, `dataset_refresh`, `create_index`, `create_dataset_index`.
243279
- `--status` accepts: `pending`, `running`, `succeeded`, `partially_succeeded`, `failed`.
244280

245281
## Sandboxes

skills/hotdata/SKILL.md

Lines changed: 60 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -94,12 +94,15 @@ hotdata connections <connection_id> [--workspace-id <workspace_id>] [--output ta
9494
- `list` returns `id`, `name`, `source_type` for each connection.
9595
- Pass a connection ID to view details (id, name, source type, table counts).
9696

97-
### Refresh connection schema
97+
### Refresh connection schema or data
9898
```
99-
hotdata connections refresh <connection_id> [--workspace-id <workspace_id>]
99+
hotdata connections refresh <connection_id> [--workspace-id <workspace_id>] [--data] [--schema <name> --table <name>] [--async] [--include-uncached]
100100
```
101-
- Refreshes the connection’s catalog so new or changed tables and columns appear in `hotdata tables list` and queries.
102-
- Use after DDL or other changes in the source database when the workspace view is stale.
101+
- Default (no flags) refreshes the connection’s catalog so new or changed tables and columns appear in `hotdata tables list` and queries. Use after DDL or other changes in the source database when the workspace view is stale.
102+
- `--data` re-syncs cached row data from the source instead of refreshing the catalog.
103+
- `--schema` and `--table` narrow a data refresh to a single table (must be supplied together).
104+
- `--async` submits a data refresh as a background job and returns a job ID; poll with `hotdata jobs <job_id>`. Only valid with `--data` — schema refresh is always synchronous.
105+
- `--include-uncached` includes tables that haven't been cached yet in a connection-wide data refresh. Only valid with `--data` and no `--table`.
103106

104107
### Create a Connection
105108

@@ -212,6 +215,14 @@ hotdata datasets create --label "My Dataset" --upload-id <upload_id> [--format c
212215
- `--table-name` is optional — derived from the label if omitted.
213216
- After **`datasets create`**, the CLI prints a **`full_name`** line (for example `datasets.main.my_table` or `datasets.s_ufmblmvq.tac_csat`). **Always use that `full_name` in SQL**—do not assume `datasets.main`.
214217

218+
#### Refresh a dataset
219+
```
220+
hotdata datasets refresh <dataset_id> [--workspace-id <workspace_id>] [--async]
221+
```
222+
- Re-runs the dataset's source (URL fetch or saved query) and creates a **new version**. Use after the upstream source has changed.
223+
- **Not supported for upload-source datasets** — those have no remote source to re-pull from. The CLI surfaces the server's `400` directly.
224+
- `--async` submits the refresh as a background job and returns a `job_id`; poll with **`hotdata jobs <job_id>`**.
225+
215226
#### Querying datasets
216227

217228
Qualified dataset tables are **`datasets.<schema>.<table_name>`**: **`main`** for workspace-scoped datasets (created outside a sandbox), or the **sandbox id** for sandbox-created data (e.g. `datasets.s_ufmblmvq.tac_csat`). The create output’s **`full_name`** is authoritative—copy it into `FROM` / `JOIN` clauses instead of guessing `datasets.main.…`.
@@ -286,40 +297,67 @@ These commands use the **active workspace only** (the `queries` command has no `
286297
To create a dataset from a **saved query** still registered for the workspace, use **`hotdata datasets create --query-id <saved_query_id>`** (this CLI does not expose separate saved-query create/run subcommands).
287298

288299
### Search
289-
```
290-
# BM25 full-text search
291-
hotdata search "query text" --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [--output table|json|csv]
292300

293-
# Vector search with --model (calls OpenAI to embed the query)
294-
hotdata search "query text" --table <table> --column <vector_column> --model text-embedding-3-small [--limit <n>]
301+
`--type` is **required**. Pass `vector` or `bm25`. Both run entirely server-side.
302+
303+
```
304+
# BM25 full-text search (requires BM25 index on the column)
305+
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [--output table|json|csv]
295306
296-
# Vector search with piped embedding
297-
echo '[0.1, -0.2, ...]' | hotdata search --table <table> --column <vector_column> [--limit <n>]
307+
# Vector similarity search via server-side auto-embed (requires a vector index on the column)
308+
hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
298309
```
299-
- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column.
300-
- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var. Supported models: `text-embedding-3-small`, `text-embedding-3-large`.
301-
- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search.
302-
- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending).
310+
- **`--type vector`** — pass the query as **plain text** and name the **source text column** (e.g. `title`). The server embeds the query at the same time, using the same provider that auto-embedded the column when the index was built — distance metric, model, and dimensions match automatically. No client-side embedding, no `OPENAI_API_KEY` required. Generated SQL: `vector_distance(col, 'text')`.
311+
- **`--type bm25`** generates `bm25_search(table, col, 'text')` server-side; requires a BM25 index on the column.
312+
- **No vector index on the column, or want a different embedding model?** `hotdata search` won't help — drop down to raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, [<vec>]) FROM ...`). See the SQL reference for available distance functions and table UDFs.
313+
- BM25 results sort by score (descending). Vector results sort by distance (ascending).
303314
- `--select` specifies which columns to return (comma-separated, defaults to all).
304315
- Default limit is 10.
305-
- **For BM25 search, create a BM25 index on the target column first. For vector search, create a vector index.**
316+
- **For BM25 search, create a BM25 index on the target column first (`hotdata indexes create ... --type bm25`). For vector search, create a vector index, optionally with auto-embedding on a text column.**
317+
- The earlier `--model` flag and stdin-piped-vector path have both been removed. They hardcoded `l2_distance` regardless of the index's metric (silently wrong on cosine indexes). For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query`.
306318

307319
### Indexes
320+
321+
Indexes attach to either a connection-table (`--connection-id` + `--schema` + `--table`) or a dataset (`--dataset-id`) — the two scopes are mutually exclusive. `--type` is required (no default).
322+
323+
```
324+
# Connection-table scope
325+
hotdata indexes list --connection-id <connection_id> --schema <schema> --table <table> [--workspace-id <workspace_id>] [--output table|json|yaml]
326+
hotdata indexes create --connection-id <connection_id> --schema <schema> --table <table> \
327+
--name <name> --columns <cols> --type sorted|bm25|vector \
328+
[--metric l2|cosine|dot] [--async] \
329+
[--embedding-provider-id <id>] [--dimensions <n>] [--output-column <name>] [--description <text>]
330+
hotdata indexes delete --connection-id <connection_id> --schema <schema> --table <table> --name <name>
331+
332+
# Dataset scope (positional dataset_id replaced by --dataset-id flag)
333+
hotdata indexes list --dataset-id <dataset_id> [--workspace-id <workspace_id>] [--output table|json|yaml]
334+
hotdata indexes create --dataset-id <dataset_id> --name <name> --columns <cols> --type sorted|bm25|vector ...
335+
hotdata indexes delete --dataset-id <dataset_id> --name <name>
336+
```
337+
- `--type` accepts `sorted` (B-tree-like; range/exact lookups), `bm25` (full-text), or `vector` (similarity). It is **required**.
338+
- `--type vector` requires exactly one column.
339+
- `--async` submits index creation as a background job; poll with `hotdata jobs <job_id>`.
340+
- **Auto-embedding:** with `--type vector` on a **text** column, the server generates embeddings automatically. Pass `--embedding-provider-id` to pick a specific provider; if omitted, the first system provider is used. The generated column defaults to `{column}_embedding` (override with `--output-column`).
341+
342+
### Embedding providers
308343
```
309-
hotdata indexes list --connection-id <connection_id> --schema <schema> --table <table> [--workspace-id <workspace_id>] [--output table|json|yaml]
310-
hotdata indexes create --connection-id <connection_id> --schema <schema> --table <table> --name <name> --columns <cols> [--workspace-id <workspace_id>] [--type sorted|bm25|vector] [--metric l2|cosine|dot] [--async]
344+
hotdata embedding-providers list [--workspace-id <workspace_id>] [--output table|json|yaml]
345+
hotdata embedding-providers get <id> [--workspace-id <workspace_id>] [--output table|json|yaml]
346+
hotdata embedding-providers create --name <name> --provider-type service|local \
347+
[--config '<json>'] [--provider-api-key <key> | --secret-name <name>] [--workspace-id <workspace_id>]
348+
hotdata embedding-providers update <id> [--name <name>] [--config '<json>'] [--provider-api-key <key> | --secret-name <name>]
349+
hotdata embedding-providers delete <id> [--workspace-id <workspace_id>]
311350
```
312-
- `list` shows indexes on a table with name, type, columns, status, and creation date.
313-
- `create` creates an index. Use `--type bm25` for full-text search, `--type vector` for vector search (requires `--metric`).
314-
- `--async` submits index creation as a background job. Use `hotdata jobs <job_id>` to check status.
351+
- System providers (e.g. `sys_emb_openai`) come pre-configured. `list` shows IDs to pass to `--embedding-provider-id`.
352+
- `--provider-api-key` (the embedding service's own key, e.g. an OpenAI `sk-...`) auto-creates a managed secret. Pairs with `--provider-type`; named to avoid colliding with the global `--api-key` (Hotdata auth). `--secret-name` references an existing secret. Mutually exclusive.
315353

316354
### Jobs
317355
```
318356
hotdata jobs list [--workspace-id <workspace_id>] [--job-type <type>] [--status <status>] [--all] [--limit <n>] [--offset <n>] [--output table|json|yaml]
319357
hotdata jobs <job_id> [--workspace-id <workspace_id>] [--output table|json|yaml]
320358
```
321359
- `list` shows only active jobs (`pending`, `running`) by default. Use `--all` to see all jobs.
322-
- `--job-type`: `data_refresh_table`, `data_refresh_connection`, `create_index`.
360+
- `--job-type`: `data_refresh_table`, `data_refresh_connection`, `dataset_refresh`, `create_index`, `create_dataset_index`.
323361
- `--status`: `pending`, `running`, `succeeded`, `partially_succeeded`, `failed`.
324362
- Use `hotdata jobs <job_id>` to inspect a specific job's status, error, and result.
325363

0 commit comments

Comments
 (0)