Commit 27dcdb0
authored
feat: data/dataset refresh + indexes auto-embedding + embedding providers (#67)
* feat(connections): add data refresh with async, scope, and include-uncached
`hotdata connections refresh` previously only triggered a synchronous
schema refresh. The /v1/refresh endpoint actually supports data refresh
(per-table or whole-connection), an async/background-job mode, and an
include_uncached toggle for picking up newly-discovered tables — none
of which were exposed.
Adds:
- --data: refresh data instead of schema metadata
- --schema/--table: narrow scope (server requires both together for data refresh)
- --async: submit as a background job, returns a job id to poll via `hotdata jobs <id>`
- --include-uncached: connection-wide data refresh only, includes uncached tables
- CLI-side validation mirroring server rules so we fail fast with clear errors
- Richer output: schema refresh now reports tables_discovered/added/modified;
data refresh reports rows_synced and duration
Also adds `dataset_refresh` to the allowed values for `jobs list --job-type`,
which the server emits but the CLI didn't accept as a filter.
* docs: document connections refresh data/async/include-uncached flags
Updates the README and the bundled hotdata skill to match the expanded
`hotdata connections refresh` surface (--data, --schema/--table, --async,
--include-uncached) and to add `dataset_refresh` to the documented values
for `hotdata jobs list --job-type`.
* feat(datasets): add refresh subcommand with async support
Adds `hotdata datasets refresh <dataset_id> [--async]` for re-running a
dataset's source (URL fetch or saved query) to create a new version.
Calls the same `/v1/refresh` endpoint as `connections refresh`, but with
`dataset_id` set instead of `connection_id`.
The sync path prints the new version and status; the async path prints
the job ID and points the user at `hotdata jobs <id>` to poll. Upload-
source datasets have no remote to re-pull from, so the server's 400
("Refresh not supported for source type 'upload'") is surfaced directly.
Updates README.md and SKILL.md to document the new subcommand.
* feat(indexes): dataset scope, auto-embedding flags, delete, embedding-providers CRUD
INDEXES
- New `--dataset-id` scope alternative to `--connection-id`/`--schema`/`--table`
on `indexes list`, `indexes create`, and the new `indexes delete` subcommand.
Scopes are mutually exclusive (clap-enforced).
- New auto-embedding flags on `indexes create`:
--embedding-provider-id --dimensions --output-column --description
When `--type vector` runs against a text column, the server generates
embeddings automatically using the named provider (or the first system
provider). Generated column defaults to `{column}_embedding`.
- `--type` is now required on `indexes create` (previously defaulted to
`sorted`). Forces deliberate choice. BREAKING for scripts that omitted it.
- New `indexes delete` subcommand for both connection and dataset scopes.
- CLI-side pre-validation:
* scope flags can't be mixed (clap mutex)
* `--schema`/`--table` require `--connection-id` (clap)
* `--connection-id` requires both `--schema` and `--table` (clap)
* auto-embed flags only valid with `--type vector` (custom)
* `--type vector` requires exactly one column in `--columns` (custom)
BUG FIX
- `indexes create --async` previously read `parsed["job_id"]` from the
response, but the server returns `id` (per `SubmitJobResponse`). Result:
it always printed `job_id: unknown`. Now reads `id` correctly. Confirmed
end-to-end against prod with `hotdata jobs <id>` lookups working.
EMBEDDING PROVIDERS
- New `hotdata embedding-providers` command surface:
list, get, create, update, delete
- The "inline API key" flag is named `--inline-api-key` (struct field
`inline_api_key`) to avoid colliding with the global `--api-key` auth
flag — clap merges fields by their internal id, so reusing the name
`api_key` would silently route the value to the auth layer.
JOBS
- Added `create_dataset_index` to the `--job-type` value list (server
emits this type for async dataset index creation; the CLI was rejecting
it as an invalid filter value).
API LAYER
- Added `ApiClient::delete_raw` — needed for `indexes delete` and
`embedding-providers delete`. Mirrors `post_raw`/`get_raw` shape.
* docs: document indexes scope flags, auto-embedding, and embedding-providers CRUD
Updates README.md and skills/hotdata/SKILL.md for the new surface:
- `hotdata indexes` now supports both connection and dataset scope; show
both invocation forms side-by-side, note `--type` is required, and
document the auto-embedding flags (--embedding-provider-id, --dimensions,
--output-column, --description).
- `hotdata indexes delete` is new; documented for both scopes.
- `hotdata embedding-providers` is new; full list/get/create/update/delete
surface documented, with a callout that `--inline-api-key` (not
`--api-key`) is the inline-secret flag — to avoid colliding with the
global auth `--api-key`.
- `--job-type` filter list updated with `create_dataset_index`.
* refactor(embedding-providers): rename --inline-api-key to --provider-api-key
Renames the flag (and Rust struct field) on `embedding-providers create`
and `embedding-providers update` from `--inline-api-key` /
`inline_api_key` to `--provider-api-key` / `provider_api_key`.
Why:
- Pairs naturally with the existing `--provider-type` flag on the same
subcommand (consistent prefix family).
- Self-documenting: this is the embedding service's own API key (e.g.
an OpenAI sk-... key), not the user's Hotdata auth credential.
- Avoids the clap field-id collision with the global `Cli::api_key`
that motivated the original rename, but does so via a name that
reads more naturally than `--inline-api-key`.
The JSON request body field stays `api_key` per the OpenAPI schema —
only the user-facing CLI flag and Rust field are renamed.
* feat(search): require --type, route to server-side bm25_search/vector_distance
`hotdata search` now requires `--type vector|bm25` (no default; same rule
as `indexes create --type`) and a positional query text argument. Both
modes run entirely server-side with no client-side embedding.
Routing:
- `--type vector "<query>"` →
SELECT *, vector_distance(<col>, '<query>') AS dist FROM <t> ORDER BY dist
Server resolves the embedding column, model, dimensions, and metric from
the index metadata. The user names the source text column.
- `--type bm25 "<query>"` → existing bm25_search() server-side path.
Removed:
- `--model` flag (was: client-side OpenAI embedding + `l2_distance` SQL).
- Stdin-piped-vector path (was: read JSON vector from stdin, generate
`l2_distance` SQL).
- `src/embedding.rs` module (its only callers were the two paths above).
Both removed paths hardcoded `l2_distance` regardless of the index's
actual metric, which silently produced wrong rankings on cosine indexes.
They also required the user to point `--column` at the auto-generated
`_embedding` column rather than the source text column. Power users who
need client-side embedding or want to query with a precomputed vector
can use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(...)`).
Verified against prod on `my_ducklake.main.internet_pages_small`:
- BM25 "basketball" → finds the basketball ProCamp title (score 2.92)
- BM25 "HIV" → finds the HIV Story titles (score 4.81)
- Vector "sports games athletes" → ranks the basketball ProCamp first
(cosine distance 0.69), heart-attack-fitness second (0.80)
- Vector "travel vacation cruise" → ranks the cruise excursion first
(0.63), 48-hours-in-Cesky-Krumlov second (0.74)
The semantically meaningful vector results confirm auto-embedding produced
useful vectors AND the server-side rewrite correctly resolves
provider+metric+output_column from index metadata. Cleaned up indexes
after the test run.
* fix(embedding-providers): use renamed --provider-api-key in update no-op error
`update`'s "provide at least one field" guard message still listed
`--api-key`, which was the original local flag name before it was renamed
to `--provider-api-key` to avoid colliding with the global Hotdata auth
flag. A user following the error guidance would reach for `--api-key`
(the global auth flag), not the provider key.
One-character class of fix; caught by the PR review bot on #67.
* test: add fixture tests for new code paths to match repo patterns
Adds 7 unit tests covering the lightest-weight gaps in the new code:
- `src/api.rs` — 2 mockito tests for `delete_raw` (204 success, 404 with
error body), matching the existing `get_none_if_not_found_*` pattern.
- `src/indexes.rs` — 2 path-construction tests for the new `IndexScope`
enum (`Connection` and `Dataset` variants), matching the existing
pure-function tests in this module.
- `src/embedding_providers.rs` — 3 deserialization fixture tests for the
`Provider` and `ListResponse` shapes plus a `parse_config` smoke test,
matching the runtimedb-payload-deserialization pattern from
`datasets.rs::update_response_deserializes_runtimedb_payload`.
Skipped `datasets.rs::refresh` — adding a typed response struct purely
for test fixture purposes would be over-engineering since the refresh
function reads the response as `serde_json::Value` directly.
110 tests → 117 (105 → 112 unit + 5 integration unchanged). Patch
coverage on the PR is still informational by repo policy (codecov.yml).
* docs(search): make auto-embedding flow explicit in the Search section
Reading just the Search section, a user might miss that vector search is
end-to-end auto-embedded — both the column's embeddings (built when the
index was created) and the query embedding (computed at search time)
come from the same server-configured provider, with matching metric,
model, and dimensions.
Spells that out at the top of the `--type vector` bullet, and adds an
explicit pointer to raw SQL via `hotdata query` for cases where the user
needs a different model than the index, or has no index at all (the SQL
reference covers the underlying distance functions and table UDFs).1 parent ae8ec55 commit 27dcdb0
10 files changed
Lines changed: 1093 additions & 266 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
74 | | - | |
| 74 | + | |
| 75 | + | |
75 | 76 | | |
76 | 77 | | |
77 | 78 | | |
| |||
101 | 102 | | |
102 | 103 | | |
103 | 104 | | |
104 | | - | |
| 105 | + | |
105 | 106 | | |
106 | 107 | | |
107 | 108 | | |
108 | 109 | | |
109 | 110 | | |
110 | | - | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
111 | 115 | | |
112 | 116 | | |
113 | 117 | | |
| |||
143 | 147 | | |
144 | 148 | | |
145 | 149 | | |
| 150 | + | |
146 | 151 | | |
147 | 152 | | |
148 | 153 | | |
149 | 154 | | |
150 | 155 | | |
151 | 156 | | |
152 | 157 | | |
| 158 | + | |
| 159 | + | |
153 | 160 | | |
154 | 161 | | |
155 | 162 | | |
| |||
194 | 201 | | |
195 | 202 | | |
196 | 203 | | |
197 | | - | |
198 | | - | |
199 | | - | |
| 204 | + | |
200 | 205 | | |
201 | | - | |
202 | | - | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
203 | 209 | | |
204 | | - | |
205 | | - | |
| 210 | + | |
| 211 | + | |
206 | 212 | | |
207 | 213 | | |
208 | | - | |
209 | | - | |
210 | | - | |
211 | | - | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
212 | 218 | | |
| 219 | + | |
213 | 220 | | |
214 | 221 | | |
215 | 222 | | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
216 | 247 | | |
217 | | - | |
218 | | - | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
219 | 255 | | |
220 | 256 | | |
221 | | - | |
222 | | - | |
223 | | - | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
224 | 260 | | |
225 | 261 | | |
226 | 262 | | |
| |||
239 | 275 | | |
240 | 276 | | |
241 | 277 | | |
242 | | - | |
| 278 | + | |
243 | 279 | | |
244 | 280 | | |
245 | 281 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
97 | | - | |
| 97 | + | |
98 | 98 | | |
99 | | - | |
| 99 | + | |
100 | 100 | | |
101 | | - | |
102 | | - | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
103 | 106 | | |
104 | 107 | | |
105 | 108 | | |
| |||
212 | 215 | | |
213 | 216 | | |
214 | 217 | | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
215 | 226 | | |
216 | 227 | | |
217 | 228 | | |
| |||
286 | 297 | | |
287 | 298 | | |
288 | 299 | | |
289 | | - | |
290 | | - | |
291 | | - | |
292 | 300 | | |
293 | | - | |
294 | | - | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
295 | 306 | | |
296 | | - | |
297 | | - | |
| 307 | + | |
| 308 | + | |
298 | 309 | | |
299 | | - | |
300 | | - | |
301 | | - | |
302 | | - | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
303 | 314 | | |
304 | 315 | | |
305 | | - | |
| 316 | + | |
| 317 | + | |
306 | 318 | | |
307 | 319 | | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
308 | 343 | | |
309 | | - | |
310 | | - | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
311 | 350 | | |
312 | | - | |
313 | | - | |
314 | | - | |
| 351 | + | |
| 352 | + | |
315 | 353 | | |
316 | 354 | | |
317 | 355 | | |
318 | 356 | | |
319 | 357 | | |
320 | 358 | | |
321 | 359 | | |
322 | | - | |
| 360 | + | |
323 | 361 | | |
324 | 362 | | |
325 | 363 | | |
| |||
0 commit comments