Skip to content

Commit 7dbbe95

Browse files
committed
feat(search): require --type, route to server-side bm25_search/vector_distance
`hotdata search` now requires `--type vector|bm25` (no default; same rule as `indexes create --type`) and a positional query text argument. Both modes run entirely server-side with no client-side embedding. Routing: - `--type vector "<query>"` → SELECT *, vector_distance(<col>, '<query>') AS dist FROM <t> ORDER BY dist Server resolves the embedding column, model, dimensions, and metric from the index metadata. The user names the source text column. - `--type bm25 "<query>"` → existing bm25_search() server-side path. Removed: - `--model` flag (was: client-side OpenAI embedding + `l2_distance` SQL). - Stdin-piped-vector path (was: read JSON vector from stdin, generate `l2_distance` SQL). - `src/embedding.rs` module (its only callers were the two paths above). Both removed paths hardcoded `l2_distance` regardless of the index's actual metric, which silently produced wrong rankings on cosine indexes. They also required the user to point `--column` at the auto-generated `_embedding` column rather than the source text column. Power users who need client-side embedding or want to query with a precomputed vector can use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(...)`). Verified against prod on `my_ducklake.main.internet_pages_small`: - BM25 "basketball" → finds the basketball ProCamp title (score 2.92) - BM25 "HIV" → finds the HIV Story titles (score 4.81) - Vector "sports games athletes" → ranks the basketball ProCamp first (cosine distance 0.69), heart-attack-fitness second (0.80) - Vector "travel vacation cruise" → ranks the cruise excursion first (0.63), 48-hours-in-Cesky-Krumlov second (0.74) The semantically meaningful vector results confirm auto-embedding produced useful vectors AND the server-side rewrite correctly resolves provider+metric+output_column from index metadata. Cleaned up indexes after the test run.
1 parent f7a2532 commit 7dbbe95

5 files changed

Lines changed: 62 additions & 197 deletions

File tree

README.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -201,22 +201,21 @@ hotdata queries <query_run_id> [-o table|json|yaml]
201201

202202
## Search
203203

204-
```sh
205-
# BM25 full-text search
206-
hotdata search "query text" --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
204+
`--type` is **required** — no default. Pass either `vector` (similarity search via the index's embedding provider) or `bm25` (full-text search). Both run entirely server-side.
207205

208-
# Vector search with --model (calls OpenAI to embed the query)
209-
hotdata search "query text" --table <table> --column <vector_column> --model text-embedding-3-small [--limit <n>]
206+
```sh
207+
# BM25 full-text search (requires a BM25 index on the column)
208+
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
210209

211-
# Vector search with piped embedding
212-
echo '[0.1, -0.2, ...]' | hotdata search --table <table> --column <vector_column> [--limit <n>]
210+
# Vector search (requires a vector index with auto-embedding on the column)
211+
hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
213212
```
214213

215-
- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column.
216-
- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var.
217-
- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search.
218-
- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending).
214+
- **`--type vector`** runs server-side `vector_distance(col, 'query')`. The server resolves the embedding column, model, dimensions, and metric from the index metadata. Name the **source text column** (e.g. `title`), not the auto-generated `_embedding` column. No `OPENAI_API_KEY` required.
215+
- **`--type bm25`** runs `bm25_search(table, col, 'query')` — requires a BM25 index on the column.
216+
- BM25 results sort by score (descending). Vector results sort by distance (ascending).
219217
- `--select` specifies which columns to return (comma-separated, defaults to all).
218+
- The previous `--model` flag and stdin-piped-vector path are **removed** — both hardcoded `l2_distance` regardless of the index's actual metric, which silently produced wrong rankings on cosine indexes. For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, [<vec>]) ...`).
220219

221220
## Indexes
222221

skills/hotdata/SKILL.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -297,23 +297,23 @@ These commands use the **active workspace only** (the `queries` command has no `
297297
To create a dataset from a **saved query** still registered for the workspace, use **`hotdata datasets create --query-id <saved_query_id>`** (this CLI does not expose separate saved-query create/run subcommands).
298298

299299
### Search
300-
```
301-
# BM25 full-text search
302-
hotdata search "query text" --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [--output table|json|csv]
303300

304-
# Vector search with --model (calls OpenAI to embed the query)
305-
hotdata search "query text" --table <table> --column <vector_column> --model text-embedding-3-small [--limit <n>]
301+
`--type` is **required**. Pass `vector` or `bm25`. Both run entirely server-side.
302+
303+
```
304+
# BM25 full-text search (requires BM25 index on the column)
305+
hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [--output table|json|csv]
306306
307-
# Vector search with piped embedding
308-
echo '[0.1, -0.2, ...]' | hotdata search --table <table> --column <vector_column> [--limit <n>]
307+
# Vector similarity search via server-side auto-embed (requires a vector index on the column)
308+
hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
309309
```
310-
- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column.
311-
- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var. Supported models: `text-embedding-3-small`, `text-embedding-3-large`.
312-
- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search.
313-
- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending).
310+
- **`--type vector`** generates `vector_distance(col, 'text')` server-side. The server resolves the embedding column, model, and metric from the index metadata. Name the **source text column** (e.g. `title`), not the auto-generated `_embedding` column. No client-side embedding, no `OPENAI_API_KEY` required.
311+
- **`--type bm25`** generates `bm25_search(table, col, 'text')` server-side; requires a BM25 index on the column.
312+
- BM25 results sort by score (descending). Vector results sort by distance (ascending).
314313
- `--select` specifies which columns to return (comma-separated, defaults to all).
315314
- Default limit is 10.
316-
- **For BM25 search, create a BM25 index on the target column first. For vector search, create a vector index.**
315+
- **For BM25 search, create a BM25 index on the target column first (`hotdata indexes create ... --type bm25`). For vector search, create a vector index, optionally with auto-embedding on a text column.**
316+
- The earlier `--model` flag and stdin-piped-vector path have both been removed. They hardcoded `l2_distance` regardless of the index's metric (silently wrong on cosine indexes). For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query`.
317317

318318
### Indexes
319319

src/command.rs

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -138,14 +138,25 @@ pub enum Commands {
138138

139139
/// Full-text or vector search across a table column
140140
Search {
141-
/// Search query text (omit to read a vector from stdin for vector search)
142-
query: Option<String>,
141+
/// Search query text — required for both --type bm25 and --type vector
142+
query: String,
143+
144+
/// Search type — required (no default; choose deliberately)
145+
///
146+
/// `vector` runs server-side `vector_distance(col, 'text')` — the server resolves the
147+
/// embedding column, model, and metric from the index metadata.
148+
///
149+
/// `bm25` runs server-side `bm25_search(table, col, 'text')` and requires a BM25 index
150+
/// on the column.
151+
#[arg(long, value_parser = ["vector", "bm25"])]
152+
r#type: String,
143153

144154
/// Table to search (connection.schema.table)
145155
#[arg(long)]
146156
table: String,
147157

148-
/// Column to search
158+
/// Column to search. For `--type vector`, name the source text column — the server
159+
/// resolves the embedding column from the index metadata.
149160
#[arg(long)]
150161
column: String,
151162

@@ -157,10 +168,6 @@ pub enum Commands {
157168
#[arg(long, default_value = "10")]
158169
limit: u32,
159170

160-
/// Embedding model to generate a vector from the query text (e.g. text-embedding-3-small)
161-
#[arg(long, value_parser = ["text-embedding-3-small", "text-embedding-3-large"])]
162-
model: Option<String>,
163-
164171
/// Workspace ID (defaults to first workspace from login)
165172
#[arg(long, short = 'w')]
166173
workspace_id: Option<String>,

src/embedding.rs

Lines changed: 0 additions & 123 deletions
This file was deleted.

src/main.rs

Lines changed: 26 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ mod connections;
66
mod connections_new;
77
mod context;
88
mod datasets;
9-
mod embedding;
109
mod embedding_providers;
1110
mod indexes;
1211
mod jobs;
@@ -554,60 +553,43 @@ fn main() {
554553
}
555554
Commands::Search {
556555
query,
556+
r#type,
557557
table,
558558
column,
559559
select,
560560
limit,
561-
model,
562561
workspace_id,
563562
output,
564563
} => {
565564
let workspace_id = resolve_workspace(workspace_id);
566565
let select_cols = select.as_deref().unwrap_or("*");
567566

568-
// Determine search mode:
569-
// 1. --model flag: embed the query text via the model provider
570-
// 2. No query + piped stdin: read vector from stdin
571-
// 3. Query text without --model: BM25 text search
572-
let sql = if let Some(ref model_name) = model {
573-
let query_text = match query {
574-
Some(ref q) => q.as_str(),
575-
None => {
576-
eprintln!("error: --model requires a search query text");
577-
std::process::exit(1);
578-
}
579-
};
580-
let vec = embedding::openai_embed(query_text, model_name);
581-
let vec_str = embedding::vector_to_sql(&vec);
582-
format!(
583-
"SELECT {}, l2_distance({}, {}) as dist FROM {} ORDER BY dist LIMIT {}",
584-
select_cols, column, vec_str, table, limit,
585-
)
586-
} else if let Some(q) = query.as_ref() {
587-
let bm25_columns = match select.as_deref() {
588-
Some(cols) => format!("{}, score", cols),
589-
None => "*".to_string(),
590-
};
591-
format!(
592-
"SELECT {} FROM bm25_search('{}', '{}', '{}') ORDER BY score DESC LIMIT {}",
593-
bm25_columns,
594-
table.replace('\'', "''"),
595-
column.replace('\'', "''"),
596-
q.replace('\'', "''"),
597-
limit,
598-
)
599-
} else {
600-
use std::io::IsTerminal;
601-
if std::io::stdin().is_terminal() {
602-
eprintln!("error: provide a search query or pipe a vector via stdin");
603-
std::process::exit(1);
567+
let sql = match r#type.as_str() {
568+
"bm25" => {
569+
let bm25_columns = match select.as_deref() {
570+
Some(cols) => format!("{}, score", cols),
571+
None => "*".to_string(),
572+
};
573+
format!(
574+
"SELECT {} FROM bm25_search('{}', '{}', '{}') ORDER BY score DESC LIMIT {}",
575+
bm25_columns,
576+
table.replace('\'', "''"),
577+
column.replace('\'', "''"),
578+
query.replace('\'', "''"),
579+
limit,
580+
)
604581
}
605-
let vec = embedding::read_vector_from_stdin();
606-
let vec_str = embedding::vector_to_sql(&vec);
607-
format!(
608-
"SELECT {}, l2_distance({}, {}) as dist FROM {} ORDER BY dist LIMIT {}",
609-
select_cols, column, vec_str, table, limit,
610-
)
582+
// Server-side vector_distance: resolves the embedding column, model,
583+
// and metric from the index metadata. The user names the source text column.
584+
"vector" => format!(
585+
"SELECT {}, vector_distance({}, '{}') AS dist FROM {} ORDER BY dist LIMIT {}",
586+
select_cols,
587+
column,
588+
query.replace('\'', "''"),
589+
table,
590+
limit,
591+
),
592+
_ => unreachable!(),
611593
};
612594
query::execute(&sql, &workspace_id, None, &output)
613595
}

0 commit comments

Comments
 (0)