feat(search): require --type, route to server-side bm25_search/vector_distance

anoop-narang · anoop-narang · commit 7dbbe9579ad4 · 2026-04-29T21:31:34.000+05:30
`hotdata search` now requires `--type vector|bm25` (no default; same rule
as `indexes create --type`) and a positional query text argument. Both
modes run entirely server-side with no client-side embedding.

Routing:
- `--type vector "&lt;query&gt;"` →
    SELECT *, vector_distance(&lt;col&gt;, '&lt;query&gt;') AS dist FROM &lt;t&gt; ORDER BY dist
  Server resolves the embedding column, model, dimensions, and metric from
  the index metadata. The user names the source text column.
- `--type bm25 "&lt;query&gt;"` → existing bm25_search() server-side path.

Removed:
- `--model` flag (was: client-side OpenAI embedding + `l2_distance` SQL).
- Stdin-piped-vector path (was: read JSON vector from stdin, generate
  `l2_distance` SQL).
- `src/embedding.rs` module (its only callers were the two paths above).

Both removed paths hardcoded `l2_distance` regardless of the index's
actual metric, which silently produced wrong rankings on cosine indexes.
They also required the user to point `--column` at the auto-generated
`_embedding` column rather than the source text column. Power users who
need client-side embedding or want to query with a precomputed vector
can use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(...)`).

Verified against prod on `my_ducklake.main.internet_pages_small`:
- BM25 "basketball" → finds the basketball ProCamp title (score 2.92)
- BM25 "HIV" → finds the HIV Story titles (score 4.81)
- Vector "sports games athletes" → ranks the basketball ProCamp first
  (cosine distance 0.69), heart-attack-fitness second (0.80)
- Vector "travel vacation cruise" → ranks the cruise excursion first
  (0.63), 48-hours-in-Cesky-Krumlov second (0.74)

The semantically meaningful vector results confirm auto-embedding produced
useful vectors AND the server-side rewrite correctly resolves
provider+metric+output_column from index metadata. Cleaned up indexes
after the test run.
diff --git a/README.md b/README.md
@@ -201,22 +201,21 @@ hotdata queries <query_run_id> [-o table|json|yaml]
 
 ## Search
 
-```sh
-# BM25 full-text search
-hotdata search "query text" --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
+`--type` is **required** — no default. Pass either `vector` (similarity search via the index's embedding provider) or `bm25` (full-text search). Both run entirely server-side.
 
-# Vector search with --model (calls OpenAI to embed the query)
-hotdata search "query text" --table <table> --column <vector_column> --model text-embedding-3-small [--limit <n>]
+```sh
+# BM25 full-text search (requires a BM25 index on the column)
+hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [-o table|json|csv]
 
-# Vector search with piped embedding
-echo '[0.1, -0.2, ...]' | hotdata search --table <table> --column <vector_column> [--limit <n>]
+# Vector search (requires a vector index with auto-embedding on the column)
+hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
 ```
 
-- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column.
-- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var.
-- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search.
-- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending).
+- **`--type vector`** runs server-side `vector_distance(col, 'query')`. The server resolves the embedding column, model, dimensions, and metric from the index metadata. Name the **source text column** (e.g. `title`), not the auto-generated `_embedding` column. No `OPENAI_API_KEY` required.
+- **`--type bm25`** runs `bm25_search(table, col, 'query')` — requires a BM25 index on the column.
+- BM25 results sort by score (descending). Vector results sort by distance (ascending).
 - `--select` specifies which columns to return (comma-separated, defaults to all).
+- The previous `--model` flag and stdin-piped-vector path are **removed** — both hardcoded `l2_distance` regardless of the index's actual metric, which silently produced wrong rankings on cosine indexes. For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query` (e.g. `SELECT *, cosine_distance(col, [<vec>]) ...`).
 
 ## Indexes
 
diff --git a/skills/hotdata/SKILL.md b/skills/hotdata/SKILL.md
@@ -297,23 +297,23 @@ These commands use the **active workspace only** (the `queries` command has no `
 To create a dataset from a **saved query** still registered for the workspace, use **`hotdata datasets create --query-id <saved_query_id>`** (this CLI does not expose separate saved-query create/run subcommands).
 
 ### Search
-```
-# BM25 full-text search
-hotdata search "query text" --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [--output table|json|csv]
 
-# Vector search with --model (calls OpenAI to embed the query)
-hotdata search "query text" --table <table> --column <vector_column> --model text-embedding-3-small [--limit <n>]
+`--type` is **required**. Pass `vector` or `bm25`. Both run entirely server-side.
+
+```
+# BM25 full-text search (requires BM25 index on the column)
+hotdata search "<query>" --type bm25 --table <connection.schema.table> --column <column> [--select <columns>] [--limit <n>] [--output table|json|csv]
 
-# Vector search with piped embedding
-echo '[0.1, -0.2, ...]' | hotdata search --table <table> --column <vector_column> [--limit <n>]
+# Vector similarity search via server-side auto-embed (requires a vector index on the column)
+hotdata search "<query>" --type vector --table <table> --column <source_text_column> [--limit <n>]
 ```
-- Without `--model` and with query text: BM25 full-text search. Requires a BM25 index on the target column.
-- With `--model`: generates an embedding via OpenAI and performs vector search using `l2_distance`. Requires `OPENAI_API_KEY` env var. Supported models: `text-embedding-3-small`, `text-embedding-3-large`.
-- Without query text and with piped stdin: reads a vector (raw JSON array or OpenAI embedding response) and performs vector search.
-- BM25 results are ordered by relevance score (descending). Vector results are ordered by distance (ascending).
+- **`--type vector`** generates `vector_distance(col, 'text')` server-side. The server resolves the embedding column, model, and metric from the index metadata. Name the **source text column** (e.g. `title`), not the auto-generated `_embedding` column. No client-side embedding, no `OPENAI_API_KEY` required.
+- **`--type bm25`** generates `bm25_search(table, col, 'text')` server-side; requires a BM25 index on the column.
+- BM25 results sort by score (descending). Vector results sort by distance (ascending).
 - `--select` specifies which columns to return (comma-separated, defaults to all).
 - Default limit is 10.
-- **For BM25 search, create a BM25 index on the target column first. For vector search, create a vector index.**
+- **For BM25 search, create a BM25 index on the target column first (`hotdata indexes create ... --type bm25`). For vector search, create a vector index, optionally with auto-embedding on a text column.**
+- The earlier `--model` flag and stdin-piped-vector path have both been removed. They hardcoded `l2_distance` regardless of the index's metric (silently wrong on cosine indexes). For client-side embedding or precomputed-vector workflows, use raw SQL via `hotdata query`.
 
 ### Indexes
 
diff --git a/src/command.rs b/src/command.rs
@@ -138,14 +138,25 @@ pub enum Commands {
 
     /// Full-text or vector search across a table column
     Search {
-        /// Search query text (omit to read a vector from stdin for vector search)
-        query: Option<String>,
+        /// Search query text — required for both --type bm25 and --type vector
+        query: String,
+
+        /// Search type — required (no default; choose deliberately)
+        ///
+        /// `vector` runs server-side `vector_distance(col, 'text')` — the server resolves the
+        /// embedding column, model, and metric from the index metadata.
+        ///
+        /// `bm25` runs server-side `bm25_search(table, col, 'text')` and requires a BM25 index
+        /// on the column.
+        #[arg(long, value_parser = ["vector", "bm25"])]
+        r#type: String,
 
         /// Table to search (connection.schema.table)
         #[arg(long)]
         table: String,
 
-        /// Column to search
+        /// Column to search. For `--type vector`, name the source text column — the server
+        /// resolves the embedding column from the index metadata.
         #[arg(long)]
         column: String,
 
@@ -157,10 +168,6 @@ pub enum Commands {
         #[arg(long, default_value = "10")]
         limit: u32,
 
-        /// Embedding model to generate a vector from the query text (e.g. text-embedding-3-small)
-        #[arg(long, value_parser = ["text-embedding-3-small", "text-embedding-3-large"])]
-        model: Option<String>,
-
         /// Workspace ID (defaults to first workspace from login)
         #[arg(long, short = 'w')]
         workspace_id: Option<String>,
diff --git a/src/embedding.rs b/src/embedding.rs
diff --git a/src/main.rs b/src/main.rs
@@ -6,7 +6,6 @@ mod connections;
 mod connections_new;
 mod context;
 mod datasets;
-mod embedding;
 mod embedding_providers;
 mod indexes;
 mod jobs;
@@ -554,60 +553,43 @@ fn main() {
             }
             Commands::Search {
                 query,
+                r#type,
                 table,
                 column,
                 select,
                 limit,
-                model,
                 workspace_id,
                 output,
             } => {
                 let workspace_id = resolve_workspace(workspace_id);
                 let select_cols = select.as_deref().unwrap_or("*");
 
-                // Determine search mode:
-                // 1. --model flag: embed the query text via the model provider
-                // 2. No query + piped stdin: read vector from stdin
-                // 3. Query text without --model: BM25 text search
-                let sql = if let Some(ref model_name) = model {
-                    let query_text = match query {
-                        Some(ref q) => q.as_str(),
-                        None => {
-                            eprintln!("error: --model requires a search query text");
-                            std::process::exit(1);
-                        }
-                    };
-                    let vec = embedding::openai_embed(query_text, model_name);
-                    let vec_str = embedding::vector_to_sql(&vec);
-                    format!(
-                        "SELECT {}, l2_distance({}, {}) as dist FROM {} ORDER BY dist LIMIT {}",
-                        select_cols, column, vec_str, table, limit,
-                    )
-                } else if let Some(q) = query.as_ref() {
-                    let bm25_columns = match select.as_deref() {
-                        Some(cols) => format!("{}, score", cols),
-                        None => "*".to_string(),
-                    };
-                    format!(
-                        "SELECT {} FROM bm25_search('{}', '{}', '{}') ORDER BY score DESC LIMIT {}",
-                        bm25_columns,
-                        table.replace('\'', "''"),
-                        column.replace('\'', "''"),
-                        q.replace('\'', "''"),
-                        limit,
-                    )
-                } else {
-                    use std::io::IsTerminal;
-                    if std::io::stdin().is_terminal() {
-                        eprintln!("error: provide a search query or pipe a vector via stdin");
-                        std::process::exit(1);
+                let sql = match r#type.as_str() {
+                    "bm25" => {
+                        let bm25_columns = match select.as_deref() {
+                            Some(cols) => format!("{}, score", cols),
+                            None => "*".to_string(),
+                        };
+                        format!(
+                            "SELECT {} FROM bm25_search('{}', '{}', '{}') ORDER BY score DESC LIMIT {}",
+                            bm25_columns,
+                            table.replace('\'', "''"),
+                            column.replace('\'', "''"),
+                            query.replace('\'', "''"),
+                            limit,
+                        )
                     }
-                    let vec = embedding::read_vector_from_stdin();
-                    let vec_str = embedding::vector_to_sql(&vec);
-                    format!(
-                        "SELECT {}, l2_distance({}, {}) as dist FROM {} ORDER BY dist LIMIT {}",
-                        select_cols, column, vec_str, table, limit,
-                    )
+                    // Server-side vector_distance: resolves the embedding column, model,
+                    // and metric from the index metadata. The user names the source text column.
+                    "vector" => format!(
+                        "SELECT {}, vector_distance({}, '{}') AS dist FROM {} ORDER BY dist LIMIT {}",
+                        select_cols,
+                        column,
+                        query.replace('\'', "''"),
+                        table,
+                        limit,
+                    ),
+                    _ => unreachable!(),
                 };
                 query::execute(&sql, &workspace_id, None, &output)
             }