fix(embedder): handle HashingVectorizer empty input gracefully

voorhs · claude · voorhs · commit f714be83c0db · 2026-06-26T01:04:13.000+03:00
sklearn's HashingVectorizer.transform([]) raises StopIteration (&gt;=1.5); guard
empty input to return a (0, n_features) array instead, matching the regression
test and the spec's intent.

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/docs/superpowers/specs/2026-06-25-sqlite-embedding-cache-design.md b/docs/superpowers/specs/2026-06-25-sqlite-embedding-cache-design.md
@@ -378,9 +378,11 @@ def _to_tensor(self, arr: npt.NDArray[np.float32]) -> "torch.Tensor":
   - **OpenAI:** keep `ValueError` on empty; prepend prompt if present; run sync/async path; return float32.
   - **vLLM:** keep `ValueError` on empty; prepend prompt if present; `model.encode`; stack float32.
   - **HashingVectorizer:** ignore prompt (as today, so `_embed_uncached`'s `prompt` param is unused →
-    `# noqa: ARG002`); transform → dense float32; **keep returning `(0, dim)` for empty input** (no
-    behavior change). Sets `supports_cache = False` so it is never cached (preserving today's behavior and
-    avoiding ~1 MB BLOBs).
+    `# noqa: ARG002`); transform → dense float32. **Empty input returns a `(0, n_features)` array** via an
+    explicit guard — note sklearn's `HashingVectorizer.transform([])` actually raises `StopIteration`
+    (sklearn ≥1.5), which the old `embed` propagated; the guard makes empty input graceful (this is the
+    one small, deliberate behavior improvement, pinned by a regression test in §6.3). Sets
+    `supports_cache = False` so it is never cached (avoiding ~1 MB BLOBs).
 - **`FakeOpenaiEmbeddingBackend`** is migrated to implement `_embed_uncached(utterances, prompt)` and
   **inherit** the template `embed`. It also re-declares `config: OpenaiEmbeddingConfig` (the narrowing from
   the ABC change). Its body uses the **passed `prompt`** directly (it must NOT call `get_prompt(task_type)`
diff --git a/src/autointent/_wrappers/embedder/hashing_vectorizer.py b/src/autointent/_wrappers/embedder/hashing_vectorizer.py
@@ -76,6 +76,10 @@ def get_hash(self) -> int:
 
     def _embed_uncached(self, utterances: list[str], prompt: str | None) -> npt.NDArray[np.float32]:  # noqa: ARG002
         """Compute HashingVectorizer embeddings (prompt is ignored; never cached)."""
+        if not utterances:
+            # sklearn's HashingVectorizer.transform([]) raises StopIteration; return an
+            # empty (0, n_features) matrix instead so empty input is handled gracefully.
+            return np.empty((0, self.config.n_features), dtype=np.float32)
         embeddings_sparse = self._vectorizer.transform(utterances)
         embeddings: npt.NDArray[np.float32] = embeddings_sparse.toarray().astype(np.float32)
         return embeddings