Skip to content

Commit f714be8

Browse files
voorhsclaude
andcommitted
fix(embedder): handle HashingVectorizer empty input gracefully
sklearn's HashingVectorizer.transform([]) raises StopIteration (>=1.5); guard empty input to return a (0, n_features) array instead, matching the regression test and the spec's intent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 77482e8 commit f714be8

2 files changed

Lines changed: 9 additions & 3 deletions

File tree

docs/superpowers/specs/2026-06-25-sqlite-embedding-cache-design.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -378,9 +378,11 @@ def _to_tensor(self, arr: npt.NDArray[np.float32]) -> "torch.Tensor":
378378
- **OpenAI:** keep `ValueError` on empty; prepend prompt if present; run sync/async path; return float32.
379379
- **vLLM:** keep `ValueError` on empty; prepend prompt if present; `model.encode`; stack float32.
380380
- **HashingVectorizer:** ignore prompt (as today, so `_embed_uncached`'s `prompt` param is unused →
381-
`# noqa: ARG002`); transform → dense float32; **keep returning `(0, dim)` for empty input** (no
382-
behavior change). Sets `supports_cache = False` so it is never cached (preserving today's behavior and
383-
avoiding ~1 MB BLOBs).
381+
`# noqa: ARG002`); transform → dense float32. **Empty input returns a `(0, n_features)` array** via an
382+
explicit guard — note sklearn's `HashingVectorizer.transform([])` actually raises `StopIteration`
383+
(sklearn ≥1.5), which the old `embed` propagated; the guard makes empty input graceful (this is the
384+
one small, deliberate behavior improvement, pinned by a regression test in §6.3). Sets
385+
`supports_cache = False` so it is never cached (avoiding ~1 MB BLOBs).
384386
- **`FakeOpenaiEmbeddingBackend`** is migrated to implement `_embed_uncached(utterances, prompt)` and
385387
**inherit** the template `embed`. It also re-declares `config: OpenaiEmbeddingConfig` (the narrowing from
386388
the ABC change). Its body uses the **passed `prompt`** directly (it must NOT call `get_prompt(task_type)`

src/autointent/_wrappers/embedder/hashing_vectorizer.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,10 @@ def get_hash(self) -> int:
7676

7777
def _embed_uncached(self, utterances: list[str], prompt: str | None) -> npt.NDArray[np.float32]: # noqa: ARG002
7878
"""Compute HashingVectorizer embeddings (prompt is ignored; never cached)."""
79+
if not utterances:
80+
# sklearn's HashingVectorizer.transform([]) raises StopIteration; return an
81+
# empty (0, n_features) matrix instead so empty input is handled gracefully.
82+
return np.empty((0, self.config.n_features), dtype=np.float32)
7983
embeddings_sparse = self._vectorizer.transform(utterances)
8084
embeddings: npt.NDArray[np.float32] = embeddings_sparse.toarray().astype(np.float32)
8185
return embeddings

0 commit comments

Comments
 (0)