Bug
Collection.embed_multi_with_metadata() has a deduplication check that queries existing rows by content_hash but then compares the wrong key. It stores the returned database row ids in existing_ids, then filters the incoming batch by checking if each incoming item's user-provided ID is in existing_ids.
Since user-provided IDs and database row IDs are semantically different, duplicate content submitted under a new ID always bypasses dedup.
Root cause
In llm/embeddings.py, the dedup logic does:
# Queries by content_hash, gets back row IDs
existing = list(db.execute(
"SELECT id FROM embeddings WHERE collection_id = ? AND content_hash IN (?)",
...
))
existing_ids = {row[0] for row in existing}
# Filters by incoming item ID -- wrong comparison
for item in items:
if item.id not in existing_ids: # item.id is user-provided, existing_ids are DB row IDs
to_embed.append(item)
Fix: compare incoming content hashes against existing content_hash values, not incoming IDs against returned row IDs.
Impact
Redundant embeddings accumulate, increasing storage and API costs. Similarity search performance degrades with duplicate vectors.
Note
This is related to but distinct from #224, which describes a different dedup issue with --store flag behavior. This is about the fundamental ID-vs-hash comparison logic.
(Found during a multi-LLM code review using sqry AST analysis + Codex + Gemini cross-validation.)
Bug
Collection.embed_multi_with_metadata()has a deduplication check that queries existing rows bycontent_hashbut then compares the wrong key. It stores the returned database rowids inexisting_ids, then filters the incoming batch by checking if each incoming item's user-provided ID is inexisting_ids.Since user-provided IDs and database row IDs are semantically different, duplicate content submitted under a new ID always bypasses dedup.
Root cause
In
llm/embeddings.py, the dedup logic does:Fix: compare incoming content hashes against existing
content_hashvalues, not incoming IDs against returned row IDs.Impact
Redundant embeddings accumulate, increasing storage and API costs. Similarity search performance degrades with duplicate vectors.
Note
This is related to but distinct from #224, which describes a different dedup issue with
--storeflag behavior. This is about the fundamental ID-vs-hash comparison logic.(Found during a multi-LLM code review using sqry AST analysis + Codex + Gemini cross-validation.)