Skip to content

embed_multi_with_metadata() compares item IDs against row IDs instead of content hashes, breaking deduplication #1397

@verivusOSS-releases

Description

@verivusOSS-releases

Bug

Collection.embed_multi_with_metadata() has a deduplication check that queries existing rows by content_hash but then compares the wrong key. It stores the returned database row ids in existing_ids, then filters the incoming batch by checking if each incoming item's user-provided ID is in existing_ids.

Since user-provided IDs and database row IDs are semantically different, duplicate content submitted under a new ID always bypasses dedup.

Root cause

In llm/embeddings.py, the dedup logic does:

# Queries by content_hash, gets back row IDs
existing = list(db.execute(
    "SELECT id FROM embeddings WHERE collection_id = ? AND content_hash IN (?)",
    ...
))
existing_ids = {row[0] for row in existing}

# Filters by incoming item ID -- wrong comparison
for item in items:
    if item.id not in existing_ids:  # item.id is user-provided, existing_ids are DB row IDs
        to_embed.append(item)

Fix: compare incoming content hashes against existing content_hash values, not incoming IDs against returned row IDs.

Impact

Redundant embeddings accumulate, increasing storage and API costs. Similarity search performance degrades with duplicate vectors.

Note

This is related to but distinct from #224, which describes a different dedup issue with --store flag behavior. This is about the fundamental ID-vs-hash comparison logic.

(Found during a multi-LLM code review using sqry AST analysis + Codex + Gemini cross-validation.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions