Skip to content

Reusing ChromaDocumentStore from disk throws an error #1622

@savank7

Description

@savank7

I’ve stored a ChromaDocumentStore locally using store.py, and it works perfectly—creating and persisting the DB as expected.

However, when I try to reuse this persisted ChromaDocumentStore in query.py, I encounter an error.

store.py - Used to create and persist the ChromaDocumentStore

import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore

file_paths = ["data" / Path(name) for name in os.listdir("data")]

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore(persist_path="./chroma_db_test", collection_name="my_documents")

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

print('done')

query.py – Trying to use the persisted ChromaDocumentStore

from haystack import Pipeline
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage

from haystack_integrations.document_stores.chroma import ChromaDocumentStore

import os

os.environ["OPENAI_API_KEY"] = "my-key"

# Load the document store from persisted DB
# This prevents Chroma from wiping the existing DB and recreating it
document_store = ChromaDocumentStore(persist_path="./chroma_db_test", collection_name="my_documents")

prompt = [
    ChatMessage.from_user(
      """
      According to the contents of this website:
      {% for document in documents %}
        {{document.content}}
      {% endfor %}
      Answer the given question: {{query}}
      Answer:
      """
    )
]

prompt_builder = ChatPromptBuilder(template=prompt)
llm = OpenAIChatGenerator()
retriever = ChromaQueryTextRetriever(document_store)

querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)

querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")

query = "How to apply discount before tax in POS ?"
results = querying.run(data={"retriever": {"query": query},
                        "prompt_builder": {"query": query}})

print(results["llm"]["replies"][0].text)

Error from the query.py code

haystack.core.errors.PipelineRuntimeError: The following component failed to run:
Component name: 'retriever'
Component type: 'ChromaQueryTextRetriever'
Error: Collection [my_documents] already exists

🔍 Problem
I want to reuse the existing vector DB (chroma_db_test) without recreating it every time. However, the query.py script throws an error when trying to load the stored ChromaDocumentStore.

💬 Request
Can you please help me correctly load and reuse the existing Chroma vector DB? I want to avoid re-indexing or wiping the DB each time I run the query pipeline.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions