Skip to content

Qdrant With all-MiniLM-L6-v2 #2066

@ghost

Description

This might be a stupid question - I have provided the embedding dimension to match with the dimension of the embedder all-MiniLM-L6-v2

document_store = QdrantDocumentStore(
    path=config["document_store"]["persist_path"],
    recreate_index=True,
    return_embedding=True,
    wait_result_from_api=True,
    embedding_dim=384
)

# Define the filetype router (this will route the files to their appropriate converter)
# In this example, we only allow plaintext, PDF, and markdown files.
file_type_router = FileTypeRouter(mime_types=["text/markdown"])


# Define the converter used for .md -> Document
markdown_converter = MarkdownToDocument()


# Define the document cleaner, which will remove all extraneous material (extended blankspace, images, etc.)
# You can change this behaviour by passing different parameters into DocumentCleaner()
document_cleaner = DocumentCleaner()



# Define the embedder. This is where the slices will be converted to vectors/embeddings
# These vectors will then be searched against when we submit our query, to find the most relevant chunks of text
document_embedder = SentenceTransformersDocumentEmbedder(model=r"all-MiniLM-L6-v2",device=ComponentDevice.from_str("cuda"),local_files_only=True)
# nvidia-smi -l

# Define the document writer, this will actually write the vectors to the DB
document_writer = DocumentWriter(document_store)

# This is where the pipeline is actually created
# First we add the routers, then the converters, then joiner, cleaner, splitter, embedder, and finally the writer.
# Adding the components...
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

# Connecting the components...
preprocessing_pipeline.connect("file_type_router.text/markdown", "markdown_converter.sources")
preprocessing_pipeline.connect("markdown_converter", "document_cleaner")
preprocessing_pipeline.connect("document_cleaner", "document_embedder")
preprocessing_pipeline.connect("document_embedder", "document_writer")

But I get the following error - I am pretty sure I am doing something stupid - Any help will be appreciated

haystack.core.errors.PipelineRuntimeError: The following component failed to run:
Component name: 'document_writer'
Component type: 'DocumentWriter'
Error: could not broadcast input array from shape (384,) into shape (768,)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions