Skip to content

Commit e20abaf

Browse files
authored
Update milvus-document-store.md (#323)
* Update milvus-document-store.md a quick sync with https://github.com/milvus-io/milvus-haystack/blob/main/README.md * refine the milvus document store example Signed-off-by: ChengZi <chen.zhang@zilliz.com> --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com>
1 parent ceab1db commit e20abaf

1 file changed

Lines changed: 251 additions & 34 deletions

File tree

integrations/milvus-document-store.md

Lines changed: 251 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -16,28 +16,29 @@ version: Haystack 2.0
1616
toc: true
1717
---
1818

19-
### Table of Contents
2019

21-
- [Overview](#overview)
20+
[![Twitter Follow](https://img.shields.io/twitter/follow/milvusio?style=social)](https://twitter.com/milvusio)
21+
<a href="https://discord.gg/mKc3R95yE5"><img height="20" src="https://img.shields.io/badge/Discord-%235865F2.svg?style=for-the-badge&logo=discord&logoColor=white" alt="discord"/></a>
22+
23+
## Table of Contents
24+
- [Recent Updates](#recent-updates)
2225
- [Installation](#installation)
2326
- [Usage](#usage)
2427
- [Dive deep usage](#dive-deep-usage)
28+
- [Sparse Retrieval](#sparse-retrieval)
29+
- [Hybrid Retrieval](#hybrid-retrieval)
30+
- [License](#license)
2531

26-
## Overview
27-
28-
[![PyPI - Version](https://img.shields.io/pypi/v/milvus-haystack.svg)](https://pypi.org/project/milvus-haystack)
29-
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/milvus-haystack.svg)](https://pypi.org/project/milvus-haystack)
32+
## Recent Updates
3033

31-
---
34+
- [2025.4.17] [Full-text Search with Milvus and Haystack](https://milvus.io/docs/full_text_search_with_milvus_and_haystack.md) - Learn how to implement full-text and hybrid search in your application using Haystack and Milvus
3235

3336
## Installation
3437

3538
```shell
3639
pip install --upgrade pymilvus milvus-haystack
3740
```
3841

39-
*If you are using Google Colab, you may need to restart the runtime to enable dependencies just installed.*
40-
4142
## Usage
4243

4344
Use the `MilvusDocumentStore` in a Haystack pipeline as a quick start.
@@ -47,8 +48,7 @@ from haystack import Document
4748
from milvus_haystack import MilvusDocumentStore
4849

4950
document_store = MilvusDocumentStore(
50-
connection_args={"uri": "./milvus.db"}, # Milvus Lite
51-
# connection_args={"uri": "http://localhost:19530"}, # Milvus standalone docker service.
51+
connection_args={"uri": "./milvus.db"},
5252
drop_old=True,
5353
)
5454
documents = [Document(
@@ -59,9 +59,38 @@ documents = [Document(
5959
document_store.write_documents(documents)
6060
print(document_store.count_documents()) # 1
6161
```
62-
In the `connection_args`, setting the URI as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.
62+
### Different ways to connect to Milvus
63+
64+
- For the case of [Milvus Lite](https://milvus.io/docs/milvus_lite.md), the most convenient method, just set the uri as a local file.
65+
```python
66+
document_store = MilvusDocumentStore(
67+
connection_args={"uri": "./milvus.db"},
68+
drop_old=True,
69+
)
70+
```
71+
72+
- For the case of Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md), it is recommended to use when you are dealing with large scale of data. After starting the Milvus service, you can use the specified uri to connect to the service.
73+
```python
74+
document_store = MilvusDocumentStore(
75+
connection_args={"uri": "http://localhost:19530"},
76+
drop_old=True,
77+
)
78+
```
79+
80+
- For the case of [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the uri and token, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud.
81+
```python
82+
from haystack.utils import Secret
83+
document_store = MilvusDocumentStore(
84+
connection_args={
85+
"uri": "https://in03-ba4234asae.api.gcp-us-west1.zillizcloud.com", # Your Public Endpoint
86+
"token": Secret.from_env_var("ZILLIZ_CLOUD_API_KEY"), # API key, we recommend using the Secret class to load the token from env variable for security.
87+
"secure": True
88+
},
89+
drop_old=True,
90+
)
91+
```
92+
6393

64-
If you have large scale of data such as more than a million docs, we recommend setting up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). When using this setup, please use the server URI, e.g.`http://localhost:19530`, as your URI.
6594

6695
## Dive deep usage
6796

@@ -71,15 +100,10 @@ Prepare an OpenAI API key and set it as an environment variable:
71100
export OPENAI_API_KEY=<your_api_key>
72101
```
73102

74-
Here are the ways to
75-
76-
- Create the indexing Pipeline
77-
- Create the retrieval pipeline
78-
- Create the RAG pipeline
79-
80103
### Create the indexing Pipeline and index some documents
81104

82105
```python
106+
import glob
83107
import os
84108

85109
from haystack import Pipeline
@@ -95,8 +119,7 @@ current_file_path = os.path.abspath(__file__)
95119
file_paths = [current_file_path] # You can replace it with your own file paths.
96120

97121
document_store = MilvusDocumentStore(
98-
connection_args={"uri": "./milvus.db"}, # Milvus Lite
99-
# connection_args={"uri": "http://localhost:19530"}, # Milvus standalone docker service.
122+
connection_args={"uri": "./milvus.db"},
100123
drop_old=True,
101124
)
102125
indexing_pipeline = Pipeline()
@@ -132,8 +155,11 @@ for doc in retrieval_results["retriever"]["documents"]:
132155
### Create the RAG pipeline and try a query
133156

134157
```python
135-
from haystack.components.builders import PromptBuilder
136-
from haystack.components.generators import OpenAIGenerator
158+
from haystack.utils import Secret
159+
160+
from haystack.components.generators.chat import OpenAIChatGenerator
161+
from haystack.components.builders import ChatPromptBuilder
162+
from haystack.dataclasses import ChatMessage
137163

138164
prompt_template = """Answer the following query based on the provided context. If the context does
139165
not include an answer, reply with 'I don't know'.\n
@@ -145,21 +171,212 @@ prompt_template = """Answer the following query based on the provided context. I
145171
Answer:
146172
"""
147173

174+
llm = OpenAIChatGenerator(api_key=Secret.from_env_var("OPENAI_API_KEY"), model="gpt-4o-mini")
175+
148176
rag_pipeline = Pipeline()
149177
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())
150178
rag_pipeline.add_component("retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=3))
151-
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
152-
rag_pipeline.add_component("generator", OpenAIGenerator(generation_kwargs={"temperature": 0}))
179+
rag_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=[ChatMessage.from_user(prompt_template)]))
180+
181+
rag_pipeline.add_component("llm", llm)
182+
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
153183
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
154-
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
155-
rag_pipeline.connect("prompt_builder", "generator")
156-
157-
results = rag_pipeline.run(
158-
{
159-
"text_embedder": {"text": question},
160-
"prompt_builder": {"query": question},
161-
}
184+
rag_pipeline.connect("retriever", "prompt_builder")
185+
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
186+
187+
messages = [ChatMessage.from_user(prompt_template)]
188+
results = rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"query": question}})
189+
190+
print('RAG answer:', results["llm"]["replies"][0].text)
191+
```
192+
193+
## Sparse Retrieval
194+
### Sparse retrieval with haystack sparse embedder
195+
This example demonstrates the basic approach to sparse indexing and retrieval using Haystack's sparse embedders.
196+
197+
```python
198+
from haystack import Document, Pipeline
199+
from haystack.components.writers import DocumentWriter
200+
from haystack.document_stores.types import DuplicatePolicy
201+
from haystack_integrations.components.embedders.fastembed import (
202+
FastembedSparseDocumentEmbedder,
203+
FastembedSparseTextEmbedder,
204+
)
205+
206+
from milvus_haystack import MilvusDocumentStore, MilvusSparseEmbeddingRetriever
207+
208+
document_store = MilvusDocumentStore(
209+
connection_args={"uri": "./milvus.db"},
210+
sparse_vector_field="sparse_vector", # Specify a name of the sparse vector field to enable sparse retrieval.
211+
drop_old=True,
212+
)
213+
214+
documents = [
215+
Document(content="My name is Wolfgang and I live in Berlin"),
216+
Document(content="I saw a black horse running"),
217+
Document(content="Germany has many big cities"),
218+
Document(content="full text search is supported by Milvus."),
219+
]
220+
221+
sparse_document_embedder = FastembedSparseDocumentEmbedder()
222+
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.NONE)
223+
224+
indexing_pipeline = Pipeline()
225+
indexing_pipeline.add_component("sparse_document_embedder", sparse_document_embedder)
226+
indexing_pipeline.add_component("writer", writer)
227+
indexing_pipeline.connect("sparse_document_embedder", "writer")
228+
229+
indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}})
230+
231+
retrieval_pipeline = Pipeline()
232+
retrieval_pipeline.add_component("sparse_text_embedder", FastembedSparseTextEmbedder())
233+
retrieval_pipeline.add_component("sparse_retriever", MilvusSparseEmbeddingRetriever(document_store=document_store))
234+
retrieval_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
235+
236+
query = "who supports full text search?"
237+
238+
result = retrieval_pipeline.run({"sparse_text_embedder": {"text": query}})
239+
240+
print(result["sparse_retriever"]["documents"][0])
241+
242+
# Document(id=..., content: 'full text search is supported by Milvus.', sparse_embedding: vector with 48 non-zero elements)
243+
```
244+
### Sparse retrieval with Milvus built-in BM25 function
245+
Milvus provides a built-in BM25 function that can generate sparse vectors directly from text fields. This approach simplifies the pipeline construction compared to using Haystack's sparse embedders. The main differences are:
246+
247+
1. We need to specify a `BM25BuiltInFunction` in the document store with some field specification parameters.
248+
2. We don't need to use the embedder explicitly since Milvus handles the sparse embedding in the Milvus server end.
249+
3. The pipeline is simpler with fewer components and connections.
250+
251+
Here is an example:
252+
253+
```python
254+
from milvus_haystack.function import BM25BuiltInFunction
255+
256+
document_store = MilvusDocumentStore(
257+
connection_args={"uri": "http://localhost:19530"},
258+
sparse_vector_field="sparse_vector",
259+
text_field="text",
260+
builtin_function=[
261+
BM25BuiltInFunction( # The BM25 function converts the text into a sparse vector.
262+
input_field_names="text", output_field_names="sparse_vector",
263+
)
264+
],
265+
drop_old=True,
266+
)
267+
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.NONE)
268+
indexing_pipeline = Pipeline()
269+
indexing_pipeline.add_component("writer", writer)
270+
indexing_pipeline.run({"writer": {"documents": documents}})
271+
retrieval_pipeline = Pipeline()
272+
retrieval_pipeline.add_component("sparse_retriever", MilvusSparseEmbeddingRetrieve(document_store=document_store))
273+
query = "who supports full text search?"
274+
result = retrieval_pipeline.run({"sparse_retriever": {"query_text": query}})
275+
print(result["sparse_retriever"]["documents"][0])
276+
```
277+
278+
279+
## Hybrid Retrieval
280+
### Hybrid retrieval with haystack sparse embedder
281+
This example demonstrates the basic approach to perform hybrid retrieval using Haystack's sparse embedders.
282+
```python
283+
from haystack import Document, Pipeline
284+
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
285+
from haystack.components.writers import DocumentWriter
286+
from haystack.document_stores.types import DuplicatePolicy
287+
from haystack_integrations.components.embedders.fastembed import (
288+
FastembedSparseDocumentEmbedder,
289+
FastembedSparseTextEmbedder,
290+
)
291+
292+
from milvus_haystack import MilvusDocumentStore, MilvusHybridRetriever
293+
294+
document_store = MilvusDocumentStore(
295+
connection_args={"uri": "./milvus.db"},
296+
drop_old=True,
297+
sparse_vector_field="sparse_vector", # Specify a name of the sparse vector field to enable hybrid retrieval.
298+
)
299+
300+
documents = [
301+
Document(content="My name is Wolfgang and I live in Berlin"),
302+
Document(content="I saw a black horse running"),
303+
Document(content="Germany has many big cities"),
304+
Document(content="full text search is supported by Milvus."),
305+
]
306+
307+
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.NONE)
308+
309+
indexing_pipeline = Pipeline()
310+
indexing_pipeline.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder())
311+
indexing_pipeline.add_component("dense_doc_embedder", OpenAIDocumentEmbedder())
312+
indexing_pipeline.add_component("writer", writer)
313+
indexing_pipeline.connect("sparse_doc_embedder", "dense_doc_embedder")
314+
indexing_pipeline.connect("dense_doc_embedder", "writer")
315+
316+
indexing_pipeline.run({"sparse_doc_embedder": {"documents": documents}})
317+
318+
retrieval_pipeline = Pipeline()
319+
retrieval_pipeline.add_component("sparse_text_embedder",
320+
FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
321+
322+
retrieval_pipeline.add_component("dense_text_embedder", OpenAITextEmbedder())
323+
retrieval_pipeline.add_component(
324+
"retriever",
325+
MilvusHybridRetriever(
326+
document_store=document_store,
327+
# reranker=WeightedRanker(0.5, 0.5), # Default is RRFRanker()
328+
)
162329
)
163-
print('RAG answer:', results["generator"]["replies"][0])
164330

331+
retrieval_pipeline.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
332+
retrieval_pipeline.connect("dense_text_embedder.embedding", "retriever.query_embedding")
333+
334+
question = "who supports full text search?"
335+
336+
results = retrieval_pipeline.run(
337+
{"dense_text_embedder": {"text": question},
338+
"sparse_text_embedder": {"text": question}}
339+
)
340+
341+
print(results["retriever"]["documents"][0])
342+
343+
# Document(id=..., content: 'full text search is supported by Milvus.', embedding: vector of size 1536, sparse_embedding: vector with 48 non-zero elements)
165344
```
345+
### Hybrid retrieval with Milvus built-in BM25 function
346+
Milvus provides a built-in BM25 function that can generate sparse vectors directly from text fields. This approach simplifies the pipeline construction compared to using Haystack's sparse embedders, making it a useful complement to semantic search. The main differences are:
347+
348+
1. We need to specify a `BM25BuiltInFunction` in the document store with some field specification parameters.
349+
2. We don't need to use the embedder explicitly since Milvus handles the sparse embedding in the Milvus server end.
350+
3. The pipeline is simpler with fewer components and connections, which is especially beneficial in hybrid retrieval setups.
351+
352+
Here is an example:
353+
354+
```python
355+
from milvus_haystack.function import BM25BuiltInFunction
356+
357+
document_store = MilvusDocumentStore(
358+
connection_args={"uri": "http://localhost:19530"},
359+
sparse_vector_field="sparse_vector",
360+
text_field="text",
361+
builtin_function=[
362+
BM25BuiltInFunction( # The BM25 function converts the text into a sparse vector.
363+
input_field_names="text", output_field_names="sparse_vector",
364+
)
365+
],
366+
drop_old=True,
367+
)
368+
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.NONE)
369+
indexing_pipeline = Pipeline()
370+
indexing_pipeline.add_component("writer", writer)
371+
indexing_pipeline.run({"writer": {"documents": documents}})
372+
retrieval_pipeline = Pipeline()
373+
retrieval_pipeline.add_component("sparse_retriever", MilvusSparseEmbeddingRetrieve(document_store=document_store))
374+
query = "who supports full text search?"
375+
result = retrieval_pipeline.run({"sparse_retriever": {"query_text": query}})
376+
print(result["sparse_retriever"]["documents"][0])
377+
```
378+
379+
380+
## License
381+
382+
`milvus-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

0 commit comments

Comments
 (0)