| title | retrieve_chunks |
|---|---|
| description | Retrieve relevant chunks from Morphik |
query(str, optional): Search query text. Mutually exclusive withquery_image.filters(Dict[str, Any], optional): Optional metadata filtersk(int, optional): Number of results. Defaults to 4.min_score(float, optional): Minimum similarity threshold. Defaults to 0.0.use_colpali(bool, optional): Whether to use ColPali-style embedding model to retrieve the chunks (only works for documents ingested withuse_colpali=True). Defaults to True.folder_name(str | List[str], optional): Optional folder scope. Accepts a single folder name or a list of folder names.padding(int, optional): Number of additional chunks/pages to retrieve before and after matched chunks (ColPali only). Defaults to 0.output_format(str, optional): Controls how image chunks are returned:"base64"(default): Returns base64-encoded image data"url": Returns presigned HTTPS URLs"text": Converts images to markdown text via OCR
query_image(str, optional): Base64-encoded image for reverse image search. Mutually exclusive withquery. Requiresuse_colpali=True.
Filters follow the same JSON syntax across the API. See the Metadata Filtering guide for supported operators and typed comparisons. Example:
filters = {
"$and": [
{"department": {"$eq": "research"}},
{"priority": {"$gte": 40}},
{"start_date": {"$lte": "2024-06-01T00:00:00Z"}}
]
}
chunks = db.retrieve_chunks("delta status", filters=filters, k=6)List[FinalChunkResult]: List of chunk results
db = Morphik()
chunks = db.retrieve_chunks(
"What are the key findings?",
filters={"department": "research"},
k=5,
min_score=0.5,
padding=1,
output_format="url", # Return image chunks as presigned URLs
)
for chunk in chunks:
print(f"Score: {chunk.score}")
# For image chunks with output_format="url", content will be a URL string
print(f"Content: {chunk.content}")
print(f"Document ID: {chunk.document_id}")
print(f"Chunk Number: {chunk.chunk_number}")
print(f"Metadata: {chunk.metadata}")
print("---")
```
async with AsyncMorphik() as db:
chunks = await db.retrieve_chunks(
"What are the key findings?",
filters={"department": "research"},
k=5,
min_score=0.5,
padding=1,
output_format="url", # Return image chunks as presigned URLs
)
for chunk in chunks:
print(f"Score: {chunk.score}")
# For image chunks with output_format="url", content will be a URL string
print(f"Content: {chunk.content}")
print(f"Document ID: {chunk.document_id}")
print(f"Chunk Number: {chunk.chunk_number}")
print(f"Metadata: {chunk.metadata}")
print("---")
```
The FinalChunkResult objects returned by this method have the following properties:
content(str | PILImage): Chunk content (text or image)score(float): Relevance scoredocument_id(str): Parent document IDchunk_number(int): Chunk sequence numbermetadata(Dict[str, Any]): Document metadatacontent_type(str): Content typefilename(Optional[str]): Original filenamedownload_url(Optional[str]): URL to download full document
"base64"(default): Image chunks are returned as base64 data (the SDK attempts to decode these into aPIL.ImageforFinalChunkResult.content)."url": Image chunks are returned as presigned HTTPS URLs incontent. This is convenient for UIs and LLMs that accept remote image URLs (e.g., viaimage_url)."text": Image chunks are converted to markdown text via OCR. Use this when you need faster inference or when documents are mostly text-based.- Text chunks are unaffected by
output_formatand are always returned as strings. - The
download_urlfield may be populated for image chunks. When usingoutput_format="url", it will typically matchcontentfor those chunks.
| Format | Best For |
|---|---|
base64 |
Direct image processing, local applications |
url |
Web UIs, LLMs with vision capabilities (lighter on network) |
text |
Faster inference, text-heavy documents, context length concerns |
When to use text: Passing images to LLMs for inference can be slow and consume significant context tokens. Use output_format="text" when you need faster inference speeds or when your documents are primarily text-based.
If you're hitting context limits with images, it may be because they aren't being passed correctly to the model. See Generating Completions with Retrieved Chunks for examples of properly passing images (both base64 and URLs) to vision-capable models like GPT-4o.
Tip: To download the original raw file for a document, use get_document_download_url.
You can search using an image instead of text by providing query_image with a base64-encoded image. This enables finding visually similar content in your documents.
db = Morphik()
# Load and encode your query image
with open("query_image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode("utf-8")
# Search using the image
chunks = db.retrieve_chunks(
query_image=image_b64,
use_colpali=True, # Required for image queries
k=5,
)
for chunk in chunks:
print(f"Score: {chunk.score}")
print(f"Document ID: {chunk.document_id}")
print("---")
```
async with AsyncMorphik() as db:
# Load and encode your query image
with open("query_image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode("utf-8")
# Search using the image
chunks = await db.retrieve_chunks(
query_image=image_b64,
use_colpali=True, # Required for image queries
k=5,
)
for chunk in chunks:
print(f"Score: {chunk.score}")
print(f"Document ID: {chunk.document_id}")
print("---")
```