Skip to content

Latest commit

 

History

History
61 lines (41 loc) · 1.78 KB

File metadata and controls

61 lines (41 loc) · 1.78 KB
title Markitdown
id integrations-markitdown
description Markitdown integration for Haystack
slug /integrations-markitdown

haystack_integrations.components.converters.markitdown.markitdown_converter

MarkItDownConverter

Converts files to Haystack Documents using MarkItDown.

MarkItDown is a Microsoft library that converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, images, audio, and more. All processing is performed locally.

Usage example

from haystack_integrations.components.converters.markitdown import MarkItDownConverter

converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]

init

__init__(store_full_path: bool = False) -> None

Initializes the MarkItDownConverter.

Parameters:

  • store_full_path (bool) – If True, the full file path is stored in the Document metadata. If False, only the file name is stored. Defaults to False.

run

run(
    sources: list[str | Path | ByteStream],
    meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]

Converts files to Documents using MarkItDown.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. Can be a single dict applied to all Documents, or a list of dicts aligned with sources.

Returns:

  • dict[str, list[Document]] – A dictionary with key documents containing the converted Documents.