| title | Markitdown |
|---|---|
| id | integrations-markitdown |
| description | Markitdown integration for Haystack |
| slug | /integrations-markitdown |
Converts files to Haystack Documents using MarkItDown.
MarkItDown is a Microsoft library that converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, images, audio, and more. All processing is performed locally.
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]__init__(store_full_path: bool = False) -> NoneInitializes the MarkItDownConverter.
Parameters:
- store_full_path (
bool) – IfTrue, the full file path is stored in the Document metadata. IfFalse, only the file name is stored. Defaults toFalse.
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]Converts files to Documents using MarkItDown.
Parameters:
- sources (
list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert. - meta (
dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. Can be a single dict applied to all Documents, or a list of dicts aligned withsources.
Returns:
dict[str, list[Document]]– A dictionary with keydocumentscontaining the converted Documents.