|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: MarkItDown |
| 4 | +description: Use Microsoft's MarkItDown to locally convert PDF, DOCX, PPTX, XLSX, HTML, images, and more into Markdown in Haystack |
| 5 | +authors: |
| 6 | + - name: deepset |
| 7 | + socials: |
| 8 | + github: deepset-ai |
| 9 | + twitter: deepset_ai |
| 10 | + linkedin: https://www.linkedin.com/company/deepset-ai/ |
| 11 | +pypi: https://pypi.org/project/markitdown-haystack |
| 12 | +repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown |
| 13 | +type: Data Ingestion |
| 14 | +report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues |
| 15 | +logo: /logos/microsoft.png |
| 16 | +version: Haystack 2.0 |
| 17 | +toc: true |
| 18 | +--- |
| 19 | + |
| 20 | +### **Table of Contents** |
| 21 | +- [Overview](#overview) |
| 22 | +- [Installation](#installation) |
| 23 | +- [Usage](#usage) |
| 24 | +- [License](#license) |
| 25 | + |
| 26 | +## Overview |
| 27 | + |
| 28 | +[MarkItDown](https://github.com/microsoft/markitdown) is a Python library by Microsoft for converting various file formats into Markdown. It supports a wide range of formats including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, images, and more — all processed locally. |
| 29 | + |
| 30 | +This integration provides a `MarkItDownConverter` component that wraps Microsoft's MarkItDown library, enabling Haystack users to convert files into Haystack `Document` objects with Markdown content. |
| 31 | + |
| 32 | +## Installation |
| 33 | + |
| 34 | +```bash |
| 35 | +pip install markitdown-haystack |
| 36 | +``` |
| 37 | + |
| 38 | +## Usage |
| 39 | + |
| 40 | +### Standalone |
| 41 | + |
| 42 | +```python |
| 43 | +from haystack_integrations.components.converters.markitdown import MarkItDownConverter |
| 44 | + |
| 45 | +converter = MarkItDownConverter() |
| 46 | +result = converter.run(sources=["document.pdf", "report.docx"]) |
| 47 | +documents = result["documents"] |
| 48 | +``` |
| 49 | + |
| 50 | +You can also pass metadata to attach to the resulting documents: |
| 51 | + |
| 52 | +```python |
| 53 | +from haystack_integrations.components.converters.markitdown import MarkItDownConverter |
| 54 | + |
| 55 | +converter = MarkItDownConverter() |
| 56 | +result = converter.run( |
| 57 | + sources=["document.pdf", "report.docx"], |
| 58 | + meta=[{"author": "Alice"}, {"author": "Bob"}] |
| 59 | +) |
| 60 | +documents = result["documents"] |
| 61 | +``` |
| 62 | + |
| 63 | +To convert `ByteStream` objects: |
| 64 | + |
| 65 | +```python |
| 66 | +from haystack.dataclasses import ByteStream |
| 67 | +from haystack_integrations.components.converters.markitdown import MarkItDownConverter |
| 68 | + |
| 69 | +converter = MarkItDownConverter() |
| 70 | +bytestream = ByteStream(data=file_bytes, meta={"file_path": "document.pdf"}) |
| 71 | +result = converter.run(sources=[bytestream]) |
| 72 | +documents = result["documents"] |
| 73 | +``` |
| 74 | + |
| 75 | +### In a Haystack Pipeline |
| 76 | + |
| 77 | +```python |
| 78 | +from haystack import Pipeline |
| 79 | +from haystack.components.writers import DocumentWriter |
| 80 | +from haystack.document_stores.in_memory import InMemoryDocumentStore |
| 81 | +from haystack_integrations.components.converters.markitdown import MarkItDownConverter |
| 82 | + |
| 83 | +document_store = InMemoryDocumentStore() |
| 84 | + |
| 85 | +indexing = Pipeline() |
| 86 | +indexing.add_component("converter", MarkItDownConverter()) |
| 87 | +indexing.add_component("writer", DocumentWriter(document_store)) |
| 88 | +indexing.connect("converter", "writer") |
| 89 | + |
| 90 | +indexing.run({"converter": {"sources": ["a/file/path.pdf", "another/file.docx"]}}) |
| 91 | +``` |
| 92 | + |
| 93 | +## License |
| 94 | + |
| 95 | +`markitdown-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license. |
0 commit comments