Skip to content

Commit 8e8a455

Browse files
authored
Add MarkItDown integration (#428)
* Add MarkItDown integration * Add logo for MarkItDown integration
1 parent dbf2486 commit 8e8a455

File tree

2 files changed

+95
-0
lines changed

2 files changed

+95
-0
lines changed

integrations/markitdown.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
layout: integration
3+
name: MarkItDown
4+
description: Use Microsoft's MarkItDown to locally convert PDF, DOCX, PPTX, XLSX, HTML, images, and more into Markdown in Haystack
5+
authors:
6+
- name: deepset
7+
socials:
8+
github: deepset-ai
9+
twitter: deepset_ai
10+
linkedin: https://www.linkedin.com/company/deepset-ai/
11+
pypi: https://pypi.org/project/markitdown-haystack
12+
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown
13+
type: Data Ingestion
14+
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
15+
logo: /logos/microsoft.png
16+
version: Haystack 2.0
17+
toc: true
18+
---
19+
20+
### **Table of Contents**
21+
- [Overview](#overview)
22+
- [Installation](#installation)
23+
- [Usage](#usage)
24+
- [License](#license)
25+
26+
## Overview
27+
28+
[MarkItDown](https://github.com/microsoft/markitdown) is a Python library by Microsoft for converting various file formats into Markdown. It supports a wide range of formats including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, images, and more — all processed locally.
29+
30+
This integration provides a `MarkItDownConverter` component that wraps Microsoft's MarkItDown library, enabling Haystack users to convert files into Haystack `Document` objects with Markdown content.
31+
32+
## Installation
33+
34+
```bash
35+
pip install markitdown-haystack
36+
```
37+
38+
## Usage
39+
40+
### Standalone
41+
42+
```python
43+
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
44+
45+
converter = MarkItDownConverter()
46+
result = converter.run(sources=["document.pdf", "report.docx"])
47+
documents = result["documents"]
48+
```
49+
50+
You can also pass metadata to attach to the resulting documents:
51+
52+
```python
53+
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
54+
55+
converter = MarkItDownConverter()
56+
result = converter.run(
57+
sources=["document.pdf", "report.docx"],
58+
meta=[{"author": "Alice"}, {"author": "Bob"}]
59+
)
60+
documents = result["documents"]
61+
```
62+
63+
To convert `ByteStream` objects:
64+
65+
```python
66+
from haystack.dataclasses import ByteStream
67+
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
68+
69+
converter = MarkItDownConverter()
70+
bytestream = ByteStream(data=file_bytes, meta={"file_path": "document.pdf"})
71+
result = converter.run(sources=[bytestream])
72+
documents = result["documents"]
73+
```
74+
75+
### In a Haystack Pipeline
76+
77+
```python
78+
from haystack import Pipeline
79+
from haystack.components.writers import DocumentWriter
80+
from haystack.document_stores.in_memory import InMemoryDocumentStore
81+
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
82+
83+
document_store = InMemoryDocumentStore()
84+
85+
indexing = Pipeline()
86+
indexing.add_component("converter", MarkItDownConverter())
87+
indexing.add_component("writer", DocumentWriter(document_store))
88+
indexing.connect("converter", "writer")
89+
90+
indexing.run({"converter": {"sources": ["a/file/path.pdf", "another/file.docx"]}})
91+
```
92+
93+
## License
94+
95+
`markitdown-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

logos/microsoft.png

1.39 KB
Loading

0 commit comments

Comments
 (0)