Skip to content

Commit 77887ca

Browse files
authored
Add integration page for LibreOfficeFileConverter (#427)
* First pass at integration page * add license info * Add logo * Add author
1 parent b73d6ac commit 77887ca

2 files changed

Lines changed: 125 additions & 0 deletions

File tree

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
layout: integration
3+
name: LibreOffice File Converter
4+
description: Convert office documents, spreadsheets, and presentations between formats using LibreOffice in Haystack pipelines.
5+
authors:
6+
- name: Max Swain
7+
socials:
8+
github: maxdswain
9+
- name: deepset
10+
socials:
11+
github: deepset-ai
12+
twitter: deepset_ai
13+
linkedin: https://www.linkedin.com/company/deepset-ai/
14+
pypi: https://pypi.org/project/libreoffice-haystack/
15+
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/libreoffice
16+
type: Data Ingestion
17+
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
18+
logo: /logos/libreoffice.png
19+
version: Haystack 2.0
20+
toc: true
21+
---
22+
23+
**Table of Contents**
24+
25+
- [Overview](#overview)
26+
- [Installation](#installation)
27+
- [Usage](#usage)
28+
- [Standalone](#standalone)
29+
- [In a Haystack Pipeline](#in-a-haystack-pipeline)
30+
- [Async Usage](#async-usage)
31+
- [License](#license)
32+
33+
## Overview
34+
35+
`LibreOfficeFileConverter` is a Haystack component that uses [LibreOffice](https://www.libreoffice.org/)'s command-line utility (`soffice`) to convert office files between formats. It supports documents, spreadsheets, and presentations, and can output `ByteStream` objects that plug directly into other Haystack components.
36+
37+
Sources can be file paths (`str` or `Path`) or `ByteStream` objects. Both synchronous (`run`) and asynchronous (`run_async`) execution modes are supported.
38+
39+
## Installation
40+
41+
First, install LibreOffice on your system:
42+
43+
- **macOS:** `brew install --cask libreoffice`
44+
- **Ubuntu/Debian:** `sudo apt-get install libreoffice`
45+
- **Windows:** Download from [libreoffice.org](https://www.libreoffice.org/download/download/)
46+
47+
Then install the Python package:
48+
49+
```bash
50+
pip install libreoffice-haystack
51+
```
52+
53+
## Usage
54+
55+
### Standalone
56+
57+
```python
58+
from pathlib import Path
59+
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter
60+
61+
converter = LibreOfficeFileConverter()
62+
result = converter.run(sources=[Path("report.doc")], output_file_type="docx")
63+
print(result["output"]) # [ByteStream(data=b'...')]
64+
```
65+
66+
The `output_file_type` can be set at initialization or passed per `run()` call (the latter takes precedence):
67+
68+
```python
69+
# Set at init
70+
converter = LibreOfficeFileConverter(output_file_type="pdf")
71+
result = converter.run(sources=[Path("report.docx")])
72+
73+
# Override per call
74+
result = converter.run(sources=[Path("slides.pptx")], output_file_type="png")
75+
```
76+
77+
### In a Haystack Pipeline
78+
79+
`LibreOfficeFileConverter` outputs `list[ByteStream]`, which connects directly to Haystack's built-in converters. Here is an example that converts a legacy `.doc` file to `.docx` and then extracts its text as Haystack `Document` objects:
80+
81+
```python
82+
from pathlib import Path
83+
from haystack import Pipeline
84+
from haystack.components.converters import DOCXToDocument
85+
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter
86+
87+
pipeline = Pipeline()
88+
pipeline.add_component("libreoffice_converter", LibreOfficeFileConverter())
89+
pipeline.add_component("docx_converter", DOCXToDocument())
90+
pipeline.connect("libreoffice_converter.output", "docx_converter.sources")
91+
92+
result = pipeline.run({
93+
"libreoffice_converter": {
94+
"sources": [Path("legacy_report.doc")],
95+
"output_file_type": "docx",
96+
}
97+
})
98+
print(result["docx_converter"]["documents"])
99+
```
100+
101+
### Async Usage
102+
103+
`LibreOfficeFileConverter` also exposes a `run_async` method with the same signature as `run`, for use in async Haystack pipelines:
104+
105+
```python
106+
import asyncio
107+
from pathlib import Path
108+
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter
109+
110+
async def main():
111+
converter = LibreOfficeFileConverter()
112+
result = await converter.run_async(
113+
sources=[Path("presentation.pptx")],
114+
output_file_type="pdf",
115+
)
116+
print(result["output"])
117+
118+
asyncio.run(main())
119+
```
120+
121+
> **Note:** LibreOffice only supports one running `soffice` instance at a time. Conversions within a single `run_async` call are executed sequentially.
122+
123+
## License
124+
125+
`libreoffice-haystack` is distributed under the [Apache-2.0 License](https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/libreoffice/LICENSE.txt).

logos/libreoffice.png

59 KB
Loading

0 commit comments

Comments
 (0)