Skip to content

Commit f946c1c

Browse files
MrChengLenclaude
andcommitted
docs(formats): document mammoth+WeasyPrint pipeline; drop LibreOffice troubleshooting
Pairs with the converter rewrite in fb7da9f^..ab796ef. The previous formats.md row promised "Requires Microsoft Word on Windows, or LibreOffice on Linux" which is no longer true — the pure-Python pipeline runs the same on every host. Replaces the row text and the long-form notes section, and removes the now-stale Linux LibreOffice troubleshooting block from installation.md so a self- hoster doesn't waste an apt install on a dead path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 453cc96 commit f946c1c

2 files changed

Lines changed: 17 additions & 17 deletions

File tree

docs/formats.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ Re-encode an image at a lower quality to reduce file size without changing forma
4646

4747
| From | To | Notes |
4848
|------|-----|-------|
49-
| **DOCX** | PDF | Requires Microsoft Word on Windows, or LibreOffice on Linux. |
49+
| **DOCX** | PDF | Best-effort: tables, images, hyperlinks and basic styles preserved. Footnotes, headers/footers and embedded OLE objects are simplified. |
5050
| **DOCX** | TXT | Extracts plain text from all paragraphs. Formatting (bold, tables) is lost. |
5151
| **TXT** | PDF | Creates a clean PDF with Helvetica font, A4 page size. |
5252
| **PDF** | TXT | Extracts text from each page using PyPDF. Complex layouts (columns, forms) may not extract cleanly. |
@@ -55,16 +55,24 @@ Re-encode an image at a lower quality to reduce file size without changing forma
5555

5656
### Notes on DOCX → PDF
5757

58-
**Windows**: Uses `docx2pdf` which interfaces with Microsoft Word via COM. Word must be installed.
58+
The pipeline runs in pure Python: `mammoth` extracts the DOCX body as HTML
59+
(with images inlined as `data:` URIs), then `WeasyPrint` renders the HTML
60+
to PDF. No external binary, no Microsoft Word, no LibreOffice required —
61+
works the same on Linux, macOS, Windows and inside the standard container.
5962

60-
**Linux**: Requires LibreOffice:
61-
```bash
62-
sudo apt install libreoffice
63-
pip install docx2pdf
64-
```
65-
`docx2pdf` on Linux uses LibreOffice in headless mode.
63+
**What is preserved**: paragraphs, basic character formatting (bold, italic),
64+
tables (with cell borders), inline images, hyperlinks, and standard list
65+
styles.
6666

67-
**Alternative** (any platform): Export manually from Microsoft Word or LibreOffice.
67+
**What is simplified**: footnotes and endnotes, headers and footers, page
68+
breaks, embedded OLE objects (Excel charts, Visio diagrams), and DOCX-native
69+
style hierarchies. When the source DOCX uses any of these, the resulting PDF
70+
includes a small notice banner at the top.
71+
72+
**Security**: The HTML pipeline runs WeasyPrint with `_deny_url_fetcher`,
73+
blocking any external resource load that a malformed DOCX might attempt. See
74+
`tests/test_convert_document.py::test_docx_to_pdf_ssrf_blocked` for the
75+
regression guard.
6876

6977
---
7078

docs/installation.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -289,14 +289,6 @@ pip install pillow-heif
289289

290290
On Linux, also install: `sudo apt install libheif-dev`
291291

292-
### "DOCX to PDF conversion failed" (Linux)
293-
294-
On Linux, DOCX → PDF requires LibreOffice:
295-
296-
```bash
297-
sudo apt install libreoffice
298-
```
299-
300292
### Permission denied on `data/api_keys.json` (Linux)
301293

302294
```bash

0 commit comments

Comments
 (0)