You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/formats.md
+17-9Lines changed: 17 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ Re-encode an image at a lower quality to reduce file size without changing forma
46
46
47
47
| From | To | Notes |
48
48
|------|-----|-------|
49
-
|**DOCX**| PDF |Requires Microsoft Word on Windows, or LibreOffice on Linux. |
49
+
|**DOCX**| PDF |Best-effort: tables, images, hyperlinks and basic styles preserved. Footnotes, headers/footers and embedded OLE objects are simplified. |
50
50
|**DOCX**| TXT | Extracts plain text from all paragraphs. Formatting (bold, tables) is lost. |
51
51
|**TXT**| PDF | Creates a clean PDF with Helvetica font, A4 page size. |
52
52
|**PDF**| TXT | Extracts text from each page using PyPDF. Complex layouts (columns, forms) may not extract cleanly. |
@@ -55,16 +55,24 @@ Re-encode an image at a lower quality to reduce file size without changing forma
55
55
56
56
### Notes on DOCX → PDF
57
57
58
-
**Windows**: Uses `docx2pdf` which interfaces with Microsoft Word via COM. Word must be installed.
58
+
The pipeline runs in pure Python: `mammoth` extracts the DOCX body as HTML
59
+
(with images inlined as `data:` URIs), then `WeasyPrint` renders the HTML
60
+
to PDF. No external binary, no Microsoft Word, no LibreOffice required —
61
+
works the same on Linux, macOS, Windows and inside the standard container.
59
62
60
-
**Linux**: Requires LibreOffice:
61
-
```bash
62
-
sudo apt install libreoffice
63
-
pip install docx2pdf
64
-
```
65
-
`docx2pdf` on Linux uses LibreOffice in headless mode.
63
+
**What is preserved**: paragraphs, basic character formatting (bold, italic),
64
+
tables (with cell borders), inline images, hyperlinks, and standard list
65
+
styles.
66
66
67
-
**Alternative** (any platform): Export manually from Microsoft Word or LibreOffice.
67
+
**What is simplified**: footnotes and endnotes, headers and footers, page
68
+
breaks, embedded OLE objects (Excel charts, Visio diagrams), and DOCX-native
69
+
style hierarchies. When the source DOCX uses any of these, the resulting PDF
70
+
includes a small notice banner at the top.
71
+
72
+
**Security**: The HTML pipeline runs WeasyPrint with `_deny_url_fetcher`,
73
+
blocking any external resource load that a malformed DOCX might attempt. See
74
+
`tests/test_convert_document.py::test_docx_to_pdf_ssrf_blocked` for the
0 commit comments