|
2 | 2 | <img src="docs-site/static/img/logo.jpeg" alt="paperjam logo" width="250"> |
3 | 3 | </p> |
4 | 4 |
|
| 5 | +# paperjam |
5 | 6 |
|
6 | | -<p align="center">Fast PDF processing powered by Rust.</p> |
| 7 | +[](https://pypi.org/project/paperjam/) |
| 8 | +[](LICENSE) |
| 9 | +[](https://www.python.org/downloads/) |
| 10 | + |
| 11 | +Fast document processing powered by Rust. One API. Every document format. |
| 12 | + |
| 13 | +## Supported Formats |
| 14 | + |
| 15 | +| Format | Read | Write | Extract Text | Extract Tables | Convert | |
| 16 | +|--------|------|-------|--------------|----------------|---------| |
| 17 | +| PDF | Yes | Yes | Yes | Yes | Yes | |
| 18 | +| DOCX | Yes | Yes | Yes | Yes | Yes | |
| 19 | +| XLSX | Yes | Yes | Yes | Yes | Yes | |
| 20 | +| PPTX | Yes | Yes | Yes | Yes | Yes | |
| 21 | +| HTML | Yes | Yes | Yes | Yes | Yes | |
| 22 | +| EPUB | Yes | Yes | Yes | - | Yes | |
7 | 23 |
|
8 | 24 | ## Installation |
9 | 25 |
|
10 | 26 | ```bash |
11 | 27 | pip install paperjam |
12 | 28 | ``` |
13 | 29 |
|
| 30 | +CLI tool (Rust): |
| 31 | + |
| 32 | +```bash |
| 33 | +cargo install paperjam-cli |
| 34 | +``` |
| 35 | + |
14 | 36 | ## Quick Start |
15 | 37 |
|
| 38 | +### Open any format |
| 39 | + |
16 | 40 | ```python |
17 | 41 | import paperjam |
18 | 42 |
|
19 | 43 | doc = paperjam.open("report.pdf") |
| 44 | +docx = paperjam.open("document.docx") |
| 45 | +xlsx = paperjam.open("data.xlsx") |
| 46 | +pptx = paperjam.open("slides.pptx") |
| 47 | +``` |
20 | 48 |
|
21 | | -# Extract text |
22 | | -text = doc.pages[0].extract_text() |
| 49 | +### Extract text and tables |
23 | 50 |
|
24 | | -# Extract tables |
25 | | -tables = doc.pages[0].extract_tables() |
| 51 | +```python |
| 52 | +doc = paperjam.open("report.pdf") |
26 | 53 |
|
27 | | -# Convert to Markdown |
| 54 | +text = doc.pages[0].extract_text() |
| 55 | +tables = doc.pages[0].extract_tables() |
28 | 56 | md = doc.to_markdown(layout_aware=True) |
| 57 | +``` |
| 58 | + |
| 59 | +### Convert between formats |
29 | 60 |
|
30 | | -# Async support |
31 | | -doc = await paperjam.aopen("report.pdf") |
32 | | -md = await doc.ato_markdown() |
| 61 | +```python |
| 62 | +paperjam.convert("report.pdf", "report.docx") |
| 63 | +paperjam.convert("data.xlsx", "data.pdf") |
| 64 | +paperjam.convert("page.html", "page.epub") |
| 65 | +``` |
| 66 | + |
| 67 | +### Run a pipeline |
| 68 | + |
| 69 | +```yaml |
| 70 | +# pipeline.yaml |
| 71 | +steps: |
| 72 | + - open: "reports/*.pdf" |
| 73 | + - extract_tables: |
| 74 | + strategy: auto |
| 75 | + output: tables.csv |
| 76 | + - convert: |
| 77 | + format: docx |
| 78 | + output: "converted/" |
| 79 | +``` |
| 80 | +
|
| 81 | +```bash |
| 82 | +paperjam pipeline run pipeline.yaml |
| 83 | +``` |
| 84 | + |
| 85 | +### CLI usage |
| 86 | + |
| 87 | +```bash |
| 88 | +paperjam extract text report.pdf |
| 89 | +paperjam extract tables data.pdf --format csv |
| 90 | +paperjam convert report.pdf report.docx |
| 91 | +paperjam info document.pdf |
| 92 | +``` |
| 93 | + |
| 94 | +### MCP server |
| 95 | + |
| 96 | +Add to your MCP client configuration: |
| 97 | + |
| 98 | +```json |
| 99 | +{ |
| 100 | + "mcpServers": { |
| 101 | + "paperjam": { |
| 102 | + "command": "paperjam", |
| 103 | + "args": ["mcp", "serve"] |
| 104 | + } |
| 105 | + } |
| 106 | +} |
33 | 107 | ``` |
34 | 108 |
|
35 | 109 | ## Features |
36 | 110 |
|
37 | | -- **Text extraction** — plain text, positioned lines, spans with font info |
38 | | -- **Table extraction** — lattice and stream strategies with CSV/DataFrame export |
39 | | -- **PDF to Markdown** — layout-aware conversion for LLM/RAG pipelines |
40 | | -- **Page manipulation** — split, merge, reorder, rotate, delete, insert blank pages |
41 | | -- **Search** — full-text search across pages with bounding boxes |
42 | | -- **Metadata & bookmarks** — read and edit document properties and outline |
43 | | -- **Annotations & watermarks** — add, read, remove annotations; text watermarks |
44 | | -- **Forms** — inspect, fill, create, and modify form fields |
45 | | -- **Security** — encryption (AES-128/256, RC4), sanitization, true content-stream redaction |
46 | | -- **PDF diff** — text-level comparison of two documents |
47 | | -- **Layout analysis** — multi-column detection, header/footer identification |
48 | | -- **Native async** — powered by Rust and tokio, no Python thread pools |
49 | | -- **Digital signatures** — sign, verify, and inspect with LTV timestamp support |
50 | | -- **PDF/A** — validation and conversion (XMP, ICC profiles, transparency removal) |
51 | | -- **PDF/UA** — accessibility validation (structure tree, alt text, tagged content) |
52 | | -- **WASM playground** — try it in the browser at [docs.byteveda.org/paperjam](https://docs.byteveda.org/paperjam/) |
| 111 | +- **Multi-format support** -- PDF, DOCX, XLSX, PPTX, HTML, EPUB through one unified API |
| 112 | +- **Text extraction** -- plain text, positioned lines, spans with font info |
| 113 | +- **Table extraction** -- lattice and stream strategies with CSV/DataFrame export |
| 114 | +- **Format conversion** -- convert between any supported formats |
| 115 | +- **Pipeline engine** -- define multi-step document workflows in YAML |
| 116 | +- **MCP server** -- expose document operations as tools for AI agents |
| 117 | +- **PDF manipulation** -- split, merge, reorder, rotate, delete, insert blank pages |
| 118 | +- **Metadata & bookmarks** -- read and edit document properties and outline |
| 119 | +- **Annotations & watermarks** -- add, read, remove annotations; text watermarks |
| 120 | +- **Forms** -- inspect, fill, create, and modify form fields |
| 121 | +- **Security** -- encryption (AES-128/256, RC4), sanitization, true content-stream redaction |
| 122 | +- **Digital signatures** -- sign, verify, and inspect with LTV timestamp support |
| 123 | +- **PDF/A & PDF/UA** -- validation and conversion, accessibility checks |
| 124 | +- **Native async** -- powered by Rust and tokio, no Python thread pools |
| 125 | +- **CLI tool** -- full-featured command-line interface for scripting and automation |
| 126 | +- **WASM playground** -- try it in the browser at [docs.byteveda.org/paperjam](https://docs.byteveda.org/paperjam/) |
53 | 127 |
|
54 | 128 | ## Documentation |
55 | 129 |
|
56 | 130 | Full docs, API reference, and interactive playground at **[docs.byteveda.org/paperjam](https://docs.byteveda.org/paperjam/)**. |
57 | 131 |
|
58 | | -## Changelog |
59 | | - |
60 | | -See [CHANGELOG.md](CHANGELOG.md) for a detailed release history. |
61 | | - |
62 | 132 | ## License |
63 | 133 |
|
64 | 134 | MIT |
0 commit comments