Skip to content

Commit 26fbbd3

Browse files
authored
Multi-format document ecosystem (#42)
* add paperjam-model crate, refactor core to use shared types * add document format crates: docx, xlsx, pptx, html, epub * add infra crates: convert, pipeline, cli, mcp, studio * update python, wasm, and async bindings for ecosystem * update python API and examples for multi-format support * update docs: new guides, API docs, and landing page * update README for ecosystem expansion * fix(wasm): use pure-Rust compression to avoid libz-sys on wasm32 * fix(tests): update validation level assertions for PDF/A-Xb format * add 0.2.0 changelog entry for multi-format ecosystem
1 parent 8fc3ec7 commit 26fbbd3

287 files changed

Lines changed: 21554 additions & 1859 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,43 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
77

88
## [Unreleased]
99

10+
## [0.2.0] — 2026-04-04
11+
12+
### Added
13+
14+
- **Multi-format ecosystem**: new crates for DOCX (`paperjam-docx`), XLSX (`paperjam-xlsx`), PPTX (`paperjam-pptx`), HTML (`paperjam-html`), and EPUB (`paperjam-epub`)
15+
- **Shared model layer**: `paperjam-model` crate with format-agnostic types shared across all crates
16+
- **Format conversion engine**: `paperjam-convert` crate for converting between document formats
17+
- **Processing pipelines**: `paperjam-pipeline` crate for YAML-driven multi-step document workflows
18+
- **CLI tool**: `paperjam-cli` crate with commands for all core operations
19+
- **MCP server**: `paperjam-mcp` crate exposing document operations over the Model Context Protocol
20+
- **Studio UI**: `paperjam-studio` — web-based document viewer, converter, and pipeline builder
21+
- **AnyDocument API**: format-agnostic Python wrapper for non-PDF documents
22+
- `open()` auto-detects format and returns `Document` (PDF) or `AnyDocument` (other formats)
23+
- `open_pdf()` for explicit PDF-only usage with strict typing
24+
- `convert()`, `convert_bytes()`, and `detect_format()` Python functions
25+
- `Pipeline` Python class for building and running document processing pipelines
26+
- Async conversion helpers in `paperjam-async`
27+
- AES-256 encryption support
28+
- LTV (Long-Term Validation) signature embedding
29+
- PDF/A conversion (`convert_to_pdf_a`)
30+
- PDF/UA accessibility validation (`validate_pdf_ua`)
31+
- WASM bindings for multi-format operations, conversion, and format detection
32+
33+
### Fixed
34+
35+
- WASM: draw black rectangles over redacted text instead of just removing it
36+
- WASM: serialize `doc_bytes` as `Uint8Array` instead of `Array<number>`
37+
- WASM: make `owner_password` optional in encrypt binding
38+
- WASM: cache-bust module URL to prevent stale JS/WASM mismatch
39+
- WASM: use pure-Rust compression to avoid `libz-sys` failure on `wasm32-unknown-unknown`
40+
41+
### Changed
42+
43+
- Core types (annotations, bookmarks, layout, metadata, etc.) moved from `paperjam-core` to `paperjam-model` and re-exported
44+
- Validation report `level` field now returns full format string (e.g. `"PDF/A-1b"` instead of `"1b"`)
45+
- Examples updated to use `open_pdf()` for type-safe PDF operations
46+
1047
## [0.1.3] — 2026-03-28
1148

1249
### Fixed

Cargo.toml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,21 @@
11
[workspace]
22
resolver = "2"
33
members = [
4+
"crates/paperjam-model",
45
"crates/paperjam-core",
6+
"crates/paperjam-cli",
57
"crates/paperjam-async",
68
"crates/paperjam-py",
79
"crates/paperjam-wasm",
10+
"crates/paperjam-docx",
11+
"crates/paperjam-xlsx",
12+
"crates/paperjam-pptx",
13+
"crates/paperjam-convert",
14+
"crates/paperjam-html",
15+
"crates/paperjam-epub",
16+
"crates/paperjam-pipeline",
17+
"crates/paperjam-mcp",
18+
"crates/paperjam-studio",
819
]
920

1021
[workspace.package]

README.md

Lines changed: 99 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,63 +2,133 @@
22
<img src="docs-site/static/img/logo.jpeg" alt="paperjam logo" width="250">
33
</p>
44

5+
# paperjam
56

6-
<p align="center">Fast PDF processing powered by Rust.</p>
7+
[![PyPI](https://img.shields.io/pypi/v/paperjam)](https://pypi.org/project/paperjam/)
8+
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
9+
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
10+
11+
Fast document processing powered by Rust. One API. Every document format.
12+
13+
## Supported Formats
14+
15+
| Format | Read | Write | Extract Text | Extract Tables | Convert |
16+
|--------|------|-------|--------------|----------------|---------|
17+
| PDF | Yes | Yes | Yes | Yes | Yes |
18+
| DOCX | Yes | Yes | Yes | Yes | Yes |
19+
| XLSX | Yes | Yes | Yes | Yes | Yes |
20+
| PPTX | Yes | Yes | Yes | Yes | Yes |
21+
| HTML | Yes | Yes | Yes | Yes | Yes |
22+
| EPUB | Yes | Yes | Yes | - | Yes |
723

824
## Installation
925

1026
```bash
1127
pip install paperjam
1228
```
1329

30+
CLI tool (Rust):
31+
32+
```bash
33+
cargo install paperjam-cli
34+
```
35+
1436
## Quick Start
1537

38+
### Open any format
39+
1640
```python
1741
import paperjam
1842

1943
doc = paperjam.open("report.pdf")
44+
docx = paperjam.open("document.docx")
45+
xlsx = paperjam.open("data.xlsx")
46+
pptx = paperjam.open("slides.pptx")
47+
```
2048

21-
# Extract text
22-
text = doc.pages[0].extract_text()
49+
### Extract text and tables
2350

24-
# Extract tables
25-
tables = doc.pages[0].extract_tables()
51+
```python
52+
doc = paperjam.open("report.pdf")
2653

27-
# Convert to Markdown
54+
text = doc.pages[0].extract_text()
55+
tables = doc.pages[0].extract_tables()
2856
md = doc.to_markdown(layout_aware=True)
57+
```
58+
59+
### Convert between formats
2960

30-
# Async support
31-
doc = await paperjam.aopen("report.pdf")
32-
md = await doc.ato_markdown()
61+
```python
62+
paperjam.convert("report.pdf", "report.docx")
63+
paperjam.convert("data.xlsx", "data.pdf")
64+
paperjam.convert("page.html", "page.epub")
65+
```
66+
67+
### Run a pipeline
68+
69+
```yaml
70+
# pipeline.yaml
71+
steps:
72+
- open: "reports/*.pdf"
73+
- extract_tables:
74+
strategy: auto
75+
output: tables.csv
76+
- convert:
77+
format: docx
78+
output: "converted/"
79+
```
80+
81+
```bash
82+
paperjam pipeline run pipeline.yaml
83+
```
84+
85+
### CLI usage
86+
87+
```bash
88+
paperjam extract text report.pdf
89+
paperjam extract tables data.pdf --format csv
90+
paperjam convert report.pdf report.docx
91+
paperjam info document.pdf
92+
```
93+
94+
### MCP server
95+
96+
Add to your MCP client configuration:
97+
98+
```json
99+
{
100+
"mcpServers": {
101+
"paperjam": {
102+
"command": "paperjam",
103+
"args": ["mcp", "serve"]
104+
}
105+
}
106+
}
33107
```
34108

35109
## Features
36110

37-
- **Text extraction** — plain text, positioned lines, spans with font info
38-
- **Table extraction** — lattice and stream strategies with CSV/DataFrame export
39-
- **PDF to Markdown** — layout-aware conversion for LLM/RAG pipelines
40-
- **Page manipulation** — split, merge, reorder, rotate, delete, insert blank pages
41-
- **Search** — full-text search across pages with bounding boxes
42-
- **Metadata & bookmarks** — read and edit document properties and outline
43-
- **Annotations & watermarks** — add, read, remove annotations; text watermarks
44-
- **Forms** — inspect, fill, create, and modify form fields
45-
- **Security** — encryption (AES-128/256, RC4), sanitization, true content-stream redaction
46-
- **PDF diff** — text-level comparison of two documents
47-
- **Layout analysis** — multi-column detection, header/footer identification
48-
- **Native async** — powered by Rust and tokio, no Python thread pools
49-
- **Digital signatures** — sign, verify, and inspect with LTV timestamp support
50-
- **PDF/A** — validation and conversion (XMP, ICC profiles, transparency removal)
51-
- **PDF/UA** — accessibility validation (structure tree, alt text, tagged content)
52-
- **WASM playground** try it in the browser at [docs.byteveda.org/paperjam](https://docs.byteveda.org/paperjam/)
111+
- **Multi-format support** -- PDF, DOCX, XLSX, PPTX, HTML, EPUB through one unified API
112+
- **Text extraction** -- plain text, positioned lines, spans with font info
113+
- **Table extraction** -- lattice and stream strategies with CSV/DataFrame export
114+
- **Format conversion** -- convert between any supported formats
115+
- **Pipeline engine** -- define multi-step document workflows in YAML
116+
- **MCP server** -- expose document operations as tools for AI agents
117+
- **PDF manipulation** -- split, merge, reorder, rotate, delete, insert blank pages
118+
- **Metadata & bookmarks** -- read and edit document properties and outline
119+
- **Annotations & watermarks** -- add, read, remove annotations; text watermarks
120+
- **Forms** -- inspect, fill, create, and modify form fields
121+
- **Security** -- encryption (AES-128/256, RC4), sanitization, true content-stream redaction
122+
- **Digital signatures** -- sign, verify, and inspect with LTV timestamp support
123+
- **PDF/A & PDF/UA** -- validation and conversion, accessibility checks
124+
- **Native async** -- powered by Rust and tokio, no Python thread pools
125+
- **CLI tool** -- full-featured command-line interface for scripting and automation
126+
- **WASM playground** -- try it in the browser at [docs.byteveda.org/paperjam](https://docs.byteveda.org/paperjam/)
53127

54128
## Documentation
55129

56130
Full docs, API reference, and interactive playground at **[docs.byteveda.org/paperjam](https://docs.byteveda.org/paperjam/)**.
57131

58-
## Changelog
59-
60-
See [CHANGELOG.md](CHANGELOG.md) for a detailed release history.
61-
62132
## License
63133

64134
MIT

crates/paperjam-async/Cargo.toml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,14 @@ name = "paperjam-async"
33
version.workspace = true
44
edition.workspace = true
55
license.workspace = true
6-
description = "Async wrappers for paperjam-core via tokio::spawn_blocking"
6+
description = "Async wrappers for paperjam operations via tokio::spawn_blocking"
77

88
[dependencies]
99
paperjam-core = { path = "../paperjam-core", features = ["render", "signatures", "validation"] }
10+
paperjam-model = { path = "../paperjam-model" }
11+
paperjam-convert = { path = "../paperjam-convert", optional = true }
1012
tokio = { workspace = true }
13+
14+
[features]
15+
default = ["convert"]
16+
convert = ["dep:paperjam-convert"]
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
use paperjam_model::format::DocumentFormat;
2+
3+
/// Async version of `paperjam_convert::convert_bytes`.
4+
pub async fn convert_bytes(
5+
input: Vec<u8>,
6+
from_format: DocumentFormat,
7+
to_format: DocumentFormat,
8+
) -> Result<Vec<u8>, paperjam_convert::ConvertError> {
9+
tokio::task::spawn_blocking(move || {
10+
paperjam_convert::convert_bytes(&input, from_format, to_format)
11+
})
12+
.await
13+
.unwrap_or_else(|e| Err(paperjam_convert::ConvertError::Extraction(e.to_string())))
14+
}
15+
16+
/// Async version of `paperjam_convert::convert` (file-to-file).
17+
pub async fn convert(
18+
input_path: String,
19+
output_path: String,
20+
) -> Result<paperjam_convert::ConvertReport, paperjam_convert::ConvertError> {
21+
tokio::task::spawn_blocking(move || {
22+
paperjam_convert::convert(
23+
std::path::Path::new(&input_path),
24+
std::path::Path::new(&output_path),
25+
)
26+
})
27+
.await
28+
.unwrap_or_else(|e| Err(paperjam_convert::ConvertError::Extraction(e.to_string())))
29+
}
30+
31+
/// Async extraction to `IntermediateDoc`.
32+
pub async fn extract(
33+
bytes: Vec<u8>,
34+
format: DocumentFormat,
35+
) -> Result<paperjam_convert::IntermediateDoc, paperjam_convert::ConvertError> {
36+
tokio::task::spawn_blocking(move || paperjam_convert::extract::extract(&bytes, format))
37+
.await
38+
.unwrap_or_else(|e| Err(paperjam_convert::ConvertError::Extraction(e.to_string())))
39+
}
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
use std::sync::Arc;
2+
3+
use paperjam_model::document::DocumentTrait;
4+
use paperjam_model::metadata::Metadata;
5+
use paperjam_model::structure::ContentBlock;
6+
use paperjam_model::table::Table;
7+
use paperjam_model::text::TextLine;
8+
9+
/// Extract text from any document implementing `DocumentTrait`.
10+
pub async fn extract_text<D>(doc: Arc<D>) -> Result<String, D::Error>
11+
where
12+
D: DocumentTrait + Send + Sync + 'static,
13+
D::Error: Send + 'static,
14+
{
15+
tokio::task::spawn_blocking(move || doc.extract_text())
16+
.await
17+
.unwrap_or_else(|e| panic!("spawn_blocking panicked: {}", e))
18+
}
19+
20+
/// Extract text lines from any document implementing `DocumentTrait`.
21+
pub async fn extract_text_lines<D>(doc: Arc<D>) -> Result<Vec<TextLine>, D::Error>
22+
where
23+
D: DocumentTrait + Send + Sync + 'static,
24+
D::Error: Send + 'static,
25+
{
26+
tokio::task::spawn_blocking(move || doc.extract_text_lines())
27+
.await
28+
.unwrap_or_else(|e| panic!("spawn_blocking panicked: {}", e))
29+
}
30+
31+
/// Extract tables from any document implementing `DocumentTrait`.
32+
pub async fn extract_tables<D>(doc: Arc<D>) -> Result<Vec<Table>, D::Error>
33+
where
34+
D: DocumentTrait + Send + Sync + 'static,
35+
D::Error: Send + 'static,
36+
{
37+
tokio::task::spawn_blocking(move || doc.extract_tables())
38+
.await
39+
.unwrap_or_else(|e| panic!("spawn_blocking panicked: {}", e))
40+
}
41+
42+
/// Extract structure from any document implementing `DocumentTrait`.
43+
pub async fn extract_structure<D>(doc: Arc<D>) -> Result<Vec<ContentBlock>, D::Error>
44+
where
45+
D: DocumentTrait + Send + Sync + 'static,
46+
D::Error: Send + 'static,
47+
{
48+
tokio::task::spawn_blocking(move || doc.extract_structure())
49+
.await
50+
.unwrap_or_else(|e| panic!("spawn_blocking panicked: {}", e))
51+
}
52+
53+
/// Extract metadata from any document implementing `DocumentTrait`.
54+
pub async fn metadata<D>(doc: Arc<D>) -> Result<Metadata, D::Error>
55+
where
56+
D: DocumentTrait + Send + Sync + 'static,
57+
D::Error: Send + 'static,
58+
{
59+
tokio::task::spawn_blocking(move || doc.metadata())
60+
.await
61+
.unwrap_or_else(|e| panic!("spawn_blocking panicked: {}", e))
62+
}
63+
64+
/// Convert to markdown from any document implementing `DocumentTrait`.
65+
pub async fn to_markdown<D>(doc: Arc<D>) -> Result<String, D::Error>
66+
where
67+
D: DocumentTrait + Send + Sync + 'static,
68+
D::Error: Send + 'static,
69+
{
70+
tokio::task::spawn_blocking(move || doc.to_markdown())
71+
.await
72+
.unwrap_or_else(|e| panic!("spawn_blocking panicked: {}", e))
73+
}

crates/paperjam-async/src/lib.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
pub mod document;
2+
pub mod generic;
23
pub mod page;
34

5+
#[cfg(feature = "convert")]
6+
pub mod convert;
7+
48
use paperjam_core::error::PdfError;
59

610
/// Convert a tokio JoinError into a PdfError.

0 commit comments

Comments
 (0)