Skip to content

Commit ddd8909

Browse files
authored
chore: polish — tooling, rustdoc, CI, changelog (#71)
* chore: add rust-toolchain and justfile for consistent dev tooling rust-toolchain.toml pins every contributor and CI invocation to the same stable toolchain with rustfmt, clippy, and the wasm32-unknown-unknown target. Previously CI used dtolnay/rust-toolchain@stable while contributors installed their own; minor version drift between them could produce clippy lint discrepancies at merge time. justfile captures the common build / test / lint commands documented in CLAUDE.md as executable recipes. `just` (no args) prints the full list, and the common flows (build, test, check, fmt, clean-all) are one step each so local iteration matches the pre-commit chain. * chore(async): stop force-enabling signatures/validation on core paperjam-async currently only reaches into paperjam_core::render, yet its manifest force-enabled the signatures and validation features on paperjam-core for every consumer. Downstream crates that need those features (paperjam-py does, explicitly) keep working unchanged; lightweight async consumers no longer drag in the x509-parser / cms / rsa / p256 / sha1 / pkcs8 / spki / ureq / rustls / roxmltree tree. * docs: crate-level rustdoc across the workspace Every library crate now has a `//!` summary describing its scope, its entry points, and how it fits into the broader paperjam ecosystem. Uniform style: plain prose, no intra-doc links in crate-level summaries (simpler to maintain, no rustdoc link warnings to manage). Also fixes two pre-existing rustdoc warnings uncovered along the way: an `[OPTIONAL]` literal in signature/tsa.rs that rustdoc was parsing as an intra-doc link, and a bare URL in model/annotations.rs flagged for auto-linking. The PyO3 `PyDocument` and `PyPage` classes get class-level docs that clarify they are the native layer beneath the pure-Python `paperjam.Document` / `paperjam.Page` wrappers. After this commit `cargo doc --workspace --no-deps` produces zero warnings. * chore(ci): run docs workflow on PRs and install wasm-opt The docs workflow previously fired only on pushes to main, so docs regressions (broken wasm builds, Docusaurus compile errors, bad links) were invisible until after merge. Now PRs with matching paths run the full build (without deploying) so problems surface in the PR check run. Also installs binaryen, whose wasm-opt binary wasm-pack invokes automatically when present on PATH. Release-mode WASM bundles shrink by 20-30% with no code changes. Concurrency group is keyed on ref so PR builds and deploy builds don't cancel each other; the deploy job is skipped on pull_request events to preserve production pages behaviour. * docs(changelog): record [Unreleased] entries since 0.2.0 Document the audit-driven work that has landed on main but hasn't been cut into a release yet: the ZIP-entry and MCP sandbox security hardening (#69), the panic-surface cleanup in the PDF engine (#70), the form-bindings stub sync and metadata / docs refresh (#68), plus the tooling, docs, and paperjam-async feature adjustments from this polish branch. * fix(ci): install pinned binaryen release instead of apt binaryen Ubuntu's apt-shipped binaryen is ~v108, which predates the default enablement of bulk-memory and sign-extension instructions in rustc output. The result is wasm-pack invoking /usr/bin/wasm-opt on a valid modern wasm module and wasm-opt rejecting it with "[wasm-validator error] Bulk memory operation (bulk memory is disabled)" — observed on the PR #71 run. Download and install a pinned binaryen release tarball from the upstream GitHub releases page. version_119 is known-good against the current rustc and supports all default features. Future bumps change one env var. * chore(ci): verify binaryen tarball checksum and cache across runs Harden the binaryen install step that landed in the previous commit: - SHA256-pin the downloaded tarball (value verified against a local download of version_119). Guards against upstream tampering or an accidental silent swap. - Split the version-check into a dedicated Verify step so the log shows the installed wasm-opt version unambiguously. - Wrap the install in actions/cache keyed on the pinned version so subsequent runs skip the download. Saves ~3-5s per run. * fix(wasm): tell wasm-pack to enable bulk-memory and sign-ext in wasm-opt rustc 1.82+ emits bulk-memory and sign-extension instructions in its default wasm output. wasm-pack's baseline wasm-opt invocation ("-O") does not pass --enable-bulk-memory / --enable-sign-ext, so even a modern binaryen rejects the module with "Bulk memory operations require bulk memory [--enable-bulk-memory]" during validation. Configure the flags in paperjam-wasm's Cargo.toml metadata block so wasm-pack invokes wasm-opt with the right feature set. This is what was blocking CI #71 even after installing a modern binaryen. * fix(wasm): extend wasm-opt feature set to the full rustc default list Rust 1.87 / LLVM 20 enabled bulk-memory and nontrapping-fptoint in the default wasm32-unknown-unknown feature set, alongside the previously-defaulted multivalue, mutable-globals, reference-types, and sign-ext. wasm-pack's baseline "-O" invocation of wasm-opt does not pass any of them, so the optimiser rejects a perfectly valid rustc-emitted module. The previous commit only enabled bulk-memory and sign-ext, which exposed a follow-on validator error on `i32.trunc_sat_f64_s` (nontrapping-fptoint). Rather than re-play whack-a-mole for each feature, pass the full list that matches the rustc default set documented in the wasm32-unknown-unknown platform-support page. Ref: https://doc.rust-lang.org/rustc/platform-support/wasm32-unknown-unknown.html
1 parent 749ee14 commit ddd8909

23 files changed

Lines changed: 342 additions & 5 deletions

File tree

.github/workflows/docs.yml

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,12 @@ on:
77
- "docs-site/**"
88
- "crates/paperjam-wasm/**"
99
- ".github/workflows/docs.yml"
10+
pull_request:
11+
branches: [main]
12+
paths:
13+
- "docs-site/**"
14+
- "crates/paperjam-wasm/**"
15+
- ".github/workflows/docs.yml"
1016
workflow_dispatch:
1117

1218
permissions:
@@ -15,7 +21,7 @@ permissions:
1521
id-token: write
1622

1723
concurrency:
18-
group: pages
24+
group: pages-${{ github.ref }}
1925
cancel-in-progress: false
2026

2127
jobs:
@@ -35,6 +41,38 @@ jobs:
3541
- name: Install wasm-pack
3642
run: curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
3743

44+
# wasm-pack invokes wasm-opt automatically when it is on PATH.
45+
# Ubuntu's apt-packaged binaryen (~v108) is too old to validate
46+
# modern rustc output — rustc 1.95+ emits bulk-memory and
47+
# sign-extension instructions by default, which that wasm-opt
48+
# rejects. We install a pinned upstream release, verify its
49+
# SHA256, and cache it across runs.
50+
- name: Cache binaryen
51+
id: cache-binaryen
52+
uses: actions/cache@v4
53+
env:
54+
BINARYEN_VERSION: version_119
55+
with:
56+
path: /usr/local/bin/wasm-opt
57+
key: binaryen-${{ env.BINARYEN_VERSION }}-x86_64-linux
58+
59+
- name: Install binaryen (wasm-opt)
60+
if: steps.cache-binaryen.outputs.cache-hit != 'true'
61+
env:
62+
BINARYEN_VERSION: version_119
63+
BINARYEN_SHA256: 716bcf9f5f36a6f466239fbb09a925eeaf54c46411ccefac979ec649e7c06d2d
64+
run: |
65+
set -euo pipefail
66+
tarball="binaryen-${BINARYEN_VERSION}-x86_64-linux.tar.gz"
67+
url="https://github.com/WebAssembly/binaryen/releases/download/${BINARYEN_VERSION}/${tarball}"
68+
curl -fsSL "$url" -o "/tmp/${tarball}"
69+
echo "${BINARYEN_SHA256} /tmp/${tarball}" | sha256sum --check --strict
70+
tar -xzf "/tmp/${tarball}" -C /tmp
71+
sudo install -m 0755 "/tmp/binaryen-${BINARYEN_VERSION}/bin/wasm-opt" /usr/local/bin/wasm-opt
72+
73+
- name: Verify wasm-opt
74+
run: wasm-opt --version
75+
3876
- name: Build WASM
3977
run: wasm-pack build crates/paperjam-wasm --target web --release --out-dir ../../docs-site/static/wasm
4078

@@ -52,12 +90,14 @@ jobs:
5290
run: cd docs-site && npm run build
5391

5492
- name: Upload Pages artifact
93+
if: github.event_name == 'push' || github.event_name == 'workflow_dispatch'
5594
uses: actions/upload-pages-artifact@v5
5695
with:
5796
path: docs-site/build
5897

5998
deploy:
6099
needs: build
100+
if: github.event_name == 'push' || github.event_name == 'workflow_dispatch'
61101
runs-on: ubuntu-latest
62102
environment:
63103
name: github-pages

CHANGELOG.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,64 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
77

88
## [Unreleased]
99

10+
### Security
11+
12+
- Bound ZIP entry reads in EPUB, PPTX, and DOCX parsers. A crafted archive
13+
declaring a tiny compressed size could previously expand to multi-GB on
14+
decompression; entries are now rejected when the declared or observed
15+
decompressed size exceeds a per-entry cap.
16+
- Cap `Vec::with_capacity` preallocations in XLSX sheet parsing and PPTX
17+
slide parsing at reasonable ceilings so attacker-controlled counts can
18+
no longer trigger large allocations up front.
19+
- `paperjam-mcp`: resolved paths are now sandboxed to the configured
20+
working directory by default. Absolute paths and `..` traversal that
21+
escape the working dir are rejected with a structured error. Operators
22+
can opt out with `--allow-absolute-paths` (or
23+
`ServerConfig::allow_absolute_paths`).
24+
25+
### Fixed
26+
27+
- Replace panic-prone `f64::partial_cmp(..).unwrap()` in table detection
28+
(`table/{grid,lattice,stream}.rs`) with `total_cmp`, so malformed PDFs
29+
producing NaN coordinates no longer crash the parser.
30+
- Replace `get_object_mut().unwrap()` / `as_dict_mut().unwrap()` /
31+
`from_utf8().unwrap()` across the stamp, watermark, bookmarks, and
32+
PDF/UA validation modules with structured `PdfError` returns. Malformed
33+
PDFs now surface typed errors instead of panicking the process.
34+
- Stub drift: add `modify_form_field`, `add_form_field`, and the
35+
`fill_form.generate_appearances` parameter to `_paperjam.pyi` so mypy
36+
sees the full PyO3 surface.
37+
38+
### Added
39+
40+
- Crate-level `//!` rustdoc summaries on every workspace crate.
41+
- `rust-toolchain.toml` pins the contributor toolchain to stable with
42+
`rustfmt`, `clippy`, and the `wasm32-unknown-unknown` target.
43+
- `justfile` with shortcuts for common build / test / lint tasks.
44+
- `[profile.release]` with thin LTO, `codegen-units = 1`, and symbol
45+
strip. Adds a `release-with-debug` profile for profiling.
46+
47+
### Changed
48+
49+
- `paperjam-async` no longer force-enables `signatures` and `validation`
50+
on `paperjam-core`. Consumers that need them (e.g. `paperjam-py`)
51+
continue to enable them explicitly; lightweight async users no longer
52+
drag in the full signing / validation stack.
53+
- Docs site CI now builds on pull requests (without deploying) so docs
54+
regressions are caught pre-merge. Binaryen's `wasm-opt` is installed
55+
so release WASM bundles are size-optimized.
56+
57+
### Docs
58+
59+
- README: CLI examples now use the correct `pj` binary name and accurate
60+
flags; removed the nonexistent `extract tables --format csv` flag.
61+
- `docs-site/docs/getting-started/installation.md`: replace leftover
62+
Sphinx build instructions with the Docusaurus workflow, fix the
63+
clone org, expand the feature-flag table.
64+
- `pyproject.toml`: fill in multi-format description, `readme`,
65+
`project.urls`, and extra classifiers/keywords so the PyPI page is
66+
populated. Drop the stale Sphinx `[docs]` extra.
67+
1068
## [0.2.0] — 2026-04-04
1169

1270
### Added

crates/paperjam-async/Cargo.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,12 @@ license.workspace = true
66
description = "Async wrappers for paperjam operations via tokio::spawn_blocking"
77

88
[dependencies]
9-
paperjam-core = { path = "../paperjam-core", features = ["render", "signatures", "validation"] }
9+
# paperjam-async currently only uses `paperjam_core::render`, so we enable
10+
# that feature here but leave `signatures`/`validation` to be opted in by
11+
# the downstream crate that actually needs them (paperjam-py does this
12+
# explicitly). Keeps the async surface lightweight for callers that only
13+
# want basic document operations.
14+
paperjam-core = { path = "../paperjam-core", features = ["render"] }
1015
paperjam-model = { path = "../paperjam-model" }
1116
paperjam-convert = { path = "../paperjam-convert", optional = true }
1217
tokio = { workspace = true }

crates/paperjam-async/src/lib.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
//! Tokio-native async wrappers around paperjam's blocking operations.
2+
//!
3+
//! Each heavy operation (`open`, `save`, `render`, `to_markdown`,
4+
//! `merge`, `redact_text`, ...) is re-exposed as an `async fn` that runs
5+
//! the blocking work on `tokio::spawn_blocking`. This is what powers the
6+
//! `paperjam.aopen` / `paperjam.arender_*` / `paperjam.amerge` helpers on
7+
//! the Python side.
8+
19
pub mod document;
210
pub mod generic;
311
pub mod page;

crates/paperjam-convert/src/lib.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
//! Cross-format document conversion.
2+
//!
3+
//! Orchestrates conversion between every pair of formats supported by the
4+
//! paperjam workspace (PDF, DOCX, XLSX, PPTX, HTML, EPUB, Markdown). Each
5+
//! format crate is an optional dependency so consumers only pay for the
6+
//! formats they want; features named after the source and target crates
7+
//! gate those conversions in and out.
8+
19
pub mod convert;
210
pub mod detect;
311
pub mod error;

crates/paperjam-core/src/lib.rs

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,23 @@
1+
//! Pure-Rust PDF engine: parsing, text and table extraction, page
2+
//! manipulation, form fields, digital signatures, encryption, rendering,
3+
//! and PDF/A / PDF/UA validation.
4+
//!
5+
//! `paperjam-core` is the PDF-specific implementation behind the
6+
//! `paperjam` library. Non-PDF formats live in sibling crates
7+
//! (`paperjam-docx`, `paperjam-xlsx`, `paperjam-pptx`, `paperjam-html`,
8+
//! `paperjam-epub`); cross-format operations go through `paperjam-convert`.
9+
//!
10+
//! Heavy optional pieces are feature-gated:
11+
//!
12+
//! | Feature | Enables |
13+
//! |--------------|----------------------------------------------------------|
14+
//! | `render` | page-to-image rasterisation via pdfium |
15+
//! | `signatures` | sign / verify / inspect digital signatures |
16+
//! | `ltv` | long-term validation (TSA, OCSP, CRL embedding) |
17+
//! | `validation` | PDF/A and PDF/UA conformance checks |
18+
//! | `parallel` | rayon-based parallel processing (default on) |
19+
//! | `mmap` | memory-mapped file access for large documents |
20+
121
pub mod annotations;
222
pub mod bookmarks;
323
#[cfg(feature = "validation")]

crates/paperjam-core/src/signature/tsa.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ pub fn build_timestamp_request(signature_value: &[u8]) -> Result<Vec<u8>> {
5454

5555
/// Parse an RFC 3161 timestamp response and extract the TimeStampToken.
5656
///
57-
/// The response is: SEQUENCE { status PKIStatusInfo, timeStampToken [OPTIONAL] }
57+
/// The response is: `SEQUENCE { status PKIStatusInfo, timeStampToken [OPTIONAL] }`
5858
/// We check the status integer and return the token bytes.
5959
pub fn parse_timestamp_response(resp_bytes: &[u8]) -> Result<Vec<u8>> {
6060
// Minimal DER parsing: skip the outer SEQUENCE, read PKIStatusInfo, extract token

crates/paperjam-docx/src/lib.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
//! DOCX (Office Open XML word-processing) support for the paperjam
2+
//! ecosystem.
3+
//!
4+
//! Reads and writes `.docx` files and exposes text, tables, and metadata
5+
//! through the `DocumentTrait` implementation on `DocxDocument`. Body
6+
//! parsing is delegated to `docx-rs`; an internal size-capped ZIP reader
7+
//! handles the metadata parts the upstream API does not expose.
8+
19
mod document;
210
mod error;
311
mod image;

crates/paperjam-epub/src/lib.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
//! EPUB document support for the paperjam ecosystem.
2+
//!
3+
//! Parses EPUB archives (container.xml → OPF → spine) and exposes each
4+
//! chapter as an `HtmlDocument`, delegating per-chapter rendering to
5+
//! `paperjam-html`. Implements `DocumentTrait` so EPUB files participate
6+
//! in the shared model (chapter → page).
7+
//!
8+
//! Archive reads are size-capped internally to mitigate zip-bomb attacks.
9+
110
mod document;
211
mod error;
312
mod image;

crates/paperjam-html/src/lib.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,10 @@
1+
//! HTML document support for the paperjam ecosystem.
2+
//!
3+
//! Parses HTML bytes via `scraper`, extracts text and tables, and
4+
//! implements `DocumentTrait` so HTML documents share the same API
5+
//! surface as the office formats. Also used by `paperjam-epub` for
6+
//! chapter content (EPUB spine entries are XHTML).
7+
18
mod document;
29
mod error;
310
pub mod image;

0 commit comments

Comments
 (0)