Skip to content

Latest commit

 

History

History
116 lines (93 loc) · 6.17 KB

File metadata and controls

116 lines (93 loc) · 6.17 KB

EdgeParse — Project Overview

"Code is law" — every statement here is traceable to a source file.

EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts raw PDF byte streams into semantically-rich, machine-readable output (JSON, Markdown, HTML, plain text) using a deterministic 20-stage processing pipeline.


Repository Layout

edgepdf/
├── Cargo.toml                          ← Workspace root: 5 crates
│
├── crates/
│   ├── pdf-cos/                        ← Low-level PDF object model (fork of lopdf 0.39.0)
│   ├── edgeparse-core/                 ← The extraction engine (all business logic)
│   │   ├── src/
│   │   │   ├── lib.rs                  ← Public API: convert()
│   │   │   ├── api/                    ← Config, filter, batch
│   │   │   ├── models/                 ← Type system: BoundingBox → ContentElement
│   │   │   ├── pdf/                    ← PDF loading & chunk extraction
│   │   │   ├── pipeline/               ← 20-stage orchestrator + stages
│   │   │   ├── output/                 ← Renderers: JSON, MD, HTML, text
│   │   │   ├── tagged/                 ← Tagged-PDF structure tree
│   │   │   └── utils/                  ← XY-Cut, layout analysis, sanitizer
│   │   └── tests/
│   ├── edgeparse-cli/                  ← Binary: edgeparse <input> [opts]
│   ├── edgeparse-python/               ← PyO3 extension: edgeparse.convert()
│   └── edgeparse-node/                 ← NAPI-RS extension: edgeparse.convert()
│
├── sdks/
│   ├── python/                         ← Python package (wraps edgeparse-python)
│   └── node/                           ← Node.js package (wraps edgeparse-node)
│
├── benchmark/                          ← Accuracy benchmark suite (Python)
├── benches/                            ← Rust micro-benchmarks (criterion)
└── tests/                              ← Integration test fixtures

Workspace definition: Cargo.toml


Ten-Second Mental Model

                      ┌────────────────────────────────────────────────────────┐
  PDF File            │                   edgeparse-core                        │
  ─────────           │                                                          │
   bytes ──▶ pdf-cos  │  chunk       pipeline        output                    │
             (lopdf)  │  parser ──▶ orchestrator ──▶ renderers ──▶ String/File │
                      │  (Stage 0)  (Stages 1–19)   (Stage 20)                 │
                      └────────────────────────────────────────────────────────┘
                                          │
                              ┌───────────┼───────────┐
                              │           │           │
                           CLI        Python SDK   Node.js SDK

The engine is pure Rust and has zero Python/Node runtime dependencies at parse time. The Python and Node.js SDKs are thin FFI wrappers — all logic lives in edgeparse-core.


Public Contracts

Layer Entry point Return type
Rust API edgeparse_core::convert() Result<PdfDocument, EdgePdfError>
CLI edgeparse-cli/src/main.rs Exit code + files on disk
Python edgeparse_core.convert() str
Node.js convert() string

Key Design Decisions

Decision Rationale Code location
Single-pass chunk parser Avoids multiple content-stream decodes chunk_parser.rs
Rayon parallel per-page Pages are independent; saturates CPU cores parallel.rs
Mutable PipelineState Avoids cloning large page vectors between stages orchestrator.rs
ContentElement enum Single type models all abstraction levels (raw→semantic) content.rs
pdf-cos (lopdf fork) Custom encryption/CMaps without upstream release dependency crates/pdf-cos/
XY-Cut++ reading order Handles multi-column layouts without ML dependency xycut.rs

Error Architecture

EdgePdfError  (lib.rs)
├── LoadError(String)          ← pdf::loader fails to open/decrypt
├── PipelineError{stage, msg}  ← any stage returns Err
├── OutputError(String)        ← renderer write failure
├── IoError(std::io::Error)    ← file I/O (#[from])
├── ConfigError(String)        ← invalid config values
└── LopdfError(String)         ← pdf-cos object errors (#[from] lopdf::Error)

Source: crates/edgeparse-core/src/lib.rs#L124


Cross-Reference to Other Docs

Document What you'll learn
01-architecture.md Crate dependency graph, module structure, threading model
02-pipeline.md All 20 stages with inputs/outputs and code pointers
03-data-model.md Full type hierarchy from BoundingBox to PdfDocument
04-pdf-extraction.md Content stream parsing, font resolution, graphics state
05-output-formats.md Renderers: JSON, Markdown, HTML, text
06-sdk-integration.md CLI flags, Python API, Node.js API