What already exists (do not re-implement)
The MCP contract, request shape, and pipeline status tracking are complete:
| Symbol |
Location |
State |
proxy_docling_ingest_pdf MCP tool |
crates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:132 |
Registered, routes to handle_ingest_pdf() |
handle_ingest_pdf<T>() |
crates/ledgerr-mcp/src/mcp_adapter.rs:999 |
Implemented — calls ingest_statement_rows() with pre-parsed extracted_rows |
IngestPdfRequest |
crates/ledgerr-mcp/src/lib.rs:91 |
{ pdf_path: String, journal_path, workbook_path, ontology_path, raw_context_bytes, extracted_rows: Vec<TransactionInput> } |
docling_ready: bool |
mcp_adapter.rs:193 |
In get_pipeline_status() — hardcoded true at call site bin/ledgerr-mcp-server.rs:130 |
DocumentChunk |
crates/ledger-core/src/rule_registry.rs:122 |
NDJSON sidecar output type — { node_id, text, parent_id, semantic_id, anchors: Vec<[u32;2]> } |
IngestStatementOp |
crates/ledger-core/src/ledger_ops.rs:169 |
Handles CSV/XLSX via calamine; PDF branch missing — returns error on .pdf input |
test_ingest_statement_via_pdf_sidecar |
crates/ledger-core/src/integration_tests.rs:73 |
#[ignore] — contract written, awaiting implementation |
ProcessSurface + Requirement::BinaryOnPath |
crates/b00t-iface/src/core/surface.rs:14–66 |
Surface lifecycle trait with typed requirement declarations |
HandshakeSurface / HandshakeDocument |
crates/b00t-iface/src/handshake/mod.rs:63 |
Writes _b00t_/handshake/l3dg3rr.json; surfaces: Vec<String> field carries capability advertisement |
What needs to be built
1. IngestStatementOp::execute() — PDF branch (ledger-core/src/ledger_ops.rs)
Currently execute() opens input_path via calamine::open_workbook_auto(), which panics or errors on .pdf. Add a PDF branch before the calamine block:
if matches!(doc_type, DocType::Pdf) {
return ingest_pdf_via_docling(input_path, &account_id, ctx);
}
ingest_pdf_via_docling(path, account_id, ctx) must:
- Check
which::which("docling").is_ok() — return LedgerOpError::MissingDependency("docling not on PATH") if absent (not panic).
- Spawn:
std::process::Command::new("docling").args(["convert", "--to", "json", path]).output()
- Deserialize stdout as
DoclingDocument (see schema below).
- Map
DoclingDocument.tables[*].data.grid rows → TransactionInput { account_id, date, amount, description, source_ref }.
amount must be rust_decimal::Decimal::from_str() — never f64.
- Compute Blake3 content-hash ID per row:
blake3(account_id + date + amount_str + description).
- Return
OperationResult::success("ingest-statement", rows.len()).
2. DoclingDocument deserialization target (ledger-core/src/ingest.rs or new ledger-core/src/docling.rs)
Docling 2.78.0 JSON schema (relevant subset):
#[derive(Debug, Deserialize)]
pub struct DoclingDocument {
pub tables: Vec<DoclingTable>,
}
#[derive(Debug, Deserialize)]
pub struct DoclingTable {
pub data: DoclingTableData,
}
#[derive(Debug, Deserialize)]
pub struct DoclingTableData {
pub grid: Vec<Vec<DoclingCell>>,
}
#[derive(Debug, Deserialize)]
pub struct DoclingCell {
pub text: String,
#[serde(default)]
pub col_span: u32,
#[serde(default)]
pub row_span: u32,
}
Column heuristic: first row of grid is the header. Map headers to date, amount, description via DocumentShape::column_map (same path as the XLSX branch at ledger_ops.rs:257–271).
3. docling_ready real probe (ledgerr-mcp/src/bin/ledgerr-mcp-server.rs)
Replace hardcoded true at line 130:
// Before:
mcp_adapter::handle_pipeline_status(true, true, true, Vec::new())
// After:
let docling_ready = which::which("docling").is_ok();
mcp_adapter::handle_pipeline_status(true, true, docling_ready, Vec::new())
which crate is already in the workspace (check Cargo.lock); if absent, use std::process::Command::new("which").arg("docling").status().map(|s| s.success()).unwrap_or(false).
4. DoclingProcessSurface — b00t attestation (crates/b00t-iface/src/ or crates/ledgerr-mcp/src/)
Implement ProcessSurface for Docling as a b00t datum. This is the node-level attestation: when the binary was compiled/optimized for this system, b00t can verify Docling is operational before l3dg3rr claims docling_ready: true.
pub struct DoclingProcessSurface;
impl ProcessSurface for DoclingProcessSurface {
type Config = ();
type Error = DoclingError;
type Handle = ();
fn capability(&self) -> SurfaceCapability {
SurfaceCapability {
name: "docling",
requirements: vec![
Requirement::BinaryOnPath("docling".into()),
],
governance: GovernancePolicy::default(),
}
}
fn init(&mut self, _config: ()) -> Result<(), DoclingError> {
which::which("docling").map(|_| ()).map_err(|_| DoclingError::NotOnPath)
}
fn operate(&mut self) -> Result<(), DoclingError> {
// Smoke: docling --version
let out = std::process::Command::new("docling")
.arg("--version")
.output()
.map_err(|e| DoclingError::SpawnFailed(e.to_string()))?;
if out.status.success() { Ok(()) } else { Err(DoclingError::VersionCheckFailed) }
}
fn maintain(&mut self) -> MaintenanceAction { MaintenanceAction::NoOp }
fn terminate(&mut self, _: ()) -> AuditRecord { /* ... */ }
}
On HandshakeSurface::operate(), append "docling" to surfaces in the HandshakeDocument written to _b00t_/handshake/l3dg3rr.json only when DoclingProcessSurface.init() succeeds. This makes the datum self-attesting: the handshake file's surfaces array is the proof that Docling is operational on this node.
Acceptance Criteria
IngestStatementOp::execute() with a .pdf input and docling on $PATH produces ≥ 1 TransactionInput with non-None date and Decimal amount.
IngestStatementOp::execute() with docling absent returns LedgerOpError::MissingDependency — no panic.
get_pipeline_status(true, true, false, vec![]) returns blockers: ["docling_unreachable"] (existing test at mcp_adapter_contract.rs:65 must still pass).
test_ingest_statement_via_pdf_sidecar (currently #[ignore]) un-ignores and passes when docling is on $PATH; stays #[ignore] in CI unless DOCLING_INTEGRATION=1 env var is set.
DoclingProcessSurface::init() returns Err(DoclingError::NotOnPath) when docling is absent.
HandshakeDocument { surfaces } includes "docling" iff DoclingProcessSurface.init().is_ok().
Files
| File |
Change |
crates/ledger-core/src/ledger_ops.rs:188 |
Add PDF branch in IngestStatementOp::execute() |
crates/ledger-core/src/ingest.rs or new docling.rs |
DoclingDocument, DoclingTable, DoclingCell deserialization; ingest_pdf_via_docling() |
crates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:130 |
Replace docling_ready: true with which::which("docling").is_ok() |
crates/b00t-iface/src/ (new file) |
DoclingProcessSurface implementing ProcessSurface |
crates/b00t-iface/src/handshake/mod.rs |
Append "docling" to HandshakeDocument::surfaces when surface is ready |
crates/ledger-core/src/integration_tests.rs:73 |
Remove #[ignore] gate; add DOCLING_INTEGRATION env guard |
Dependency
Independent of #55–#57. IngestPdfRequest.extracted_rows is the output that feeds #55 (TransactionFacts population) — implementing this makes the PDF → legal verification path end-to-end.
What already exists (do not re-implement)
The MCP contract, request shape, and pipeline status tracking are complete:
proxy_docling_ingest_pdfMCP toolcrates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:132handle_ingest_pdf()handle_ingest_pdf<T>()crates/ledgerr-mcp/src/mcp_adapter.rs:999ingest_statement_rows()with pre-parsedextracted_rowsIngestPdfRequestcrates/ledgerr-mcp/src/lib.rs:91{ pdf_path: String, journal_path, workbook_path, ontology_path, raw_context_bytes, extracted_rows: Vec<TransactionInput> }docling_ready: boolmcp_adapter.rs:193get_pipeline_status()— hardcodedtrueat call sitebin/ledgerr-mcp-server.rs:130DocumentChunkcrates/ledger-core/src/rule_registry.rs:122{ node_id, text, parent_id, semantic_id, anchors: Vec<[u32;2]> }IngestStatementOpcrates/ledger-core/src/ledger_ops.rs:169.pdfinputtest_ingest_statement_via_pdf_sidecarcrates/ledger-core/src/integration_tests.rs:73#[ignore]— contract written, awaiting implementationProcessSurface+Requirement::BinaryOnPathcrates/b00t-iface/src/core/surface.rs:14–66HandshakeSurface/HandshakeDocumentcrates/b00t-iface/src/handshake/mod.rs:63_b00t_/handshake/l3dg3rr.json;surfaces: Vec<String>field carries capability advertisementWhat needs to be built
1.
IngestStatementOp::execute()— PDF branch (ledger-core/src/ledger_ops.rs)Currently
execute()opensinput_pathviacalamine::open_workbook_auto(), which panics or errors on.pdf. Add a PDF branch before the calamine block:ingest_pdf_via_docling(path, account_id, ctx)must:which::which("docling").is_ok()— returnLedgerOpError::MissingDependency("docling not on PATH")if absent (not panic).std::process::Command::new("docling").args(["convert", "--to", "json", path]).output()DoclingDocument(see schema below).DoclingDocument.tables[*].data.gridrows →TransactionInput { account_id, date, amount, description, source_ref }.amountmust berust_decimal::Decimal::from_str()— neverf64.blake3(account_id + date + amount_str + description).OperationResult::success("ingest-statement", rows.len()).2.
DoclingDocumentdeserialization target (ledger-core/src/ingest.rsor newledger-core/src/docling.rs)Docling 2.78.0 JSON schema (relevant subset):
Column heuristic: first row of grid is the header. Map headers to
date,amount,descriptionviaDocumentShape::column_map(same path as the XLSX branch atledger_ops.rs:257–271).3.
docling_readyreal probe (ledgerr-mcp/src/bin/ledgerr-mcp-server.rs)Replace hardcoded
trueat line 130:whichcrate is already in the workspace (checkCargo.lock); if absent, usestd::process::Command::new("which").arg("docling").status().map(|s| s.success()).unwrap_or(false).4.
DoclingProcessSurface— b00t attestation (crates/b00t-iface/src/orcrates/ledgerr-mcp/src/)Implement
ProcessSurfacefor Docling as a b00t datum. This is the node-level attestation: when the binary was compiled/optimized for this system, b00t can verify Docling is operational before l3dg3rr claimsdocling_ready: true.On
HandshakeSurface::operate(), append"docling"tosurfacesin theHandshakeDocumentwritten to_b00t_/handshake/l3dg3rr.jsononly whenDoclingProcessSurface.init()succeeds. This makes the datum self-attesting: the handshake file'ssurfacesarray is the proof that Docling is operational on this node.Acceptance Criteria
IngestStatementOp::execute()with a.pdfinput anddoclingon$PATHproduces ≥ 1TransactionInputwith non-NonedateandDecimalamount.IngestStatementOp::execute()withdoclingabsent returnsLedgerOpError::MissingDependency— no panic.get_pipeline_status(true, true, false, vec![])returnsblockers: ["docling_unreachable"](existing test atmcp_adapter_contract.rs:65must still pass).test_ingest_statement_via_pdf_sidecar(currently#[ignore]) un-ignores and passes whendoclingis on$PATH; stays#[ignore]in CI unlessDOCLING_INTEGRATION=1env var is set.DoclingProcessSurface::init()returnsErr(DoclingError::NotOnPath)whendoclingis absent.HandshakeDocument { surfaces }includes"docling"iffDoclingProcessSurface.init().is_ok().Files
crates/ledger-core/src/ledger_ops.rs:188IngestStatementOp::execute()crates/ledger-core/src/ingest.rsor newdocling.rsDoclingDocument,DoclingTable,DoclingCelldeserialization;ingest_pdf_via_docling()crates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:130docling_ready: truewithwhich::which("docling").is_ok()crates/b00t-iface/src/(new file)DoclingProcessSurfaceimplementingProcessSurfacecrates/b00t-iface/src/handshake/mod.rs"docling"toHandshakeDocument::surfaceswhen surface is readycrates/ledger-core/src/integration_tests.rs:73#[ignore]gate; addDOCLING_INTEGRATIONenv guardDependency
Independent of #55–#57.
IngestPdfRequest.extracted_rowsis the output that feeds #55 (TransactionFacts population) — implementing this makes the PDF → legal verification path end-to-end.