Skip to content

docling bridge: wire IngestStatementOp PDF branch + DoclingProcessSurface b00t attestation #60

@promptexecutionerr

Description

@promptexecutionerr

What already exists (do not re-implement)

The MCP contract, request shape, and pipeline status tracking are complete:

Symbol Location State
proxy_docling_ingest_pdf MCP tool crates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:132 Registered, routes to handle_ingest_pdf()
handle_ingest_pdf<T>() crates/ledgerr-mcp/src/mcp_adapter.rs:999 Implemented — calls ingest_statement_rows() with pre-parsed extracted_rows
IngestPdfRequest crates/ledgerr-mcp/src/lib.rs:91 { pdf_path: String, journal_path, workbook_path, ontology_path, raw_context_bytes, extracted_rows: Vec<TransactionInput> }
docling_ready: bool mcp_adapter.rs:193 In get_pipeline_status() — hardcoded true at call site bin/ledgerr-mcp-server.rs:130
DocumentChunk crates/ledger-core/src/rule_registry.rs:122 NDJSON sidecar output type — { node_id, text, parent_id, semantic_id, anchors: Vec<[u32;2]> }
IngestStatementOp crates/ledger-core/src/ledger_ops.rs:169 Handles CSV/XLSX via calamine; PDF branch missing — returns error on .pdf input
test_ingest_statement_via_pdf_sidecar crates/ledger-core/src/integration_tests.rs:73 #[ignore] — contract written, awaiting implementation
ProcessSurface + Requirement::BinaryOnPath crates/b00t-iface/src/core/surface.rs:14–66 Surface lifecycle trait with typed requirement declarations
HandshakeSurface / HandshakeDocument crates/b00t-iface/src/handshake/mod.rs:63 Writes _b00t_/handshake/l3dg3rr.json; surfaces: Vec<String> field carries capability advertisement

What needs to be built

1. IngestStatementOp::execute() — PDF branch (ledger-core/src/ledger_ops.rs)

Currently execute() opens input_path via calamine::open_workbook_auto(), which panics or errors on .pdf. Add a PDF branch before the calamine block:

if matches!(doc_type, DocType::Pdf) {
    return ingest_pdf_via_docling(input_path, &account_id, ctx);
}

ingest_pdf_via_docling(path, account_id, ctx) must:

  1. Check which::which("docling").is_ok() — return LedgerOpError::MissingDependency("docling not on PATH") if absent (not panic).
  2. Spawn: std::process::Command::new("docling").args(["convert", "--to", "json", path]).output()
  3. Deserialize stdout as DoclingDocument (see schema below).
  4. Map DoclingDocument.tables[*].data.grid rows → TransactionInput { account_id, date, amount, description, source_ref }.
  5. amount must be rust_decimal::Decimal::from_str() — never f64.
  6. Compute Blake3 content-hash ID per row: blake3(account_id + date + amount_str + description).
  7. Return OperationResult::success("ingest-statement", rows.len()).

2. DoclingDocument deserialization target (ledger-core/src/ingest.rs or new ledger-core/src/docling.rs)

Docling 2.78.0 JSON schema (relevant subset):

#[derive(Debug, Deserialize)]
pub struct DoclingDocument {
    pub tables: Vec<DoclingTable>,
}

#[derive(Debug, Deserialize)]
pub struct DoclingTable {
    pub data: DoclingTableData,
}

#[derive(Debug, Deserialize)]
pub struct DoclingTableData {
    pub grid: Vec<Vec<DoclingCell>>,
}

#[derive(Debug, Deserialize)]
pub struct DoclingCell {
    pub text: String,
    #[serde(default)]
    pub col_span: u32,
    #[serde(default)]
    pub row_span: u32,
}

Column heuristic: first row of grid is the header. Map headers to date, amount, description via DocumentShape::column_map (same path as the XLSX branch at ledger_ops.rs:257–271).

3. docling_ready real probe (ledgerr-mcp/src/bin/ledgerr-mcp-server.rs)

Replace hardcoded true at line 130:

// Before:
mcp_adapter::handle_pipeline_status(true, true, true, Vec::new())
// After:
let docling_ready = which::which("docling").is_ok();
mcp_adapter::handle_pipeline_status(true, true, docling_ready, Vec::new())

which crate is already in the workspace (check Cargo.lock); if absent, use std::process::Command::new("which").arg("docling").status().map(|s| s.success()).unwrap_or(false).

4. DoclingProcessSurface — b00t attestation (crates/b00t-iface/src/ or crates/ledgerr-mcp/src/)

Implement ProcessSurface for Docling as a b00t datum. This is the node-level attestation: when the binary was compiled/optimized for this system, b00t can verify Docling is operational before l3dg3rr claims docling_ready: true.

pub struct DoclingProcessSurface;

impl ProcessSurface for DoclingProcessSurface {
    type Config = ();
    type Error = DoclingError;
    type Handle = ();

    fn capability(&self) -> SurfaceCapability {
        SurfaceCapability {
            name: "docling",
            requirements: vec![
                Requirement::BinaryOnPath("docling".into()),
            ],
            governance: GovernancePolicy::default(),
        }
    }

    fn init(&mut self, _config: ()) -> Result<(), DoclingError> {
        which::which("docling").map(|_| ()).map_err(|_| DoclingError::NotOnPath)
    }

    fn operate(&mut self) -> Result<(), DoclingError> {
        // Smoke: docling --version
        let out = std::process::Command::new("docling")
            .arg("--version")
            .output()
            .map_err(|e| DoclingError::SpawnFailed(e.to_string()))?;
        if out.status.success() { Ok(()) } else { Err(DoclingError::VersionCheckFailed) }
    }

    fn maintain(&mut self) -> MaintenanceAction { MaintenanceAction::NoOp }
    fn terminate(&mut self, _: ()) -> AuditRecord { /* ... */ }
}

On HandshakeSurface::operate(), append "docling" to surfaces in the HandshakeDocument written to _b00t_/handshake/l3dg3rr.json only when DoclingProcessSurface.init() succeeds. This makes the datum self-attesting: the handshake file's surfaces array is the proof that Docling is operational on this node.

Acceptance Criteria

  • IngestStatementOp::execute() with a .pdf input and docling on $PATH produces ≥ 1 TransactionInput with non-None date and Decimal amount.
  • IngestStatementOp::execute() with docling absent returns LedgerOpError::MissingDependency — no panic.
  • get_pipeline_status(true, true, false, vec![]) returns blockers: ["docling_unreachable"] (existing test at mcp_adapter_contract.rs:65 must still pass).
  • test_ingest_statement_via_pdf_sidecar (currently #[ignore]) un-ignores and passes when docling is on $PATH; stays #[ignore] in CI unless DOCLING_INTEGRATION=1 env var is set.
  • DoclingProcessSurface::init() returns Err(DoclingError::NotOnPath) when docling is absent.
  • HandshakeDocument { surfaces } includes "docling" iff DoclingProcessSurface.init().is_ok().

Files

File Change
crates/ledger-core/src/ledger_ops.rs:188 Add PDF branch in IngestStatementOp::execute()
crates/ledger-core/src/ingest.rs or new docling.rs DoclingDocument, DoclingTable, DoclingCell deserialization; ingest_pdf_via_docling()
crates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:130 Replace docling_ready: true with which::which("docling").is_ok()
crates/b00t-iface/src/ (new file) DoclingProcessSurface implementing ProcessSurface
crates/b00t-iface/src/handshake/mod.rs Append "docling" to HandshakeDocument::surfaces when surface is ready
crates/ledger-core/src/integration_tests.rs:73 Remove #[ignore] gate; add DOCLING_INTEGRATION env guard

Dependency

Independent of #55#57. IngestPdfRequest.extracted_rows is the output that feeds #55 (TransactionFacts population) — implementing this makes the PDF → legal verification path end-to-end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    doclingDocument extraction bridgeenhancementNew feature or requestpipelineTransaction pipeline

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions