Ledger Storage Architecture

This note is the canonical storage plan for Ledger while the source-package harness stabilizes. The detailed agent workflow remains in docs/agent-source-package-harness.md; this document only defines where each class of Ledger data belongs.

Decision Summary

Ledger uses three storage layers with different jobs.

ledger-raw is the immutable source-byte archive. It stores exact publisher artifacts as fetched: workbooks, CSVs, PDFs, ZIPs, HTML snapshots, and similar government-statistics release files. Raw objects are content-addressed by checksum and should never be overwritten in place.

ledger-derived is the reproducible artifact archive. It stores build outputs that Ledger can regenerate from raw bytes, package specs, parser code, and build configuration. Examples include parsed-cell or parsed-row Parquet/JSONL files, source record outputs, ledger.db, mirror JSONL exports, QA reports, Data Package metadata, and RO-Crate metadata.

Supabase/Postgres is the queryable relational registry for accepted Ledger builds. It stores rows that applications, agents, and downstream systems need to search and join: source artifacts, source rows/cells, source records, aggregate facts, aggregate constraints, concept alignments, lineage edges, build metadata, validation status, and object-location pointers. Supabase is not the source of raw bytes and should not be hand-edited as the ingestion authority.

The deterministic local build remains the authority for source-backed facts. Hosted tables mirror accepted build outputs and provide a shared query surface.

Ownership Matrix

Data class	Git/local package	`ledger-raw` R2	`ledger-derived` R2	SQLite `ledger.db`	Supabase/Postgres
Source package specs	Authoritative YAML and parser code	No	Optional packaged snapshot	No	Metadata only
Raw publisher files	Tiny fixtures only	Authoritative bytes	No	Metadata only	Metadata plus R2 pointer
Source manifests	Authoritative checked metadata	No	Optional snapshot	Metadata loaded into tables	Queryable artifact registry
Parsed source rows/cells	Generated local output	No	Snapshot artifact	Queryable table	Queryable mirror
Source records/facts	Generated local output	No	Snapshot artifact	Queryable table	Queryable mirror
Aggregate constraints	Generated local output	No	Snapshot artifact	Queryable table	Queryable mirror
Build reports and QA	Generated local output	No	Snapshot artifact	Build summary rows	Queryable validation status
Mirror JSONL exports	Generated local output	No	Snapshot artifact	Export source	Bulk-load input
Populace active targets	No	No	No	No	Future adapter output outside Ledger core

Object Key Conventions

Raw source artifacts use the implemented content-addressed key shape:

raw/{source_id}/{package_id}/{year}/{sha256}/{filename}

For example:

raw/irs_soi/soi-table-1-1/2023/842da11...aca17/23in11si.xls

Derived build artifacts should use build-scoped keys so different builds can coexist and be audited:

derived/{source_id}/{package_id}/{year}/{build_id}/{artifact_name}

Examples:

derived/irs_soi/soi-table-1-1/2023/{build_id}/source_cells.jsonl
derived/bea/bea-nipa-pension-contributions/2022/{build_id}/source_rows.jsonl
derived/irs_soi/soi-table-1-1/2023/{build_id}/ledger.db
derived/irs_soi/soi-table-1-1/2023/{build_id}/reports/build_summary.json
derived/irs_soi/soi-table-1-1/2023/{build_id}/mirror/aggregate_facts.jsonl

Derived artifacts are reproducible and may be replaced by a new build, but a specific {build_id} path should be immutable once published.

Relational Registry Contract

The hosted ledger schema should be the lookup surface for Ledger, not the place where agents invent source facts. Rows should be bulk-loaded from deterministic build outputs.

The registry should expose:

source artifact identity: source name, table/file, URL, vintage, extraction date, extraction method, checksum, size, and raw R2 bucket/key/URI;
source rows/cells and source records, including exact source-row and source-cell lineage;
source columns and source-row values, so row-oriented artifacts are queryable by raw or normalized column names without JSON scans;
aggregate facts and aggregate constraints with stable keys, dimensions, filters, units, aggregation semantics, labels, and source provenance;
concept alignments, including source concept, canonical concept, relation, authority, legal vintage, and evidence;
build metadata, validation status, and derived artifact R2 bucket/key/URI.

The current Supabase migration mirrors the core relational tables and includes R2 location fields for raw source artifacts and derived build artifacts, so the registry can serve as the shared index over both R2 buckets.

Build And Publish Flow

The intended flow is:

Register raw source artifacts with uv run ledger fetch-artifact, which writes local bytes, records checksums in manifest.yaml, and can upload the exact bytes to ledger-raw. Existing manifest-declared artifacts can be checksum-validated, uploaded, and linked with uv run ledger publish-raw. Production package specs may omit raw bytes from Git as long as the manifest keeps source_url and SHA-256 metadata; builds can fill LEDGER_SOURCE_ARTIFACT_CACHE_DIR by setting LEDGER_SOURCE_ARTIFACT_FETCH=1. The old LEDGER_-prefixed environment variables remain accepted only as migration fallbacks.
Validate and build a source package with uv run ledger validate-package and uv run ledger build-suite.
Produce local deterministic outputs: parsed rows/cells, source records, aggregate facts, ledger.db, QA reports, Data Package metadata, and RO-Crate metadata.
Export relational mirror files with uv run ledger export-db-tables.

Publish derived build outputs to ledger-derived:

uv run ledger publish-derived \
  --dir /tmp/ledger-suite \
  --source-id irs_soi \
  --package-id soi-table-1-1 \
  --year 2023 \
  --build-artifacts-out /tmp/ledger-build-artifacts.jsonl

Bulk-load or upsert accepted relational rows into Supabase/Postgres:

uv run ledger load-supabase-mirror \
  --dir /tmp/ledger-mirror \
  --build-artifacts /tmp/ledger-build-artifacts.jsonl

The Supabase project must have the checked migration applied and the ledger schema exposed in PostgREST/Data API settings before the REST loader can write to it. Use --dry-run to verify local JSONL files without writing.

Non-Goals

Supabase should not store large raw binary artifacts. It should point to R2.

Ledger should not store raw survey or administrative microdata. It reflects government statistics releases and the provenance needed to audit those published facts.

R2 should not be the schema authority. It stores bytes and reproducible build files, while Ledger code and checked specs define semantics.

Ledger should not own Populace source selection, aging, reconciliation, activation profiles, or simulator-specific target mappings. Those belong in Populace or thin downstream adapters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ledger Storage Architecture

Decision Summary

Ownership Matrix

Object Key Conventions

Relational Registry Contract

Build And Publish Flow

Non-Goals

Uh oh!

FilesExpand file tree

storage-architecture.md

Latest commit

History

storage-architecture.md

File metadata and controls

Ledger Storage Architecture

Decision Summary

Ownership Matrix

Object Key Conventions

Relational Registry Contract

Build And Publish Flow

Non-Goals