This note is the canonical storage plan for Arch while the source-package
harness stabilizes. The detailed agent workflow remains in
docs/agent-source-package-harness.md; this document only defines where each
class of Arch data belongs.
Arch uses three storage layers with different jobs.
arch-raw is the immutable source-byte archive. It stores exact publisher
artifacts as fetched: workbooks, CSVs, PDFs, ZIPs, HTML snapshots, public raw
microdata files when allowed, and similar source files. Raw objects are
content-addressed by checksum and should never be overwritten in place.
arch-derived is the reproducible artifact archive. It stores build outputs
that Arch can regenerate from raw bytes, package specs, parser code, and build
configuration. Examples include parsed-cell or parsed-row Parquet/JSONL files,
source record outputs, arch.db, mirror JSONL exports, QA reports, Data
Package metadata, and RO-Crate metadata.
Supabase/Postgres is the queryable relational registry for accepted Arch builds. It stores rows that applications, agents, and downstream systems need to search and join: source artifacts, source rows/cells, source records, aggregate facts, aggregate constraints, concept alignments, lineage edges, build metadata, validation status, and object-location pointers. Supabase is not the source of raw bytes and should not be hand-edited as the ingestion authority.
The deterministic local build remains the authority for source-backed facts. Hosted tables mirror accepted build outputs and provide a shared query surface.
| Data class | Git/local package | arch-raw R2 |
arch-derived R2 |
SQLite arch.db |
Supabase/Postgres |
|---|---|---|---|---|---|
| Source package specs | Authoritative YAML and parser code | No | Optional packaged snapshot | No | Metadata only |
| Raw publisher files | Tiny fixtures only | Authoritative bytes | No | Metadata only | Metadata plus R2 pointer |
| Source manifests | Authoritative checked metadata | No | Optional snapshot | Metadata loaded into tables | Queryable artifact registry |
| Parsed source rows/cells | Generated local output | No | Snapshot artifact | Queryable table | Queryable mirror |
| Source records/facts | Generated local output | No | Snapshot artifact | Queryable table | Queryable mirror |
| Aggregate constraints | Generated local output | No | Snapshot artifact | Queryable table | Queryable mirror |
| Build reports and QA | Generated local output | No | Snapshot artifact | Build summary rows | Queryable validation status |
| Mirror JSONL exports | Generated local output | No | Snapshot artifact | Export source | Bulk-load input |
| Microplex active targets | No | No | No | No | Future adapter output outside Arch core |
Raw source artifacts use the implemented content-addressed key shape:
raw/{source_id}/{package_id}/{year}/{sha256}/{filename}
For example:
raw/irs_soi/soi-table-1-1/2023/842da11...aca17/23in11si.xls
Derived build artifacts should use build-scoped keys so different builds can coexist and be audited:
derived/{source_id}/{package_id}/{year}/{build_id}/{artifact_name}
Examples:
derived/irs_soi/soi-table-1-1/2023/{build_id}/source_cells.jsonl
derived/bea/bea-nipa-pension-contributions/2022/{build_id}/source_rows.jsonl
derived/irs_soi/soi-table-1-1/2023/{build_id}/arch.db
derived/irs_soi/soi-table-1-1/2023/{build_id}/reports/build_summary.json
derived/irs_soi/soi-table-1-1/2023/{build_id}/mirror/aggregate_facts.jsonl
Derived artifacts are reproducible and may be replaced by a new build, but a
specific {build_id} path should be immutable once published.
The hosted arch schema should be the lookup surface for Arch, not the place
where agents invent source facts. Rows should be bulk-loaded from deterministic
build outputs.
The registry should expose:
- source artifact identity: source name, table/file, URL, vintage, extraction date, extraction method, checksum, size, and raw R2 bucket/key/URI;
- source rows/cells and source records, including exact source-row and source-cell lineage;
- source columns and source-row values, so row-oriented artifacts are queryable by raw or normalized column names without JSON scans;
- aggregate facts and aggregate constraints with stable keys, dimensions, filters, units, aggregation semantics, labels, and source provenance;
- concept alignments, including source concept, canonical concept, relation, authority, legal vintage, and evidence;
- build metadata, validation status, and derived artifact R2 bucket/key/URI.
The current Supabase migration mirrors the core relational tables and includes R2 location fields for raw source artifacts and derived build artifacts, so the registry can serve as the shared index over both R2 buckets.
The intended flow is:
-
Register raw source artifacts with
uv run arch fetch-artifact, which writes local bytes, records checksums inmanifest.yaml, and can upload the exact bytes toarch-raw. Existing manifest-declared artifacts can be checksum-validated, uploaded, and linked withuv run arch publish-raw. Production package specs may omit raw bytes from Git as long as the manifest keepssource_urland SHA-256 metadata; builds can fillARCH_SOURCE_ARTIFACT_CACHE_DIRby settingARCH_SOURCE_ARTIFACT_FETCH=1. -
Validate and build a source package with
uv run arch validate-packageanduv run arch build-suite. -
Produce local deterministic outputs: parsed rows/cells, source records, aggregate facts,
arch.db, QA reports, Data Package metadata, and RO-Crate metadata. -
Export relational mirror files with
uv run arch export-db-tables. -
Publish derived build outputs to
arch-derived:uv run arch publish-derived \ --dir /tmp/arch-suite \ --source-id irs_soi \ --package-id soi-table-1-1 \ --year 2023 \ --build-artifacts-out /tmp/arch-build-artifacts.jsonl
-
Bulk-load or upsert accepted relational rows into Supabase/Postgres:
uv run arch load-supabase-mirror \ --dir /tmp/arch-mirror \ --build-artifacts /tmp/arch-build-artifacts.jsonl
The Supabase project must have the checked migration applied and the arch
schema exposed in PostgREST/Data API settings before the REST loader can write
to it. Use --dry-run to verify local JSONL files without writing.
Supabase should not store large raw binary artifacts. It should point to R2.
R2 should not be the schema authority. It stores bytes and reproducible build files, while Arch code and checked specs define semantics.
Arch should not own Microplex source selection, aging, reconciliation, activation profiles, or simulator-specific target mappings. Those belong in Microplex or thin downstream adapters.