Background
Currently each ARC document stored in CouchDB embeds the full RO-Crate JSON as arc_content
inside the same document that holds metadata (metadata.events, metadata.arc_hash, etc.).
Problem
Any metadata-only query (health checks, harvest statistics, event log reads) loads the entire
RO-Crate payload — which can be several MBs — over the network unnecessarily.
Although the field-projection workaround (fields=[...] in CouchDBClient.find()) was
added as an interim fix, the root cause is the mixed document layout.
Proposed solution
Store ARC content in a separate companion document:
arc_<id> — lightweight metadata document (hash, events, status, RDI).
arc_<id>_content — large payload document containing arc_content.
DocumentStore.get_arc_content() fetches the _content companion; all other queries
target the metadata document only.
Acceptance criteria
Background
Currently each ARC document stored in CouchDB embeds the full RO-Crate JSON as
arc_contentinside the same document that holds metadata (
metadata.events,metadata.arc_hash, etc.).Problem
Any metadata-only query (health checks, harvest statistics, event log reads) loads the entire
RO-Crate payload — which can be several MBs — over the network unnecessarily.
Although the field-projection workaround (
fields=[...]inCouchDBClient.find()) wasadded as an interim fix, the root cause is the mixed document layout.
Proposed solution
Store ARC content in a separate companion document:
arc_<id>— lightweight metadata document (hash, events, status, RDI).arc_<id>_content— large payload document containingarc_content.DocumentStore.get_arc_content()fetches the_contentcompanion; all other queriestarget the metadata document only.
Acceptance criteria
CouchDB.store_arc(),get_arc_content(), andget_metadata()updated.get_harvest_statistics()no longer needs afieldsprojection workaround.