Skip to content

perf: split arc_content out of the ARC CouchDB document (P1) #184

@Zalfsten

Description

@Zalfsten

Background

Currently each ARC document stored in CouchDB embeds the full RO-Crate JSON as arc_content
inside the same document that holds metadata (metadata.events, metadata.arc_hash, etc.).

Problem

Any metadata-only query (health checks, harvest statistics, event log reads) loads the entire
RO-Crate payload — which can be several MBs — over the network unnecessarily.

Although the field-projection workaround (fields=[...] in CouchDBClient.find()) was
added as an interim fix, the root cause is the mixed document layout.

Proposed solution

Store ARC content in a separate companion document:

  • arc_<id> — lightweight metadata document (hash, events, status, RDI).
  • arc_<id>_content — large payload document containing arc_content.

DocumentStore.get_arc_content() fetches the _content companion; all other queries
target the metadata document only.

Acceptance criteria

  • New document schema implemented and migration path documented.
  • CouchDB.store_arc(), get_arc_content(), and get_metadata() updated.
  • get_harvest_statistics() no longer needs a fields projection workaround.
  • All existing tests updated/extended.
  • No regression in API behaviour.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions