The document store is the single persistence layer for all structured data in the middleware. It uses one CouchDB database that holds three types of documents:
- ARC documents — the serialized RO-Crate content together with all
lifecycle metadata for that ARC (content hash, timestamps, event log).
Content and metadata are co-located in a single document, keyed by
arc_id. - Harvest documents — metadata about a harvest run (owner, counters,
status). Separate from ARC documents; referenced by
harvest_id. - Task records — optional records written by background workers to track Celery task outcomes.
All three types share one database. Isolation between them is by document key prefix, not by separate databases.
- On initialization, ensure the application database exists before use.
- On initialization, ensure CouchDB system databases (
_users,_replicator) exist if they do not yet exist. - Handle
412 Precondition Failed(database already exists) as a success during initialization — parallel service startups must not cause crashes. - Store ARC documents keyed by
arc_id; return flags indicating whether the document was newly created and whether its content changed, based on content hash comparison. - Support harvest run lifecycle operations: create, retrieve, increment statistics counters, and finalize a harvest run.
- Append event records to an ARC document's event log.
- Release the underlying HTTP session and database client on shutdown.
Parallel service startup (two containers connecting simultaneously) → both attempt
database creation; 412 Precondition Failed is treated as success, not an error.
ARC document already exists with identical content → the content-changed flag is
false; no CouchDB write performed for the body; only timestamp fields may be updated.
Concurrent writes to the same ARC document (e.g. two harvest workers submitting the
same ARC simultaneously) → the store strips the stale _rev from the payload,
re-fetches the current revision on each attempt, and retries up to a configurable
maximum (default 3) on ConflictError before raising DocumentConflictError.
Fetching a harvest by an unknown ID → the store returns nothing; callers raise
ResourceNotFoundError.