Skip to content

Server-side harvest error tracking: duplicate ARC detection and per-item error persistence in CouchDB #240

@Zalfsten

Description

@Zalfsten

Problem

Currently, per-item errors during a harvest run (duplicate ARC identifiers, failed ARC submissions) are handled client-side only inside harvest_arcs() of the api_client. They are counted and logged, but never persisted. As a result:

  • A harvester building on top of the api_client cannot generate a complete error report.
  • After the harvest, GET /v3/harvests/{id} returns no information about which ARCs failed or were duplicates.
  • Querying old harvests for diagnostic purposes is impossible.

Goal

Move all harvest error tracking to the server side so that CouchDB becomes the single source of truth. Any client that submits ARCs to a harvest should be able to retrieve the full error list afterwards via the normal harvest query endpoints.

Proposed Solution (Option C — fully server-side)

1. Server-side duplicate detection

The server tracks a per-harvest mapping of ARC identifier → arc_id (content hash). When the same identifier is submitted a second time within the same harvest run:

  • The server records a DUPLICATE error in the harvest document.
  • The second submission is rejected with a 409 Conflict response.
  • The harvest continues normally (non-catastrophic error).

This requires a new CouchDB index on (type=arc, metadata.last_harvest_id, identifier) or a lightweight lookup table stored as part of the harvest document.

2. New HarvestError model

Add to harvest_document.py (server) and to the shared API models:

class HarvestErrorType(StrEnum):
    DUPLICATE = "duplicate"
    SUBMISSION_FAILED = "submission_failed"

class HarvestError(BaseModel):
    arc_id: str | None           # ARC identifier (not content hash)
    error_type: HarvestErrorType
    message: str
    timestamp: datetime

3. HarvestDocument extended

class HarvestDocument(BaseModel):
    ...
    errors: list[HarvestError] = Field(default_factory=list)

4. API response updated

HarvestResponse and HarvestResult (api_client) include the errors list. All harvest query methods (get_harvest, list_harvests) naturally return the errors.

5. api_client simplified

  • Remove client-side duplicate detection logic from _submit_arcs_parallel().
  • Remove the failed_submissions counter.
  • harvest_arcs() returns a plain HarvestResult — the caller reads result.errors for the full picture.

Affected Components

Component Change
middleware/api/src/middleware/api/document_store/harvest_document.py Add HarvestError, HarvestErrorType; extend HarvestDocument
middleware/api/src/middleware/api/document_store/couchdb.py Implement server-side identifier tracking per harvest; append_harvest_error()
middleware/api/src/middleware/api/document_store/__init__.py Extend DocumentStore interface
middleware/api/src/middleware/api/business_logic/arc_manager.py Record duplicate / submission errors in harvest
middleware/shared/src/middleware/shared/api_models/v3/models.py Add errors to HarvestResponse
middleware/api_client/src/middleware/api_client/models.py Add errors: list[HarvestError] to HarvestResult
middleware/api_client/src/middleware/api_client/api_client.py Remove client-side duplicate detection; simplify _submit_arcs_parallel()

Open Questions

  • Should errors be paginated in GET /v3/harvests/{id} for very large error lists (e.g. 10 000 duplicates)?
  • Should HarvestStatistics aggregate errors count alongside arcs\_submitted etc., or rely solely on len(errors)?
  • Backward compatibility: old HarvestDocument records in CouchDB have no errors field — the model validator should default to [].

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions