Problem
Currently, per-item errors during a harvest run (duplicate ARC identifiers, failed ARC submissions) are handled client-side only inside harvest_arcs() of the api_client. They are counted and logged, but never persisted. As a result:
- A harvester building on top of the api_client cannot generate a complete error report.
- After the harvest,
GET /v3/harvests/{id} returns no information about which ARCs failed or were duplicates.
- Querying old harvests for diagnostic purposes is impossible.
Goal
Move all harvest error tracking to the server side so that CouchDB becomes the single source of truth. Any client that submits ARCs to a harvest should be able to retrieve the full error list afterwards via the normal harvest query endpoints.
Proposed Solution (Option C — fully server-side)
1. Server-side duplicate detection
The server tracks a per-harvest mapping of ARC identifier → arc_id (content hash). When the same identifier is submitted a second time within the same harvest run:
- The server records a
DUPLICATE error in the harvest document.
- The second submission is rejected with a
409 Conflict response.
- The harvest continues normally (non-catastrophic error).
This requires a new CouchDB index on (type=arc, metadata.last_harvest_id, identifier) or a lightweight lookup table stored as part of the harvest document.
2. New HarvestError model
Add to harvest_document.py (server) and to the shared API models:
class HarvestErrorType(StrEnum):
DUPLICATE = "duplicate"
SUBMISSION_FAILED = "submission_failed"
class HarvestError(BaseModel):
arc_id: str | None # ARC identifier (not content hash)
error_type: HarvestErrorType
message: str
timestamp: datetime
3. HarvestDocument extended
class HarvestDocument(BaseModel):
...
errors: list[HarvestError] = Field(default_factory=list)
4. API response updated
HarvestResponse and HarvestResult (api_client) include the errors list. All harvest query methods (get_harvest, list_harvests) naturally return the errors.
5. api_client simplified
- Remove client-side duplicate detection logic from
_submit_arcs_parallel().
- Remove the
failed_submissions counter.
harvest_arcs() returns a plain HarvestResult — the caller reads result.errors for the full picture.
Affected Components
| Component |
Change |
middleware/api/src/middleware/api/document_store/harvest_document.py |
Add HarvestError, HarvestErrorType; extend HarvestDocument |
middleware/api/src/middleware/api/document_store/couchdb.py |
Implement server-side identifier tracking per harvest; append_harvest_error() |
middleware/api/src/middleware/api/document_store/__init__.py |
Extend DocumentStore interface |
middleware/api/src/middleware/api/business_logic/arc_manager.py |
Record duplicate / submission errors in harvest |
middleware/shared/src/middleware/shared/api_models/v3/models.py |
Add errors to HarvestResponse |
middleware/api_client/src/middleware/api_client/models.py |
Add errors: list[HarvestError] to HarvestResult |
middleware/api_client/src/middleware/api_client/api_client.py |
Remove client-side duplicate detection; simplify _submit_arcs_parallel() |
Open Questions
- Should
errors be paginated in GET /v3/harvests/{id} for very large error lists (e.g. 10 000 duplicates)?
- Should
HarvestStatistics aggregate errors count alongside arcs\_submitted etc., or rely solely on len(errors)?
- Backward compatibility: old
HarvestDocument records in CouchDB have no errors field — the model validator should default to [].
Problem
Currently, per-item errors during a harvest run (duplicate ARC identifiers, failed ARC submissions) are handled client-side only inside
harvest_arcs()of the api_client. They are counted and logged, but never persisted. As a result:GET /v3/harvests/{id}returns no information about which ARCs failed or were duplicates.Goal
Move all harvest error tracking to the server side so that CouchDB becomes the single source of truth. Any client that submits ARCs to a harvest should be able to retrieve the full error list afterwards via the normal harvest query endpoints.
Proposed Solution (Option C — fully server-side)
1. Server-side duplicate detection
The server tracks a per-harvest mapping of ARC identifier → arc_id (content hash). When the same identifier is submitted a second time within the same harvest run:
DUPLICATEerror in the harvest document.409 Conflictresponse.This requires a new CouchDB index on
(type=arc, metadata.last_harvest_id, identifier)or a lightweight lookup table stored as part of the harvest document.2. New
HarvestErrormodelAdd to
harvest_document.py(server) and to the shared API models:3.
HarvestDocumentextended4. API response updated
HarvestResponseandHarvestResult(api_client) include theerrorslist. All harvest query methods (get_harvest,list_harvests) naturally return the errors.5. api_client simplified
_submit_arcs_parallel().failed_submissionscounter.harvest_arcs()returns a plainHarvestResult— the caller readsresult.errorsfor the full picture.Affected Components
middleware/api/src/middleware/api/document_store/harvest_document.pyHarvestError,HarvestErrorType; extendHarvestDocumentmiddleware/api/src/middleware/api/document_store/couchdb.pyappend_harvest_error()middleware/api/src/middleware/api/document_store/__init__.pyDocumentStoreinterfacemiddleware/api/src/middleware/api/business_logic/arc_manager.pymiddleware/shared/src/middleware/shared/api_models/v3/models.pyerrorstoHarvestResponsemiddleware/api_client/src/middleware/api_client/models.pyerrors: list[HarvestError]toHarvestResultmiddleware/api_client/src/middleware/api_client/api_client.py_submit_arcs_parallel()Open Questions
errorsbe paginated inGET /v3/harvests/{id}for very large error lists (e.g. 10 000 duplicates)?HarvestStatisticsaggregateerrorscount alongsidearcs\_submittedetc., or rely solely onlen(errors)?HarvestDocumentrecords in CouchDB have noerrorsfield — the model validator should default to[].