Context
During a code review of the ARC status model, we identified a mismatch between the intended state space (as expressed by the enum definitions) and the actually implemented state space (as reflected by the code that reads and writes these values). This issue captures the desired target state and the work needed to reach it.
Current situation
Three distinct concepts — currently conflated
| Concept |
Today |
Problem |
| Operation result |
ArcStatus (CREATED, UPDATED, DELETED, REQUESTED) |
DELETED and REQUESTED are never set; ArcStatus is not a state — it is a verb in the HTTP response body |
| Harvest lifecycle state |
ArcLifecycleStatus (ACTIVE, MISSING, DELETED, PROCESSING, INVALID) |
Only ACTIVE is ever written; PROCESSING and INVALID belong to the Git dimension, not the harvest dimension |
| Git sync state |
GitMetadata.status (free string "PENDING"/"SYNCED"/"FAILED") |
Never updated by the worker — always stays "PENDING"; GIT_PUSH_SUCCESS/FAILED is only written as an ArcEvent |
What ArcEvents are used for today
- Harvest statistics —
get_harvest_statistics() reconstructs arcs_new / arcs_updated / arcs_unchanged by scanning the event log for ARC_CREATED / ARC_UPDATED events. This is fragile: statistics depend on event-log archaeology rather than a structured field.
- Git-sync outcome — The Celery worker appends
GIT_PUSH_SUCCESS or GIT_PUSH_FAILED to the event log. This is the only way the worker reports back; GitMetadata.status is never touched.
Desired target state
Concept 1 — Operation result (ArcStatus)
ArcStatus is not a state. It is an operation result returned in the HTTP response body, analogous to HTTP 201 vs 200. It should only contain values that are actually assigned:
ArcStatus: CREATED | UPDATED
DELETED and REQUESTED are removed (never assigned today; future deletion semantics belong to ArcLifecycleStatus).
Concept 2 — Harvest lifecycle state (ArcLifecycleStatus)
Describes where an ARC stands in the harvest cycle. Transitions are driven by harvest runs, not by Git sync.
ArcLifecycleStatus: ACTIVE | MISSING | DELETED
State diagram:
(new ARC submitted)
│
▼
ACTIVE ◄────────────────────────────────┐
│ │
│ harvest completed, │ ARC reappears
│ ARC not seen │ in a later harvest
▼ │
MISSING ────────────────────────────────┘
│
│ missing for N consecutive harvests
▼
DELETED ──── (ARC re-submitted) ──────► ACTIVE
PROCESSING and INVALID are removed from this enum — they describe the Git sync dimension, not the harvest dimension.
Concept 3 — Git sync state (GitSyncStatus)
Describes the outcome of the asynchronous GitLab synchronisation driven by the Celery worker.
GitSyncStatus: PENDING | SYNCING | SYNCED | FAILED
State diagram:
PENDING ──► SYNCING ──► SYNCED
│
└──► FAILED ──► PENDING (on Celery retry)
GitMetadata.status is retyped from str to GitSyncStatus. The worker writes the new status to this field in addition to appending an ArcEvent.
State machine & atomicity
Desired: explicit transition table
Rather than ad-hoc if statements scattered across couchdb.py, transitions should be validated centrally — ideally enforced inside the pre_save_validator hook that already exists on save_document:
_LIFECYCLE_TRANSITIONS: dict[ArcLifecycleStatus, set[ArcLifecycleStatus]] = {
ArcLifecycleStatus.ACTIVE: {ArcLifecycleStatus.MISSING},
ArcLifecycleStatus.MISSING: {ArcLifecycleStatus.ACTIVE, ArcLifecycleStatus.DELETED},
ArcLifecycleStatus.DELETED: {ArcLifecycleStatus.ACTIVE},
}
_GIT_TRANSITIONS: dict[GitSyncStatus, set[GitSyncStatus]] = {
GitSyncStatus.PENDING: {GitSyncStatus.SYNCING},
GitSyncStatus.SYNCING: {GitSyncStatus.SYNCED, GitSyncStatus.FAILED},
GitSyncStatus.FAILED: {GitSyncStatus.PENDING},
GitSyncStatus.SYNCED: set(), # terminal unless ARC content changes → reset to PENDING
}
Any attempt to write an invalid transition raises an error before touching CouchDB.
Atomicity analysis
| Transition |
Atomic? |
Reason |
ArcLifecycleStatus changes (harvest path) |
✅ Yes |
Single CouchDB document, protected by OCC retry in save_document |
GitSyncStatus: PENDING → SYNCING + Celery enqueue |
❌ No |
CouchDB write and RabbitMQ enqueue are two separate systems — no distributed transaction |
GitSyncStatus: SYNCING → SYNCED/FAILED |
✅ Yes |
Worker writes only to CouchDB |
The non-atomic PENDING → SYNCING + enqueue transition follows the outbox pattern pragmatically: write CouchDB first, then enqueue. If the process crashes between the two, the document stays stuck in SYNCING. A periodic watchdog job can detect and reset these.
Tasks
Context
During a code review of the ARC status model, we identified a mismatch between the intended state space (as expressed by the enum definitions) and the actually implemented state space (as reflected by the code that reads and writes these values). This issue captures the desired target state and the work needed to reach it.
Current situation
Three distinct concepts — currently conflated
ArcStatus(CREATED,UPDATED,DELETED,REQUESTED)DELETEDandREQUESTEDare never set;ArcStatusis not a state — it is a verb in the HTTP response bodyArcLifecycleStatus(ACTIVE,MISSING,DELETED,PROCESSING,INVALID)ACTIVEis ever written;PROCESSINGandINVALIDbelong to the Git dimension, not the harvest dimensionGitMetadata.status(free string"PENDING"/"SYNCED"/"FAILED")"PENDING";GIT_PUSH_SUCCESS/FAILEDis only written as anArcEventWhat
ArcEventsare used for todayget_harvest_statistics()reconstructsarcs_new / arcs_updated / arcs_unchangedby scanning the event log forARC_CREATED/ARC_UPDATEDevents. This is fragile: statistics depend on event-log archaeology rather than a structured field.GIT_PUSH_SUCCESSorGIT_PUSH_FAILEDto the event log. This is the only way the worker reports back;GitMetadata.statusis never touched.Desired target state
Concept 1 — Operation result (
ArcStatus)ArcStatusis not a state. It is an operation result returned in the HTTP response body, analogous to HTTP 201 vs 200. It should only contain values that are actually assigned:DELETEDandREQUESTEDare removed (never assigned today; future deletion semantics belong toArcLifecycleStatus).Concept 2 — Harvest lifecycle state (
ArcLifecycleStatus)Describes where an ARC stands in the harvest cycle. Transitions are driven by harvest runs, not by Git sync.
State diagram:
PROCESSINGandINVALIDare removed from this enum — they describe the Git sync dimension, not the harvest dimension.Concept 3 — Git sync state (
GitSyncStatus)Describes the outcome of the asynchronous GitLab synchronisation driven by the Celery worker.
State diagram:
GitMetadata.statusis retyped fromstrtoGitSyncStatus. The worker writes the new status to this field in addition to appending anArcEvent.State machine & atomicity
Desired: explicit transition table
Rather than ad-hoc
ifstatements scattered acrosscouchdb.py, transitions should be validated centrally — ideally enforced inside thepre_save_validatorhook that already exists onsave_document:Any attempt to write an invalid transition raises an error before touching CouchDB.
Atomicity analysis
ArcLifecycleStatuschanges (harvest path)save_documentGitSyncStatus: PENDING → SYNCING+ Celery enqueueGitSyncStatus: SYNCING → SYNCED/FAILEDThe non-atomic
PENDING → SYNCING + enqueuetransition follows the outbox pattern pragmatically: write CouchDB first, then enqueue. If the process crashes between the two, the document stays stuck inSYNCING. A periodic watchdog job can detect and reset these.Tasks
DELETEDandREQUESTEDfromArcStatus; update all testsPROCESSINGandINVALIDfromArcLifecycleStatus; update all testsGitSyncStatusenum (PENDING | SYNCING | SYNCED | FAILED) inmiddleware/shared/api_models/common/models.pyGitMetadata.statusfromstrtoGitSyncStatusinarc_document.pyGitSyncStatusin the Celery worker (sync_to_gitlab):PENDING → SYNCINGbefore Git push,SYNCING → SYNCED/FAILEDafterArcLifecycleStatustransition table; enforce it viapre_save_validatorinsave_documentGitSyncStatustransition table; enforce it in the workerget_harvest_statistics()with structuredis_new/has_changesfields onArcMetadatagit_sync_statusin the v3ArcMetadataresponse model and update the API clientapi_client/models.py(ArcStatus,ArcLifecycleStatus, addGitSyncStatus)middleware/api/spec/document-store/design.mdSYNCINGdocuments