Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,10 @@ Before generating or modifying code, read the relevant spec folders:
- **[`middleware/api/spec/document-store/`](middleware/api/spec/document-store/)** — CouchDB persistence layer, race-condition-safe initialization, and content-hash idempotency.
- **[`middleware/api/spec/harvest-manager/`](middleware/api/spec/harvest-manager/)** — Harvest run lifecycle, ownership validation, and progress tracking.

**API Client component** (`middleware/api_client/spec/`) — client internals:

- **[`middleware/api_client/spec/harvest-client/`](middleware/api_client/spec/harvest-client/)** — Harvest lifecycle: parallel ARC submission, per-item error collection (`HarvestError`, `HarvestErrorType`), typed statistics (`HarvestStatistics`), and compatibility shim for issue #240.

For the AI agent workflow documentation, see [`docs/ai_workflow.md`](docs/ai_workflow.md).

### Spec-to-Code Mapping
Expand All @@ -256,6 +260,7 @@ The `spec-to-code` agent uses this table in Step 3 to locate affected code.
| `middleware/api/spec/harvest-manager/` | `middleware/api/src/middleware/api/business_logic/harvest_manager.py` |
| `middleware/api/spec/arc-upload/` | `middleware/api/src/middleware/api/api/v3/arcs.py` |
| `middleware/api/spec/harvest-arc-upload/` | `middleware/api/src/middleware/api/api/v3/harvests.py` |
| `middleware/api_client/spec/harvest-client/` | `middleware/api_client/src/middleware/api_client/api_client.py`, `models.py` |
| `spec/` (project-level) | Follow links in **Architecture & Design** above to the affected component. |

---
Expand Down
6 changes: 3 additions & 3 deletions docker/Dockerfile.api
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ COPY pyproject.toml uv.lock ./
COPY middleware ./middleware

# Upgrade pip and install uv
RUN pip install --no-cache-dir --upgrade pip==26.0.1 uv==0.11.7
RUN pip install --no-cache-dir --upgrade pip==26.1.1 uv==0.11.16

# Build wheels
RUN uv build --package fairagro-middleware-shared --wheel && \
Expand All @@ -38,7 +38,7 @@ RUN apk add --no-cache \
WORKDIR /build

# Install uv and PyInstaller
RUN pip install --no-cache-dir --upgrade pip==26.0.1 uv==0.11.7
RUN pip install --no-cache-dir --upgrade pip==26.1.1 uv==0.11.16

# Copy built wheel from package-builder stage
COPY --from=package-builder /build/dist/*.whl /tmp/wheels/
Expand Down Expand Up @@ -100,7 +100,7 @@ ENV UVICORN_LOG_LEVEL=info

# Create non-root user and group and fix permissions
RUN apk add --no-cache --upgrade \
curl=8.17.0-r1 \
curl=8.19.0-r0 \
git=2.52.0-r0 \
zlib=1.3.2-r0 \
tzdata \
Expand Down
14 changes: 10 additions & 4 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,17 @@ architecture-beta

group rdi2(database)[RDI 2]
service rdi2db(database)[DB] in rdi2
service csw(server)[csw] in rdi2
service csw(server)["CSW / INSPIRE"] in rdi2

group rdi3(database)[RDI 3]
service rdi3db(database)[DB] in rdi3
service web(internet)["Web / schema.org"] in rdi3

group middleware(cloud)[Middleware]
service api(server)[API] in middleware
service db(database)[CouchDB] in middleware
service git(database)[DataHUB] in middleware
service inspire2arc(server)[inspire2arc] in middleware
service harvester(server)[Harvester] in middleware

service searchhub(server)[SearchHUB]
service sciwin(server)[SciWIn]
Expand All @@ -25,8 +29,10 @@ architecture-beta
sql2arc:L --> R:rdi1db
sql2arc:R --> L:api
csw:L --> R:rdi2db
inspire2arc:T --> B:api
inspire2arc:L --> R:csw
web:L --> R:rdi3db
harvester:T --> B:api
harvester:L --> R:csw
harvester:B --> T:web
searchhub:L --> R:git
sciwin:L --> B:git
```
62 changes: 62 additions & 0 deletions middleware/api_client/spec/harvest-client/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Harvest Client — Design

## Module Overview

`ApiClient` (`api_client.py`) orchestrates the harvest lifecycle.
`HarvestResult`, `HarvestStatistics`, `HarvestError`, and `HarvestErrorType`
(`models.py`) are the stable public types exposed to harvesters.

```text
harvester
└─→ ApiClient.harvest_arcs(rdi, arcs)
├─→ create_harvest → HarvestResult (RUNNING)
├─→ _submit_arcs_parallel
│ ├─→ duplicate check (client-side) → HarvestError(DUPLICATE)
│ └─→ POST v3/harvests/{id}/arcs → HarvestError(SUBMISSION_FAILED) on error
└─→ complete_harvest → HarvestResult (COMPLETED)
└─→ inject client_errors via model_copy → HarvestResult.errors
```

## Key Decisions

1. **`HarvestStatistics` is a typed Pydantic model, not `dict`**
— The server serializes its internal `HarvestStatistics` via `model_dump()`
before sending it over the wire. The field names and types are stable and
known. A typed model gives consumers validated, IDE-navigable fields rather
than requiring dict key lookups with no type safety.

2. **`HarvestError` is a client-facing type in `models.py`, independent of any server model**
— Per-item errors are currently generated client-side. When the server
persists them natively (issue #240), `_parse_harvest_response` will
populate `HarvestResult.errors` from the server response automatically —
the type and consumer interface remain unchanged.

3. **`arc_id: str | None` in `HarvestError`**
— The `DUPLICATE` and `SUBMISSION_FAILED` categories always have a
known ARC identifier (when one is extractable from the RO-Crate). Future
error categories — such as harvest-level timeouts or config failures —
may not be associated with any specific ARC. `None` is the semantically
correct representation; an empty string would be an invisible sentinel
value that callers would need to treat specially.

4. **Client-side error collection as compatibility shim until issue #240**
— `harvest_arcs()` collects errors from `_submit_arcs_parallel()` and
merges them into the server response via `model_copy(update=...)`.
This shim is removed once the server persists and returns per-item errors
natively. The `model_copy` merge is additive: if the server already
returns errors in its response (post-#240), client-side errors are
appended rather than overwriting.

5. **Duplicate detection is performed client-side before the HTTP request**
— Submitting both duplicates would cause the server to process two ARCs
with the same identifier in the same harvest run, resulting in an opaque
conflict. Client-side detection gives an explicit `DUPLICATE` error,
prevents the wasted round-trip, and avoids requiring the server to handle
intra-harvest identity conflicts.

6. **Item-level failures are non-fatal; harvest-level failures are fatal**
— A submission failure for one ARC (e.g. server 422 on bad content) must
not abort the entire harvest because the remaining ARCs may be valid. A
catastrophic failure (e.g. 401 Unauthorized, harvest already closed) means
no further submissions will succeed, so the harvest is aborted, marked
`FAILED`, and the exception propagates to the caller.
50 changes: 50 additions & 0 deletions middleware/api_client/spec/harvest-client/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Harvest Client

Manage the full lifecycle of a harvest run — creation, parallel ARC
submission, error collection, and finalization — on behalf of a harvester
process. The client returns a typed result that captures both statistics
and per-item errors so harvesters can produce complete reports.

## Requirements

- [ ] Create a harvest run for a given RDI, submit all ARCs from an async
source in bounded parallelism, and return the completed harvest result
as a single operation.
- [ ] Accept an optional expected-dataset count at the start of a harvest to
enable progress tracking on the server side.
- [ ] Return typed harvest statistics (submitted, new, updated, unchanged,
missing counts, and optional expected-dataset count) as structured
fields rather than an opaque mapping.
- [ ] Record per-item errors encountered during submission and include them
in the returned harvest result.
- [ ] Classify each per-item error into one of the following categories:
`duplicate` (two ARCs share the same identifier) or `submission_failed`
(the server rejected or could not process the ARC).
- [ ] Each per-item error carries: the error category, a human-readable
message, and an ISO 8601 timestamp of when the error occurred.
- [ ] Optionally associate a per-item error with an ARC identifier; errors
that do not relate to a specific ARC (e.g. harvest-level failures) may
omit the identifier.
- [ ] Detect duplicate ARC identifiers before submission and record them as
`duplicate` errors; do not submit the duplicate.
- [ ] Skip individual ARC submission failures and continue the harvest with
remaining items; record each failure as a `submission_failed` error.
- [ ] Abort the entire harvest on catastrophic errors (e.g. authentication
failure, invalid harvest state) and mark the harvest as failed before
propagating the exception to the caller.

## Edge Cases

ARC with no extractable RO-Crate identifier → submitted normally; any
resulting error records no ARC identifier (`null`).

Two ARCs share the same identifier → the second is skipped; a `duplicate`
error is recorded for it; the first continues to be submitted normally.

Catastrophic error during submission → remaining tasks are cancelled; the
harvest is transitioned to `FAILED`; the exception propagates to the caller.

No per-item errors → the returned result contains an empty errors list.

`expected_datasets` not provided → harvest is created without a progress
denominator; statistics show raw counts only.
16 changes: 15 additions & 1 deletion middleware/api_client/src/middleware/api_client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,18 @@

from .api_client import ApiClient, ApiClientError
from .config import Config
from .models import ArcEventSummary, ArcLifecycleStatus, ArcMetadata, ArcResult, ArcStatus, HarvestResult, HarvestStatus
from .models import (
ArcEventSummary,
ArcLifecycleStatus,
ArcMetadata,
ArcResult,
ArcStatus,
HarvestError,
HarvestErrorType,
HarvestResult,
HarvestStatistics,
HarvestStatus,
)

__all__ = [
"Config",
Expand All @@ -14,5 +25,8 @@
"ArcMetadata",
"ArcEventSummary",
"HarvestResult",
"HarvestStatistics",
"HarvestStatus",
"HarvestError",
"HarvestErrorType",
]
Loading
Loading