Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
0c0b3f9
Remove comited by mistake
MrHinsh Jun 30, 2026
03a8523
fix: replace fragile fan-out channels with unified agent→CP→CLI comms…
MrHinsh Jun 30, 2026
64f8998
Merge branch 'main' into update-for-comms
MrHinsh Jun 30, 2026
6d9dd4f
docs: update all comms-model docs for iron-comms Phases A-E
MrHinsh Jun 30, 2026
e615caf
fix: resolve enum serialisation mismatch and flush race in agent→CP c…
MrHinsh Jun 30, 2026
645f97e
feat: add EnqueueMetrics and EnqueueSnapshot to UnifiedWorkerEventWriter
MrHinsh Jul 1, 2026
799825b
refactor: route ControlPlaneTelemetryTimer through UnifiedWorkerEvent…
MrHinsh Jul 1, 2026
9145da1
docs: fix stale lease-gating doc comment in ControlPlaneTelemetryTimer
MrHinsh Jul 1, 2026
f61831c
refactor: route SignalTerminalAsync through UnifiedWorkerEventWriter
MrHinsh Jul 1, 2026
6c68768
fix: repair TfsJobAgentWorkerTests terminal-signal assertions after S…
MrHinsh Jul 1, 2026
56daddc
fix: repair JobAgentWorkerDispatchTests terminal-signal assertions an…
MrHinsh Jul 1, 2026
43a2c73
refactor: remove ControlPlaneLoggerProvider legacy HTTP fallback path
MrHinsh Jul 1, 2026
919a9fe
refactor: migrate TfsJobAgentWorker task-list push off IControlPlaneT…
MrHinsh Jul 1, 2026
2faccc8
Merge branch 'update-for-comms' of https://github.com/nkdAgility/azur…
MrHinsh Jul 1, 2026
a0ffdf8
refactor: delete dead ControlPlaneProgressSink and ControlPlaneTeleme…
MrHinsh Jul 1, 2026
027cb6e
refactor: delete IControlPlaneTelemetryClient and scrub stale doc ref…
MrHinsh Jul 1, 2026
d6e8908
fix: address PR review findings in comms channel
MrHinsh Jul 1, 2026
5894545
fix: eliminate SSE stream abort race at job termination
MrHinsh Jul 1, 2026
d0c02c1
refactor: remove legacy per-signal agent endpoints from ControlPlane
MrHinsh Jul 1, 2026
e872281
fix: align store cap tests and options with Phase D append-only contract
MrHinsh Jul 1, 2026
7d7333c
docs: update all comms documentation for the unified worker-event cha…
MrHinsh Jul 1, 2026
acc215c
feat: auto-reconnect the unified job stream with replay from last seq…
MrHinsh Jul 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 22 additions & 4 deletions .agents/10-contracts/specs/lease-coordination-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,43 @@ Canonical contract for lease polling, job dispatch, and terminal signaling.

1. Worker polls control plane lease endpoint and dispatches leased jobs.
2. Lease state is set before dispatch and cleared after completion/failure.
3. Terminal lease status must be reported explicitly (`complete` or `fail`).
3. Terminal signals (`Terminal` kind with `failed` flag) are sent through `UnifiedWorkerEventWriter` as part of the unified event batch channel — not as separate `/complete` or `/fail` HTTP calls.
4. A 15-second heartbeat runs in parallel with job execution via `POST /agents/lease/{leaseId}/heartbeat`. The CP uses this to distinguish "agent alive but quiet" from "agent dead".

## Sequence Diagram

```mermaid
sequenceDiagram
participant AW as AgentWorkerBase
participant UEW as UnifiedWorkerEventWriter
participant CP as ControlPlane
participant LS as ActiveLeaseState
participant JW as JobAgentWorker

AW->>CP: GET /agents/lease?capabilities=...
CP-->>AW: leaseId + Job
AW->>LS: Set current lease
AW->>JW: OnJobAsync(job, leaseId)
AW->>UEW: Start (BackgroundService)

par Heartbeat loop (15 s)
loop Every 15 s
AW->>CP: POST /agents/lease/{leaseId}/heartbeat
CP-->>AW: 204 No Content
end
and Job execution
AW->>JW: OnJobAsync(job, leaseId)
JW->>UEW: EnqueueTasks / Emit(ProgressEvent) / EnqueueDiagnostic
Note over UEW: Batch ≤50 events or 500 ms
UEW->>CP: POST /workers/{workerId}/events (WorkerEventBatch)
CP-->>UEW: WorkerEventAck {lastAcceptedSeq}
end

alt Success
AW->>CP: POST /agents/lease/{leaseId}/complete
AW->>UEW: EnqueueTerminal(failed: false)
UEW->>CP: POST /workers/{workerId}/events [{kind: Terminal, failed: false}]
else Failure
AW->>CP: POST /agents/lease/{leaseId}/fail
AW->>UEW: EnqueueTerminal(failed: true)
UEW->>CP: POST /workers/{workerId}/events [{kind: Terminal, failed: true}]
end
AW->>LS: Clear current lease
```
Expand Down
73 changes: 56 additions & 17 deletions .agents/10-contracts/specs/observability-transport-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,39 +7,78 @@ Canonical contract for runtime observability transport channels.
- `IProgressSink`
- `CompositeProgressSink`
- `PackageProgressSink`
- `ControlPlaneProgressSink`
- `ControlPlaneTelemetryClient`
- `ControlPlaneTelemetryTimer`
- `ControlPlaneLoggerProvider`
- `UnifiedWorkerEventWriter` ← **the only Agent → ControlPlane telemetry transport**
- `ControlPlaneLoggerProvider` (enqueues Diagnostic events through `UnifiedWorkerEventWriter`; it has no HTTP path of its own)
- `ControlPlaneTelemetryTimer` (samples metrics/snapshots and enqueues them through `UnifiedWorkerEventWriter`)
- `PackageLoggerProvider`
- `PlatformMetrics`
- `WorkerEvent`, `WorkerEventBatch`, `WorkerEventAck` — wire DTOs
- `WorkerEventsController` — sole CP ingestion endpoint (`POST /workers/{workerId}/events`)
- `JobStreamController` — unified CP → CLI SSE stream (`GET /jobs/{jobId}/stream?from={seq}`)

Removed (must not reappear): `ControlPlaneProgressSink`, `ControlPlaneTelemetryClient` / `IControlPlaneTelemetryClient`, the `ControlPlaneLoggerProvider` HTTP fallback, and the seven per-lease CP telemetry endpoints under `POST /agents/lease/{leaseId}/…` (progress, complete, fail, metrics, snapshot, tasks, diagnostics).

## Wire Schema

**Request** — `POST /workers/{workerId}/events`:

```json
WorkerEventBatch {
"workerId": "string",
"leaseId": "string",
"events": [
WorkerEvent {
"seq": long, // monotonic per worker
"timestamp": "ISO-8601",
"kind": "heartbeat" | "progress" | "diagnostic" | "metrics" | "snapshot" | "tasks" | "terminal",
"payloadJson": "string" // kind-specific payload, serialised JSON
}
]
}
```

`kind` values are the `WorkerEventKind` enum members Heartbeat / Progress / Diagnostic / Metrics / Snapshot / Tasks / Terminal, serialized as camelCase strings.

**Ack** — the CP validates that the route `workerId` matches the batch, resolves the job via the lease, dispatches each event kind to its store, and responds:

```json
WorkerEventAck { "lastAcceptedSeq": long }
```

## Required Semantics

1. Subsystems emit progress, diagnostics, traces, and metric snapshots through the canonical transport surfaces.
2. Progress is transported to both control-plane and package run logs.
3. Diagnostics are transported to control-plane diagnostics stream and package diagnostics log stream.
4. Telemetry snapshots are transported to control-plane telemetry endpoints.
5. Transport contract is cross-cutting and must preserve O-1..O-5 requirements.
2. Progress is transported to both control-plane (via `UnifiedWorkerEventWriter`) and package run logs (via `PackageProgressSink`).
3. Diagnostics are enqueued through `ControlPlaneLoggerProvider` → `UnifiedWorkerEventWriter` → CP; and written to the package diagnostics log via `PackageLoggerProvider`.
4. Task lists are enqueued through `UnifiedWorkerEventWriter.EnqueueTasks()`; metrics and snapshots via `ControlPlaneTelemetryTimer` → `UnifiedWorkerEventWriter`.
5. Terminal signals (job complete/fail) are enqueued via `AgentWorkerBase.SignalTerminalAsync` as `Terminal` events and are **flushed immediately** (batch timer bypassed).
6. Batching: a single unbounded channel drained by a background flush loop; a batch closes at ≤50 events or 500 ms, whichever first.
7. Delivery: acknowledged, not fire-and-forget. Failed batches retry with exponential backoff (up to 5 attempts); HTTP 429 is honoured (2 s back-off). No silent loss.
8. The only other agent-originated HTTP calls are `GET /agents/lease?capabilities=...` (lease acquisition long-poll) and `POST /agents/lease/{leaseId}/heartbeat` (15 s liveness).
9. CP storage is **append-only** per job (`JobProgressStore`, `DiagnosticLogStore`): events are never evicted; safety caps `MaxEventsPerJob` / `MaxRecordsPerJob` (default 50,000) discard further events with a warning once reached.
10. CP → CLI: `GET /jobs/{jobId}/stream?from={seq}` is the unified SSE stream — multiplexes progress + diagnostics, replays the full append-only log from `seq`, emits a heartbeat comment every 15 s, and closes with `event: job-ended` / `event: job-failed`. The CLI additionally polls `GET /jobs/{id}/bootstrap` for task lists and metrics.
11. Transport contract is cross-cutting and must preserve O-1..O-5 requirements.

## Sequence Diagram

```mermaid
sequenceDiagram
participant SUB as AnySubsystem
participant CPS as ControlPlaneProgressSink
participant UEW as UnifiedWorkerEventWriter
participant PPS as PackageProgressSink
participant CLP as ControlPlaneLoggerProvider
participant CTT as ControlPlaneTelemetryTimer
participant CP as ControlPlane
participant CLI as CLI/TUI

SUB->>CPS: Emit ProgressEvent
SUB->>PPS: Emit ProgressEvent
SUB->>UEW: IProgressSink.Emit(ProgressEvent)
SUB->>PPS: IProgressSink.Emit(ProgressEvent)
SUB->>CLP: ILogger records
SUB->>CTT: IMigrationMetrics instruments
CPS->>CP: POST progress event
CLP->>CP: POST diagnostics record
CTT->>CP: POST telemetry snapshot
CLP->>UEW: EnqueueDiagnostic(records[])
SUB->>UEW: EnqueueTasks(JobTaskList)
Note over UEW: Batch ≤50 events or 500 ms<br/>(Terminal flushes immediately)
UEW->>CP: POST /workers/{workerId}/events (WorkerEventBatch)
CP-->>UEW: WorkerEventAck {lastAcceptedSeq}
PPS-->>PPS: Append progress.ndjson
CLI->>CP: GET /jobs/{jobId}/stream?from={seq}
CP-->>CLI: SSE replay + live (progress + diagnostics)
```

2 changes: 1 addition & 1 deletion .agents/10-contracts/specs/task-plan-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,6 @@ sequenceDiagram
TB->>ST: Write plan.json
TB-->>JW: Return fresh JobTaskList
end
JW->>CP: POST /agents/lease/{leaseId}/tasks
JW->>CP: EnqueueTasks via UnifiedWorkerEventWriter → POST /workers/{workerId}/events (Tasks kind)
```

2 changes: 1 addition & 1 deletion .agents/10-contracts/specs/validation-safety-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ sequenceDiagram
PV-->>JW: ValidationResult
alt Validation failed
JW->>PS: Emit failure ProgressEvent
JW->>CP: POST /agents/lease/{leaseId}/fail
JW->>CP: POST /workers/{workerId}/events (Terminal: fail — flushed immediately)
else Validation passed
JW-->>JW: Continue phase execution
end
Expand Down
4 changes: 2 additions & 2 deletions .agents/20-guardrails/workflow/definition-of-done.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ Every module/tool must pass all four checks:

**Pipeline wiring:** Verify both paths are intact:

- Metrics path: Module → `IMigrationMetrics` → OTel → `SnapshotMetricExporter` → `JobMetrics` → `POST /telemetry` → CLI polls `GET /jobs/{id}/telemetry` → `BuildProgressRenderable`
- Progress path: Module → `IProgressSink.Emit` → `ControlPlaneProgressSink` → `POST /progress` → SSE → CLI subscribes `GET /jobs/{id}/progress?follow=true`
- Metrics path: Module → `IMigrationMetrics` → OTel → `SnapshotMetricExporter` → `JobMetrics` → `UnifiedWorkerEventWriter` (Metrics kind) → `POST /workers/{workerId}/events` → CLI polls `GET /jobs/{id}/telemetry` → `BuildProgressRenderable`
- Progress path: Module → `IProgressSink.Emit` → `UnifiedWorkerEventWriter` (Progress kind) → `POST /workers/{workerId}/events` → CLI subscribes unified SSE `GET /jobs/{id}/stream?from={seq}`

**FAIL conditions:** Any link missing; counter read from `ProgressEvent.Metrics` in CLI/TUI (null for .NET 10 = silent zeros); direct `IProgressSink` wiring in CLI/TUI.

Expand Down
51 changes: 51 additions & 0 deletions .agents/20-guardrails/workflow/spec-coverage-completeness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Guardrail: Spec Coverage Completeness

## Rule

Before implementing a spec that changes an interface, method, or communication
pattern, you MUST verify that the spec accounts for ALL call sites of the thing
being changed. If even one call site is not covered, **stop and ask the operator**
before writing any code.

## Why

Past incidents (e.g., migrating `IControlPlaneClient` in Phase E of the Iron
Communications plan) showed that specs written for one call site missed several
others. The agent implemented the change for the documented site and left the
remaining sites with the old pattern — creating an inconsistent codebase that
still compiled but violated the intended contract.

## How to Apply

1. When a spec says "change X" (method, interface, pattern, field), grep the
codebase for ALL usages of X before touching any code:
```
grep -r "FollowLogsAsync\|StreamDiagnosticsAsync\|GetTelemetryAsync" src/ tests/
```
2. Compare the grep results with the call sites listed in the spec.
3. If ANY call site is missing from the spec, surface the gap to the operator:

> **STOP — spec coverage gap detected.**
> The spec covers N call sites but the codebase has M. The following are
> not covered: [list]. Should I extend the spec to cover them, or is this
> intentional scope?

4. Do not proceed until the operator confirms the scope is complete or
explicitly accepts the gap.

## Scope

Applies to all specs that change:
- Interface members (add, remove, rename, change signature)
- Communication patterns (e.g., replacing SSE methods, changing HTTP verbs)
- DI registrations that affect multiple commands
- Shared abstractions used across more than one project

Does NOT apply to:
- Purely additive changes (new method, new endpoint) that do not touch existing callers
- Changes scoped to a single file with no shared consumers

## Verification After Implementation

After completing an implementation, run a final grep to confirm zero remaining
uses of the old pattern. If any remain, treat them as bugs — not scope creep.
39 changes: 0 additions & 39 deletions .output/nkda-testdsl/agent-job-context/01-feature-assessment.md

This file was deleted.

37 changes: 0 additions & 37 deletions .output/nkda-testdsl/agent-job-context/02-dsl-design.md

This file was deleted.

19 changes: 0 additions & 19 deletions .output/nkda-testdsl/agent-job-context/03-extraction-summary.md

This file was deleted.

12 changes: 0 additions & 12 deletions .output/nkda-testdsl/agent-job-context/04-conversion-summary.md

This file was deleted.

9 changes: 0 additions & 9 deletions .output/nkda-testdsl/agent-job-context/05-refactor-summary.md

This file was deleted.

24 changes: 0 additions & 24 deletions .output/nkda-testdsl/agent-job-context/06-verification.md

This file was deleted.

This file was deleted.

Loading
Loading