chore(doc): update roadmap

krokoko · krokoko · commit 03b8c831913e · 2026-04-01T10:22:56.000-05:00
diff --git a/docs/guides/ROADMAP.md b/docs/guides/ROADMAP.md
@@ -6,6 +6,21 @@ The order and scope of items may shift as we learn; the list below reflects curr
 
 ---
 
+## Ongoing engineering practice (cross-iteration)
+
+These practices apply continuously across iterations and are not treated as one-time feature milestones.
+
+- **Property-based correctness testing for orchestration invariants** — Complement example-based tests (Jest/pytest) with property-based testing (`fast-check` for TypeScript and `hypothesis` for Python) so randomized inputs and interleavings validate invariants over many runs. The goal is to verify safety properties that are timing-sensitive or hard to cover with scenario tests alone (for example, concurrent state transitions and lock/contention behavior).
+- **Machine-readable property catalog** — Maintain a versioned property set with explicit mapping from each property to enforcing code paths and tests. Initial properties include:
+  - `P-ABCA-1` terminal-state immutability: tasks in `COMPLETED` / `FAILED` / `CANCELLED` / `TIMED_OUT` cannot transition further.
+  - `P-ABCA-2` concurrency counter consistency: for each user, `active_count` equals the number of tasks in active states (`SUBMITTED`, `HYDRATING`, `RUNNING`, `FINALIZING`).
+  - `P-ABCA-3` event ordering: `TaskEvents` are strictly monotonic by `event_id` (ULID order).
+  - `P-ABCA-4` memory fallback guarantee: if task finalization sees `memory_written = false`, fallback episode write is attempted and result is observable.
+  - `P-ABCA-5` branch-name uniqueness: simultaneous tasks for the same repo generate distinct branch names (ULID-based suffix).
+- **Definition-of-done hook** — New orchestrator/concurrency changes should include: updated property mappings, at least one property-based test where applicable, and invariant notes in `ORCHESTRATOR.md` to keep docs and executable checks aligned.
+
+---
+
 ## Iteration 1 — First shippable slice (done)
 
 **Goal:** An agent runs on AWS in an isolated environment; user submits a task from the CLI and gets a PR when done.
@@ -137,6 +152,8 @@ The order and scope of items may shift as we learn; the list below reflects curr
 **Goal:** Multi-layered validation catches errors, enforces code quality, and assesses change risk before PRs are created; the platform supports more than one task type; multi-modal input broadens what users can express.
 
 - **Per-repo GitHub credentials (GitHub App)** — Replace the single shared OAuth token with a **GitHub App** installed per-organization or per-repository. Each onboarded repo is associated with a GitHub App installation that grants fine-grained permissions (read/write to that repo only). This eliminates the security gap where any authenticated user can trigger agent work against any repo the shared token can access. Token management (installation token generation, rotation) is handled by the platform, not by the agent. AgentCore Identity's token vault can store and refresh installation tokens. This is a prerequisite for any multi-user or multi-team deployment.
+- **Orchestrator pre-flight checks (fail-closed)** — Add a `pre-flight` step before `start-session` so doomed tasks fail fast without consuming AgentCore runtime. The orchestrator performs lightweight readiness checks with strict timeouts (for example, 5 seconds): verify GitHub API reachability, verify repository existence and credential access (`GET /repos/{owner}/{repo}` or equivalent), and optionally verify AgentCore Runtime availability when a status probe exists. If pre-flight fails, the task transitions to `FAILED` immediately with a clear terminal reason (`GITHUB_UNREACHABLE`, `REPO_NOT_FOUND_OR_NO_ACCESS`, `RUNTIME_UNAVAILABLE`), releases the concurrency slot, emits an event/notification, and does **not** invoke the agent. Unlike memory/context hydration (fail-open), pre-flight is explicitly fail-closed: inability to verify repo access blocks execution by design.
+- **Pre-execution task risk classification** — Add a lightweight risk classifier at task submission (before orchestration starts) to drive proportional controls for agent execution. Initial implementation can be rule-based and Blueprint-configurable: prompt keywords (for example, `database`, `auth`, `security`, `infrastructure`), metadata from issue labels, and file/path signals when available (for example, `**/migrations/**`, `**/.github/**`, infra directories). Persist `risk_level` (`low` / `medium` / `high` / `critical`) on the task record and use it to set defaults and policy: model tier/cascade, turn and budget defaults, prompt strictness/conservatism, approval requirements before merge, and optional autonomous-execution blocks for `critical` tasks. This is intentionally pre-execution and complements (does not replace) post-execution PR risk/blast-radius analysis.
 - **Tiered validation pipeline** — Three tiers of post-agent validation run sequentially after the agent finishes but before finalization. Each tier can fail the PR independently, and failure output is fed back to the agent for a fix cycle (capped at 2 retries per tier to bound cost). If the agent still fails, the PR is created with a validation report (labels, comments, and a risk summary) so the reviewer knows. All three tiers are implemented via the blueprint framework's Layer 2 custom steps (`phase: 'post-agent'`). See [REPO_ONBOARDING.md](../design/REPO_ONBOARDING.md#blueprint-execution-framework) for the 3-layer customization model, [ORCHESTRATOR.md](../design/ORCHESTRATOR.md) for the step execution contract, and [EVALUATION.md](../design/EVALUATION.md#tiered-validation-pipeline) for the full design.
   - **Tier 1 — Tool validation (build, test, lint)** — Run deterministic tooling: test suites, linters, type checkers, SAST scanners, or a custom script. This is the existing "deterministic validation" concept. Binary pass/fail; failures are concrete (test output, lint errors) and actionable by the agent in a fix cycle. Already partially implemented via the system prompt instructing the agent to run tests.
   - **Tier 2 — Code quality analysis** — Static analysis of the agent's diff against code quality principles: DRY (duplicated code detection), SOLID violations, design pattern adherence, complexity metrics (cyclomatic, cognitive), naming conventions, and repo-specific style rules (from onboarding config). Implemented as an LLM-based review step or a combination of static analysis tools (e.g. SonarQube rules, custom linters) and LLM judgment. Produces structured findings (severity, location, rule, suggestion) that the agent can act on in a fix cycle. Findings below a configurable severity threshold are advisory (included in the PR as comments) rather than blocking.
@@ -190,7 +207,8 @@ The order and scope of items may shift as we learn; the list below reflects curr
 - **Memory isolation for multi-tenancy** — AgentCore Memory has no per-namespace IAM isolation. For multi-tenant deployments, private repo knowledge could leak cross-repo unless isolation is enforced. Options: silo model (separate memory resource per org — strongest), pool model (single resource with strict application-layer namespace scoping — sufficient for single-org), or shared model (intentional cross-repo learning — only for same-org repos). The onboarding pipeline should create or assign memory resources based on the isolation model. See [SECURITY.md](../design/SECURITY.md) and [MEMORY.md](../design/MEMORY.md).
 - **Full cost management** — per-user and per-team monthly budgets, cost attribution dashboards (cost per task, per repo, per user), alerts when budgets are approaching limits. Token usage and compute cost are tracked per task and aggregated. The control panel (Iter 4) displays cost dashboards.
 - **Adaptive model router with cost-aware cascade** — Per-turn model selection via a lightweight heuristic engine. File reads and simple edits use a cheaper model (Haiku); multi-file refactors use Sonnet; complex reasoning escalates to Opus. Error escalation: if the agent fails twice on the same step, upgrade model for the retry. As the cost budget ceiling approaches, cascade down to cheaper models. Blueprint `modelCascade` config enables per-repo tuning. Potential 30-40% cost reduction on inference-dominated workloads. Requires agent harness changes to support mid-session model switching.
-- **Advanced evaluation and feedback loop** — Extend the basic evaluation pipeline from Iteration 3d: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates (e.g. "for repo X, always run tests before opening PR"), and per-repo or per-failure-type improvement tracking. Evaluation results can update the repo's agent configuration stored during onboarding.
+- **Advanced evaluation and feedback loop** — Extend the basic evaluation pipeline from Iteration 3d: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates (e.g. "for repo X, always run tests before opening PR"), and per-repo or per-failure-type improvement tracking. Evaluation results can update the repo's agent configuration stored during onboarding. **Optional patterns from adaptive teaching research** (e.g. plan → targeted critique → execution; separate **evaluator** vs **prompt/reflection** roles; fitness from LLM judging plus efficiency metrics; evolution of teaching templates from failed trajectories with Pareto-style candidate sets for diverse failure modes) can inform offline or scheduled improvement of Blueprint prompts and checklists without replacing ABCA's core orchestrator.
+- **Formal orchestrator verification (TLA+)** — Add a formal specification of the orchestrator in TLA+ and verify it with TLC model checking. Scope includes the task state machine (8 states, valid transitions, terminal states), concurrency admission control (atomic increment + max check), cancellation races (cancel arriving during any orchestration step), reconciler/orchestrator interleavings (counter drift correction while tasks are active), and the polling loop (agent writes terminal status, orchestrator observes and finalizes). Define invariants such as valid-state progression, no illegal transitions, and repo-level safety constraints (for example, at most one active `RUNNING` task per repo when configured). Keep the spec aligned with `src/constructs/task-status.ts` and orchestrator docs so regressions surface as model-check counterexamples before production.
 - **Guardrails** — Natural-language or policy-based **guardrails** on agent tool calls using Amazon Bedrock Guardrails. Defends against prompt injection, restricts sensitive content generation, and enforces organizational policies (e.g. "do not modify files in `/infrastructure`"). See [SECURITY.md](../design/SECURITY.md). Guardrails configuration can be per-repo (via onboarding) or platform-wide.
 - **Capability-based security model** — Fine-grained enforcement beyond Bedrock Guardrails, operating at three levels: (1) **Tool-level capabilities** — Bash command allowlist (git, npm, make permitted; curl, wget blocked), configurable per capability tier (standard / elevated / read-only). (2) **File-system scope** — Blueprint declares include/exclude path patterns; Write/Edit/Read tools are filtered to the declared scope. (3) **Input trust scoring** — Authenticated user input = trusted; external GitHub issues = untrusted; PR review comments entering memory = adversarial. Trust level selects the capability set. Essential once review feedback memory (Iter 3d) introduces attacker-controlled content into the agent's context. Blueprint `security` prop configures the capability profile per repo.
 - **Additional execution environment** — Support an alternative to AgentCore Runtime (e.g. ECS/Fargate, EKS) behind the **ComputeStrategy** interface (see [REPO_ONBOARDING.md](../design/REPO_ONBOARDING.md#compute-strategy-interface)). The orchestrator calls abstract methods (`startSession`, `stopSession`, `pollSession`); the implementation maps to AgentCore, Fargate, or EKS. Repos select the strategy via `compute_type` in their blueprint configuration. Reduces vendor lock-in and enables workloads that exceed AgentCore limits (e.g. GPU, larger images, longer sessions). The ComputeStrategy interface contract is defined in Iteration 3a; Iteration 5 adds alternative implementations.
@@ -224,11 +242,11 @@ The order and scope of items may shift as we learn; the list below reflects curr
 - **Iteration 2** — Production orchestrator, API contract, task management (list/status/cancel), durable execution, observability, threat model, network isolation, basic cost guardrails, CI/CD.
 - **Iteration 3a** — Repo onboarding, DNS Firewall (domain-level egress filtering), webhook trigger, GitHub Actions, per-repo customization (prompt from repo), data retention, turn/iteration caps, cost budget caps, user prompt guide, agent harness improvements (turn budget, default branch, safety net, lint, softened conventions), operator dashboard, WAF, model invocation logging, input length limits.
 - **Iteration 3b** ✅ — Memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, per-prompt commit attribution. CDK L2 construct with named semantic + episodic strategies using namespace templates (`/{actorId}/knowledge/`, `/{actorId}/episodes/{sessionId}/`), fail-open memory load/write, orchestrator fallback episode, SHA-256 prompt hashing, git trailer attribution.
-- **Iteration 3c** — Per-repo GitHub App credentials, tiered validation pipeline (tool validation, code quality analysis, risk/blast radius analysis), PR risk level, PR review task type, multi-modal input.
+- **Iteration 3c** — Per-repo GitHub App credentials, orchestrator pre-flight checks (fail-closed before session start), pre-execution task risk classification (model/limits/approval policy selection), tiered validation pipeline (tool validation, code quality analysis, post-execution risk/blast radius analysis), PR risk level, PR review task type, multi-modal input.
 - **Iteration 3d** — Review feedback memory loop (Tier 2), PR outcome tracking, evaluation pipeline (basic).
 - **Iteration 3bis** (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (`schema_version: "2"`), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition (4 extracted subfunctions), dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active).
 - **Iteration 4** — Additional git providers, visual proof (screenshots/videos), Slack channel, skills pipeline, user preference memory (Tier 3), control panel (restrict CORS to dashboard origin), real-time event streaming (WebSocket), live session replay and mid-task nudge, browser extension client, MFA for production.
-- **Iteration 5** — Snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, full Bedrock Guardrails (PII, denied topics, output filters), capability-based security model, alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules.
+- **Iteration 5** — Snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, full Bedrock Guardrails (PII, denied topics, output filters), capability-based security model, alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules.
 - **Iteration 6** — Agent swarm orchestration, skills learning, multi-repo, iterative feedback and multiplayer sessions, HITL approval, scheduled triggers, CDK constructs.
 
 Design docs to keep in sync: [ARCHITECTURE.md](../design/ARCHITECTURE.md), [ORCHESTRATOR.md](../design/ORCHESTRATOR.md), [API_CONTRACT.md](../design/API_CONTRACT.md), [INPUT_GATEWAY.md](../design/INPUT_GATEWAY.md), [REPO_ONBOARDING.md](../design/REPO_ONBOARDING.md), [MEMORY.md](../design/MEMORY.md), [OBSERVABILITY.md](../design/OBSERVABILITY.md), [COMPUTE.md](../design/COMPUTE.md), [CONTROL_PANEL.md](../design/CONTROL_PANEL.md), [SECURITY.md](../design/SECURITY.md), [EVALUATION.md](../design/EVALUATION.md).