You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/ROADMAP.md
+21-3Lines changed: 21 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,21 @@ The order and scope of items may shift as we learn; the list below reflects curr
6
6
7
7
---
8
8
9
+
## Ongoing engineering practice (cross-iteration)
10
+
11
+
These practices apply continuously across iterations and are not treated as one-time feature milestones.
12
+
13
+
-**Property-based correctness testing for orchestration invariants** — Complement example-based tests (Jest/pytest) with property-based testing (`fast-check` for TypeScript and `hypothesis` for Python) so randomized inputs and interleavings validate invariants over many runs. The goal is to verify safety properties that are timing-sensitive or hard to cover with scenario tests alone (for example, concurrent state transitions and lock/contention behavior).
14
+
-**Machine-readable property catalog** — Maintain a versioned property set with explicit mapping from each property to enforcing code paths and tests. Initial properties include:
-`P-ABCA-2` concurrency counter consistency: for each user, `active_count` equals the number of tasks in active states (`SUBMITTED`, `HYDRATING`, `RUNNING`, `FINALIZING`).
17
+
-`P-ABCA-3` event ordering: `TaskEvents` are strictly monotonic by `event_id` (ULID order).
18
+
-`P-ABCA-4` memory fallback guarantee: if task finalization sees `memory_written = false`, fallback episode write is attempted and result is observable.
19
+
-`P-ABCA-5` branch-name uniqueness: simultaneous tasks for the same repo generate distinct branch names (ULID-based suffix).
20
+
-**Definition-of-done hook** — New orchestrator/concurrency changes should include: updated property mappings, at least one property-based test where applicable, and invariant notes in `ORCHESTRATOR.md` to keep docs and executable checks aligned.
21
+
22
+
---
23
+
9
24
## Iteration 1 — First shippable slice (done)
10
25
11
26
**Goal:** An agent runs on AWS in an isolated environment; user submits a task from the CLI and gets a PR when done.
@@ -137,6 +152,8 @@ The order and scope of items may shift as we learn; the list below reflects curr
137
152
**Goal:** Multi-layered validation catches errors, enforces code quality, and assesses change risk before PRs are created; the platform supports more than one task type; multi-modal input broadens what users can express.
138
153
139
154
-**Per-repo GitHub credentials (GitHub App)** — Replace the single shared OAuth token with a **GitHub App** installed per-organization or per-repository. Each onboarded repo is associated with a GitHub App installation that grants fine-grained permissions (read/write to that repo only). This eliminates the security gap where any authenticated user can trigger agent work against any repo the shared token can access. Token management (installation token generation, rotation) is handled by the platform, not by the agent. AgentCore Identity's token vault can store and refresh installation tokens. This is a prerequisite for any multi-user or multi-team deployment.
155
+
-**Orchestrator pre-flight checks (fail-closed)** — Add a `pre-flight` step before `start-session` so doomed tasks fail fast without consuming AgentCore runtime. The orchestrator performs lightweight readiness checks with strict timeouts (for example, 5 seconds): verify GitHub API reachability, verify repository existence and credential access (`GET /repos/{owner}/{repo}` or equivalent), and optionally verify AgentCore Runtime availability when a status probe exists. If pre-flight fails, the task transitions to `FAILED` immediately with a clear terminal reason (`GITHUB_UNREACHABLE`, `REPO_NOT_FOUND_OR_NO_ACCESS`, `RUNTIME_UNAVAILABLE`), releases the concurrency slot, emits an event/notification, and does **not** invoke the agent. Unlike memory/context hydration (fail-open), pre-flight is explicitly fail-closed: inability to verify repo access blocks execution by design.
156
+
-**Pre-execution task risk classification** — Add a lightweight risk classifier at task submission (before orchestration starts) to drive proportional controls for agent execution. Initial implementation can be rule-based and Blueprint-configurable: prompt keywords (for example, `database`, `auth`, `security`, `infrastructure`), metadata from issue labels, and file/path signals when available (for example, `**/migrations/**`, `**/.github/**`, infra directories). Persist `risk_level` (`low` / `medium` / `high` / `critical`) on the task record and use it to set defaults and policy: model tier/cascade, turn and budget defaults, prompt strictness/conservatism, approval requirements before merge, and optional autonomous-execution blocks for `critical` tasks. This is intentionally pre-execution and complements (does not replace) post-execution PR risk/blast-radius analysis.
140
157
-**Tiered validation pipeline** — Three tiers of post-agent validation run sequentially after the agent finishes but before finalization. Each tier can fail the PR independently, and failure output is fed back to the agent for a fix cycle (capped at 2 retries per tier to bound cost). If the agent still fails, the PR is created with a validation report (labels, comments, and a risk summary) so the reviewer knows. All three tiers are implemented via the blueprint framework's Layer 2 custom steps (`phase: 'post-agent'`). See [REPO_ONBOARDING.md](../design/REPO_ONBOARDING.md#blueprint-execution-framework) for the 3-layer customization model, [ORCHESTRATOR.md](../design/ORCHESTRATOR.md) for the step execution contract, and [EVALUATION.md](../design/EVALUATION.md#tiered-validation-pipeline) for the full design.
141
158
-**Tier 1 — Tool validation (build, test, lint)** — Run deterministic tooling: test suites, linters, type checkers, SAST scanners, or a custom script. This is the existing "deterministic validation" concept. Binary pass/fail; failures are concrete (test output, lint errors) and actionable by the agent in a fix cycle. Already partially implemented via the system prompt instructing the agent to run tests.
142
159
-**Tier 2 — Code quality analysis** — Static analysis of the agent's diff against code quality principles: DRY (duplicated code detection), SOLID violations, design pattern adherence, complexity metrics (cyclomatic, cognitive), naming conventions, and repo-specific style rules (from onboarding config). Implemented as an LLM-based review step or a combination of static analysis tools (e.g. SonarQube rules, custom linters) and LLM judgment. Produces structured findings (severity, location, rule, suggestion) that the agent can act on in a fix cycle. Findings below a configurable severity threshold are advisory (included in the PR as comments) rather than blocking.
@@ -190,7 +207,8 @@ The order and scope of items may shift as we learn; the list below reflects curr
190
207
-**Memory isolation for multi-tenancy** — AgentCore Memory has no per-namespace IAM isolation. For multi-tenant deployments, private repo knowledge could leak cross-repo unless isolation is enforced. Options: silo model (separate memory resource per org — strongest), pool model (single resource with strict application-layer namespace scoping — sufficient for single-org), or shared model (intentional cross-repo learning — only for same-org repos). The onboarding pipeline should create or assign memory resources based on the isolation model. See [SECURITY.md](../design/SECURITY.md) and [MEMORY.md](../design/MEMORY.md).
191
208
-**Full cost management** — per-user and per-team monthly budgets, cost attribution dashboards (cost per task, per repo, per user), alerts when budgets are approaching limits. Token usage and compute cost are tracked per task and aggregated. The control panel (Iter 4) displays cost dashboards.
192
209
-**Adaptive model router with cost-aware cascade** — Per-turn model selection via a lightweight heuristic engine. File reads and simple edits use a cheaper model (Haiku); multi-file refactors use Sonnet; complex reasoning escalates to Opus. Error escalation: if the agent fails twice on the same step, upgrade model for the retry. As the cost budget ceiling approaches, cascade down to cheaper models. Blueprint `modelCascade` config enables per-repo tuning. Potential 30-40% cost reduction on inference-dominated workloads. Requires agent harness changes to support mid-session model switching.
193
-
-**Advanced evaluation and feedback loop** — Extend the basic evaluation pipeline from Iteration 3d: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates (e.g. "for repo X, always run tests before opening PR"), and per-repo or per-failure-type improvement tracking. Evaluation results can update the repo's agent configuration stored during onboarding.
210
+
-**Advanced evaluation and feedback loop** — Extend the basic evaluation pipeline from Iteration 3d: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates (e.g. "for repo X, always run tests before opening PR"), and per-repo or per-failure-type improvement tracking. Evaluation results can update the repo's agent configuration stored during onboarding. **Optional patterns from adaptive teaching research** (e.g. plan → targeted critique → execution; separate **evaluator** vs **prompt/reflection** roles; fitness from LLM judging plus efficiency metrics; evolution of teaching templates from failed trajectories with Pareto-style candidate sets for diverse failure modes) can inform offline or scheduled improvement of Blueprint prompts and checklists without replacing ABCA's core orchestrator.
211
+
-**Formal orchestrator verification (TLA+)** — Add a formal specification of the orchestrator in TLA+ and verify it with TLC model checking. Scope includes the task state machine (8 states, valid transitions, terminal states), concurrency admission control (atomic increment + max check), cancellation races (cancel arriving during any orchestration step), reconciler/orchestrator interleavings (counter drift correction while tasks are active), and the polling loop (agent writes terminal status, orchestrator observes and finalizes). Define invariants such as valid-state progression, no illegal transitions, and repo-level safety constraints (for example, at most one active `RUNNING` task per repo when configured). Keep the spec aligned with `src/constructs/task-status.ts` and orchestrator docs so regressions surface as model-check counterexamples before production.
194
212
-**Guardrails** — Natural-language or policy-based **guardrails** on agent tool calls using Amazon Bedrock Guardrails. Defends against prompt injection, restricts sensitive content generation, and enforces organizational policies (e.g. "do not modify files in `/infrastructure`"). See [SECURITY.md](../design/SECURITY.md). Guardrails configuration can be per-repo (via onboarding) or platform-wide.
195
213
-**Capability-based security model** — Fine-grained enforcement beyond Bedrock Guardrails, operating at three levels: (1) **Tool-level capabilities** — Bash command allowlist (git, npm, make permitted; curl, wget blocked), configurable per capability tier (standard / elevated / read-only). (2) **File-system scope** — Blueprint declares include/exclude path patterns; Write/Edit/Read tools are filtered to the declared scope. (3) **Input trust scoring** — Authenticated user input = trusted; external GitHub issues = untrusted; PR review comments entering memory = adversarial. Trust level selects the capability set. Essential once review feedback memory (Iter 3d) introduces attacker-controlled content into the agent's context. Blueprint `security` prop configures the capability profile per repo.
196
214
-**Additional execution environment** — Support an alternative to AgentCore Runtime (e.g. ECS/Fargate, EKS) behind the **ComputeStrategy** interface (see [REPO_ONBOARDING.md](../design/REPO_ONBOARDING.md#compute-strategy-interface)). The orchestrator calls abstract methods (`startSession`, `stopSession`, `pollSession`); the implementation maps to AgentCore, Fargate, or EKS. Repos select the strategy via `compute_type` in their blueprint configuration. Reduces vendor lock-in and enables workloads that exceed AgentCore limits (e.g. GPU, larger images, longer sessions). The ComputeStrategy interface contract is defined in Iteration 3a; Iteration 5 adds alternative implementations.
@@ -224,11 +242,11 @@ The order and scope of items may shift as we learn; the list below reflects curr
0 commit comments