This roadmap outlines multiple iterations for ABCA. Each iteration adds features incrementally and builds on the previous one. Delivering a working slice at the end of each iteration is the goal. Non–backward-compatible changes between iterations are acceptable (e.g. switching CLI auth from IAM to Cognito, or changing the orchestration model) when they simplify the design or align with the target architecture.
The order and scope of items may shift as we learn; the list below reflects current design docs (ARCHITECTURE.md and component docs in docs/design/).
These practices apply continuously across iterations and are not treated as one-time feature milestones.
- Property-based correctness testing for orchestration invariants — Complement example-based tests (Jest/pytest) with property-based testing (
fast-checkfor TypeScript andhypothesisfor Python) so randomized inputs and interleavings validate invariants over many runs. The goal is to verify safety properties that are timing-sensitive or hard to cover with scenario tests alone (for example, concurrent state transitions and lock/contention behavior). - Machine-readable property catalog — Maintain a versioned property set with explicit mapping from each property to enforcing code paths and tests. Initial properties include:
P-ABCA-1terminal-state immutability: tasks inCOMPLETED/FAILED/CANCELLED/TIMED_OUTcannot transition further.P-ABCA-2concurrency counter consistency: for each user,active_countequals the number of tasks in active states (SUBMITTED,HYDRATING,RUNNING,FINALIZING).P-ABCA-3event ordering:TaskEventsare strictly monotonic byevent_id(ULID order).P-ABCA-4memory fallback guarantee: if task finalization seesmemory_written = false, fallback episode write is attempted and result is observable.P-ABCA-5branch-name uniqueness: simultaneous tasks for the same repo generate distinct branch names (ULID-based suffix).
- Definition-of-done hook — New orchestrator/concurrency changes should include: updated property mappings, at least one property-based test where applicable, and invariant notes in
ORCHESTRATOR.mdto keep docs and executable checks aligned. - Memory extraction prompt versioning — Hash memory extraction prompts (in
agent/memory.py:write_task_episode,write_repo_learnings) alongside system prompts so changes to extraction logic are tracked byprompt_version. This enables correlating memory quality changes with extraction prompt updates in the evaluation pipeline.
Goal: An agent runs on AWS in an isolated environment; user submits a task from the CLI and gets a PR when done.
- Agent on AWS — Agent runs in a sandboxed compute environment (AgentCore Runtime MicroVM or equivalent). Each task gets an isolated session (compute, memory, filesystem). Container/image has shell, filesystem, dev tooling; session isolation is built-in.
- CLI trigger — User can submit a task via CLI (script or simple CLI): provide repo + task description (text and/or GitHub issue ref). Single entry path; no multi-channel yet.
- Autonomous agent loop — Agent SDK runs with full tool access in headless mode (read, write, edit, bash, glob, grep;
permissionMode: "bypassPermissions"or equivalent). No human prompts during execution. - Git workflow — Agent creates a branch, commits incrementally, pushes to GitHub, and creates a pull request when done. Branch naming convention: e.g.
bgagent/<task-id>/<short-desc>. - GitHub only — Single git provider (GitHub). Agent clones repo, works on branch, opens PR via GitHub API (OAuth or token via AgentCore Identity).
- Minimal orchestration — Task is created, execution is triggered (e.g. Lambda or direct invoke), agent runs to completion or failure. Platform infers outcome from GitHub (PR created or not) or from session end. No durable orchestration (e.g. no Step Functions / Durable Functions) required for this slice if we accept "fire-and-forget" plus polling.
- Task state (minimal) — At least: task id, status (e.g. running / completed / failed), repo, and way to poll or wait for completion. Persistence can be minimal (e.g. DynamoDB or single table).
- API authentication — CLI authenticates to the API (e.g. IAM SigV4 or Cognito JWT). Prevents unauthorized task submission.
- Scaling — Each task runs in its own isolated session; no shared mutable state so the system can scale with concurrent tasks (within runtime quotas).
Out of scope for Iteration 1: Repo onboarding (any repo the credentials can access is allowed), multiple channels, durable execution with checkpoint/resume, rich observability, memory/code attribution, webhook, Slack.
Goal: Robust task lifecycle, durable execution, security foundations, basic cost guardrails, and visibility into what's running. This iteration makes the platform production-grade for single-channel (CLI) usage.
- Task management — Submit, list (e.g. my tasks), get status (per task), cancel (stop a running task). Clear task state machine (SUBMITTED → HYDRATING → RUNNING → FINALIZING → COMPLETED / FAILED / CANCELLED / TIMED_OUT). See ORCHESTRATOR.md.
- API contract — Implement the external API:
POST /v1/tasks,GET /v1/tasks,GET /v1/tasks/{id},DELETE /v1/tasks/{id},GET /v1/tasks/{id}/events. Consistent error format, pagination, idempotency. See API_CONTRACT.md. - Input gateway (single entry point) — All requests go through one gateway: verify auth, normalize payload to an internal message schema, validate (required fields, repo/issue refs), then dispatch to the task pipeline. The gateway is designed for extensibility — adding new channels later requires only new adapters, not core changes. In this iteration, CLI is the only channel; the gateway architecture is established so future channels (webhook, Slack) plug in cleanly. See INPUT_GATEWAY.md.
- Idempotency — Task submit accepts an idempotency key (e.g.
Idempotency-Keyheader); duplicate submits with the same key do not create a second task. Prevents duplicate work on retries. Keys are stored with a 24-hour TTL. - Improve CLI — Dedicated CLI package (
@abca/cliincli/) with commands:configure,login,submit,list,status,cancel,events. Cognito auth with token caching and auto-refresh,--waitmode that polls until completion,--output jsonfor scripting, and--verbosefor debugging.
- Durable execution — Orchestrator on top of the agent using Lambda Durable Functions: checkpoint/resume, async session monitoring via DynamoDB polling, timeout recovery, idempotent step execution. Long-running sessions (hours) survive transient failures; agent commits regularly so work is not lost. See ORCHESTRATOR.md for the task state machine, execution model, failure modes, concurrency management, data model, and implementation strategy.
- Storage — (1) Task and event storage — Tasks table, TaskEvents (audit log), UserConcurrency counters in DynamoDB. (2) Durable execution state — Lambda Durable Functions checkpoints (managed by the service). (3) Artifact storage (optional) — S3 bucket for future screenshot/video uploads.
- Threat model — Document the threat model for the current architecture using threat-composer. Cover: input validation, agent isolation, credential management, data flow, and trust boundaries. Update the threat model as new features land in future iterations. Threat modeling informs the security controls built in this and subsequent iterations — it must come before, not after, the production gateway and orchestrator.
- Network isolation (basic) — Deploy the agent compute environment within a VPC. Restrict outbound egress to allowlisted endpoints: GitHub API, Amazon Bedrock, AgentCore services, and necessary AWS service endpoints (DynamoDB, CloudWatch, S3). No open internet access by default. This prevents a compromised or confused agent from reaching arbitrary endpoints. Fine-grained per-repo allowlisting and egress logging are deferred to Iteration 3a.
- Observability — Metrics: task duration, token usage (from agent SDK result), cold start, error rate, active task counts, and submitted backlog. Dashboards: active tasks, submitted backlog, completion rate, basic task list. Alarms: stuck tasks (e.g. RUNNING > 9 hours), sustained submitted backlog over threshold, orchestration failures, counter drift. Logs: Agent/runtime logs (e.g. CloudWatch) tied to task id. See OBSERVABILITY.md.
Builds on Iteration 1: Same agent + git workflow; adds orchestrator, gateway, task CRUD, API contract, observability, security foundations, and cost guardrails.
Out of scope for Iteration 2: Webhook trigger (no second channel yet), multi-modal input (text-based tasks are sufficient), repo onboarding, memory, customization.
Goal: Only onboarded repos can receive tasks; per-repo credentials replace the single shared OAuth token; agent environment is customizable per repo.
- Repository onboarding pipeline — Repos must be onboarded before tasks can target them. Onboarding registers a repo with the platform and produces a per-repo agent configuration (workload, security, customization). Submitting a task for a non-onboarded repo returns an error (
REPO_NOT_ONBOARDED). The pipeline can discover static config (e.g. rules, README) and optionally generate dynamic artifacts (summaries, dependency graphs). See REPO_ONBOARDING.md. - Basic customization: prompt from repo — The full project-level configuration scope is loaded at runtime via the Claude Agent SDK's
setting_sources=["project"]parameter. This includesCLAUDE.md/.claude/CLAUDE.md(instructions),.claude/rules/*.md(path-scoped rules),.claude/settings.json(project settings, hooks, env),.claude/agents/(custom subagents), and.mcp.json(MCP servers). The CLI natively discovers and injects these — no custom file parsing needed. Additionally, Blueprintsystem_prompt_overridesfrom DynamoDB are wired throughserver.py→entrypoint.pyand appended after template substitution. Composable prompt model: platform default + Blueprint overrides (appended) + repo-level project configuration (loaded by CLI). - Network isolation (fine-grained) — Route 53 Resolver DNS Firewall enforces a platform-wide domain allowlist. Per-repo
networking.egressAllowlistfeeds the aggregate policy (VPC-wide, not per-session). DNS query logging provides egress audit. Deployed in observation mode (ALERT) with a path to enforcement mode (BLOCK). See NETWORK_ARCHITECTURE.md and SECURITY.md. - Webhook / API trigger — Expose task submission as a webhook (HMAC-authenticated) so external systems can create tasks programmatically. Same API contract as CLI; gateway normalizes and validates. This is the foundation for GitHub Actions integration and CI-triggered tasks. Webhook management API (create/list/revoke) protected by Cognito; per-integration secrets stored in Secrets Manager; HMAC-SHA256 REQUEST authorizer on the webhook endpoint.
- Better context hydration — Dedicated pre-processing step before the agent runs: gather relevant context (user message, GitHub issue body/comments, optionally recent commits or related paths). Assemble into a structured prompt. Basic version for this iteration: user message + issue body + system prompt template. Advanced sources (related code, linked issues, memory) are added in later iterations.
- Data retention and cleanup — Define and implement retention policies: task record TTL in DynamoDB (e.g. 90 days for completed tasks, configurable), CloudWatch log retention (e.g. 30 days).
- Turn / iteration caps — Complement time-based timeouts with configurable per-task turn limits (default 100, range 1–500). Users can set
max_turnsvia the API or CLI (--max-turns). The value is validated, persisted in the task record, passed through the orchestrator payload, and consumed by the agent'sserver.py→ClaudeAgentOptions(max_turns=...). TheMAX_TURNSenv var on the AgentCore Runtime provides a defense-in-depth fallback. Per-repo overrides viablueprint_configare supported. See ORCHESTRATOR.md. - Cost budget caps — Complement turn limits with a configurable per-task cost budget (
max_budget_usd, range $0.01–$100). When the budget is reached, the agent stops regardless of remaining turns. Users can set via the API (max_budget_usd) or CLI (--max-budget). Per-repo defaults are configurable viablueprint_config.max_budget_usd. Follows a 2-tier override: per-task → Blueprint config; if neither is set, no budget limit is applied. See ORCHESTRATOR.md and COST_MODEL.md. - User prompt guide and anti-patterns — Publish a best-practices guide for writing effective task descriptions. Common anti-patterns are: (1) overly generic prompts that expect the agent to infer intent, and (2) overly specific prompts that break when encountering unexpected scenarios. The guide should include concrete examples of good vs. bad prompts, guidance on when to use issue references vs. free-text descriptions, and tips for defining verifiable goals (e.g. "add tests for X" rather than "make this better"). Can be part of onboarding docs or a standalone user guide. See REPO_ONBOARDING.md and PROMPT_GUIDE.md.
- Agent turn budget awareness — The system prompt now includes the
max_turnsvalue so the agent can prioritize effectively. An agent that knows it has 20 turns left behaves differently from one that doesn't — it avoids excessive exploration and focuses on impactful changes first. Injected via{max_turns}placeholder inagent/system_prompt.py. - Default branch detection — Replaced all hardcoded
mainreferences in the agent harness with dynamic detection viagh repo view --json defaultBranchRef. The system prompt now includes{default_branch}, andensure_pr()targets the detected default branch. Repos usingmaster,develop, ortrunknow work correctly. - Uncommitted work safety net — Added
ensure_committed()as a deterministic post-hook before PR creation. If the agent left uncommitted tracked-file changes (e.g. due to turn limit or timeout), the harness stages them withgit add -uand creates a safety-net commit. Prevents silent loss of agent work. - Pre-agent lint baseline — Added
mise run lintduringsetup_repo()alongside the existingmise run buildbaseline. Records lint state before agent changes so post-agent lint failures can be attributed to the agent (same pattern asbuild_before). - Post-agent lint verification — Added
verify_lint()alongsideverify_build()in post-hooks. Lint pass/fail is recorded in the task result, persisted to DynamoDB, emitted as a span attribute (lint.passed), and included in the PR body's verification section. - Softened commit/PR conventions — The system prompt now instructs the agent to follow the repo's commit conventions if discoverable (from CONTRIBUTING.md, CLAUDE.md, or prior commits), defaulting to conventional commit format only when no repo convention is apparent. Reduces review friction for repos with non-standard commit styles.
- Operator metrics dashboard — CloudWatch Dashboard (
BackgroundAgent-Tasks) providing immediate operator visibility: task success rate, cost per task, turns per task, duration distribution, build/lint pass rates, and AgentCore invocations/errors/latency. Lightweight alternative to the full web control panel (Iteration 4). Seesrc/constructs/task-dashboard.ts. - WAF on API Gateway — AWS WAFv2 Web ACL protects the Task API with AWS managed rule groups (
AWSManagedRulesCommonRuleSet,AWSManagedRulesKnownBadInputsRuleSet) and a rate-based rule (1,000 requests per 5-minute window per IP). Provides edge-layer protection against common web exploits, known bad inputs, and volumetric abuse. See SECURITY.md. - Bedrock model invocation logging — Account-level Bedrock model invocation logging enabled via custom resource, sending prompt and response text to CloudWatch (
/aws/bedrock/model-invocation-logs, 90-day retention). Provides full auditability of model inputs and outputs for prompt injection investigation, compliance, and debugging. - Task description length limit — Task descriptions capped at 2,000 characters (as recommended by the threat model) to bound prompt injection attack surface and prevent oversized payloads.
Builds on Iteration 2: Gateway and orchestration stay; adds onboarding gate, webhook channel, DNS Firewall, better context hydration, turn caps, cost budget caps, prompt guide, data lifecycle, agent harness improvements (turn budget, default branch, safety net, lint verification), operator dashboard, WAF, model invocation logging, and input length limits.
Goal: Agents learn from past interactions; memory Tier 1 (repository knowledge + task execution history) is operational; prompt versioning and commit attribution provide traceability.
- Interaction memory / code attribution (Tier 1) — AgentCore Memory resource provisioned via CDK L2 construct (
@aws-cdk/aws-bedrock-agentcore-alpha) with named semantic (SemanticKnowledge) and episodic (TaskEpisodes) extraction strategies using explicit namespace templates:/{actorId}/knowledge/for semantic records,/{actorId}/episodes/{sessionId}/for per-task episodes, and/{actorId}/episodes/for episodic reflection (cross-task summaries). Events are written withactorId = repo("owner/repo") andsessionId = taskId, so the extraction pipeline places records at/{repo}/knowledge/and/{repo}/episodes/{taskId}/. Memory is loaded at task start during context hydration (two parallelRetrieveMemoryRecordsCommandcalls using repo-derived namespace prefixes —/{repo}/knowledge/for semantic,/{repo}/episodes/for episodic) with a 5-second timeout and 2,000-token budget. Memory is written at task end by the agent (agent/memory.py:write_task_episodeandwrite_repo_learningsviacreate_event). An orchestrator fallback (writeMinimalEpisodeinorchestrator.ts) writes a minimal episode if the agent container crashes or times out. All memory operations are fail-open — failures never block task execution. See MEMORY.md and OBSERVABILITY.md (Code attribution). Implementation:src/constructs/agent-memory.ts,src/handlers/shared/memory.ts,agent/memory.py. - Insights and agent self-feedback — The agent writes structured summaries at the end of each task via
write_task_episode(status, PR URL, cost, duration) andwrite_repo_learnings(codebase patterns and conventions). Agent self-feedback is captured via an "## Agent notes" section in the PR body, extracted post-task by the entrypoint (_extract_agent_notesinagent/entrypoint.py) and stored as part of the task episode. See MEMORY.md (Extraction prompts) and EVALUATION.md. - Prompt versioning — System prompts are hashed (SHA-256 of deterministic prompt parts, excluding memory context which varies per run) via
computePromptVersioninsrc/handlers/shared/prompt-version.ts. Theprompt_versionis stored on the task record in DynamoDB during hydration, enabling future A/B comparison of prompt changes against task outcomes. See EVALUATION.md and ORCHESTRATOR.md (data model). - Per-prompt commit attribution — A
prepare-commit-msggit hook (agent/prepare-commit-msg.sh) is installed during repo setup and appendsTask-Id: <task_id>andPrompt-Version: <hash>trailers to every agent commit. The hook gracefully skips trailers whenTASK_IDis unset (e.g. during manual commits). See MEMORY.md.
Builds on Iteration 3a: Onboarding and per-repo config are in place; adds memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, and commit attribution. These are all write-at-end / read-at-start additions that do not change the orchestrator blueprint.
Goal: Address architectural risks identified by external review before moving to new features. These are fixes to existing code, not new capabilities.
-
Conditional writes in agent task_state.py — Added
ConditionExpressionguards towrite_running()(requires status IN SUBMITTED, HYDRATING) andwrite_terminal()(requires status IN RUNNING, HYDRATING, FINALIZING).ConditionalCheckFailedExceptionis caught bytype(e).__name__(avoids botocore import) and logged as a skip. Prevents the agent from silently overwriting orchestrator-managed CANCELLED status. Seeagent/task_state.py. -
Orchestrator Lambda error alarm — Added CloudWatch alarm on
fn.metricErrors()(threshold: 3, evaluation: 2 periods of 5min, treatMissingData: NOT_BREACHING). Skipped SQS DLQ since durable execution (withDurableExecution, 14-day retention) manages its own retries; a DLQ would conflict. AddedretryAttempts: 0on the alias async invoke config to prevent Lambda-level duplicate invocations. Alarm exported aserrorAlarmpublic property for dashboard/SNS wiring. Seesrc/constructs/task-orchestrator.ts. -
Concurrency counter reconciliation — Implemented
ConcurrencyReconcilerconstruct with a scheduled Lambda (EventBridge rate 15min). Handler scans the concurrency table, queries the task table'sUserStatusIndexGSI per user with aFilterExpressionon active statuses (SUBMITTED, HYDRATING, RUNNING, FINALIZING), compares actual count with storedactive_count, and corrects drift. Seesrc/constructs/concurrency-reconciler.ts,src/handlers/reconcile-concurrency.ts. -
Multi-AZ NAT for production — Already configurable via
AgentVpcProps.natGateways(default: 1) atsrc/constructs/agent-vpc.ts:60. Deployers can setnatGateways: 2or higher for multi-AZ redundancy. No code changes needed — documentation-only update. -
Orchestrator IAM grant for Memory — The orchestrator Lambda had
MEMORY_IDin its env vars and calledloadMemoryContext/writeMinimalEpisode, but was never grantedbedrock-agentcore:RetrieveMemoryRecordsorbedrock-agentcore:CreateEventpermissions. The fail-open pattern silently swallowedAccessDeniedException, making memory appear empty. Fixed by addingagentMemory.grantReadWrite(orchestrator.fn)inagent.ts, with a new stack test asserting the grant. Seesrc/stacks/agent.ts:255. -
Memory schema versioning — Added
schema_version: "2"metadata field to all memory write operations (Python agentmemory.pyand TypeScriptmemory.ts). Enables distinguishing records written under the old namespace scheme (v1,repos/prefix) from the new namespace-template scheme (v2,/{actorId}/knowledge/). Supports future migration tooling and debugging. -
Python repo format validation — Added
_validate_repo()inagent/memory.pythat asserts therepoparameter matches^[a-zA-Z0-9._-]+/[a-zA-Z0-9._-]+$(mirrors TypeScriptisValidRepo). Catches format mismatches (full URLs, extra whitespace, wrong casing) that would cause namespace divergence between write and read paths. -
Severity-aware error logging in Python memory — Replaced bare
except Exceptionblocks with_log_error()helper that distinguishes infrastructure errors (network, auth, throttling → WARN) from programming errors (TypeError,ValueError,AttributeError,KeyError→ ERROR). All exceptions are still caught (fail-open preserved), but bugs surface as ERROR-level logs instead of being hidden at WARN. -
Narrowed entrypoint try-catch — Separated
_extract_agent_notes()extraction from memory writes inagent/entrypoint.py. Agent notes parsing failure now logs"Agent notes extraction failed"(specific) instead of"Memory write failed"(misleading). Memory writes (write_task_episode,write_repo_learnings) are no longer nested inside the same try-catch, since they are individually fail-open. -
Orchestrator fallback episode observability —
writeMinimalEpisodereturn value is now checked and logged:logger.warn('Fallback episode write returned false')when the inner function reports failure via its return value (previously discarded). New testlogs warning when writeMinimalEpisode returns falsecovers this path. -
Python unit tests — Added pytest-based unit tests (
agent/tests/) for pure functions:slugify(),redact_secrets(),format_bytes(),truncate(),build_config(),assemble_prompt(),_discover_project_config(),_build_system_prompt()(entrypoint),_validate_repo()(memory),_now_iso(),_build_logs_url()(task_state). Added pytest to dev dependency group withpythonpathconfig for in-tree imports. -
Decompose entrypoint.py — Initially extracted four named subfunctions (
_build_system_prompt(),_discover_project_config(),_write_memory(),_setup_agent_env()). Subsequently, the agent code was further decomposed into a fullagent/src/module structure:config.py(configuration and validation),models.py(Pydantic data models and enumerations),pipeline.py(task orchestration),runner.py(agent execution),context.py(context hydration),prompt_builder.py(prompt assembly),hooks.py(PreToolUse policy hooks),policy.py(Cedar policy engine),post_hooks.py(deterministic post-hooks),repo.py(repository setup),shell.py(utilities),telemetry.py(metrics and trajectory). The originalentrypoint.pyis now a re-export shim for backward compatibility with tests. -
Deprecate dual prompt assembly — Added deprecation docstring to
assemble_prompt()clarifying that production uses the orchestrator'sassembleUserPrompt()viaHydratedContext.user_prompt(validated from the incoming JSON). Python version retained only for local batch mode and dry-run mode. No code deletion — just documentation of the intended flow. -
Graceful thread drain in server.py — Added
_active_threadslist for tracking background threads,_drain_threads(timeout=300)function that joins all alive threads, registered via@app.on_event("shutdown")(FastAPI lifecycle — uvicorn translates SIGTERM) andatexit.register()as backup. Thread list is cleaned on each new invocation. -
Remove dead QUEUED state — Removed
QUEUEDfromTaskStatus,VALID_TRANSITIONS, andACTIVE_STATUSESintask-status.ts. Updated SUBMITTED transitions to[HYDRATING, FAILED, CANCELLED]. Removed QUEUED from all tests (count assertions, cancel test, validation test) and documentation (ORCHESTRATOR.md, OBSERVABILITY.md, API_CONTRACT.md, ARCHITECTURE.md). -
Hardening fixes (review round) — Thread race in
server.py(track thread beforestart()), defensive.get()onClientError.responseintask_state.py, wiredfallback_errorthroughorchestrator.ts(warning log + event metadata), TOCTOUConditionExpressionon reconciler update, per-user error isolation in reconciler,TaskStatusTypepropagation across types/orchestrator/memory, graduated trajectory writer failure, subprocess timeouts, FastAPI lifespan pattern,decrementConcurrencyCCF distinction.
Follow-ups (identified during review, not blocking):
- Reconciler batch error tracking — Added
errorscounter toreconcile-concurrency.ts. Incremented in the per-user catch block. Final log line now includes{ scanned, corrected, errors }. Logs at ERROR iferrors === scanned && scanned > 0(systemic failure). - Test:
decrementConcurrencyCCF path — Added two tests inorchestrate-task.test.ts: one forConditionalCheckFailedException(best-effort, no throw) and one for non-CCF errors (swallowed with warn log, no throw). - Test: reconciler non-CCF update failure — Added test in
reconcile-concurrency.test.ts: two users with drift, user-1'sUpdateItemCommandfails with non-CCF error, user-2 still corrected (per-user error isolation). - Consistent error serialization — Replaced all
String(err)in error/warn log contexts witherr instanceof Error ? err.message : String(err)acrosscontext-hydration.ts,orchestrator.ts,memory.ts, andrepo-config.ts.
Goal: Multi-layered validation catches errors, enforces code quality, and assesses change risk before PRs are created; the platform supports more than one task type; multi-modal input broadens what users can express.
-
Per-repo GitHub credentials (GitHub App + AgentCore Token Vault) — Replace the single shared OAuth token with a GitHub App installed per-organization or per-repository, using AgentCore Identity's Token Vault for credential management (recommended approach). Each onboarded repo is associated with a GitHub App installation that grants fine-grained permissions (read/write to that repo only). This eliminates the security gap where any authenticated user can trigger agent work against any repo the shared token can access.
Implementation approach — AgentCore Token Vault integration:
- WorkloadIdentity resource — Create a
CfnWorkloadIdentityin CDK representing the agent's identity, enabling token exchange with GitHub. - Token Vault credential provider — Register the GitHub App's credentials in the AgentCore Token Vault. For server-to-server authentication, the GitHub App uses a private key to sign JWTs that are exchanged for installation tokens via the GitHub API. For the user-authorization OAuth flow (acting on behalf of a user), the App's client ID and client secret are registered as an OAuth credential provider. The Token Vault handles token refresh automatically — no expiry issues for long-running tasks (sessions exceeding 1 hour).
- Orchestrator token generation — At task hydration time, the orchestrator calls the GitHub API to generate an installation token (1-hour TTL, scoped to the target repo) and passes it to the agent at session start.
- Agent-side token refresh — For tasks running longer than 1 hour, the agent calls
GetWorkloadAccessToken(permissions already granted to the runtime execution role:bedrock-agentcore:GetWorkloadAccessToken,GetWorkloadAccessTokenForJWT,GetWorkloadAccessTokenForUserId) to obtain a fresh token from the Token Vault. No Secrets Manager reads needed at runtime. - Blueprint configuration — Extend
Blueprintcredentials withgithubAppId,githubAppPrivateKeySecretArn, andgithubAppInstallationId(per-org or per-repo). - Gateway integration (future) — Wire an AgentCore Gateway target for GitHub API calls with automatic credential injection, enabling audit trails and Cedar policy enforcement per request. Git transport (clone/push) still requires a token in the remote URL, so Gateway-mediated access applies to API operations only.
Why Token Vault over Secrets Manager: The runtime already has
GetWorkloadAccessTokenpermissions (granted by the AgentCore Runtime construct). Token Vault is purpose-built for dynamic credential vending — it manages refresh automatically, supports arbitrary OAuth providers (GitHub, GitLab, Jira, Slack via the same pattern), and keeps credentials out of the sandbox as static secrets. This sets up the pattern for all future third-party integrations.Per-user identity flow (future, connects to SSO): With a GitHub App, installation tokens can be scoped per-repository and per-permission set. Combined with federated identity (SSO), the orchestrator can look up the user's GitHub identity and generate tokens scoped to the target repo with only the permissions that user would have. Git commits are attributed to the GitHub App acting on behalf of the user.
This is a prerequisite for any multi-user or multi-team deployment.
- WorkloadIdentity resource — Create a
-
Orchestrator pre-flight checks (fail-closed) — Add a
pre-flightstep beforestart-sessionso doomed tasks fail fast without consuming AgentCore runtime. The orchestrator performs lightweight readiness checks with strict timeouts (for example, 5 seconds): verify GitHub API reachability, verify repository existence and credential access (GET /repos/{owner}/{repo}or equivalent), and optionally verify AgentCore Runtime availability when a status probe exists. If pre-flight fails, the task transitions toFAILEDimmediately with a clear terminal reason (GITHUB_UNREACHABLE,REPO_NOT_FOUND_OR_NO_ACCESS,RUNTIME_UNAVAILABLE), releases the concurrency slot, emits an event/notification, and does not invoke the agent. Unlike memory/context hydration (fail-open), pre-flight is explicitly fail-closed: inability to verify repo access blocks execution by design. -
Persistent session storage (cache layer) — Enabled AgentCore Runtime persistent session storage (preview) for selective cache persistence across stop/resume. A per-session filesystem is mounted at
/mnt/workspaceviaFilesystemConfigurations(CFN escape hatch on the L2 construct). The S3-backed FUSE mount does not supportflock()(returnsENOTRECOVERABLE/ os error 524), so only caches whose tools never callflock()go on the mount (npm_config_cache,CLAUDE_CONFIG_DIR). Caches for tools that useflock()stay on local ephemeral disk (MISE_DATA_DIR=/tmp/mise-data— mise's pipx backend delegates touvwhich flocks inside installs/;UV_CACHE_DIR=/tmp/uv-cache). Repo clones stay on/workspace(local) for the same reason. TheAGENT_WORKSPACEenv var and{workspace}system prompt placeholder are wired for a future move to persistent repo clones if the mount addsflock()support. EachruntimeSessionIdgets isolated storage (no cross-task leakage). 14-day TTL; data deleted on runtime version update. See COMPUTE.md. -
Pre-execution task risk classification — Add a lightweight risk classifier at task submission (before orchestration starts) to drive proportional controls for agent execution. Initial implementation can be rule-based and Blueprint-configurable: prompt keywords (for example,
database,auth,security,infrastructure), metadata from issue labels, and file/path signals when available (for example,**/migrations/**,**/.github/**, infra directories). Persistrisk_level(low/medium/high/critical) on the task record and use it to set defaults and policy: model tier/cascade, turn and budget defaults, prompt strictness/conservatism, approval requirements before merge, and optional autonomous-execution blocks forcriticaltasks. This is intentionally pre-execution and complements (does not replace) post-execution PR risk/blast-radius analysis. -
Principal-to-repository authorization mapping — Bind repository access to the requesting principal, not merely any authenticated user. Map Cognito identities to allowed repository sets so that User A cannot trigger agent work on User B's repositories. This is distinct from the credential mechanism (GitHub App tokens solve the credential blast radius but not the authorization problem). Implementation: add a
user_id → repo[]authorization table (or extend onboarding config withauthorized_users), check authorization in the orchestrator before session start, and returnUNAUTHORIZED_REPOon mismatch. See SECURITY.md. -
Tiered validation pipeline — Three tiers of post-agent validation run sequentially after the agent finishes but before finalization. Each tier can fail the PR independently, and failure output is fed back to the agent for a fix cycle (capped at 2 retries per tier to bound cost). If the agent still fails, the PR is created with a validation report (labels, comments, and a risk summary) so the reviewer knows. All three tiers are implemented via the blueprint framework's Layer 2 custom steps (
phase: 'post-agent'). See REPO_ONBOARDING.md for the 3-layer customization model, ORCHESTRATOR.md for the step execution contract, and EVALUATION.md for the full design.- Tier 1 — Tool validation (build, test, lint) — Run deterministic tooling: test suites, linters, type checkers, SAST scanners, or a custom script. This is the existing "deterministic validation" concept. Binary pass/fail; failures are concrete (test output, lint errors) and actionable by the agent in a fix cycle. Already partially implemented via the system prompt instructing the agent to run tests.
- Tier 2 — Code quality analysis — Static analysis of the agent's diff against code quality principles: DRY (duplicated code detection), SOLID violations, design pattern adherence, complexity metrics (cyclomatic, cognitive), naming conventions, and repo-specific style rules (from onboarding config). Implemented as an LLM-based review step or a combination of static analysis tools (e.g. SonarQube rules, custom linters) and LLM judgment. Produces structured findings (severity, location, rule, suggestion) that the agent can act on in a fix cycle. Findings below a configurable severity threshold are advisory (included in the PR as comments) rather than blocking.
- Tier 3 — Risk and blast radius analysis — Analyze the scope and impact of the agent's changes to detect unintended side effects in other parts of the codebase. Includes: dependency graph analysis (what modules/functions consume the changed code), change surface area (number of files, lines, and modules touched), semantic impact assessment (does the change alter public APIs, shared types, configuration, or database schemas), and regression risk scoring. Produces a risk level (low / medium / high / critical) attached to the PR as a label and included in the validation report. High-risk changes may require explicit human approval before merge (foundation for the HITL approval mode in Iteration 6). The risk level considers: number of downstream dependents affected, whether the change touches shared infrastructure or core abstractions, test coverage of the affected area, and whether the change introduces new external dependencies.
-
PR risk level and validation report — Every agent-created PR includes a structured validation report (as a PR comment or check run) summarizing: Tier 1 results (pass/fail per tool), Tier 2 findings (code quality issues by severity), Tier 3 risk assessment (risk level, blast radius summary, affected modules). The PR is labeled with the computed risk level (
risk:low,risk:medium,risk:high,risk:critical). Risk level is persisted in the task record for evaluation and trending. See EVALUATION.md. -
Other task types: PR review and PR-iteration — Support additional task types beyond "implement from issue": iterate on pull request (
pr_iteration) reads review comments and addresses them (implement changes, push updates, post summary). Review pull request (pr_review) is a read-only task type where the agent analyzes a PR's changes and posts structured review comments via the GitHub Reviews API. Thepr_reviewagent runs withoutWriteorEdittools (defense-in-depth), skipsensure_committedand push, and treats build status as informational only. Each review comment uses a structured format: type (comment/question/issue/good_point), severity for issues (minor/medium/major/critical), title, description with memory attribution, proposed fix, and a ready-to-use AI prompt. The CLI exposes--review-pr <number>(mutually exclusive with--pr). -
Input guardrail screening (Bedrock Guardrails) — Amazon Bedrock Guardrails screen task descriptions at submission time and assembled PR prompts during context hydration (
pr_iteration,pr_review). UsesPROMPT_ATTACKcontent filter atHIGHstrength. Fail-closed: Bedrock outages block tasks rather than letting unscreened content through. See SECURITY.md. -
Guardrail screening for GitHub issue content (
new_task) — Bedrock Guardrail screening now covers GitHub issue bodies and comments fetched during context hydration fornew_tasktasks. The assembled user prompt is screened through thePROMPT_ATTACKfilter when issue content is present; when no issue content is fetched (task_description only), hydration-time screening is skipped because the task description was already screened at submission time. Same fail-closed pattern as PR tasks. See SECURITY.md. -
Multi-modal input — Accept text and images (or other modalities) in the task payload; pass through to the agent. Gateway and schema support it; agent harness supports it where available. Primary use case: screenshots of bugs, UI mockups, or design specs attached to issues.
Scope note: Iteration 3c contains a wide range of items — from security-critical (GitHub App credentials, guardrail screening) to quality-improving (tiered validation, risk classification) to capability-expanding (multi-modal input). Items marked [x] are done. The remaining items can be delivered incrementally; the tiered validation pipeline and risk classification in particular can ship independently of per-repo credentials and multi-modal input.
Builds on Iteration 3b: Memory is operational; this iteration changes the orchestrator blueprint (tiered validation pipeline, new task type) and broadens the input schema. These are independently testable from memory.
Goal: The primary feedback loop (PR reviews → memory → future tasks) is operational; automated evaluation provides measurable quality signals; PR outcomes are tracked as feedback.
- Post-execution output screening — Post-execution screening for secrets, PII, and unsafe content is enforced as a runtime control. Tool outputs are screened after each tool call completes via a PostToolUse hook (
agent/src/hooks.py) backed by a regex-based output scanner (agent/src/output_scanner.py). Detected patterns include AWS access keys, AWS secret keys, GitHub tokens (PAT, OAuth, App, fine-grained), private keys (PEM blocks), Bearer tokens, and connection strings with embedded passwords. When sensitive content is found, the hook returnsupdatedMCPToolOutputwith redacted content (steered enforcement — content is sanitized, not blocked). Findings emitOUTPUT_SCREENINGtelemetry events viaagent/src/telemetry.py. This closes the gap where an agent could leak a.envvalue into a PR description or commit message — input-only guardrails cannot catch this. Informed by the ABCA Threat Model Matrix (Threat 7: Sensitive data disclosure, rated Medium-High; Priority 3). See SECURITY.md (Mid-execution enforcement). - Context hydration screening for untrusted content — Add Bedrock Guardrails screening of hydrated context (PR review comments, issue bodies, review feedback) at the point of injection into the agent prompt, not only at task submission time. The current guardrail screening happens at submission for task descriptions and during hydration for
pr_iteration/pr_reviewtask types, but if an attacker posts a malicious PR review comment after the task is created, it may be hydrated into context without screening when fetched during the review feedback memory loop (Iteration 3d). Implementation: extendcontext-hydration.tsto screen all externally-sourced content through thePROMPT_ATTACKfilter before including it in the assembled prompt, with fail-closed semantics matching the existing guardrail pattern. Tag screened content withtrust_level: untrusted-externalmetadata. Informed by the ABCA Threat Model Matrix (Threats 1 and 6: Agent goal hijack and Memory/context poisoning). See SECURITY.md. - Review feedback memory loop (Tier 2) — Capture PR review comments via GitHub webhook, extract actionable rules via LLM, and persist them as searchable memory so the agent internalizes reviewer preferences over time. This is the primary feedback loop between human reviewers and the agent — no shipping coding agent does this today. Requires a GitHub webhook → API Gateway → Lambda pipeline (separate from agent execution). Two types of extracted knowledge: repo-level rules ("don't use
anytypes") and task-specific corrections. See MEMORY.md (Review feedback memory) and SECURITY.md (prompt injection via review comments). - PR outcome tracking — Track whether agent-created PRs are merged, revised, or rejected via GitHub webhooks (
pull_request.closedevents). A merged PR is a positive signal; closed-without-merge is a negative signal. These outcome signals feed into the evaluation pipeline and enable the episodic memory to learn which approaches succeed. See MEMORY.md (PR outcome signals) and EVALUATION.md. - Evaluation pipeline (basic) — Automated evaluation of agent runs: failure categorization (reasoning errors, missed instructions, missing tests, timeouts, tool failures). Results are stored and surfaced in observability dashboards. Basic version: rules-based analysis of task outcomes and agent responses. Track memory effectiveness metrics: first-review merge rate, revision cycles, CI pass rate on first push, review comment density, and repeated mistakes. Advanced version (ML-based trace analysis, A/B prompt comparison, feedback loop into prompts) is deferred to Iteration 5. See EVALUATION.md and OBSERVABILITY.md.
- Behavioral circuit breaker specification — Define the concrete specification for mid-execution behavioral monitoring (currently listed as planned work in Iteration 5). The circuit breaker monitors aggregate agent behavior within a running session and triggers pause/terminate/alert actions when anomalous patterns are detected. Signals: tool-call frequency (calls per minute), cumulative session cost velocity, repeated failures on the same tool (>N consecutive), file mutation rate (files written per minute), anomalous file access patterns (reads outside the repo tree, access to sensitive paths like
/etc/,~/.ssh/), and memory write bursts (>N writes in a window). Actions:pause(suspend session, emit alert, await operator decision),terminate(stop session, transition to FAILED withCIRCUIT_BREAKERreason code),alert(continue but emit high-priority notification). Thresholds: configurable per-repo via Blueprintsecurity.circuitBreakerwith platform-wide defaults (e.g., >50 tool calls/minute, >$10 cumulative cost, >5 consecutive same-tool failures). The specification is delivered in this iteration as a design artifact; implementation ships in Iteration 5 as part of mid-execution behavioral monitoring. Informed by the ABCA Threat Model Matrix (Threats 2, 8, 9: Tool misuse, Runaway cost, and Rogue behavior). See SECURITY.md (Mid-execution enforcement). - Per-tool-call structured telemetry — Instrument the agent harness (
agent/src/telemetry.py) to emit structured events for every tool call: tool name, input hash (SHA-256), output hash, duration, cost attribution, and result status. Events flow through the existingcreate_eventpath and are surfaced in CloudWatch. This is foundational for: (a) the evaluation pipeline (tool-call-level success/failure analysis), (b) the centralized policy framework Phase 1 (tool calls becomePolicyDecisionEventsources in Iteration 5), and (c) future mid-execution policy enforcement (tool-call interceptor in Iteration 5). Without per-tool-call telemetry, the platform can only observe sessions as opaque black boxes — model invocation logs capture LLM reasoning but not the tool execution that connects reasoning to action. Informed by the Guardian system's tool-call interception architecture (Hu et al. 2025). See OBSERVABILITY.md and SECURITY.md (Mid-execution enforcement).
Prerequisite: 3e Phase 1 (input hardening) ships with this iteration. The review feedback memory loop writes attacker-controlled content (PR review comments) to persistent memory. Without content sanitization, provenance tagging, and integrity hashing (3e Phase 1), this creates a known attack vector — poisoned review comments stored as persistent rules that influence all future tasks on the repo. 3e Phase 1 items (memory content sanitization, GitHub issue input sanitization, source provenance on memory writes, content integrity hashing) must be implemented before or concurrently with the review feedback pipeline. See SECURITY.md (Prompt injection via PR review comments).
Builds on Iteration 3c: Validation and PR review task type are in place; this iteration adds new infrastructure (webhook → Lambda → LLM extraction pipeline), connects the feedback loop, and closes output screening and context hydration screening gaps identified by the ABCA Threat Model Matrix.
Goal: Harden the memory system against both adversarial corruption (prompt injection into memory, poisoned tool outputs, experience grafting) and emergent corruption (hallucination crystallization, feedback loops, stale context accumulation). OWASP classifies this as ASI06 — Memory & Context Poisoning in the 2026 Top 10 for Agentic Applications.
Deep research identified 9 memory-layer security gaps in the current architecture (see the Memory Security Analysis section in MEMORY.md). The platform has strong network-layer security (VPC isolation, DNS Firewall, HTTPS-only egress) but lacks memory content validation, provenance tracking, trust scoring, anomaly detection, and rollback capabilities. Research shows that MINJA-style attacks achieve 95%+ injection success rates against undefended agent memory systems, and that emergent self-corruption (hallucination crystallization, error compounding feedback loops) is equally dangerous because it lacks an external attacker signature.
Phase 1 is a prerequisite for Iteration 3d's review feedback memory loop. Attacker-controlled PR review comments must not enter persistent memory without sanitization, provenance tagging, and integrity checking. These items ship concurrently with 3d, not after it.
- Memory content sanitization — Add content validation in
loadMemoryContext()(src/handlers/shared/memory.ts). Scan retrieved memory records for injection patterns (embedded instructions, system prompt overrides, command injection payloads) before including them in the agent's context. Implement asanitizeMemoryContent()function that strips or flags suspicious patterns while preserving legitimate repository knowledge. - GitHub issue input sanitization — Add trust-boundary-aware sanitization in
context-hydration.tsfor GitHub issue bodies and comments. These are attacker-controlled inputs that currently flow into the agent's context without differentiation. Strip control characters, embedded instruction patterns, and known injection payloads. Tag the content source asuntrusted-externalin the hydrated context. - Source provenance on memory writes — Tag all memory writes with source provenance metadata. In
memory.ts(writeMinimalEpisode) andagent/memory.py(write_task_episode,write_repo_learnings), add asource_typefield to event metadata:agent_episode,agent_learning,orchestrator_fallback,github_issue, orreview_feedback. This enables trust-differentiated retrieval in Phase 2. - Content integrity hashing — Add SHA-256 content hashing on all memory writes. Store the hash in event metadata. At read time, verify that content has not been modified between write and read. Implementation: compute hash before
CreateEventCommand, store ascontent_hashmetadata, verify onRetrieveMemoryRecordsCommandresults.
- Trust scoring at retrieval — Modify
loadMemoryContext()to weight retrieved memories by temporal freshness, source type reliability, and pattern consistency with other memories. Memories fromorchestrator_fallbackandagent_episodesources receive higher trust than memories derived from external inputs. Entries below a configurable trust threshold are deprioritized or excluded from the 2,000-token budget. - Configurable temporal decay — Implement per-entry TTL with configurable decay rates. Unverified or externally-sourced memory entries decay faster (e.g., 30-day default) than agent-generated or human-confirmed entries (e.g., 365-day default). Add
trust_tieranddecay_rateto the memory metadata schema. - Memory validation Lambda — Add a lightweight validation function triggered on
CreateEventCommand(via EventBridge rule on AgentCore events or as a post-write hook). The validator runs a classifier that checks whether new memory content looks like legitimate repository knowledge or could influence future agent behavior in unintended ways (the "guardian pattern"). Flag suspicious entries for operator review.
- Memory write anomaly detection — Instrument memory write operations with CloudWatch custom metrics: write frequency per repo, average content length, source type distribution. Add CloudWatch Alarms for anomalous patterns (e.g., burst of writes from a single task, unusually long content, writes with
untrusted-externalsource type exceeding a threshold). - Circuit breaker in orchestrator — Add circuit breaker logic in
orchestrator.ts: if the agent's tool invocation patterns or memory write patterns deviate from a baseline (e.g., sudden increase in memory writes, writes containing instruction-like patterns), pause the task and emit an alert. The circuit breaker transitions the task to a newMEMORY_REVIEWstate that requires operator intervention. - Memory quarantine API — Expose an operator API endpoint (
POST /v1/memory/quarantine,GET /v1/memory/quarantine) for flagging and isolating suspicious memory entries. Quarantined entries are excluded from retrieval but preserved for forensic analysis. - Memory rollback capability — Implement point-in-time memory snapshots. Before each task starts, snapshot the current memory state for the target repo (via the existing
loadMemoryContextpath, persisted to S3). If poisoning is detected post-task, operators can restore the repo's memory to the pre-task snapshot. AddPOST /v1/memory/rollbackendpoint.
- Write-ahead validation (guardian model) — Route proposed memory writes through a smaller, cheaper model (e.g., Haiku) that evaluates whether the content is legitimate learned context or could be adversarial. Adds latency (~100-500ms per write) but catches sophisticated attacks that evade pattern-based sanitization. Configurable per-repo via Blueprint.
- Cross-task behavioral drift detection — Compare agent reasoning patterns and tool invocation sequences across tasks for the same repo. Detect drift from established baselines that could indicate memory-influenced behavioral manipulation. Implemented as a post-task analysis step in the evaluation pipeline.
- Cryptographic provenance chain — Implement Merkle tree-based provenance for memory entry chains per repo. Each new entry includes a hash of the previous entry, creating an append-only, tamper-evident chain. Enables cryptographic verification that no entries have been inserted, modified, or deleted between known-good checkpoints.
- Red team validation — Red team the memory system using published attack methodologies: MINJA (query-based memory injection), AgentPoison (RAG retrieval poisoning), and experience grafting. Document results and adjust defenses. Add automated red team tests to the evaluation pipeline using the DeepTeam framework (OWASP ASI06 attack categories).
- Memory metadata schema changes (
source_type,content_hash,trust_tier,decay_rate) requireschema_version: "3"and are not readable by v2 code paths without migration. - The
MEMORY_REVIEWtask state is a new addition to the state machine (requires orchestrator, API contract, and observability updates). - Trust-scored retrieval changes the memory context budget allocation, which may affect prompt version hashing.
Builds on Iteration 3d: Review feedback memory and PR outcome tracking are in place; Phases 2–4 harden the memory system that those components write to. Phase 1 (input hardening) ships with 3d as a prerequisite — see Iteration 3d. The phased approach allows incremental deployment with measurable security improvement at each phase.
Goal: Additional git providers; agent can run the app and attach visual proof; Slack integration; web dashboard for operators and users; real-time streaming.
- Additional git providers — Support GitLab (and optionally Bitbucket or others). Same workflow (clone, branch, commit, push, PR/MR). Provider-specific APIs, auth, and webhook adapters. The gateway and task schema are already channel-agnostic (repo is
owner/repo); this iteration adds agit_providerfield and provider-specific adapters. Onboarding (Iter 3a) must support non-GitHub repos. - Live execution and visual proof — Agent can execute the application after build/tests, capture screenshots or videos as proof that changes work, and upload them (e.g. as PR attachments or to an S3 artifact store linked from the PR). Requires compute support: virtual display (Xvfb) or headless browser (Playwright/Puppeteer), capture scripts, and outbound upload. See COMPUTE.md (Visual proof). This may require a larger compute profile (more CPU/RAM/disk) or a dedicated "visual proof" step in the blueprint.
- Slack channel — Slack adapter for the input gateway: users can submit tasks, check status, and receive notifications from Slack. Inbound: verify Slack signing secret, normalize Slack payload to the internal message schema. Outbound: render internal notifications as Slack Block Kit messages, post to the originating channel/thread. Requires a Slack→platform user mapping. See INPUT_GATEWAY.md.
- Automated skills creation pipeline — Pipeline that creates or updates agent skills (or similar artifacts) from repo interaction or from onboarding. For example: the pipeline observes that a repo always requires
npm run lint:fixbefore tests pass, and generates a skill or rule that the agent uses automatically. Builds on customization (Iter 3a) and memory (Iter 3b–3d). - User preference memory (Tier 3) — Per-user memory for PR style, commit conventions, test coverage expectations, and other execution preferences. Extracted from task descriptions (explicit) and review feedback patterns (implicit). Lower priority than repo-level and review feedback memory, but enables personalization when multiple users submit tasks. See MEMORY.md (User preference memory, Tier 3).
- Control panel (web dashboard) — Web UI for operators and users: list tasks (with filters by status, repo, user), view task detail and status history, cancel tasks, link to agent logs, and show basic metrics (active tasks, submitted backlog, completion rate, error rate). Optional: submit a task from the UI (the panel becomes another channel via the input gateway). See CONTROL_PANEL.md. Tech stack TBD (e.g. React + AppSync or REST).
- Real-time event streaming (WebSocket) — Replace or supplement the polling-based
GET /v1/tasks/{id}/eventswith an API Gateway WebSocket API for real-time task status updates. WebSocket is chosen over SSE because multiplayer sessions (Iteration 6) and iterative feedback require bidirectional communication. This improves the experience for the control panel, Slack integration, and CLI--waitmode. Requires connection management (DynamoDB connection table). See API_CONTRACT.md (OQ1). - Live session replay and mid-task nudge — Extend WebSocket streaming with structured trajectory events (thinking steps, tool calls, cost, timing) for real-time session observation and post-hoc replay with timeline scrubbing. Add a "nudge" mechanism to inject one-shot course corrections between agent turns (via TaskNudges table and mid-session message injection). Structured streaming with cost telemetry provides better debugging and operational visibility than raw terminal logs. Requires bidirectional WebSocket (same as real-time streaming) plus agent harness support for consuming nudge messages.
- Browser extension client — A lightweight Chrome/Firefox extension that lets users trigger tasks directly from the browser (e.g. while viewing a GitHub issue, click a button to submit it as a task). The extension calls the existing webhook API (Iteration 3a) with the current page's issue URL, requiring minimal new infrastructure — just a small client-side wrapper over the webhook endpoint. See INPUT_GATEWAY.md.
Builds on Iteration 3d: Onboarding, memory (Tiers 1–2), evaluation, and validation are in place; adds git providers, visual proof, Slack, skills pipeline, user preference memory, control panel, real-time streaming, and browser extension.
Goal: Faster cold start, multi-user/team, full cost management, guardrails, and alternative runtime support.
-
Automated container (devbox) from repo — Optionally derive or customize the agent container image from the repo (e.g. Dockerfile, dev container config, language-specific base images). Tied to onboarding: per-repo workload config. Reduces cold start for repos with known environments and ensures the agent has the right tools (compilers, SDKs, linters) pre-installed.
-
CI/CD pipeline — Automated deployment pipeline for the platform itself: source → build → test → synth → deploy to staging → deploy to production. Use CDK Pipelines or equivalent. The current ad-hoc CDK deploy workflow is not sufficient for a production orchestrator managing long-running tasks — deployments need to be safe (canary, rollback), auditable, and repeatable.
-
Environment pre-warming (snapshot-on-schedule) — Pre-build container layers or repo snapshots (code + deps pre-installed) per repo; store in ECR or equivalent. Reduces cold start from minutes to seconds for known repos. The onboarding pipeline (Iter 3a) can trigger pre-warming as part of repo setup or on a schedule. Periodically snapshot the onboarded repo's container image (code + deps) to ECR, rebuild on push to the default branch (via webhook or EventBridge), and use that as the base for new sessions. Optionally begin sandbox warming when a user starts composing a task (proactive warming). Snapshot-based session starts (if AgentCore supports it) further reduce startup time. See COMPUTE.md.
-
Multi-user / team support — Multiple users with shared task history, team-level visibility, and optionally shared approval queues or budgets. Adds a
team_idororg_idto the task model. Team admins can view all tasks for their team, set team-level concurrency limits, and configure team-wide cost budgets. Builds on existing task model (user_id, filters) and adds authorization rules (team members can view each other's tasks). -
Memory isolation for multi-tenancy — AgentCore Memory has no per-namespace IAM isolation. For multi-tenant deployments, private repo knowledge could leak cross-repo unless isolation is enforced. Options: silo model (separate memory resource per org — strongest), pool model (single resource with strict application-layer namespace scoping — sufficient for single-org), or shared model (intentional cross-repo learning — only for same-org repos). The onboarding pipeline should create or assign memory resources based on the isolation model. See SECURITY.md and MEMORY.md.
-
Full cost management — per-user and per-team monthly budgets, cost attribution dashboards (cost per task, per repo, per user), alerts when budgets are approaching limits. Token usage and compute cost are tracked per task and aggregated. The control panel (Iter 4) displays cost dashboards.
-
Adaptive model router with cost-aware cascade — Per-turn model selection via a lightweight heuristic engine. File reads and simple edits use a cheaper model (Haiku); multi-file refactors use Sonnet; complex reasoning escalates to Opus. Error escalation: if the agent fails twice on the same step, upgrade model for the retry. As the cost budget ceiling approaches, cascade down to cheaper models. Blueprint
modelCascadeconfig enables per-repo tuning. Potential 30-40% cost reduction on inference-dominated workloads. Requires agent harness changes to support mid-session model switching. -
Advanced evaluation and feedback loop — Extend the basic evaluation pipeline from Iteration 3d: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates (e.g. "for repo X, always run tests before opening PR"), and per-repo or per-failure-type improvement tracking. Evaluation results can update the repo's agent configuration stored during onboarding. Optional patterns from adaptive teaching research (e.g. plan → targeted critique → execution; separate evaluator vs prompt/reflection roles; fitness from LLM judging plus efficiency metrics; evolution of teaching templates from failed trajectories with Pareto-style candidate sets for diverse failure modes) can inform offline or scheduled improvement of Blueprint prompts and checklists without replacing ABCA's core orchestrator.
-
Formal orchestrator verification (TLA+) — Add a formal specification of the orchestrator in TLA+ and verify it with TLC model checking. Scope includes the task state machine (8 states, valid transitions, terminal states), concurrency admission control (atomic increment + max check), cancellation races (cancel arriving during any orchestration step), reconciler/orchestrator interleavings (counter drift correction while tasks are active), and the polling loop (agent writes terminal status, orchestrator observes and finalizes). Define invariants such as valid-state progression, no illegal transitions, and repo-level safety constraints (for example, at most one active
RUNNINGtask per repo when configured). Keep the spec aligned withsrc/constructs/task-status.tsand orchestrator docs so regressions surface as model-check counterexamples before production. Note: The TLA+ specification can be started earlier (e.g. during Iteration 3d) since the state machine and concurrency model are already stable. The spec is documentation that also catches bugs — writing it does not depend on Iteration 5 features. Consider starting the state machine and cancellation models as part of the ongoing engineering practice. -
Guardrails (output and tool-call) with interceptor pattern — Extend Bedrock Guardrails from input screening (implemented in Iteration 3c) to output filtering and agent tool-call guardrails. Apply content filters to model responses during agent execution, restrict sensitive content generation, and enforce organizational policies (e.g. "do not modify files in
/infrastructure"). Guardrails configuration can be per-repo (via onboarding) or platform-wide.Tool-call interceptor (Guardian pattern) — pre- and post-execution stages implemented: A Cedar-based policy engine (
agent/src/policy.py) with PreToolUse hooks and a regex-based output scanner (agent/src/output_scanner.py) with PostToolUse hooks (agent/src/hooks.py) intercept tool calls between the agent SDK's decision and actual execution. Pre-execution stage (implemented): Every tool call is evaluated against Cedar deny-list policies:pr_reviewagents are deniedWrite/Edittools, writes to protected paths (.github/workflows/*,.git/*) are blocked, and destructive bash commands (rm -rf /,git push --force) are denied. The engine is fail-closed — ifcedarpyis unavailable or evaluation errors occur, all tool calls are denied. Per-repo custom Cedar policies are supported via Blueprintsecurity.cedarPolicies. Denied decisions emitPOLICY_DECISIONtelemetry events viaagent/src/telemetry.py. Post-execution stage (implemented): Tool outputs are screened for secrets and PII (AWS keys, GitHub tokens, private keys, connection strings, Bearer tokens) viaoutput_scanner.py. When sensitive content is found, the PostToolUse hook returnsupdatedMCPToolOutputwith redacted content (steered enforcement). Findings emitOUTPUT_SCREENINGtelemetry events. Remaining work: Cost threshold checks, bash command allowlist per capability tier, and Bedrock Guardrails-based output filtering (complementing the regex-based scanner). Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision will be logged as aPolicyDecisionEvent. This pattern is informed by the Guardian system (Hu et al. 2025) — a "guardian agent" that monitors and can intercept tool calls before execution. See SECURITY.md (Mid-execution enforcement). -
Mid-execution behavioral monitoring — Lightweight monitoring of agent behavior within a running session, filling the gap between input guardrails (pre-session) and validation (post-session). A behavioral circuit breaker in the agent harness tracks aggregate metrics: tool-call frequency (calls per minute), cumulative session cost, repeated failures on the same tool, and file mutation rate. When metrics exceed configurable thresholds (e.g. >50 tool calls/minute, >$10 cumulative cost, >5 consecutive failures on the same tool), the circuit breaker pauses or terminates the session and emits a
circuit_breaker_triggeredevent. This catches runaway loops, cost explosions, and stuck agents before the hard session timeout. Thresholds are configurable per-repo via Blueprintsecurityprops. The circuit breaker operates within the existing agent harness — no sidecar process or external service required. For ABCA's single-agent-per-task model, embedded monitoring is simpler and more reliable than an external sidecar; sidecar architecture becomes relevant when multi-agent orchestration lands (Iteration 6). See SECURITY.md (Mid-execution enforcement). -
Centralized policy framework — Consolidate the platform's distributed policy decisions into a unified policy framework and audit layer. Policy logic today is scattered across 20+ files (input validation in
validation.tsandcreate-task-core.ts, admission control inorchestrator.ts, guardrail screening incontext-hydration.ts, budget resolution acrossvalidation.ts/orchestrator.ts/agent/src/config.py, tool access inagent/src/policy.py+agent/src/hooks.py, network egress indns-firewall.ts/agent.ts, state transitions intask-status.ts/orchestrator.ts). The agent-side Cedar policy engine (agent/src/policy.py) is a first step — it provides in-process tool-call governance with fail-closed semantics and per-repo custom policies. The full framework extends this to the TypeScript orchestrator side. This fragmentation makes it difficult to audit what policies exist, verify consistency, or change policy behavior without touching multiple files.Phase 1 — Policy audit normalization: Define a stable
PolicyDecisionEventschema:decision_id(ULID),policy_name(e.g.admission.concurrency,budget.max_turns,guardrail.input_screening),policy_version,phase(submission|admission|pre_flight|hydration|session_start|session|finalization),input_hash(SHA-256 of the decision input for reproducibility),result(allow|deny|modify),reason_codes[],enforcement(enforced|observed|steered), andtask_id. The three enforcement modes serve distinct purposes:enforcedmeans the decision is binding (deny blocks, allow proceeds),observedmeans the decision is logged but not enforced (shadow mode for safe rollout), andsteeredmeans the decision modifies the input or output rather than blocking (redact PII, sanitize paths, mask secrets). New rules deploy inobservedmode first; operators validate false-positive rates viaPolicyDecisionEventlogs, then promote toenforcedorsteered. This observe-before-enforce workflow enables gradual rollout of security policies without risking false blocks on legitimate tasks. Emit apolicy_decisionevent viaemitTaskEventat every existing enforcement point. Today, some decisions emit events (admission_rejected,preflight_failed,guardrail_blocked) while others silently return HTTP errors — normalize them all. This is pure instrumentation of existing code paths; no behavior change.Phase 2 — Cedar policy engine (partially implemented): Introduce Cedar (not OPA) as the single policy engine for both operational policy (budget/quota/tool-access resolution, tool-call interception rules) and authorization (extended for multi-tenant access control when multi-user/team support lands). Cedar is AWS-native, has formal verification guarantees, and integrates with AgentCore Gateway.
Current state: An in-process Cedar policy engine is implemented in the agent harness (
agent/src/policy.py) usingcedarpyfor tool-call governance. The engine enforces a deny-list model:pr_reviewagents are forbidden fromWrite/Edit, writes to.github/workflows/*and.git/*are blocked, and destructive bash commands are denied. The engine is fail-closed (denies on error,cedarpyunavailability, or CedarNoDecision). Per-repo custom Cedar policies can be injected via Blueprintsecurity.cedarPoliciesand are validated at initialization. Task types are validated against theTaskTypeenum (agent/src/models.py). Denied decisions emitPOLICY_DECISIONtelemetry events.Remaining work: Extend Cedar to the TypeScript orchestrator side. Cedar replaces the scattered budget/quota/tool-access merge logic (3-tier
max_turnsresolution, 2-tiermax_budget_usdresolution, per-repo configuration merge inloadBlueprintConfig) with a unified policy evaluation. A thinpolicy.tsadapter module translates Cedar decisions intoPolicyDecisionobjects (PolicyInput→ Cedar evaluation →PolicyDecisionwith computed budgets, tool profile, risk tier, redaction directives) consumed by existing handlers — no new service, no network hop. Input validation (format checks, range checks) remains at the input boundary; Cedar handles resolution and policy composition. Migrate from in-processcedarpyto Amazon Verified Permissions for runtime-configurable policies.Operational tool-call policies use a virtual-action classification pattern to support the three enforcement modes (
enforced,observed,steered) within Cedar's binary permit/forbid model. Instead of asking Cedar "allow or deny?", the interceptor evaluates against multiple virtual actions (invoke_tool,invoke_tool_steered,invoke_tool_denied) and uses the first permitted action to determine the mode. For example:forbid(principal, action == Action::"invoke_tool", resource) when { resource.path like ".github/workflows/*" && principal.capability_tier != "elevated" }blocks the call, whilepermit(principal, action == Action::"invoke_tool_steered", resource) when { context.output_contains_pii }triggers PII redaction. This keeps Cedar doing what it does best (binary decisions with formal verification) while the interceptor interprets the combination of decisions as allow/steer/deny.Authorization policies (extended with multi-user/team): When multi-user/team support lands, the same Cedar policy store expands to cover tenant-specific authorization: "users in team X can submit tasks to repos A, B, C", "team Y has a monthly budget of $500", "repos tagged
criticalrequirepr_reviewbeforenew_task". This replaces the current single-dimensional ownership check (record.user_id !== userId) with multi-dimensional authorization (user, team, repo, action, risk level). No new policy engine — the same Cedar instance grows to cover authorization alongside operational policy.Runtime-configurable policies: Cedar policies are stored in Amazon Verified Permissions and loaded at hydration/session-start time. Policy changes take effect without CDK redeployment — operators update policies via the Verified Permissions API, and the next task evaluation picks them up. Deployment-time invariants (schema validation, state machine transitions) remain in CDK code.
Policy versioning, rollback, and observe-before-enforce semantics carry forward from Phase 1. Cedar policies are evaluated at submission, admission, hydration, session (tool-call interception), and finalization.
Why not OPA: OPA uses Rego (a custom DSL) and runs as a sidecar or external service. ABCA's policies change at the same cadence as infrastructure (deployed via CDK). A separate service with a separate language adds operational burden without proportionate benefit for a single-tenant platform. Cedar is a better fit: it's a typed language with formal verification, it's AWS-native (used by Amazon Verified Permissions and AgentCore Gateway), and policies can be evaluated in-process via the Cedar SDK without a separate service. Unlike OPA/Rego (which can return arbitrary JSON), Cedar's binary decisions require the virtual-action pattern for steering — but this keeps policy evaluation formally verifiable, which OPA cannot guarantee.
What stays out of the policy framework: Schema validation (repo format,
max_turnsrange, task description length) stays at the input boundary. State machine transitions stay in the orchestrator. DNS Firewall stays in CDK. These are infrastructure invariants, not policy decisions — they don't vary by tenant, user, or context.See SECURITY.md (Policy enforcement and audit).
-
Capability-based security model — Fine-grained enforcement beyond Bedrock Guardrails, operating at three levels: (1) Tool-level capabilities — Bash command allowlist (git, npm, make permitted; curl, wget blocked), configurable per capability tier (standard / elevated / read-only). (2) File-system scope — Blueprint declares include/exclude path patterns; Write/Edit/Read tools are filtered to the declared scope. (3) Input trust scoring — Authenticated user input = trusted; external GitHub issues = untrusted; PR review comments entering memory = adversarial. Trust level selects the capability set. Essential once review feedback memory (Iter 3d) introduces attacker-controlled content into the agent's context. Blueprint
securityprop configures the capability profile per repo. Capability tiers become inputs to the centralized policy framework and are governed by Cedar policies (Phase 2). -
Additional execution environment — Support an alternative to AgentCore Runtime (e.g. ECS/Fargate, EKS) behind the ComputeStrategy interface (see REPO_ONBOARDING.md). The orchestrator calls abstract methods (
startSession,stopSession,pollSession); the implementation maps to AgentCore, Fargate, or EKS. Repos select the strategy viacompute_typein their blueprint configuration. Reduces vendor lock-in and enables workloads that exceed AgentCore limits (e.g. GPU, larger images, longer sessions). The ComputeStrategy interface contract is defined in Iteration 3a; Iteration 5 adds alternative implementations. -
Full web dashboard — Extend the control panel from Iteration 4: detailed dashboards (cost, performance, evaluation), reasoning trace viewer or log explorer (linked to OpenTelemetry traces from AgentCore), task submit/cancel from the UI, and admin views (system health, capacity, user management).
-
Customization (advanced) with tiered tool access — Agent can be extended with MCP servers, plugins, and skills beyond the basic prompt-from-repo customization in Iteration 3a. Composable tool sets per repo. MCP server discovery and lifecycle management. More tools increase behavioral unpredictability, so use a tiered tool access model: a minimal default tool set (bash allowlist, git, verify/lint/test) that all repos get, with MCP servers and plugins as opt-in per repo during onboarding. Per-repo tool profiles are stored in the onboarding config and loaded by the orchestrator. This balances flexibility with predictability. See SECURITY.md and REPO_ONBOARDING.md.
Builds on Iteration 4: Adds pre-warming, multi-user, cost management, guardrails, alternate runtime, and advanced customization with tiered tool access.
Goal: Skills learned from repo interaction; multi-repo tasks; iterative human-agent collaboration; reusable CDK constructs.
- GitHub Actions integration — Publish a GitHub Action that triggers a ABCA task (e.g. on issue label like
agent:fix, on flaky test detection, or on PR comment command). The Action calls the webhook endpoint from Iteration 3a. Natural integration for GitHub-centric workflows. - Automated pipeline for learning skills from repo interaction — Pipeline that observes agent interactions with repositories and produces reusable skills (rules, prompts, tools) that improve future runs. Builds on memory, code attribution, and evaluation. Example: the pipeline notices that tasks on repo X frequently fail because of a missing env variable, and generates a rule that the agent always sets it.
- Agent swarm orchestration — Planner-worker architecture for complex, multi-file tasks that overwhelm a single agent session. A lightweight planner decomposes the task into a DAG of subtasks with scope boundaries and interface contracts. Each subtask runs as an independent child task in its own AgentCore session. A merge orchestrator cherry-picks commits, resolves conflicts, and runs the full test suite before opening one consolidated PR. New DynamoDB fields:
parent_task_id,child_task_ids[],subtask_contract. New blueprint steps:decompose-task, fan-out + wait-all, merge-and-verify. Naturally bounds PR size and enables work that no single-session agent can handle (large features, cross-cutting refactors, migrations). - Multi-repo support — Tasks that span multiple repositories (e.g. change an API in repo A and update the consumer in repo B). Requires: multi-branch orchestration (one branch per repo), coordinated PR creation (linked PRs), cross-repo auth (GitHub App installations for both repos), and cross-repo testing. This is architecturally significant and needs a dedicated design doc before implementation.
- Iterative feedback and multiplayer sessions — User can send follow-up instructions to a completed or running task (e.g. "also add tests for X" or "change the approach to use library Y"). For completed tasks, the platform starts a new session on the same branch with the follow-up context. For running tasks, this requires message injection into a live session — which depends on agent harness support for session persistence and message channels. Design the interaction model carefully: what happens to in-flight work when instructions change? Multiplayer extension: allow multiple authorized users to inject context into a running or follow-up session (e.g. team code reviews or collaborative debugging with the agent). Per-prompt commit attribution (Iter 3b) supports tracking which user's input led to which changes.
- HITL approval mode — Optional mid-task approval gates for high-risk operations (e.g. "agent wants to delete 50 files — approve?"). The orchestrator pauses the task, emits a notification, and waits for user approval before continuing. Requires changes to the agent harness (pause/resume) and the orchestrator (a new
AWAITING_APPROVALstate in the state machine). - Scheduled triggers — Cron or schedule-based task creation (e.g. "run dependency update every Monday", "check for flaky tests nightly"). Implemented as EventBridge Scheduler rules that call the task creation API. Schedules are configured per repo during onboarding or via the control panel.
- CDK constructs — Publish reusable CDK constructs (e.g.
BackgroundAgentStack,OnboardingPipelineStack,TaskOrchestrator) so other teams can compose the platform into their own CDK apps. Document construct APIs, publish to a construct library (e.g. Construct Hub), and version following semver.
Builds on Iteration 5: Leverages memory, evaluation, and customization to close the loop (learn → improve); adds advanced workflows and exposes the platform as constructs.
- Iteration 1 — Core agent + git (isolated run, CLI submit, branch + PR, minimal task state).
- Iteration 2 — Production orchestrator, API contract, task management (list/status/cancel), durable execution, observability, threat model, network isolation, basic cost guardrails, CI/CD.
- Iteration 3a — Repo onboarding, DNS Firewall (domain-level egress filtering), webhook trigger (foundation for GitHub Actions integration in Iteration 6), per-repo customization (prompt from repo), data retention, turn/iteration caps, cost budget caps, user prompt guide, agent harness improvements (turn budget, default branch, safety net, lint, softened conventions), operator dashboard, WAF, model invocation logging, input length limits.
- Iteration 3b ✅ — Memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, per-prompt commit attribution. CDK L2 construct with named semantic + episodic strategies using namespace templates (
/{actorId}/knowledge/,/{actorId}/episodes/{sessionId}/), fail-open memory load/write, orchestrator fallback episode, SHA-256 prompt hashing, git trailer attribution. - Iteration 3c — Per-repo GitHub App credentials via AgentCore Token Vault (
CfnWorkloadIdentity+ Token Vault credential provider for automatic token refresh; agent usesGetWorkloadAccessTokenfor long-running sessions; sets pattern for GitLab/Jira/Slack integrations), principal-to-repository authorization mapping (Cognito identity → allowed repo sets, distinct from credential scoping — Threat Model Priority 1), orchestrator pre-flight checks (fail-closed before session start), persistent session storage for select caches (AgentCore Runtime/mnt/workspacemount for npm/Claude config; mise/uv/repo on local disk due to FUSEflock()limitation), pre-execution task risk classification (model/limits/approval policy selection), tiered validation pipeline (tool validation, code quality analysis, post-execution risk/blast radius analysis), PR risk level, PR review task type (pr_review— read-only structured review with tool restriction, defense-in-depth enforcement, CLI--review-prflag), input guardrail screening (Bedrock Guardrails, fail-closed — including GitHub issue content fornew_task), multi-modal input. - Iteration 3d — Post-execution output screening (done — regex-based secret/PII scanner in
agent/src/output_scanner.pywith PostToolUse hook inagent/src/hooks.py; screens AWS keys, GitHub tokens, private keys, connection strings, Bearer tokens; steered enforcement viaupdatedMCPToolOutputredaction;OUTPUT_SCREENINGtelemetry events), context hydration screening for untrusted content (PR review comments, issue bodies screened at injection point, not only at submission — Threats 1/6), behavioral circuit breaker specification (signal taxonomy, threshold defaults, action model — design artifact, implementation in Iteration 5 — Threats 2/8/9), review feedback memory loop (Tier 2), PR outcome tracking, evaluation pipeline (basic), per-tool-call structured telemetry (tool name, input/output hash, duration, cost — foundational for evaluation and Iteration 5 policy enforcement). Co-ships with 3e Phase 1 (memory input hardening: content sanitization, provenance tagging, integrity hashing) as a prerequisite for safely writing attacker-controlled content to memory. - Iteration 3e — Memory security and integrity: Phase 1 (input hardening — content sanitization, provenance tagging, integrity hashing) ships with 3d as a prerequisite; Phases 2–4 follow: trust-aware retrieval (trust scoring, temporal decay, guardian validation), detection and response (anomaly detection, circuit breaker, quarantine, rollback), advanced protections (write-ahead validation, behavioral drift detection, cryptographic provenance, red teaming). Addresses OWASP ASI06 (Memory & Context Poisoning).
- Iteration 3bis (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (
schema_version: "2"), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition intoagent/src/modules (config, models, pipeline, runner, context, prompt_builder, hooks, policy, post_hooks, repo, shell, telemetry — with entrypoint.py as re-export shim), Cedar policy engine (in-processcedarpy, fail-closed deny-list for tool-call governance, PreToolUse hooks, per-repo custom policies via Blueprintsecurity.cedarPolicies), TaskType enum with validation, dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active). - Iteration 4 — Additional git providers, visual proof (screenshots/videos), Slack channel, skills pipeline, user preference memory (Tier 3), control panel (restrict CORS to dashboard origin), real-time event streaming (WebSocket), live session replay and mid-task nudge, browser extension client, MFA for production.
- Iteration 5 — Automated container (devbox) from repo, CI/CD pipeline, snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, Bedrock Guardrails output/tool-call with Guardian interceptor pattern (pre-execution stage implemented via Cedar
agent/src/policy.py+ PreToolUse hooks; post-execution stage implemented viaagent/src/output_scanner.py+ PostToolUse hooksagent/src/hooks.py; remaining: cost threshold checks, bash command allowlist per capability tier, Bedrock Guardrails-based output filtering complementing regex scanner) — input screening in 3c, mid-execution behavioral monitoring (tool-call frequency circuit breaker, cost runaway detection, aggregate behavioral bounds within agent harness), centralized policy framework (Phase 1: policy audit normalization withPolicyDecisionEventschema across all enforcement points, three enforcement modes —enforced|observed|steered— with observe-before-enforce rollout workflow; Phase 2: Cedar partially implemented in agent harness with in-processcedarpyfor tool-call governance; remaining: extend Cedar to TypeScript orchestrator for budget/quota resolution, migrate to Amazon Verified Permissions for runtime-configurable policies, virtual-action classification pattern for enforce/observe/steer, extended for multi-tenant authorization when multi-user/team lands), capability-based security model (tiers feed into policy framework), alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules. - Iteration 6 — Agent swarm orchestration, skills learning, multi-repo, iterative feedback and multiplayer sessions, HITL approval, scheduled triggers, CDK constructs.
Design docs to keep in sync: ARCHITECTURE.md, ORCHESTRATOR.md, API_CONTRACT.md, INPUT_GATEWAY.md, REPO_ONBOARDING.md, MEMORY.md, OBSERVABILITY.md, COMPUTE.md, CONTROL_PANEL.md, SECURITY.md, EVALUATION.md.