diff --git a/AGENTS.md b/AGENTS.md index 89b1a99..f4e950e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -19,7 +19,7 @@ Use this routing before editing so the right package and tests get updated: | REST API, Lambdas, task validation, orchestration | `cdk/src/handlers/`, `cdk/src/stacks/`, `cdk/src/constructs/` | Matching tests under `cdk/test/` | | Shared API request/response shapes | `cdk/src/handlers/shared/types.ts` | **`cli/src/types.ts`** (must stay in sync) | | `bgagent` CLI commands and HTTP client | `cli/src/`, `cli/test/` | `cli/src/types.ts` if API types change | -| Agent runtime (clone, tools, prompts, container) | `agent/` (`entrypoint.py`, `prompts/`, `Dockerfile`, etc.) | `agent/tests/`, `agent/README.md` for env/PAT | +| Agent runtime (clone, tools, prompts, container) | `agent/src/` (`pipeline.py`, `runner.py`, `config.py`, `hooks.py`, `policy.py`, `prompts/`, Dockerfile, etc.) | `agent/tests/`, `agent/README.md` for env/PAT | | User-facing or design prose | `docs/guides/`, `docs/design/` | Run **`mise //docs:sync`** or **`mise //docs:build`** (do not edit `docs/src/content/docs/` by hand) | | Monorepo tasks, CI glue | Root `mise.toml`, `scripts/`, `.github/workflows/` | — | diff --git a/agent/Dockerfile b/agent/Dockerfile index d0a052f..4d00e3b 100644 --- a/agent/Dockerfile +++ b/agent/Dockerfile @@ -50,8 +50,7 @@ RUN uv sync --frozen --no-dev --directory /app # Copy agent code (ARG busts cache so file edits are always picked up) ARG CACHE_BUST=0 -COPY entrypoint.py system_prompt.py server.py task_state.py observability.py memory.py /app/ -COPY prompts/ /app/prompts/ +COPY src/ /app/src/ COPY prepare-commit-msg.sh /app/ COPY test_sdk_smoke.py test_subprocess_threading.py /app/ @@ -69,4 +68,4 @@ WORKDIR /workspace EXPOSE 8080 -CMD ["opentelemetry-instrument", "uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080", "--app-dir", "/app", "--loop", "asyncio"] +CMD ["opentelemetry-instrument", "uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080", "--app-dir", "/app/src", "--loop", "asyncio"] diff --git a/agent/README.md b/agent/README.md index 88dade0..a8e8103 100644 --- a/agent/README.md +++ b/agent/README.md @@ -81,7 +81,7 @@ The second argument is auto-detected: When an issue number is given, the optional third argument provides additional instructions on top of the issue context. -The `run.sh` script overrides the container's default CMD to run `python /app/entrypoint.py` (batch mode) instead of the uvicorn server. +The `run.sh` script overrides the container's default CMD to run `python /app/src/entrypoint.py` (batch mode) instead of the uvicorn server. ### Environment Variables @@ -320,24 +320,45 @@ docker images bgagent-local --format "{{.Size}}" agent/ ├── Dockerfile Python 3.13 + Node.js 20 + Claude Code CLI + git + gh + mise (default platform linux/arm64) ├── .dockerignore -├── pyproject.toml App dependencies (claude-agent-sdk, FastAPI, boto3, OpenTelemetry distro, MCP, …) +├── pyproject.toml App dependencies (claude-agent-sdk, FastAPI, boto3, OpenTelemetry distro, MCP, cedarpy, …) ├── uv.lock Locked deps for reproducible `uv sync` in the image ├── mise.toml Tool versions / tasks used when the target repo relies on mise -├── entrypoint.py Config, context hydration, ClaudeSDKClient pipeline, metrics, run_task() -├── server.py FastAPI — async /invocations (background thread) and /ping; OTEL session correlation -├── task_state.py Best-effort DynamoDB task status (no-op if TASK_TABLE_NAME unset) -├── observability.py OpenTelemetry helpers (e.g. AgentCore session id) -├── memory.py Optional memory / episode integration for the agent -├── prompts/ Per-task-type system prompt workflows -│ ├── __init__.py Prompt registry — assembles base template + workflow for each task type -│ ├── base.py Shared base template (environment, rules, placeholders) -│ ├── new_task.py Workflow for new_task (create branch, implement, open PR) -│ ├── pr_iteration.py Workflow for pr_iteration (read feedback, address, push) -│ └── pr_review.py Workflow for pr_review (read-only analysis, structured review comments) -├── system_prompt.py Behavioral contract (PRD Section 11) +├── src/ Agent source modules (pythonpath configured in pyproject.toml) +│ ├── __init__.py +│ ├── entrypoint.py Re-export shim for backward compatibility (tests); delegates to specific modules +│ ├── config.py Configuration: build_config(), get_config(), resolve_github_token(), TaskType validation +│ ├── models.py Data models and enumerations (TaskType StrEnum with is_pr_task property) +│ ├── pipeline.py Top-level pipeline: main() CLI entry, run_task() orchestration +│ ├── runner.py Agent runner: run_agent() — ClaudeSDKClient connect/query/receive_response +│ ├── context.py Context hydration: fetch_github_issue(), assemble_prompt() (local/dry-run only) +│ ├── prompt_builder.py System prompt assembly + memory context, repo config scanning +│ ├── hooks.py PreToolUse hook callback for Cedar policy enforcement (Claude Agent SDK hooks) +│ ├── policy.py Cedar policy engine — in-process cedarpy evaluation, fail-closed, deny-list model +│ ├── post_hooks.py Deterministic post-hooks: ensure_committed, ensure_pushed, ensure_pr, verify_build, verify_lint +│ ├── repo.py Repository setup: clone, branch, git auth, mise trust/install/build/lint +│ ├── shell.py Shell utilities: log(), run_cmd(), redact_secrets(), slugify(), truncate() +│ ├── telemetry.py Metrics, disk usage, trajectory writer (_TrajectoryWriter with write_policy_decision) +│ ├── server.py FastAPI — async /invocations (background thread) and /ping; OTEL session correlation +│ ├── task_state.py Best-effort DynamoDB task status (no-op if TASK_TABLE_NAME unset) +│ ├── observability.py OpenTelemetry helpers (e.g. AgentCore session id) +│ ├── memory.py Optional memory / episode integration for the agent +│ ├── system_prompt.py Behavioral contract (PRD Section 11) +│ └── prompts/ Per-task-type system prompt workflows +│ ├── __init__.py Prompt registry — assembles base template + workflow for each task type +│ ├── base.py Shared base template (environment, rules, placeholders) +│ ├── new_task.py Workflow for new_task (create branch, implement, open PR) +│ ├── pr_iteration.py Workflow for pr_iteration (read feedback, address, push) +│ └── pr_review.py Workflow for pr_review (read-only analysis, structured review comments) ├── prepare-commit-msg.sh Git hook (Task-Id / Prompt-Version trailers on commits) ├── run.sh Build + run helper for local/server mode with AgentCore constraints -├── tests/ pytest unit tests for pure functions and prompt assembly +├── tests/ pytest unit tests (pythonpath: src/) +│ ├── test_config.py Config validation and TaskType tests +│ ├── test_hooks.py PreToolUse hook and hook matcher tests +│ ├── test_models.py TaskType enum tests +│ ├── test_policy.py Cedar policy engine tests (fail-closed, deny-list) +│ ├── test_pipeline.py Pipeline orchestration tests (cedar_policies injection) +│ ├── test_shell.py Shell utility tests (slugify, redact_secrets, truncate, format_bytes) +│ └── ... ├── test_sdk_smoke.py Diagnostic: minimal SDK smoke test (ClaudeSDKClient → CLI → Bedrock) └── test_subprocess_threading.py Diagnostic: subprocess-in-background-thread verification ``` diff --git a/agent/entrypoint.py b/agent/entrypoint.py deleted file mode 100644 index 42e06cc..0000000 --- a/agent/entrypoint.py +++ /dev/null @@ -1,2091 +0,0 @@ -"""Background Agent entrypoint. - -Mirrors the durable function orchestration flow from the PRD (Section 6). -Supports two modes: - - Local batch mode: `python entrypoint.py` (reads config from env vars) - - AgentCore server mode: imported by server.py via `run_task()` - -Flow: - 1. Build configuration - 2. Context hydration: fetch GitHub issue, assemble prompt - 3. Setup: clone repo, create branch, mise install, initial build - 4. Invoke Claude Agent SDK (one-shot, unattended) - 5. Post-hooks: safety-net commit, verify build, verify lint, ensure PR - 6. Collect and return metrics -""" - -import asyncio -import glob -import hashlib -import json -import os -import re -import subprocess -import sys -import time -import uuid -from urllib.parse import quote - -import requests - -import memory as agent_memory -import task_state -from observability import task_span -from prompts import get_system_prompt -from system_prompt import SYSTEM_PROMPT - -# --------------------------------------------------------------------------- -# Configuration -# --------------------------------------------------------------------------- - -AGENT_WORKSPACE = os.environ.get("AGENT_WORKSPACE", "/workspace") - -# Task types that operate on an existing pull request. -PR_TASK_TYPES = frozenset(("pr_iteration", "pr_review")) - - -def resolve_github_token() -> str: - """Resolve GitHub token from Secrets Manager or environment variable. - - In deployed mode, GITHUB_TOKEN_SECRET_ARN is set and the token is fetched - from Secrets Manager on first call, then cached in os.environ. - For local development, falls back to GITHUB_TOKEN. - """ - # Return cached value if already resolved - cached = os.environ.get("GITHUB_TOKEN", "") - if cached: - return cached - secret_arn = os.environ.get("GITHUB_TOKEN_SECRET_ARN") - if secret_arn: - import boto3 - - region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION") - client = boto3.client("secretsmanager", region_name=region) - resp = client.get_secret_value(SecretId=secret_arn) - token = resp["SecretString"] - # Cache in env so downstream tools (git, gh CLI) work unchanged - os.environ["GITHUB_TOKEN"] = token - return token - return "" - - -def build_config( - repo_url: str, - task_description: str = "", - issue_number: str = "", - github_token: str = "", - anthropic_model: str = "", - max_turns: int = 10, - max_budget_usd: float | None = None, - aws_region: str = "", - dry_run: bool = False, - task_id: str = "", - system_prompt_overrides: str = "", - task_type: str = "new_task", - branch_name: str = "", - pr_number: str = "", -) -> dict: - """Build and validate configuration from explicit parameters. - - Parameters fall back to environment variables if empty. - """ - config = { - "repo_url": repo_url or os.environ.get("REPO_URL", ""), - "issue_number": issue_number or os.environ.get("ISSUE_NUMBER", ""), - "task_description": task_description or os.environ.get("TASK_DESCRIPTION", ""), - "github_token": github_token or resolve_github_token(), - "aws_region": aws_region or os.environ.get("AWS_REGION", ""), - "anthropic_model": anthropic_model - or os.environ.get("ANTHROPIC_MODEL", "us.anthropic.claude-sonnet-4-6"), - "dry_run": dry_run, - "max_turns": max_turns, - "max_budget_usd": max_budget_usd, - "system_prompt_overrides": system_prompt_overrides, - "task_type": task_type, - "branch_name": branch_name, - "pr_number": pr_number, - } - - errors = [] - if not config["repo_url"]: - errors.append("repo_url is required (e.g., 'owner/repo')") - if not config["github_token"]: - errors.append("github_token is required") - if not config["aws_region"]: - errors.append("aws_region is required for Bedrock") - if config["task_type"] in PR_TASK_TYPES: - if not config["pr_number"]: - errors.append("pr_number is required for pr_iteration/pr_review task type") - elif not config["issue_number"] and not config["task_description"]: - errors.append("Either issue_number or task_description is required") - - if errors: - raise ValueError("; ".join(errors)) - - config["task_id"] = task_id or uuid.uuid4().hex[:12] - return config - - -def get_config() -> dict: - """Parse configuration from environment variables (local batch mode).""" - try: - return build_config( - repo_url=os.environ.get("REPO_URL", ""), - task_description=os.environ.get("TASK_DESCRIPTION", ""), - issue_number=os.environ.get("ISSUE_NUMBER", ""), - github_token=os.environ.get("GITHUB_TOKEN", ""), - anthropic_model=os.environ.get("ANTHROPIC_MODEL", ""), - max_turns=int(os.environ.get("MAX_TURNS", "100")), - max_budget_usd=float(os.environ.get("MAX_BUDGET_USD", "0")) or None, - aws_region=os.environ.get("AWS_REGION", ""), - dry_run=os.environ.get("DRY_RUN", "").lower() in ("1", "true", "yes"), - ) - except ValueError as e: - print(f"ERROR: {e}", file=sys.stderr) - sys.exit(1) - - -# --------------------------------------------------------------------------- -# Context hydration -# --------------------------------------------------------------------------- - - -def fetch_github_issue(repo_url: str, issue_number: str, token: str) -> dict: - """Fetch a GitHub issue's title, body, and comments.""" - headers = { - "Authorization": f"token {token}", - "Accept": "application/vnd.github.v3+json", - } - - # Fetch issue - issue_resp = requests.get( - f"https://api.github.com/repos/{repo_url}/issues/{issue_number}", - headers=headers, - timeout=30, - ) - issue_resp.raise_for_status() - issue = issue_resp.json() - - # Fetch comments - comments = [] - if issue.get("comments", 0) > 0: - comments_resp = requests.get( - f"https://api.github.com/repos/{repo_url}/issues/{issue_number}/comments", - headers=headers, - timeout=30, - ) - comments_resp.raise_for_status() - comments = [{"author": c["user"]["login"], "body": c["body"]} for c in comments_resp.json()] - - return { - "title": issue["title"], - "body": issue.get("body", ""), - "number": issue["number"], - "comments": comments, - } - - -def assemble_prompt(config: dict) -> str: - """Assemble the user prompt from issue context and task description. - - .. deprecated:: - In production (AgentCore server mode), the orchestrator's - ``assembleUserPrompt()`` in ``context-hydration.ts`` is the sole prompt - assembler. The hydrated prompt arrives via ``hydrated_context["user_prompt"]``. - This Python implementation is retained only for **local batch mode** - (``python entrypoint.py``) and **dry-run mode** (``DRY_RUN=1``). - """ - parts = [] - - parts.append(f"Task ID: {config['task_id']}") - parts.append(f"Repository: {config['repo_url']}") - - if config.get("issue"): - issue = config["issue"] - parts.append(f"\n## GitHub Issue #{issue['number']}: {issue['title']}\n") - parts.append(issue["body"] or "(no description)") - if issue["comments"]: - parts.append("\n### Comments\n") - for c in issue["comments"]: - parts.append(f"**@{c['author']}**: {c['body']}\n") - - if config["task_description"]: - parts.append(f"\n## Task\n\n{config['task_description']}") - elif config.get("issue"): - parts.append( - "\n## Task\n\nResolve the GitHub issue described above. " - "Follow the workflow in your system instructions." - ) - - return "\n".join(parts) - - -# --------------------------------------------------------------------------- -# Repository setup (deterministic pre-hooks) -# --------------------------------------------------------------------------- - - -def slugify(text: str, max_len: int = 40) -> str: - """Convert text to a URL-safe slug for branch names.""" - text = text.lower().strip() - text = re.sub(r"[^a-z0-9\s-]", "", text) - text = re.sub(r"[\s-]+", "-", text) - text = text.strip("-") - if len(text) > max_len: - text = text[:max_len].rstrip("-") - return text or "task" - - -def redact_secrets(text: str) -> str: - """Redact tokens and secrets from log output.""" - # GitHub and generic token-like values. - text = re.sub(r"(ghp_|github_pat_|gho_|ghs_|ghr_)[A-Za-z0-9_]+", r"\1***", text) - text = re.sub(r"(x-access-token:)[^\s@]+", r"\1***", text) - text = re.sub(r"(authorization:\s*(?:bearer|token)\s+)[^\s]+", r"\1***", text, flags=re.I) - text = re.sub( - r"([?&](?:token|access_token|api_key|apikey|password)=)[^&\s]+", - r"\1***", - text, - flags=re.I, - ) - text = re.sub(r"(gh[opusr]_[A-Za-z0-9_]+)", "***", text) - return text - - -def _clean_env() -> dict[str, str]: - """Return a copy of os.environ with OTEL auto-instrumentation vars removed. - - The ``opentelemetry-instrument`` wrapper injects PYTHONPATH and OTEL_* - env vars that would cause child Python processes (e.g. mise run build → - semgrep in the target repo) to attempt OTEL auto-instrumentation and fail - because the target repo's Python environment doesn't have the OTEL - packages installed. Stripping these vars isolates target-repo commands - from the agent's own instrumentation. - """ - env = {k: v for k, v in os.environ.items() if not k.startswith("OTEL_")} - # Strip only OTEL-injected PYTHONPATH components (the sitecustomize.py - # directory), preserving any entries the target repo's toolchain may need. - pythonpath = env.get("PYTHONPATH", "") - if pythonpath: - cleaned = os.pathsep.join( - p for p in pythonpath.split(os.pathsep) if "opentelemetry" not in p - ) - if cleaned: - env["PYTHONPATH"] = cleaned - else: - env.pop("PYTHONPATH", None) - return env - - -def run_cmd( - cmd: list[str], - label: str, - cwd: str | None = None, - timeout: int = 600, - check: bool = True, -) -> subprocess.CompletedProcess: - """Run a command with logging.""" - log("CMD", redact_secrets(f"{label}: {' '.join(cmd)}")) - result = subprocess.run( - cmd, - cwd=cwd, - capture_output=True, - text=True, - timeout=timeout, - env=_clean_env(), - ) - if result.returncode != 0: - log("CMD", f"{label}: FAILED (exit {result.returncode})") - if result.stderr: - for line in result.stderr.strip().splitlines()[:20]: - log("CMD", f" {line}") - if check: - stderr_snippet = redact_secrets(result.stderr.strip()[:500]) if result.stderr else "" - raise RuntimeError(f"{label} failed (exit {result.returncode}): {stderr_snippet}") - else: - log("CMD", f"{label}: OK") - return result - - -def setup_repo(config: dict) -> dict: - """Clone repo, create branch, configure git auth, run mise install. - - Returns a dict with keys: repo_dir, branch, notes, build_before, - lint_before, and default_branch. - """ - repo_dir = f"{AGENT_WORKSPACE}/{config['task_id']}" - setup: dict[str, str | list[str] | bool] = {"repo_dir": repo_dir, "notes": []} - - if config.get("task_type") in PR_TASK_TYPES and config.get("branch_name"): - branch = config["branch_name"] - setup["branch"] = branch - else: - # Derive branch slug from issue title or task description - title = "" - if config.get("issue"): - title = config["issue"]["title"] - if not title: - title = config["task_description"] - slug = slugify(title) - branch = f"bgagent/{config['task_id']}/{slug}" - setup["branch"] = branch - - # Mark the repo directory as safe for git. On persistent session storage - # the mount may be owned by a different UID than the container user, - # triggering git's "dubious ownership" check on clone/resume. - run_cmd( - ["git", "config", "--global", "--add", "safe.directory", repo_dir], - label="safe-directory", - ) - - # Clone - log("SETUP", f"Cloning {config['repo_url']}...") - run_cmd( - ["gh", "repo", "clone", config["repo_url"], repo_dir], - label="clone", - ) - - # Configure remote URL with embedded token so git push works without - # credential helpers or extra auth setup inside the agent. - token = config["github_token"] - run_cmd( - [ - "git", - "remote", - "set-url", - "origin", - f"https://x-access-token:{token}@github.com/{config['repo_url']}.git", - ], - label="set-remote-url", - cwd=repo_dir, - ) - - # Branch setup - if config.get("task_type") in PR_TASK_TYPES and config.get("branch_name"): - log("SETUP", f"Checking out existing PR branch: {branch}") - run_cmd( - ["git", "fetch", "origin", branch], - label="fetch-pr-branch", - cwd=repo_dir, - ) - run_cmd( - ["git", "checkout", "-b", branch, f"origin/{branch}"], - label="checkout-pr-branch", - cwd=repo_dir, - ) - else: - log("SETUP", f"Creating branch: {branch}") - run_cmd(["git", "checkout", "-b", branch], label="create-branch", cwd=repo_dir) - - # Trust mise config files in the cloned repo (required before mise install) - run_cmd( - ["mise", "trust", repo_dir], - label="mise-trust", - cwd=repo_dir, - check=False, - ) - - # mise install (deterministic — not left to the LLM) - log("SETUP", "Running mise install...") - result = run_cmd( - ["mise", "install"], - label="mise-install", - cwd=repo_dir, - check=False, - ) - if result.returncode != 0: - note = f"mise install failed (exit {result.returncode})" - setup["notes"].append(note) - else: - setup["notes"].append("mise install: OK") - - # Initial build (record whether the project builds before agent changes) - log("SETUP", "Running initial build (mise run build)...") - result = run_cmd( - ["mise", "run", "build"], - label="mise-run-build-pre", - cwd=repo_dir, - check=False, - ) - if result.returncode != 0: - note = "Initial build (mise run build) FAILED before agent changes" - setup["notes"].append(note) - setup["build_before"] = False - else: - setup["notes"].append("Initial build (mise run build): OK") - setup["build_before"] = True - - # Initial lint baseline (record whether lint passes before agent changes) - log("SETUP", "Running initial lint (mise run lint)...") - result = run_cmd( - ["mise", "run", "lint"], - label="mise-run-lint-pre", - cwd=repo_dir, - check=False, - ) - if result.returncode != 0: - note = "Initial lint (mise run lint) FAILED before agent changes" - setup["notes"].append(note) - setup["lint_before"] = False - else: - setup["notes"].append("Initial lint (mise run lint): OK") - setup["lint_before"] = True - - # Detect default branch - # For PR tasks (pr_iteration, pr_review): use base_branch from orchestrator if available - if config.get("task_type") in PR_TASK_TYPES and config.get("base_branch"): - setup["default_branch"] = config["base_branch"] - else: - setup["default_branch"] = detect_default_branch(config["repo_url"], repo_dir) - - # Install prepare-commit-msg hook for code attribution - _install_commit_hook(repo_dir) - - return setup - - -def _install_commit_hook(repo_dir: str) -> None: - """Install the prepare-commit-msg git hook for Task-Id/Prompt-Version trailers.""" - try: - hooks_dir = os.path.join(repo_dir, ".git", "hooks") - os.makedirs(hooks_dir, exist_ok=True) - - hook_src = os.path.join(os.path.dirname(__file__), "prepare-commit-msg.sh") - hook_dst = os.path.join(hooks_dir, "prepare-commit-msg") - - if not os.path.isfile(hook_src): - log("ERROR", f"Hook not found at {hook_src}") - return - - import shutil - import stat - - shutil.copy2(hook_src, hook_dst) - current = os.stat(hook_dst).st_mode - exec_bits = stat.S_IXUSR | stat.S_IXGRP - os.chmod(hook_dst, current | exec_bits) # nosemgrep - log("SETUP", "Installed prepare-commit-msg hook") - except Exception as e: - log("WARN", f"Commit hook install failed: {type(e).__name__}: {e}") - - -def detect_default_branch(repo_url: str, repo_dir: str) -> str: - """Detect the repository's default branch via gh CLI. - - Falls back to 'main' if detection fails (timeout, auth error, etc.). - """ - try: - result = subprocess.run( - [ - "gh", - "repo", - "view", - repo_url, - "--json", - "defaultBranchRef", - "-q", - ".defaultBranchRef.name", - ], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=30, - ) - except subprocess.TimeoutExpired: - log("WARN", "Default branch detection timed out — defaulting to 'main'") - return "main" - - if result.returncode == 0 and result.stdout.strip(): - branch = result.stdout.strip() - log("SETUP", f"Detected default branch: {branch}") - return branch - - stderr = result.stderr.strip()[:200] if result.stderr else "(no stderr)" - log( - "WARN", - f"Could not detect default branch (exit {result.returncode}): " - f"{stderr} — defaulting to 'main'", - ) - return "main" - - -def verify_build(repo_dir: str) -> bool: - """Run mise run build after agent completion to verify the build.""" - log("POST", "Running post-agent build verification (mise run build)...") - try: - result = run_cmd( - ["mise", "run", "build"], - label="mise-run-build-post", - cwd=repo_dir, - check=False, - ) - except subprocess.TimeoutExpired: - log("WARN", "Post-agent build timed out — treating as failed") - return False - if result.returncode != 0: - log("POST", "Post-agent build FAILED") - return False - log("POST", "Post-agent build: OK") - return True - - -def verify_lint(repo_dir: str) -> bool: - """Run mise run lint after agent completion to verify lint passes.""" - log("POST", "Running post-agent lint verification (mise run lint)...") - try: - result = run_cmd( - ["mise", "run", "lint"], - label="mise-run-lint-post", - cwd=repo_dir, - check=False, - ) - except subprocess.TimeoutExpired: - log("WARN", "Post-agent lint timed out — treating as failed") - return False - if result.returncode != 0: - log("POST", "Post-agent lint FAILED") - return False - log("POST", "Post-agent lint: OK") - return True - - -def ensure_committed(repo_dir: str) -> bool: - """Safety net: commit any uncommitted tracked changes before finalization. - - This catches work the agent wrote but forgot to commit (e.g. due to turn - limit or timeout). Only stages tracked-but-modified files (git add -u) to - avoid accidentally committing temp files or build artifacts. - - Returns True if a safety-net commit was created, False if nothing to commit - or if git operations fail. - """ - try: - result = subprocess.run( - ["git", "status", "--porcelain"], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - except subprocess.TimeoutExpired: - log("WARN", "git status timed out in safety-net commit") - return False - - if result.returncode != 0: - stderr = result.stderr.strip()[:200] if result.stderr else "" - log("WARN", f"git status failed (exit {result.returncode}): {stderr}") - return False - if not result.stdout.strip(): - return False - - log("POST", "Uncommitted changes detected — creating safety-net commit") - # Stage tracked-but-modified files only (not untracked files) - try: - add_result = subprocess.run( - ["git", "add", "-u"], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - except subprocess.TimeoutExpired: - log("WARN", "git add -u timed out in safety-net commit") - return False - - if add_result.returncode != 0: - stderr = add_result.stderr.strip()[:200] if add_result.stderr else "" - log("WARN", f"git add -u failed (exit {add_result.returncode}): {stderr}") - return False - - # Check if there's anything staged after add -u - staged = subprocess.run( - ["git", "diff", "--cached", "--quiet"], - cwd=repo_dir, - capture_output=True, - timeout=30, - ) - if staged.returncode == 0: - # Nothing staged (changes were only untracked files) — skip - log("POST", "No tracked file changes to commit") - return False - - commit_result = subprocess.run( - ["git", "commit", "-m", "chore(agent): save uncommitted work from session end"], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - if commit_result.returncode == 0: - log("POST", "Safety-net commit created") - return True - log("POST", f"Safety-net commit failed: {commit_result.stderr.strip()[:200]}") - return False - - -def ensure_pushed(repo_dir: str, branch: str) -> bool: - """Push the branch if there are unpushed commits.""" - result = subprocess.run( - ["git", "log", f"origin/{branch}..HEAD", "--oneline"], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - # If the remote branch doesn't exist or there are unpushed commits - if result.returncode != 0 or result.stdout.strip(): - log("POST", "Pushing unpushed commits...") - push_result = run_cmd( - ["git", "push", "-u", "origin", branch], - label="push", - cwd=repo_dir, - check=False, - ) - return push_result.returncode == 0 - return True - - -def ensure_pr( - config: dict, - setup: dict, - build_passed: bool, - lint_passed: bool, - agent_result: dict | None = None, -) -> str | None: - """Check if a PR exists for the branch; if not, create one. - - For ``new_task``: creates a new PR if needed. - For ``pr_iteration``: pushes commits, then resolves the existing PR URL. - For ``pr_review``: resolves the existing PR URL without pushing (read-only). - - Returns the PR URL, or None if there are no commits beyond the default - branch or PR creation failed. ``build_passed`` and ``lint_passed`` control - the verification status shown in the PR body. - """ - repo_dir = setup["repo_dir"] - branch = setup["branch"] - default_branch = setup.get("default_branch", "main") - - # PR iteration/review: skip PR creation — just resolve existing PR URL - if config.get("task_type") in PR_TASK_TYPES: - if config.get("task_type") == "pr_iteration": - if not ensure_pushed(repo_dir, branch): - log("WARN", "Failed to push commits before resolving PR URL") - else: - log("POST", "pr_review task — skipping push (read-only)") - log("POST", f"{config.get('task_type')} — returning existing PR URL") - result = subprocess.run( - [ - "gh", - "pr", - "view", - branch, - "--repo", - config["repo_url"], - "--json", - "url", - "-q", - ".url", - ], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - if result.returncode == 0 and result.stdout.strip(): - pr_url = result.stdout.strip() - log("POST", f"Existing PR: {pr_url}") - return pr_url - stderr_msg = result.stderr.strip() if result.stderr else "(no stderr)" - log("WARN", f"Could not resolve existing PR URL (rc={result.returncode}): {stderr_msg}") - return None - - # Check if the agent already created a PR for this branch - log("POST", "Checking for existing PR...") - result = subprocess.run( - [ - "gh", - "pr", - "view", - branch, - "--repo", - config["repo_url"], - "--json", - "url", - "-q", - ".url", - ], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - if result.returncode == 0 and result.stdout.strip(): - pr_url = result.stdout.strip() - log("POST", f"PR already exists: {pr_url}") - return pr_url - - # Check if there are any commits on this branch beyond the default branch - diff_result = subprocess.run( - ["git", "log", f"origin/{default_branch}..HEAD", "--oneline"], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - if diff_result.returncode != 0 or not diff_result.stdout.strip(): - log("POST", "No commits to create PR from — skipping PR creation") - return None - - # Ensure all commits are pushed - ensure_pushed(repo_dir, branch) - - # Collect commit messages for the PR body - log_result = subprocess.run( - ["git", "log", f"origin/{default_branch}..HEAD", "--pretty=format:%s%n%b---"], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - commits = log_result.stdout.strip() if log_result.returncode == 0 else "" - - # Derive PR title from first commit message - first_commit = subprocess.run( - ["git", "log", f"origin/{default_branch}..HEAD", "--pretty=format:%s", "--reverse"], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=60, - ) - pr_title = ( - first_commit.stdout.strip().split("\n")[0] - if first_commit.stdout.strip() - else f"chore: bgagent/{config['task_id']}" - ) - - # Build PR body - task_source = "" - if config["issue_number"]: - task_source = f"Resolves #{config['issue_number']}\n\n" - elif config["task_description"]: - task_source = f"**Task:** {config['task_description']}\n\n" - - build_status = "PASS" if build_passed else "FAIL" - lint_status = "PASS" if lint_passed else "FAIL" - - cost_line = "" - if agent_result and agent_result.get("cost_usd") is not None: - cost_line = f"- Agent cost: **${agent_result['cost_usd']:.4f}**\n" - - pr_body = ( - f"## Summary\n\n" - f"{task_source}" - f"### Commits\n\n" - f"```\n{commits}\n```\n\n" - f"## Verification\n\n" - f"- `mise run build` (post-agent): **{build_status}**\n" - f"- `mise run lint` (post-agent): **{lint_status}**\n" - f"{cost_line}\n" - f"---\n\n" - f"By submitting this pull request, I confirm that you can use, modify, copy, " - f"and redistribute this contribution, under the terms of the [project license](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/blob/main/LICENSE)." - ) - - log("POST", f"Creating PR: {pr_title}") - pr_result = run_cmd( - [ - "gh", - "pr", - "create", - "--repo", - config["repo_url"], - "--head", - branch, - "--base", - default_branch, - "--title", - pr_title, - "--body", - pr_body, - ], - label="create-pr", - cwd=repo_dir, - check=False, - ) - if pr_result.returncode == 0: - pr_url = pr_result.stdout.strip() - log("POST", f"PR created: {pr_url}") - return pr_url - else: - log("POST", "Failed to create PR") - return None - - -# --------------------------------------------------------------------------- -# Self-feedback extraction -# --------------------------------------------------------------------------- - - -def _extract_agent_notes(repo_dir: str, branch: str, config: dict) -> str | None: - """Extract the "## Agent notes" section from the PR body. - - Checks the existing PR body via `gh pr view`. Returns the text content - of the "## Agent notes" section, or None if not found. - """ - try: - result = subprocess.run( - [ - "gh", - "pr", - "view", - branch, - "--repo", - config["repo_url"], - "--json", - "body", - "-q", - ".body", - ], - cwd=repo_dir, - capture_output=True, - text=True, - timeout=30, - ) - if result.returncode != 0 or not result.stdout.strip(): - return None - - body = result.stdout.strip() - # Find "## Agent notes" section - match = re.search( - r"##\s*Agent\s*notes\s*\n(.*?)(?=\n##\s|\Z)", - body, - re.DOTALL | re.IGNORECASE, - ) - if match: - notes = match.group(1).strip() - return notes if notes else None - return None - except Exception as e: - log("WARN", f"Failed to extract agent notes from PR body: {type(e).__name__}: {e}") - return None - - -# --------------------------------------------------------------------------- -# Metrics -# --------------------------------------------------------------------------- - - -def get_disk_usage(path: str = AGENT_WORKSPACE) -> float: - """Return disk usage in bytes for the given path.""" - try: - result = subprocess.run( - ["du", "-sb", path], - capture_output=True, - text=True, - timeout=30, - ) - return int(result.stdout.split()[0]) if result.returncode == 0 else 0 - except (subprocess.TimeoutExpired, ValueError, IndexError): - return 0 - - -def format_bytes(size: float) -> str: - """Human-readable byte size.""" - for unit in ("B", "KB", "MB", "GB"): - if abs(size) < 1024: - return f"{size:.1f} {unit}" - size /= 1024 - return f"{size:.1f} TB" - - -def _emit_metrics_to_cloudwatch(json_payload: dict) -> None: - """Write the METRICS_REPORT JSON event directly to CloudWatch Logs. - - Writes the log event directly to the APPLICATION_LOGS log group using the - CloudWatch Logs API, ensuring metrics are reliably available for dashboard - Logs Insights queries regardless of container stdout routing. - """ - log_group = os.environ.get("LOG_GROUP_NAME") - if not log_group: - return - - try: - import contextlib - - import boto3 - - region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION") - client = boto3.client("logs", region_name=region) - - task_id = json_payload.get("task_id", "unknown") - log_stream = f"metrics/{task_id}" - - # Create the log stream (ignore if it already exists) - with contextlib.suppress(client.exceptions.ResourceAlreadyExistsException): - client.create_log_stream(logGroupName=log_group, logStreamName=log_stream) - - client.put_log_events( - logGroupName=log_group, - logStreamName=log_stream, - logEvents=[ - { - "timestamp": int(time.time() * 1000), - "message": json.dumps(json_payload), - } - ], - ) - except ImportError: - print("[metrics] boto3 not available — skipping CloudWatch write", flush=True) - except Exception as e: - exc_type = type(e).__name__ - print(f"[metrics] CloudWatch Logs write failed (best-effort): {exc_type}: {e}", flush=True) - if "Credential" in exc_type or "Endpoint" in exc_type or "AccessDenied" in str(e): - print( - "[metrics] WARNING: This may indicate a deployment misconfiguration " - "(IAM role, VPC endpoint, or credentials). Dashboard data will be missing.", - flush=True, - ) - - -class _TrajectoryWriter: - """Write per-turn trajectory events to CloudWatch Logs. - - Follows the same pattern as ``_emit_metrics_to_cloudwatch()``: lazy boto3 - import, best-effort error handling, ``contextlib.suppress`` for idempotent - stream creation. Log stream: ``trajectory/{task_id}`` (parallel to the - existing ``metrics/{task_id}`` stream). - - Events are progressively truncated to stay under the CloudWatch Logs 262 KB - event-size limit: large fields (thinking, tool result content) are truncated - first, then a hard byte-level safety-net truncation is applied. - """ - - _CW_MAX_EVENT_BYTES = 262_144 # CloudWatch limit per event - - _MAX_FAILURES = 3 - - def __init__(self, task_id: str) -> None: - self._task_id = task_id - self._log_group = os.environ.get("LOG_GROUP_NAME") - self._client = None - self._disabled = False - self._failure_count = 0 - - def _ensure_client(self): - """Lazily create the CloudWatch Logs client and log stream.""" - if self._client is not None: - return - if not self._log_group: - self._disabled = True - return - - import contextlib - - import boto3 - - region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION") - self._client = boto3.client("logs", region_name=region) - - log_stream = f"trajectory/{self._task_id}" - with contextlib.suppress(self._client.exceptions.ResourceAlreadyExistsException): - self._client.create_log_stream(logGroupName=self._log_group, logStreamName=log_stream) - - def _put_event(self, payload: dict) -> None: - """Serialize *payload* to JSON, truncate if needed, and write.""" - if not self._log_group or self._disabled: - return - try: - self._ensure_client() - if self._client is None: - self._disabled = True - return - - message = json.dumps(payload, default=str) - - # Safety-net: hard byte-level truncation - encoded = message.encode("utf-8") - if len(encoded) > self._CW_MAX_EVENT_BYTES: - print( - f"[trajectory] WARNING: Event exceeded CW limit even after field " - f"truncation ({len(encoded)} bytes). Hard-truncating — event JSON " - f"will be invalid.", - flush=True, - ) - message = ( - encoded[: self._CW_MAX_EVENT_BYTES - 100].decode("utf-8", errors="ignore") - + " [TRUNCATED]" - ) - - self._client.put_log_events( - logGroupName=self._log_group, - logStreamName=f"trajectory/{self._task_id}", - logEvents=[ - { - "timestamp": int(time.time() * 1000), - "message": message, - } - ], - ) - except ImportError: - self._disabled = True - print("[trajectory] boto3 not available — skipping", flush=True) - except Exception as e: - self._failure_count += 1 - exc_type = type(e).__name__ - if self._failure_count >= self._MAX_FAILURES: - self._disabled = True - print( - f"[trajectory] CloudWatch write failed {self._failure_count} times, " - f"disabling trajectory: {exc_type}: {e}", - flush=True, - ) - else: - print( - f"[trajectory] CloudWatch write failed ({self._failure_count}/" - f"{self._MAX_FAILURES}): {exc_type}: {e}", - flush=True, - ) - if "Credential" in exc_type or "Endpoint" in exc_type or "AccessDenied" in str(e): - print( - "[trajectory] WARNING: This may indicate a deployment misconfiguration " - "(IAM role, VPC endpoint, or credentials). Trajectory data will be missing.", - flush=True, - ) - - @staticmethod - def _truncate_field(value: str, max_len: int = 4000) -> str: - """Truncate a large string field for trajectory events.""" - if not value or len(value) <= max_len: - return value - return value[:max_len] + f"... [truncated, {len(value)} chars total]" - - def write_turn( - self, - turn: int, - model: str, - thinking: str, - text: str, - tool_calls: list[dict], - tool_results: list[dict], - ) -> None: - """Write a TRAJECTORY_TURN event for one agent turn.""" - # Truncate large fields to stay under CloudWatch event limit - truncated_thinking = self._truncate_field(thinking) - truncated_text = self._truncate_field(text) - truncated_results = [] - for tr in tool_results: - entry = dict(tr) - if isinstance(entry.get("content"), str): - entry["content"] = self._truncate_field(entry["content"], 2000) - truncated_results.append(entry) - - self._put_event( - { - "event": "TRAJECTORY_TURN", - "task_id": self._task_id, - "turn": turn, - "model": model, - "thinking": truncated_thinking, - "text": truncated_text, - "tool_calls": tool_calls, - "tool_results": truncated_results, - } - ) - - def write_result( - self, - subtype: str, - num_turns: int, - cost_usd: float | None, - duration_ms: int, - duration_api_ms: int, - session_id: str, - usage: dict | None, - ) -> None: - """Write a TRAJECTORY_RESULT summary event at session end.""" - self._put_event( - { - "event": "TRAJECTORY_RESULT", - "task_id": self._task_id, - "subtype": subtype, - "num_turns": num_turns, - "cost_usd": cost_usd, - "duration_ms": duration_ms, - "duration_api_ms": duration_api_ms, - "session_id": session_id, - "usage": usage, - } - ) - - -# Values under these keys may contain tool stderr, paths, or incidental secrets. -_METRICS_REDACT_KEYS = frozenset({"error"}) - - -def _metrics_payload_for_logging(metrics: dict) -> dict: - """Build metrics dict for stdout / CloudWatch JSON (redacts sensitive fields).""" - out: dict = {} - for k, v in metrics.items(): - if k in _METRICS_REDACT_KEYS: - out[k] = None if v is None else "[redacted]" - continue - if isinstance(v, (bool, int, float, type(None))): - out[k] = v - else: - out[k] = str(v) - return out - - -def print_metrics(metrics: dict): - """Emit a METRICS_REPORT event and print a human-readable summary. - - Writes the JSON event directly to CloudWatch Logs via - ``_emit_metrics_to_cloudwatch()`` for dashboard querying, and prints a - human-readable table to stdout for operator console inspection. - - Native types (int, float, bool, None) are preserved in the JSON payload. - None values become JSON ``null`` and are excluded by ``ispresent()`` - filters in the dashboard queries. Raw ``error`` text is never logged verbatim. - """ - safe = _metrics_payload_for_logging(metrics) - json_payload: dict = {"event": "METRICS_REPORT", **safe} - - # Write directly to CloudWatch Logs (reliable — doesn't depend on stdout capture) - _emit_metrics_to_cloudwatch(json_payload) - - # Also print to stdout for operator console visibility - print(json.dumps(json_payload), flush=True) - - # Human-readable banner only; do not print keys/values from ``metrics`` (taints logging sinks). - print("\n" + "=" * 60) - print("METRICS REPORT") - print("=" * 60) - print( - " See structured JSON on the previous line — table omitted so metric " - "keys are not echoed next to log sinks.", - flush=True, - ) - print("=" * 60) - - -# --------------------------------------------------------------------------- -# Agent invocation -# --------------------------------------------------------------------------- - - -def log(prefix: str, text: str): - """Print a timestamped log line without dynamic payload text.""" - ts = time.strftime("%H:%M:%S") - _ = text - print(f"[{ts}] {prefix}", flush=True) - - -def truncate(text: str, max_len: int = 200) -> str: - """Truncate text for log display.""" - if not text: - return "" - text = text.replace("\n", " ").strip() - if len(text) > max_len: - return text[:max_len] + "..." - return text - - -def _setup_agent_env(config: dict) -> tuple[str | None, str | None]: - """Configure process environment for the Claude Code CLI subprocess. - - Sets Bedrock credentials, strips OTEL auto-instrumentation vars, and - optionally enables CLI-native OTel telemetry. - - Returns (otlp_endpoint, otlp_protocol) for logging. - """ - os.environ["CLAUDE_CODE_USE_BEDROCK"] = "1" - os.environ["AWS_REGION"] = config["aws_region"] - os.environ["ANTHROPIC_MODEL"] = config["anthropic_model"] - os.environ["GITHUB_TOKEN"] = config["github_token"] - os.environ["GH_TOKEN"] = config["github_token"] - # DO NOT set ANTHROPIC_LOG — any logging level causes the CLI to write to - # stderr, which fills the OS pipe buffer (64 KB) and deadlocks the - # single-threaded Node.js CLI process (blocked stderr write prevents stdout - # writes, while the SDK is waiting on stdout). The stderr callback in - # ClaudeAgentOptions cannot drain fast enough to prevent this. - os.environ.pop("ANTHROPIC_LOG", None) - os.environ["ANTHROPIC_DEFAULT_HAIKU_MODEL"] = "anthropic.claude-haiku-4-5-20251001-v1:0" - - # Save OTLP endpoint/protocol configured by ADOT auto-instrumentation - # before stripping, so we can re-use it for Claude Code CLI telemetry. - otlp_endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT") - otlp_protocol = os.environ.get("OTEL_EXPORTER_OTLP_PROTOCOL") - - # Strip OTEL auto-instrumentation vars from os.environ so target-repo - # child processes (mise run build, pytest, semgrep, etc.) don't attempt - # Python OTEL auto-instrumentation using the agent's packages. - # The agent's own TracerProvider is already configured at startup — it does - # not re-read env vars, so removing them is safe. - for key in [k for k in os.environ if k.startswith("OTEL_")]: - del os.environ[key] - pythonpath = os.environ.get("PYTHONPATH", "") - if pythonpath: - cleaned = os.pathsep.join( - p for p in pythonpath.split(os.pathsep) if "opentelemetry" not in p - ) - if cleaned: - os.environ["PYTHONPATH"] = cleaned - else: - os.environ.pop("PYTHONPATH", None) - - # Enable Claude Code CLI's native OTel telemetry if an OTLP endpoint is - # available. The CLI exports events (tool results, API requests/errors, - # tool decisions) as OTLP logs with per-prompt granularity — beyond the - # aggregate ResultMessage at session end. - # - # Gated on ENABLE_CLI_TELEMETRY env var (opt-in) because the ADOT sidecar - # in AgentCore Runtime is only confirmed to forward traces (configured via - # CfnRuntimeLogsMixin.TRACES.toXRay() in CDK). Whether the sidecar also - # forwards OTLP logs is unconfirmed. Set ENABLE_CLI_TELEMETRY=1 in the - # runtime environment to enable and verify logs appear in CloudWatch. - # - # Configuration choices based on AWS documentation: - # - OTEL_METRICS_EXPORTER=none: All AWS ADOT examples disable metrics - # export. CloudWatch does not ingest OTLP metrics from the sidecar. - # - OTEL_TRACES_EXPORTER=none: Explicitly disabled. The agent's own - # custom spans (task.pipeline, task.agent_execution, etc.) already - # provide trace-level coverage via the Python ADOT auto-instrumentation. - # - OTEL_LOGS_EXPORTER=otlp: SDK events (tool_result, api_request, etc.) - # are the primary telemetry of interest and are exported as OTLP logs. - # - OTEL_EXPORTER_OTLP_LOGS_HEADERS: Includes the application log group - # name so that, if the exporter sends directly to CloudWatch's OTLP - # endpoint, logs land in the correct log group. Ignored by the sidecar - # if it has its own routing config. - # - Protocol defaults to http/protobuf (AWS-recommended for OTLP). - # - # NOTE: These env vars are set on os.environ (process-global) because the - # Claude Agent SDK spawns the CLI subprocess from the process environment. - # This is safe for single-task-per-container deployments (AgentCore Runtime - # allocates one session per container). If concurrent tasks ever share a - # process, this must be revisited (pass env via subprocess instead). - if os.environ.get("ENABLE_CLI_TELEMETRY") == "1": - if not otlp_endpoint: - log("WARN", "OTEL_EXPORTER_OTLP_ENDPOINT not set by ADOT") - # Default to http/protobuf on port 4318 (AWS-recommended protocol). - otlp_endpoint = "http://localhost:4318" - if not otlp_protocol: - otlp_protocol = "http/protobuf" - - os.environ["CLAUDE_CODE_ENABLE_TELEMETRY"] = "1" - os.environ["OTEL_METRICS_EXPORTER"] = "none" - os.environ["OTEL_TRACES_EXPORTER"] = "none" - os.environ["OTEL_LOGS_EXPORTER"] = "otlp" - os.environ["OTEL_EXPORTER_OTLP_PROTOCOL"] = otlp_protocol - os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = otlp_endpoint - os.environ["OTEL_LOG_TOOL_DETAILS"] = "1" - - # Route OTLP logs to the application log group. This header is used - # when sending directly to CloudWatch's OTLP logs endpoint - # (https://logs.{region}.amazonaws.com/v1/logs). If the exporter - # sends to the ADOT sidecar instead, the sidecar may ignore this. - log_group = os.environ.get("LOG_GROUP_NAME", "") - if log_group: - os.environ["OTEL_EXPORTER_OTLP_LOGS_HEADERS"] = f"x-aws-log-group={log_group}" - - # Tag all SDK telemetry with task metadata for correlation in CloudWatch. - # Values are percent-encoded per the OTEL_RESOURCE_ATTRIBUTES spec to - # handle any special characters (commas, equals, spaces) in config values. - os.environ["OTEL_RESOURCE_ATTRIBUTES"] = ( - f"task.id={quote(config.get('task_id', 'unknown'), safe='')}," - f"repo.url={quote(config.get('repo_url', 'unknown'), safe='')}," - f"agent.model={quote(config.get('anthropic_model', 'unknown'), safe='')}" - ) - log( - "AGENT", - f"Claude Code telemetry enabled: endpoint={otlp_endpoint} " - f"protocol={otlp_protocol} logs_log_group={log_group or '(not set)'}", - ) - else: - log("AGENT", "Claude Code CLI telemetry disabled (set ENABLE_CLI_TELEMETRY=1 to enable)") - - return otlp_endpoint, otlp_protocol - - -async def run_agent( - prompt: str, system_prompt: str, config: dict, cwd: str = AGENT_WORKSPACE -) -> dict: - """Invoke the Claude Agent SDK and stream output.""" - from claude_agent_sdk import ( - AssistantMessage, - ClaudeAgentOptions, - ClaudeSDKClient, - ResultMessage, - SystemMessage, - TextBlock, - ThinkingBlock, - ToolResultBlock, - ToolUseBlock, - ) - - _setup_agent_env(config) - - stderr_line_count = 0 - - def _on_stderr(line: str) -> None: - nonlocal stderr_line_count - stderr_line_count += 1 - log("CLI", line.rstrip()) - - # Log SDK and CLI versions for diagnosing protocol mismatches - import claude_agent_sdk as _sdk - - sdk_version = getattr(_sdk, "__version__", "unknown") - log("AGENT", f"claude-agent-sdk version: {sdk_version}") - cli_path = subprocess.run(["which", "claude"], capture_output=True, text=True, timeout=5) - if cli_path.returncode == 0: - cli_ver = subprocess.run( - ["claude", "--version"], capture_output=True, text=True, timeout=10 - ) - log("AGENT", f"claude CLI: {cli_path.stdout.strip()} version={cli_ver.stdout.strip()}") - else: - log("WARN", "claude CLI not found on PATH") - - if config.get("task_type") == "pr_review": - allowed_tools = ["Bash", "Read", "Glob", "Grep", "WebFetch"] - else: - allowed_tools = ["Bash", "Read", "Write", "Edit", "Glob", "Grep", "WebFetch"] - - options = ClaudeAgentOptions( - model=config["anthropic_model"], - system_prompt=system_prompt, - allowed_tools=allowed_tools, - permission_mode="bypassPermissions", - cwd=cwd, - max_turns=config["max_turns"], - setting_sources=["project"], - **({"max_budget_usd": config["max_budget_usd"]} if config.get("max_budget_usd") else {}), - stderr=_on_stderr, - ) - - result: dict[str, object] = {"status": "unknown", "turns": 0, "cost_usd": None} - message_counts = {"system": 0, "assistant": 0, "result": 0, "other": 0} - trajectory = _TrajectoryWriter(config.get("task_id", "unknown")) - - # Use ClaudeSDKClient (connect/query/receive_response) instead of the - # standalone query() function. This matches the official AWS sample: - # https://github.com/aws-samples/sample-deploy-ClaudeAgentSDK-based-agents-to-AgentCore-Runtime - client = ClaudeSDKClient(options=options) - log("AGENT", "Connecting to Claude Code CLI subprocess...") - await client.connect() - log("AGENT", "Connected. Sending prompt...") - await client.query(prompt=prompt) - log("AGENT", "Prompt sent. Receiving messages...") - try: - async for message in client.receive_response(): - if isinstance(message, SystemMessage): - message_counts["system"] += 1 - log("SYS", f"{message.subtype}: {message.data}") - if message.subtype == "init" and isinstance(message.data, dict): - cli_ver = message.data.get("claude_code_version", "?") - log("SYS", f"CLI reports version: {cli_ver}") - log("AGENT", "Waiting for next message from CLI...") - - elif isinstance(message, AssistantMessage): - message_counts["assistant"] += 1 - result["turns"] += 1 - log("TURN", f"#{result['turns']} (model: {message.model})") - - # Per-turn accumulators for trajectory - turn_thinking = "" - turn_text = "" - turn_tool_calls: list[dict] = [] - turn_tool_results: list[dict] = [] - - for block in message.content: - if isinstance(block, ThinkingBlock): - log("THINK", truncate(block.thinking, 200)) - turn_thinking += block.thinking + "\n" - elif isinstance(block, TextBlock): - print(block.text, flush=True) - turn_text += block.text + "\n" - elif isinstance(block, ToolUseBlock): - tool_input = block.input - if block.name == "Bash": - cmd = tool_input.get("command", "") - log("TOOL", f"Bash: {truncate(cmd, 300)}") - elif block.name in ("Read", "Glob", "Grep"): - log("TOOL", f"{block.name}: {truncate(str(tool_input))}") - elif block.name in ("Write", "Edit"): - path = tool_input.get("file_path", "") - log("TOOL", f"{block.name}: {path}") - else: - log("TOOL", f"{block.name}: {truncate(str(tool_input))}") - turn_tool_calls.append({"name": block.name, "input": tool_input}) - elif isinstance(block, ToolResultBlock): - status = "ERROR" if block.is_error else "ok" - content = ( - block.content if isinstance(block.content, str) else str(block.content) - ) - log("RESULT", f"[{status}] {truncate(content)}") - turn_tool_results.append( - { - "tool_use_id": getattr(block, "tool_use_id", ""), - "is_error": block.is_error, - "content": content, - } - ) - - # Write trajectory event for this turn - trajectory.write_turn( - turn=result["turns"], - model=message.model, - thinking=turn_thinking.strip(), - text=turn_text.strip(), - tool_calls=turn_tool_calls, - tool_results=turn_tool_results, - ) - - elif isinstance(message, ResultMessage): - message_counts["result"] += 1 - result["status"] = message.subtype - result["cost_usd"] = getattr(message, "total_cost_usd", None) - result["num_turns"] = getattr(message, "num_turns", 0) - result["duration_ms"] = getattr(message, "duration_ms", 0) - result["duration_api_ms"] = getattr(message, "duration_api_ms", 0) - result["session_id"] = getattr(message, "session_id", "") - - # Capture token usage from ResultMessage - raw_usage = getattr(message, "usage", None) - if raw_usage is not None: - # Handle both object (dataclass) and dict forms - if isinstance(raw_usage, dict): - usage_dict: dict | None = { - "input_tokens": raw_usage.get("input_tokens", 0), - "output_tokens": raw_usage.get("output_tokens", 0), - "cache_read_input_tokens": raw_usage.get("cache_read_input_tokens", 0), - "cache_creation_input_tokens": raw_usage.get( - "cache_creation_input_tokens", 0 - ), - } - else: - usage_dict = { - "input_tokens": getattr(raw_usage, "input_tokens", 0), - "output_tokens": getattr(raw_usage, "output_tokens", 0), - "cache_read_input_tokens": getattr( - raw_usage, "cache_read_input_tokens", 0 - ), - "cache_creation_input_tokens": getattr( - raw_usage, "cache_creation_input_tokens", 0 - ), - } - result["usage"] = usage_dict - if all(v == 0 for v in usage_dict.values()): - log( - "WARN", - f"All token usage values are zero — usage object " - f"type={type(raw_usage).__name__}", - ) - else: - log( - "USAGE", - f"input={usage_dict['input_tokens']} " - f"output={usage_dict['output_tokens']} " - f"cache_read={usage_dict['cache_read_input_tokens']} " - f"cache_create={usage_dict['cache_creation_input_tokens']}", - ) - else: - usage_dict = None - result["usage"] = None - - log( - "DONE", - f"status={message.subtype} turns={message.num_turns} " - f"cost=${message.total_cost_usd or 0:.4f} " - f"duration={message.duration_ms / 1000:.1f}s", - ) - if message.is_error and message.result: - log("ERROR", message.result) - - # Write trajectory result summary - trajectory.write_result( - subtype=message.subtype, - num_turns=getattr(message, "num_turns", 0), - cost_usd=getattr(message, "total_cost_usd", None), - duration_ms=getattr(message, "duration_ms", 0), - duration_api_ms=getattr(message, "duration_api_ms", 0), - session_id=getattr(message, "session_id", ""), - usage=usage_dict, - ) - - else: - message_counts["other"] += 1 - log( - "MSG", - f"Unrecognized message type: {type(message).__name__}: " - f"{truncate(str(message), 300)}", - ) - - except Exception as e: - log("ERROR", f"Exception during receive_response(): {type(e).__name__}: {e}") - if result["status"] == "unknown": - result["status"] = "error" - result["error"] = f"receive_response() failed: {e}" - - log("AGENT", f"Generator finished. Messages received: {message_counts}") - log("AGENT", f"CLI stderr lines received: {stderr_line_count}") - if message_counts["assistant"] == 0 and message_counts["system"] > 0: - log( - "WARN", - "Got init SystemMessage but zero AssistantMessages. The CLI subprocess " - "started but produced no turns. Likely causes: (1) Bedrock API auth/connectivity " - "failure, (2) SDK↔CLI protocol mismatch, (3) CLI crash after init. " - "Check [CLI] stderr lines above for errors.", - ) - if message_counts["result"] == 0: - log( - "WARN", - "No ResultMessage received from the agent SDK — " - "agent metrics (cost, turns) will be unavailable", - ) - - return result - - -# --------------------------------------------------------------------------- -# run_task — core pipeline callable from server.py or main() -# --------------------------------------------------------------------------- - - -def _build_system_prompt( - config: dict, - setup: dict, - hydrated_context: dict | None, - overrides: str, -) -> str: - """Assemble the system prompt with task-specific values and memory context.""" - task_type = config.get("task_type", "new_task") - try: - system_prompt = get_system_prompt(task_type) - except ValueError: - log("ERROR", f"Unknown task_type {task_type!r} — falling back to default system prompt") - system_prompt = SYSTEM_PROMPT - system_prompt = system_prompt.replace("{repo_url}", config["repo_url"]) - system_prompt = system_prompt.replace("{task_id}", config["task_id"]) - system_prompt = system_prompt.replace("{workspace}", AGENT_WORKSPACE) - system_prompt = system_prompt.replace("{branch_name}", setup["branch"]) - default_branch = setup.get("default_branch", "main") - system_prompt = system_prompt.replace("{default_branch}", default_branch) - system_prompt = system_prompt.replace("{max_turns}", str(config.get("max_turns", 100))) - setup_notes = ( - "\n".join(f"- {n}" for n in setup["notes"]) - if setup["notes"] - else "All setup steps completed successfully." - ) - system_prompt = system_prompt.replace("{setup_notes}", setup_notes) - - # Inject memory context from orchestrator hydration - memory_context_text = "(No previous knowledge available for this repository.)" - if hydrated_context and hydrated_context.get("memory_context"): - mc = hydrated_context["memory_context"] - mc_parts = [] - if mc.get("repo_knowledge"): - mc_parts.append("**Repository knowledge:**") - for item in mc["repo_knowledge"]: - mc_parts.append(f"- {item}") - if mc.get("past_episodes"): - mc_parts.append("\n**Past task episodes:**") - for item in mc["past_episodes"]: - mc_parts.append(f"- {item}") - if mc_parts: - memory_context_text = "\n".join(mc_parts) - system_prompt = system_prompt.replace("{memory_context}", memory_context_text) - - # Substitute PR-specific placeholders - pr_number_val = config.get("pr_number", "") - if pr_number_val: - system_prompt = system_prompt.replace("{pr_number}", str(pr_number_val)) - elif "{pr_number}" in system_prompt: - log("WARN", "System prompt contains {pr_number} placeholder but no pr_number in config") - system_prompt = system_prompt.replace("{pr_number}", "(unknown)") - - # Append Blueprint system_prompt_overrides after all placeholder - # substitutions (avoids double-substitution if overrides contain - # template placeholders like {repo_url}). - if overrides: - system_prompt += f"\n\n## Additional instructions\n\n{overrides}" - n = len(overrides) - log("TASK", f"Applied system prompt overrides ({n} chars)") - - return system_prompt - - -def _discover_project_config(repo_dir: str) -> dict[str, list[str]]: - """Scan the cloned repo for project-level configuration files. - - Returns a dict mapping config categories to lists of file paths found. - """ - project_config: dict[str, list[str]] = {} - try: - # CLAUDE.md instructions - for md in ["CLAUDE.md", os.path.join(".claude", "CLAUDE.md")]: - if os.path.isfile(os.path.join(repo_dir, md)): - project_config.setdefault("instructions", []).append(md) - # .claude/rules/*.md - rules_dir = os.path.join(repo_dir, ".claude", "rules") - if os.path.isdir(rules_dir): - for p in glob.glob(os.path.join(rules_dir, "*.md")): - project_config.setdefault("rules", []).append(os.path.relpath(p, repo_dir)) - # .claude/settings.json - settings = os.path.join(repo_dir, ".claude", "settings.json") - if os.path.isfile(settings): - project_config["settings"] = [".claude/settings.json"] - # .claude/agents/*.md - agents_dir = os.path.join(repo_dir, ".claude", "agents") - if os.path.isdir(agents_dir): - for p in glob.glob(os.path.join(agents_dir, "*.md")): - project_config.setdefault("agents", []).append(os.path.relpath(p, repo_dir)) - # .mcp.json - mcp = os.path.join(repo_dir, ".mcp.json") - if os.path.isfile(mcp): - project_config["mcp_servers"] = [".mcp.json"] - except OSError as e: - log("WARN", f"Error scanning project config: {e}") - return project_config - - -def _write_memory( - config: dict, - setup: dict, - agent_result: dict, - start_time: float, - build_passed: bool, - pr_url: str | None, - memory_id: str, -) -> bool: - """Write task episode and repo learnings to AgentCore Memory. - - Returns True if any memory was successfully written. - """ - # Parse self-feedback from PR body — separate try-catch so extraction - # failures don't mask memory write errors (and vice versa). - self_feedback = None - try: - self_feedback = _extract_agent_notes(setup["repo_dir"], setup["branch"], config) - except Exception as e: - log( - "WARN", - f"Agent notes extraction failed (non-fatal): {type(e).__name__}: {e}", - ) - - raw_cost = agent_result.get("cost_usd") - try: - episode_cost: float | None = float(raw_cost) if raw_cost is not None else None - except (ValueError, TypeError): - log("WARN", f"Invalid cost_usd: '{raw_cost}'") - episode_cost = None - - # Memory writes are individually fail-open (return False on error) - episode_ok = agent_memory.write_task_episode( - memory_id=memory_id, - repo=config["repo_url"], - task_id=config["task_id"], - status="COMPLETED" if build_passed else "FAILED", - pr_url=pr_url, - cost_usd=episode_cost, - duration_s=round(time.time() - start_time, 1), - self_feedback=self_feedback, - ) - - learnings_ok = False - if self_feedback: - learnings_ok = agent_memory.write_repo_learnings( - memory_id=memory_id, - repo=config["repo_url"], - task_id=config["task_id"], - learnings=self_feedback, - ) - - log("MEMORY", f"Memory write: episode={episode_ok}, learnings={learnings_ok}") - return episode_ok or learnings_ok - - -def run_task( - repo_url: str, - task_description: str = "", - issue_number: str = "", - github_token: str = "", - anthropic_model: str = "", - max_turns: int = 100, - max_budget_usd: float | None = None, - aws_region: str = "", - task_id: str = "", - hydrated_context: dict | None = None, - system_prompt_overrides: str = "", - prompt_version: str = "", - memory_id: str = "", - task_type: str = "new_task", - branch_name: str = "", - pr_number: str = "", -) -> dict: - """Run the full agent pipeline and return a result dict. - - This is the main entry point for both: - - AgentCore server mode (called by server.py /invocations) - - Local batch mode (called by main()) - - Returns a dict with: status, pr_url, build_passed, cost_usd, - turns, duration_s, task_id, error. - """ - from opentelemetry.trace import StatusCode - - # Build config - config = build_config( - repo_url=repo_url, - task_description=task_description, - issue_number=issue_number, - github_token=github_token, - anthropic_model=anthropic_model, - max_turns=max_turns, - max_budget_usd=max_budget_usd, - aws_region=aws_region, - task_id=task_id, - system_prompt_overrides=system_prompt_overrides, - task_type=task_type, - branch_name=branch_name, - pr_number=pr_number, - ) - - log("TASK", f"Task ID: {config['task_id']}") - log("TASK", f"Repository: {config['repo_url']}") - log("TASK", f"Issue: {config['issue_number'] or '(none)'}") - log("TASK", f"Model: {config['anthropic_model']}") - - with task_span( - "task.pipeline", - attributes={ - "task.id": config["task_id"], - "repo.url": config["repo_url"], - "issue.number": config.get("issue_number", ""), - "agent.model": config["anthropic_model"], - }, - ) as root_span: - task_state.write_running(config["task_id"]) - - try: - # Context hydration - with task_span("task.context_hydration"): - if hydrated_context: - log("TASK", "Using hydrated context from orchestrator") - prompt = hydrated_context["user_prompt"] - if hydrated_context.get("issue"): - config["issue"] = hydrated_context["issue"] - if hydrated_context.get("resolved_base_branch"): - config["base_branch"] = hydrated_context["resolved_base_branch"] - if hydrated_context.get("truncated"): - log("WARN", "Context was truncated by orchestrator token budget") - else: - # Local batch mode — fetch issue and assemble prompt in-container - if config["issue_number"]: - log("TASK", f"Fetching issue #{config['issue_number']}...") - config["issue"] = fetch_github_issue( - config["repo_url"], config["issue_number"], config["github_token"] - ) - log("TASK", f" Title: {config['issue']['title']}") - - prompt = assemble_prompt(config) - - # Configure git and gh auth before setup_repo() uses them - subprocess.run( - ["git", "config", "--global", "user.name", "bgagent"], - check=True, - capture_output=True, - timeout=60, - ) - subprocess.run( - ["git", "config", "--global", "user.email", "bgagent@noreply.github.com"], - check=True, - capture_output=True, - timeout=60, - ) - os.environ["GITHUB_TOKEN"] = config["github_token"] - os.environ["GH_TOKEN"] = config["github_token"] - - # Set env vars for the prepare-commit-msg hook BEFORE setup_repo() - # so the hook has access to TASK_ID/PROMPT_VERSION from the start. - os.environ["TASK_ID"] = config["task_id"] - if prompt_version: - os.environ["PROMPT_VERSION"] = prompt_version - - # Setup repo (deterministic pre-hooks) - with task_span("task.repo_setup") as setup_span: - setup = setup_repo(config) - setup_span.set_attribute("build.before", setup.get("build_before", False)) - - system_prompt = _build_system_prompt( - config, setup, hydrated_context, system_prompt_overrides - ) - - # Log discovered repo-level project configuration - # (all files loaded by setting_sources=["project"]) - repo_dir = setup["repo_dir"] - project_config = _discover_project_config(repo_dir) - if project_config: - log("TASK", f"Repo project configuration: {project_config}") - else: - log("TASK", "No repo-level project configuration found") - - # Run agent - disk_before = get_disk_usage(AGENT_WORKSPACE) - start_time = time.time() - - log("TASK", "Starting agent...") - if config.get("max_budget_usd"): - log("TASK", f"Budget limit: ${config['max_budget_usd']:.2f}") - # Warn if uvloop is the active policy — subprocess SIGCHLD conflicts. - policy = asyncio.get_event_loop_policy() - policy_name = type(policy).__name__ - if "uvloop" in policy_name.lower(): - log( - "WARN", - f"uvloop detected ({policy_name}) — this may cause subprocess " - f"SIGCHLD conflicts with the Claude Agent SDK", - ) - with task_span("task.agent_execution") as agent_span: - try: - agent_result = asyncio.run( - run_agent(prompt, system_prompt, config, cwd=setup["repo_dir"]) - ) - except Exception as e: - log("ERROR", f"Agent failed: {e}") - agent_span.set_status(StatusCode.ERROR, str(e)) - agent_span.record_exception(e) - agent_result = { - "status": "error", - "turns": 0, - "cost_usd": None, - "error": str(e), - } - - # Post-hooks - with task_span("task.post_hooks") as post_span: - # Safety net: commit any uncommitted tracked changes (skip for read-only tasks) - if config.get("task_type") == "pr_review": - safety_committed = False - else: - safety_committed = ensure_committed(setup["repo_dir"]) - post_span.set_attribute("safety_net.committed", safety_committed) - - build_passed = verify_build(setup["repo_dir"]) - lint_passed = verify_lint(setup["repo_dir"]) - pr_url = ensure_pr( - config, setup, build_passed, lint_passed, agent_result=agent_result - ) - post_span.set_attribute("build.passed", build_passed) - post_span.set_attribute("lint.passed", lint_passed) - post_span.set_attribute("pr.url", pr_url or "") - - # Memory write — capture task episode and repo learnings - memory_written = False - effective_memory_id = memory_id or os.environ.get("MEMORY_ID", "") - if effective_memory_id: - memory_written = _write_memory( - config, - setup, - agent_result, - start_time, - build_passed, - pr_url, - effective_memory_id, - ) - - # Metrics - duration = time.time() - start_time - disk_after = get_disk_usage(AGENT_WORKSPACE) - - # Determine overall status: - # - "success" if the agent reported success/end_turn and the build passes - # (or the build was already broken before the agent ran — pre-existing failure) - # - "success" if agent_status is unknown (SDK didn't yield ResultMessage) - # but the pipeline produced a PR and the build didn't regress - # - "error" otherwise - # NOTE: lint_passed is intentionally NOT used in the status - # determination — lint failures are advisory and reported in the PR - # body and span attributes but do not affect the task's terminal - # status. Lint regression detection is planned for Iteration 3c. - agent_status = agent_result["status"] - # Default True = assume build was green before, so a post-agent - # failure IS counted as a regression (conservative). - build_before = setup.get("build_before", True) - if config.get("task_type") == "pr_review": - build_ok = True # Review task — build status is informational only - if not build_passed: - log("INFO", "pr_review: build failed — informational only, not gating") - else: - build_ok = build_passed or not build_before - if not build_passed and not build_before and config.get("task_type") != "pr_review": - log( - "WARN", - "Post-agent build failed, but build was already failing before " - "agent changes — not counting as regression", - ) - if agent_status in ("success", "end_turn") and build_ok: - overall_status = "success" - elif agent_status == "unknown" and pr_url and build_ok: - log( - "WARN", - "Agent SDK did not yield a ResultMessage, but PR was created " - "and build didn't regress — treating as success", - ) - overall_status = "success" - else: - overall_status = "error" - - result = { - "status": overall_status, - "agent_status": agent_status, - "pr_url": pr_url, - "build_passed": build_passed, - "lint_passed": lint_passed, - "cost_usd": agent_result.get("cost_usd"), - "turns": agent_result.get("num_turns") or agent_result.get("turns"), - "duration_s": round(duration, 1), - "task_id": config["task_id"], - "disk_before": format_bytes(disk_before), - "disk_after": format_bytes(disk_after), - "disk_delta": format_bytes(disk_after - disk_before), - "prompt_version": prompt_version or None, - "memory_written": memory_written, - } - if agent_result.get("error"): - result["error"] = agent_result["error"] - if agent_result.get("session_id"): - result["session_id"] = agent_result["session_id"] - - # Propagate token usage from agent result into metrics - usage = agent_result.get("usage") - if isinstance(usage, dict): - result["input_tokens"] = usage.get("input_tokens", 0) - result["output_tokens"] = usage.get("output_tokens", 0) - result["cache_read_input_tokens"] = usage.get("cache_read_input_tokens", 0) - result["cache_creation_input_tokens"] = usage.get("cache_creation_input_tokens", 0) - elif usage is not None: - log( - "WARN", - f"agent_result['usage'] has unexpected type {type(usage).__name__} — " - f"token usage will not be recorded in metrics or span attributes", - ) - - # Record terminal attributes on the root span for CloudWatch querying - root_span.set_attribute("task.status", overall_status) - cost = agent_result.get("cost_usd") - if cost is not None: - root_span.set_attribute("agent.cost_usd", float(cost)) - turns = agent_result.get("num_turns") or agent_result.get("turns") - if turns is not None: - root_span.set_attribute("agent.turns", int(turns)) - root_span.set_attribute("build.passed", build_passed) - root_span.set_attribute("lint.passed", lint_passed) - root_span.set_attribute("pr.url", pr_url or "") - root_span.set_attribute("task.duration_s", round(duration, 1)) - if isinstance(usage, dict): - root_span.set_attribute("agent.input_tokens", usage.get("input_tokens", 0)) - root_span.set_attribute("agent.output_tokens", usage.get("output_tokens", 0)) - root_span.set_attribute( - "agent.cache_read_input_tokens", - usage.get("cache_read_input_tokens", 0), - ) - root_span.set_attribute( - "agent.cache_creation_input_tokens", - usage.get("cache_creation_input_tokens", 0), - ) - if overall_status != "success": - root_span.set_status( - StatusCode.ERROR, str(result.get("error", "task did not succeed")) - ) - - # Emit metrics to CloudWatch Logs and print summary to stdout - print_metrics(result) - - # Persist terminal state to DynamoDB - terminal_status = "COMPLETED" if overall_status == "success" else "FAILED" - task_state.write_terminal(config["task_id"], terminal_status, result) - - return result - - except Exception as e: - # Ensure the task is marked FAILED in DynamoDB even if the pipeline - # crashes before reaching the normal terminal-state write. - task_state.write_terminal(config["task_id"], "FAILED", {"error": str(e)}) - raise - - -# --------------------------------------------------------------------------- -# Local batch mode -# --------------------------------------------------------------------------- - - -def main(): - config = get_config() - - print("Task configuration loaded.", flush=True) - print("Dry run mode detected.", flush=True) - print() - - if config["dry_run"]: - # Context hydration for dry run - if config["issue_number"]: - config["issue"] = fetch_github_issue( - config["repo_url"], config["issue_number"], config["github_token"] - ) - prompt = assemble_prompt(config) - system_prompt = SYSTEM_PROMPT.replace("{repo_url}", config["repo_url"]) - system_prompt = system_prompt.replace("{task_id}", config["task_id"]) - system_prompt = system_prompt.replace("{workspace}", AGENT_WORKSPACE) - system_prompt = system_prompt.replace("{branch_name}", "bgagent/{task_id}/dry-run") - system_prompt = system_prompt.replace("{default_branch}", "main") - system_prompt = system_prompt.replace("{max_turns}", str(config.get("max_turns", 100))) - system_prompt = system_prompt.replace("{setup_notes}", "(dry run — setup not executed)") - system_prompt = system_prompt.replace("{memory_context}", "(dry run — memory not loaded)") - overrides = config.get("system_prompt_overrides", "") - if overrides: - system_prompt += f"\n\n## Additional instructions\n\n{overrides}" - system_prompt_hash = hashlib.sha256(system_prompt.encode("utf-8")).hexdigest()[:12] - prompt_hash = hashlib.sha256(prompt.encode("utf-8")).hexdigest()[:12] - print("\n--- SYSTEM PROMPT (REDACTED) ---") - print( - f"length={len(system_prompt)} chars sha256={system_prompt_hash} " - "(set DEBUG_DRY_RUN_PROMPTS=1 to print full text)", - flush=True, - ) - print("\n--- USER PROMPT (REDACTED) ---") - print( - f"length={len(prompt)} chars sha256={prompt_hash} " - "(set DEBUG_DRY_RUN_PROMPTS=1 to print full text)", - flush=True, - ) - if os.environ.get("DEBUG_DRY_RUN_PROMPTS") == "1": - print( - "\nDEBUG_DRY_RUN_PROMPTS=1 is set, but full prompt printing is disabled " - "for secure logging compliance.", - flush=True, - ) - print("\n--- DRY RUN COMPLETE ---") - return - - # Run the full pipeline. run_task() is sync and calls asyncio.run() - # internally, so main() must NOT be async (nested asyncio.run() is illegal). - result = run_task( - repo_url=config["repo_url"], - task_description=config["task_description"], - issue_number=config["issue_number"], - github_token=config["github_token"], - anthropic_model=config["anthropic_model"], - max_turns=config["max_turns"], - max_budget_usd=config.get("max_budget_usd"), - aws_region=config["aws_region"], - system_prompt_overrides=config.get("system_prompt_overrides", ""), - ) - - # Exit with error if agent failed - if result["status"] != "success": - sys.exit(1) - - -if __name__ == "__main__": - main() diff --git a/agent/pyproject.toml b/agent/pyproject.toml index 41d71f1..9f910a7 100644 --- a/agent/pyproject.toml +++ b/agent/pyproject.toml @@ -11,6 +11,7 @@ dependencies = [ "uvicorn==0.42.0", "aws-opentelemetry-distro~=0.15.0", "mcp==1.23.0", + "cedarpy>=4.8.0", ] [tool.bandit] @@ -55,13 +56,15 @@ ignore = [ ] [tool.ruff.lint.per-file-ignores] -"system_prompt.py" = ["E501"] # long prompt strings -"prompts/*.py" = ["E501"] # long prompt strings -"tests/**" = ["S101", "S106"] # assert in tests is fine; hardcoded test tokens are not secrets +"src/entrypoint.py" = ["E402"] # re-export shim: importlib.reload() call must precede re-export from-imports +"src/system_prompt.py" = ["E501"] # long prompt strings +"src/prompts/*.py" = ["E501"] # long prompt strings +"tests/**" = ["S101", "S106", "S108", "E402"] # assert; test tokens; /tmp paths; importorskip [tool.pytest.ini_options] testpaths = ["tests"] -pythonpath = ["."] +pythonpath = ["src"] [tool.ty.environment] python-version = "3.13" +extra-paths = ["src"] diff --git a/agent/run.sh b/agent/run.sh index da1230e..dbb8579 100755 --- a/agent/run.sh +++ b/agent/run.sh @@ -227,5 +227,5 @@ if [[ "$MODE" == "server" ]]; then echo "" docker run "${DOCKER_ARGS[@]}" bgagent-local else - docker run "${DOCKER_ARGS[@]}" bgagent-local python /app/entrypoint.py + docker run "${DOCKER_ARGS[@]}" bgagent-local python /app/src/entrypoint.py fi diff --git a/agent/src/__init__.py b/agent/src/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/agent/src/config.py b/agent/src/config.py new file mode 100644 index 0000000..4e26e2e --- /dev/null +++ b/agent/src/config.py @@ -0,0 +1,124 @@ +"""Agent configuration: constants and config-builder.""" + +import os +import sys +import uuid + +from models import TaskConfig, TaskType + +AGENT_WORKSPACE = os.environ.get("AGENT_WORKSPACE", "/workspace") + +# Task types that operate on an existing pull request. +PR_TASK_TYPES = frozenset(("pr_iteration", "pr_review")) + + +def resolve_github_token() -> str: + """Resolve GitHub token from Secrets Manager or environment variable. + + In deployed mode, GITHUB_TOKEN_SECRET_ARN is set and the token is fetched + from Secrets Manager on first call, then cached in os.environ. + For local development, falls back to GITHUB_TOKEN. + """ + # Return cached value if already resolved + cached = os.environ.get("GITHUB_TOKEN", "") + if cached: + return cached + secret_arn = os.environ.get("GITHUB_TOKEN_SECRET_ARN") + if secret_arn: + import boto3 + + region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION") + client = boto3.client("secretsmanager", region_name=region) + resp = client.get_secret_value(SecretId=secret_arn) + token = resp["SecretString"] + # Cache in env so downstream tools (git, gh CLI) work unchanged + os.environ["GITHUB_TOKEN"] = token + return token + return "" + + +def build_config( + repo_url: str, + task_description: str = "", + issue_number: str = "", + github_token: str = "", + anthropic_model: str = "", + max_turns: int = 10, + max_budget_usd: float | None = None, + aws_region: str = "", + dry_run: bool = False, + task_id: str = "", + system_prompt_overrides: str = "", + task_type: str = "new_task", + branch_name: str = "", + pr_number: str = "", +) -> TaskConfig: + """Build and validate configuration from explicit parameters. + + Parameters fall back to environment variables if empty. + """ + resolved_repo_url = repo_url or os.environ.get("REPO_URL", "") + resolved_issue_number = issue_number or os.environ.get("ISSUE_NUMBER", "") + resolved_task_description = task_description or os.environ.get("TASK_DESCRIPTION", "") + resolved_github_token = github_token or resolve_github_token() + resolved_aws_region = aws_region or os.environ.get("AWS_REGION", "") + resolved_anthropic_model = anthropic_model or os.environ.get( + "ANTHROPIC_MODEL", "us.anthropic.claude-sonnet-4-6" + ) + + errors = [] + if not resolved_repo_url: + errors.append("repo_url is required (e.g., 'owner/repo')") + if not resolved_github_token: + errors.append("github_token is required") + if not resolved_aws_region: + errors.append("aws_region is required for Bedrock") + try: + task = TaskType(task_type) + except ValueError: + errors.append(f"Invalid task_type: '{task_type}'") + task = None + if task and task.is_pr_task: + if not pr_number: + errors.append("pr_number is required for pr_iteration/pr_review task type") + elif task and not resolved_issue_number and not resolved_task_description: + errors.append("Either issue_number or task_description is required") + + if errors: + raise ValueError("; ".join(errors)) + + return TaskConfig( + repo_url=resolved_repo_url, + issue_number=resolved_issue_number, + task_description=resolved_task_description, + github_token=resolved_github_token, + aws_region=resolved_aws_region, + anthropic_model=resolved_anthropic_model, + dry_run=dry_run, + max_turns=max_turns, + max_budget_usd=max_budget_usd, + system_prompt_overrides=system_prompt_overrides, + task_type=task_type, + branch_name=branch_name, + pr_number=pr_number, + task_id=task_id or uuid.uuid4().hex[:12], + ) + + +def get_config() -> TaskConfig: + """Parse configuration from environment variables (local batch mode).""" + try: + return build_config( + repo_url=os.environ.get("REPO_URL", ""), + task_description=os.environ.get("TASK_DESCRIPTION", ""), + issue_number=os.environ.get("ISSUE_NUMBER", ""), + github_token=os.environ.get("GITHUB_TOKEN", ""), + anthropic_model=os.environ.get("ANTHROPIC_MODEL", ""), + max_turns=int(os.environ.get("MAX_TURNS", "100")), + max_budget_usd=float(os.environ.get("MAX_BUDGET_USD", "0")) or None, + aws_region=os.environ.get("AWS_REGION", ""), + dry_run=os.environ.get("DRY_RUN", "").lower() in ("1", "true", "yes"), + ) + except ValueError as e: + print(f"ERROR: {e}", file=sys.stderr) + sys.exit(1) diff --git a/agent/src/context.py b/agent/src/context.py new file mode 100644 index 0000000..abb2e9d --- /dev/null +++ b/agent/src/context.py @@ -0,0 +1,78 @@ +"""Context hydration: GitHub issue fetching and prompt assembly.""" + +import requests + +from models import GitHubIssue, IssueComment, TaskConfig + + +def fetch_github_issue(repo_url: str, issue_number: str, token: str) -> GitHubIssue: + """Fetch a GitHub issue's title, body, and comments.""" + headers = { + "Authorization": f"token {token}", + "Accept": "application/vnd.github.v3+json", + } + + # Fetch issue + issue_resp = requests.get( + f"https://api.github.com/repos/{repo_url}/issues/{issue_number}", + headers=headers, + timeout=30, + ) + issue_resp.raise_for_status() + issue = issue_resp.json() + + # Fetch comments + comments: list[IssueComment] = [] + if issue.get("comments", 0) > 0: + comments_resp = requests.get( + f"https://api.github.com/repos/{repo_url}/issues/{issue_number}/comments", + headers=headers, + timeout=30, + ) + comments_resp.raise_for_status() + comments = [ + IssueComment(author=c["user"]["login"], body=c["body"]) for c in comments_resp.json() + ] + + return GitHubIssue( + title=issue["title"], + body=issue.get("body", "") or "", + number=issue["number"], + comments=comments, + ) + + +def assemble_prompt(config: TaskConfig) -> str: + """Assemble the user prompt from issue context and task description. + + .. deprecated:: + In production (AgentCore server mode), the orchestrator's + ``assembleUserPrompt()`` in ``context-hydration.ts`` is the sole prompt + assembler. The hydrated prompt arrives via + ``HydratedContext.user_prompt`` (validated from the incoming JSON). + This Python implementation is retained only for **local batch mode** + (``python src/entrypoint.py``) and **dry-run mode** (``DRY_RUN=1``). + """ + parts = [] + + parts.append(f"Task ID: {config.task_id}") + parts.append(f"Repository: {config.repo_url}") + + if config.issue: + issue = config.issue + parts.append(f"\n## GitHub Issue #{issue.number}: {issue.title}\n") + parts.append(issue.body or "(no description)") + if issue.comments: + parts.append("\n### Comments\n") + for c in issue.comments: + parts.append(f"**@{c.author}**: {c.body}\n") + + if config.task_description: + parts.append(f"\n## Task\n\n{config.task_description}") + elif config.issue: + parts.append( + "\n## Task\n\nResolve the GitHub issue described above. " + "Follow the workflow in your system instructions." + ) + + return "\n".join(parts) diff --git a/agent/src/entrypoint.py b/agent/src/entrypoint.py new file mode 100644 index 0000000..7a2611b --- /dev/null +++ b/agent/src/entrypoint.py @@ -0,0 +1,50 @@ +"""Re-export shim for backward compatibility. + +Existing callers (tests) that import from ``entrypoint`` continue +to work. New code should import from the specific module directly. +""" + +import importlib as _importlib + +# Reload config so that ``importlib.reload(entrypoint)`` picks up env changes +# (e.g. AGENT_WORKSPACE) — needed for backward-compatible test patterns. +import config as _config + +_importlib.reload(_config) + +from config import ( # noqa: F401 + AGENT_WORKSPACE, + PR_TASK_TYPES, + build_config, + get_config, + resolve_github_token, +) +from context import assemble_prompt, fetch_github_issue # noqa: F401 +from models import ( # noqa: F401 + AgentResult, + GitHubIssue, + HydratedContext, + IssueComment, + MemoryContext, + RepoSetup, + TaskConfig, + TaskResult, + TaskType, + TokenUsage, +) +from pipeline import main, run_task # noqa: F401 +from post_hooks import ( # noqa: F401 + ensure_committed, + ensure_pr, + ensure_pushed, + verify_build, + verify_lint, +) +from prompt_builder import build_system_prompt as _build_system_prompt # noqa: F401 +from prompt_builder import discover_project_config as _discover_project_config # noqa: F401 +from runner import run_agent # noqa: F401 +from shell import log, redact_secrets, run_cmd, slugify, truncate # noqa: F401 +from telemetry import format_bytes, get_disk_usage, print_metrics # noqa: F401 + +if __name__ == "__main__": + main() diff --git a/agent/src/hooks.py b/agent/src/hooks.py new file mode 100644 index 0000000..ba7df83 --- /dev/null +++ b/agent/src/hooks.py @@ -0,0 +1,117 @@ +"""PreToolUse hook callback for Cedar policy enforcement. + +Integrates the PolicyEngine with the Claude Agent SDK's hook system +to enforce tool-use policies at runtime. +""" + +from __future__ import annotations + +import json +from typing import TYPE_CHECKING, Any + +from shell import log + +if TYPE_CHECKING: + from policy import PolicyEngine + from telemetry import _TrajectoryWriter + + +async def pre_tool_use_hook( + hook_input: Any, + tool_use_id: str | None, + hook_context: Any, + *, + engine: PolicyEngine, + trajectory: _TrajectoryWriter | None = None, +) -> dict: + """PreToolUse hook: evaluate tool call against Cedar policies. + + Returns a dict with hookSpecificOutput containing: + - permissionDecision: "allow" or "deny" + - permissionDecisionReason: explanation string + """ + if not isinstance(hook_input, dict): + log("WARN", "PreToolUse hook received non-dict input — denying") + return { + "hookSpecificOutput": { + "hookEventName": "PreToolUse", + "permissionDecision": "deny", + "permissionDecisionReason": "invalid hook input", + } + } + + tool_name = hook_input.get("tool_name", "unknown") + tool_input = hook_input.get("tool_input", {}) + if isinstance(tool_input, str): + try: + tool_input = json.loads(tool_input) + except (json.JSONDecodeError, TypeError): + log("WARN", f"PreToolUse hook failed to parse tool_input — denying {tool_name}") + return { + "hookSpecificOutput": { + "hookEventName": "PreToolUse", + "permissionDecision": "deny", + "permissionDecisionReason": "unparseable tool input", + } + } + + decision = engine.evaluate_tool_use(tool_name, tool_input) + + # Emit telemetry for all non-permitted decisions (including fail-closed) + if trajectory and decision.reason != "permitted": + trajectory.write_policy_decision( + tool_name, decision.allowed, decision.reason, decision.duration_ms + ) + + if not decision.allowed: + log("POLICY", f"DENIED: {tool_name} — {decision.reason}") + return { + "hookSpecificOutput": { + "hookEventName": "PreToolUse", + "permissionDecision": "deny", + "permissionDecisionReason": decision.reason, + } + } + + return { + "hookSpecificOutput": { + "hookEventName": "PreToolUse", + "permissionDecision": "allow", + "permissionDecisionReason": "permitted", + } + } + + +def build_hook_matchers( + engine: PolicyEngine, + trajectory: _TrajectoryWriter | None = None, +) -> dict: + """Build hook matchers dict for ClaudeAgentOptions. + + Returns a dict mapping HookEvent strings to lists of HookMatcher + instances, ready to pass as ``hooks=...`` to ClaudeAgentOptions. + + The SDK expects ``dict[HookEvent, list[HookMatcher]]`` where HookMatcher + has ``matcher: str | None`` and ``hooks: list[HookCallback]``. + """ + from claude_agent_sdk.types import ( + HookContext, + HookInput, + HookJSONOutput, + HookMatcher, + SyncHookJSONOutput, + ) + + # Closure-based wrapper matches the HookCallback signature exactly: + # (HookInput, str | None, HookContext) -> Awaitable[HookJSONOutput] + async def _pre( + hook_input: HookInput, tool_use_id: str | None, ctx: HookContext + ) -> HookJSONOutput: + result = await pre_tool_use_hook( + hook_input, tool_use_id, ctx, engine=engine, trajectory=trajectory + ) + return SyncHookJSONOutput(**result) # type: ignore[typeddict-item] + + return { + "PreToolUse": [HookMatcher(matcher=None, hooks=[_pre])], + } diff --git a/agent/memory.py b/agent/src/memory.py similarity index 100% rename from agent/memory.py rename to agent/src/memory.py diff --git a/agent/src/models.py b/agent/src/models.py new file mode 100644 index 0000000..05c155f --- /dev/null +++ b/agent/src/models.py @@ -0,0 +1,134 @@ +"""Data models and enumerations for the agent pipeline.""" + +from __future__ import annotations + +from enum import StrEnum + +from pydantic import BaseModel, ConfigDict + + +class TaskType(StrEnum): + """Supported task types.""" + + new_task = "new_task" + pr_iteration = "pr_iteration" + pr_review = "pr_review" + + @property + def is_pr_task(self) -> bool: + return self in (TaskType.pr_iteration, TaskType.pr_review) + + @property + def is_read_only(self) -> bool: + return self == TaskType.pr_review + + +class IssueComment(BaseModel): + model_config = ConfigDict(frozen=True) + + author: str + body: str + + +class GitHubIssue(BaseModel): + model_config = ConfigDict(frozen=True) + + title: str + body: str = "" + number: int + comments: list[IssueComment] = [] + + +class MemoryContext(BaseModel): + model_config = ConfigDict(frozen=True) + + repo_knowledge: list[str] = [] + past_episodes: list[str] = [] + + +class HydratedContext(BaseModel): + model_config = ConfigDict(frozen=True) + + user_prompt: str + issue: GitHubIssue | None = None + resolved_base_branch: str | None = None + truncated: bool = False + memory_context: MemoryContext | None = None + + +class TaskConfig(BaseModel): + model_config = ConfigDict(validate_assignment=True) + + repo_url: str + issue_number: str = "" + task_description: str = "" + github_token: str + aws_region: str + anthropic_model: str = "us.anthropic.claude-sonnet-4-6" + dry_run: bool = False + max_turns: int = 10 + max_budget_usd: float | None = None + system_prompt_overrides: str = "" + task_type: str = "new_task" + branch_name: str = "" + pr_number: str = "" + task_id: str = "" + # Enriched mid-flight by pipeline.py: + cedar_policies: list[str] = [] + issue: GitHubIssue | None = None + base_branch: str | None = None + + +class RepoSetup(BaseModel): + model_config = ConfigDict(frozen=True) + + repo_dir: str + branch: str + notes: list[str] = [] + build_before: bool = True + lint_before: bool = True + default_branch: str = "main" + + +class TokenUsage(BaseModel): + model_config = ConfigDict(frozen=True) + + input_tokens: int = 0 + output_tokens: int = 0 + cache_read_input_tokens: int = 0 + cache_creation_input_tokens: int = 0 + + +class AgentResult(BaseModel): + status: str = "unknown" + turns: int = 0 + num_turns: int = 0 + cost_usd: float | None = None + duration_ms: int = 0 + duration_api_ms: int = 0 + session_id: str = "" + error: str | None = None + usage: TokenUsage | None = None + + +class TaskResult(BaseModel): + status: str + agent_status: str = "unknown" + pr_url: str | None = None + build_passed: bool = False + lint_passed: bool = False + cost_usd: float | None = None + turns: int | None = None + duration_s: float = 0.0 + task_id: str = "" + disk_before: str = "" + disk_after: str = "" + disk_delta: str = "" + prompt_version: str | None = None + memory_written: bool = False + error: str | None = None + session_id: str | None = None + input_tokens: int | None = None + output_tokens: int | None = None + cache_read_input_tokens: int | None = None + cache_creation_input_tokens: int | None = None diff --git a/agent/observability.py b/agent/src/observability.py similarity index 100% rename from agent/observability.py rename to agent/src/observability.py diff --git a/agent/src/pipeline.py b/agent/src/pipeline.py new file mode 100644 index 0000000..8708800 --- /dev/null +++ b/agent/src/pipeline.py @@ -0,0 +1,453 @@ +"""Task pipeline: the main orchestrator that wires all modules together.""" + +from __future__ import annotations + +import asyncio +import hashlib +import os +import subprocess +import sys +import time + +import memory as agent_memory +import task_state +from config import AGENT_WORKSPACE, build_config, get_config +from context import assemble_prompt, fetch_github_issue +from models import AgentResult, HydratedContext, RepoSetup, TaskConfig, TaskResult +from observability import task_span +from post_hooks import ( + _extract_agent_notes, + ensure_committed, + ensure_pr, + verify_build, + verify_lint, +) +from prompt_builder import build_system_prompt, discover_project_config +from runner import run_agent +from shell import log +from system_prompt import SYSTEM_PROMPT +from telemetry import format_bytes, get_disk_usage, print_metrics + + +def _write_memory( + config: TaskConfig, + setup: RepoSetup, + agent_result: AgentResult, + start_time: float, + build_passed: bool, + pr_url: str | None, + memory_id: str, +) -> bool: + """Write task episode and repo learnings to AgentCore Memory. + + Returns True if any memory was successfully written. + """ + # Parse self-feedback from PR body — separate try-catch so extraction + # failures don't mask memory write errors (and vice versa). + self_feedback = None + try: + self_feedback = _extract_agent_notes(setup.repo_dir, setup.branch, config) + except Exception as e: + log( + "WARN", + f"Agent notes extraction failed (non-fatal): {type(e).__name__}: {e}", + ) + + episode_cost = agent_result.cost_usd + + # Memory writes are individually fail-open (return False on error) + episode_ok = agent_memory.write_task_episode( + memory_id=memory_id, + repo=config.repo_url, + task_id=config.task_id, + status="COMPLETED" if build_passed else "FAILED", + pr_url=pr_url, + cost_usd=episode_cost, + duration_s=round(time.time() - start_time, 1), + self_feedback=self_feedback, + ) + + learnings_ok = False + if self_feedback: + learnings_ok = agent_memory.write_repo_learnings( + memory_id=memory_id, + repo=config.repo_url, + task_id=config.task_id, + learnings=self_feedback, + ) + + log("MEMORY", f"Memory write: episode={episode_ok}, learnings={learnings_ok}") + return episode_ok or learnings_ok + + +def run_task( + repo_url: str, + task_description: str = "", + issue_number: str = "", + github_token: str = "", + anthropic_model: str = "", + max_turns: int = 100, + max_budget_usd: float | None = None, + aws_region: str = "", + task_id: str = "", + hydrated_context: dict | None = None, + system_prompt_overrides: str = "", + prompt_version: str = "", + memory_id: str = "", + task_type: str = "new_task", + branch_name: str = "", + pr_number: str = "", + cedar_policies: list[str] | None = None, +) -> dict: + """Run the full agent pipeline and return a serialized result dict. + + This is the main entry point for both: + - AgentCore server mode (called by server.py /invocations) + - Local batch mode (called by main()) + + Builds a ``TaskResult`` Pydantic model internally, then returns + ``TaskResult.model_dump()`` for downstream consumers (DynamoDB, + metrics, server response). + """ + from opentelemetry.trace import StatusCode + + from repo import setup_repo + + # Build config + config = build_config( + repo_url=repo_url, + task_description=task_description, + issue_number=issue_number, + github_token=github_token, + anthropic_model=anthropic_model, + max_turns=max_turns, + max_budget_usd=max_budget_usd, + aws_region=aws_region, + task_id=task_id, + system_prompt_overrides=system_prompt_overrides, + task_type=task_type, + branch_name=branch_name, + pr_number=pr_number, + ) + + # Inject Cedar policies into config for the PolicyEngine in runner.py + if cedar_policies: + config.cedar_policies = cedar_policies + + log("TASK", f"Task ID: {config.task_id}") + log("TASK", f"Repository: {config.repo_url}") + log("TASK", f"Issue: {config.issue_number or '(none)'}") + log("TASK", f"Model: {config.anthropic_model}") + + with task_span( + "task.pipeline", + attributes={ + "task.id": config.task_id, + "repo.url": config.repo_url, + "issue.number": config.issue_number, + "agent.model": config.anthropic_model, + }, + ) as root_span: + task_state.write_running(config.task_id) + + try: + # Context hydration + with task_span("task.context_hydration"): + if hydrated_context: + log("TASK", "Using hydrated context from orchestrator") + hc = HydratedContext.model_validate(hydrated_context) + prompt = hc.user_prompt + if hc.issue: + config.issue = hc.issue + if hc.resolved_base_branch: + config.base_branch = hc.resolved_base_branch + if hc.truncated: + log("WARN", "Context was truncated by orchestrator token budget") + else: + hc = None + # Local batch mode — fetch issue and assemble prompt in-container + if config.issue_number: + log("TASK", f"Fetching issue #{config.issue_number}...") + config.issue = fetch_github_issue( + config.repo_url, config.issue_number, config.github_token + ) + log("TASK", f" Title: {config.issue.title}") + + prompt = assemble_prompt(config) + + # Configure git and gh auth before setup_repo() uses them + subprocess.run( + ["git", "config", "--global", "user.name", "bgagent"], + check=True, + capture_output=True, + timeout=60, + ) + subprocess.run( + ["git", "config", "--global", "user.email", "bgagent@noreply.github.com"], + check=True, + capture_output=True, + timeout=60, + ) + os.environ["GITHUB_TOKEN"] = config.github_token + os.environ["GH_TOKEN"] = config.github_token + + # Set env vars for the prepare-commit-msg hook BEFORE setup_repo() + # so the hook has access to TASK_ID/PROMPT_VERSION from the start. + os.environ["TASK_ID"] = config.task_id + if prompt_version: + os.environ["PROMPT_VERSION"] = prompt_version + + # Setup repo (deterministic pre-hooks) + with task_span("task.repo_setup") as setup_span: + setup = setup_repo(config) + setup_span.set_attribute("build.before", setup.build_before) + + system_prompt = build_system_prompt(config, setup, hc, system_prompt_overrides) + + # Log discovered repo-level project configuration + # (all files loaded by setting_sources=["project"]) + repo_dir = setup.repo_dir + project_config = discover_project_config(repo_dir) + if project_config: + log("TASK", f"Repo project configuration: {project_config}") + else: + log("TASK", "No repo-level project configuration found") + + # Run agent + disk_before = get_disk_usage(AGENT_WORKSPACE) + start_time = time.time() + + log("TASK", "Starting agent...") + if config.max_budget_usd: + log("TASK", f"Budget limit: ${config.max_budget_usd:.2f}") + # Warn if uvloop is the active policy — subprocess SIGCHLD conflicts. + policy = asyncio.get_event_loop_policy() + policy_name = type(policy).__name__ + if "uvloop" in policy_name.lower(): + log( + "WARN", + f"uvloop detected ({policy_name}) — this may cause subprocess " + f"SIGCHLD conflicts with the Claude Agent SDK", + ) + with task_span("task.agent_execution") as agent_span: + try: + agent_result = asyncio.run( + run_agent(prompt, system_prompt, config, cwd=setup.repo_dir) + ) + except Exception as e: + log("ERROR", f"Agent failed: {e}") + agent_span.set_status(StatusCode.ERROR, str(e)) + agent_span.record_exception(e) + agent_result = AgentResult(status="error", error=str(e)) + + # Post-hooks + with task_span("task.post_hooks") as post_span: + # Safety net: commit any uncommitted tracked changes (skip for read-only tasks) + if config.task_type == "pr_review": + safety_committed = False + else: + safety_committed = ensure_committed(setup.repo_dir) + post_span.set_attribute("safety_net.committed", safety_committed) + + build_passed = verify_build(setup.repo_dir) + lint_passed = verify_lint(setup.repo_dir) + pr_url = ensure_pr( + config, setup, build_passed, lint_passed, agent_result=agent_result + ) + post_span.set_attribute("build.passed", build_passed) + post_span.set_attribute("lint.passed", lint_passed) + post_span.set_attribute("pr.url", pr_url or "") + + # Memory write — capture task episode and repo learnings + memory_written = False + effective_memory_id = memory_id or os.environ.get("MEMORY_ID", "") + if effective_memory_id: + memory_written = _write_memory( + config, + setup, + agent_result, + start_time, + build_passed, + pr_url, + effective_memory_id, + ) + + # Metrics + duration = time.time() - start_time + disk_after = get_disk_usage(AGENT_WORKSPACE) + + # Determine overall status: + # - "success" if the agent reported success/end_turn and the build passes + # (or the build was already broken before the agent ran — pre-existing failure) + # - "success" if agent_status is unknown (SDK didn't yield ResultMessage) + # but the pipeline produced a PR and the build didn't regress + # - "error" otherwise + # NOTE: lint_passed is intentionally NOT used in the status + # determination — lint failures are advisory and reported in the PR + # body and span attributes but do not affect the task's terminal + # status. Lint regression detection is planned for Iteration 3c. + agent_status = agent_result.status + # Default True = assume build was green before, so a post-agent + # failure IS counted as a regression (conservative). + build_before = setup.build_before + if config.task_type == "pr_review": + build_ok = True # Review task — build status is informational only + if not build_passed: + log("INFO", "pr_review: build failed — informational only, not gating") + else: + build_ok = build_passed or not build_before + if not build_passed and not build_before and config.task_type != "pr_review": + log( + "WARN", + "Post-agent build failed, but build was already failing before " + "agent changes — not counting as regression", + ) + if agent_status in ("success", "end_turn") and build_ok: + overall_status = "success" + elif agent_status == "unknown" and pr_url and build_ok: + log( + "WARN", + "Agent SDK did not yield a ResultMessage, but PR was created " + "and build didn't regress — treating as success", + ) + overall_status = "success" + else: + overall_status = "error" + + # Build TaskResult + usage = agent_result.usage + result = TaskResult( + status=overall_status, + agent_status=agent_status, + pr_url=pr_url, + build_passed=build_passed, + lint_passed=lint_passed, + cost_usd=agent_result.cost_usd, + turns=agent_result.num_turns or agent_result.turns, + duration_s=round(duration, 1), + task_id=config.task_id, + disk_before=format_bytes(disk_before), + disk_after=format_bytes(disk_after), + disk_delta=format_bytes(disk_after - disk_before), + prompt_version=prompt_version or None, + memory_written=memory_written, + error=agent_result.error, + session_id=agent_result.session_id or None, + input_tokens=usage.input_tokens if usage else None, + output_tokens=usage.output_tokens if usage else None, + cache_read_input_tokens=usage.cache_read_input_tokens if usage else None, + cache_creation_input_tokens=usage.cache_creation_input_tokens if usage else None, + ) + + result_dict = result.model_dump() + + # Record terminal attributes on the root span for CloudWatch querying + root_span.set_attribute("task.status", result.status) + if result.cost_usd is not None: + root_span.set_attribute("agent.cost_usd", float(result.cost_usd)) + if result.turns: + root_span.set_attribute("agent.turns", int(result.turns)) + root_span.set_attribute("build.passed", result.build_passed) + root_span.set_attribute("lint.passed", result.lint_passed) + root_span.set_attribute("pr.url", result.pr_url or "") + root_span.set_attribute("task.duration_s", result.duration_s) + if usage: + root_span.set_attribute("agent.input_tokens", usage.input_tokens) + root_span.set_attribute("agent.output_tokens", usage.output_tokens) + root_span.set_attribute( + "agent.cache_read_input_tokens", + usage.cache_read_input_tokens, + ) + root_span.set_attribute( + "agent.cache_creation_input_tokens", + usage.cache_creation_input_tokens, + ) + if result.status != "success": + root_span.set_status(StatusCode.ERROR, str(result.error or "task did not succeed")) + + # Emit metrics to CloudWatch Logs and print summary to stdout + print_metrics(result_dict) + + # Persist terminal state to DynamoDB + terminal_status = "COMPLETED" if overall_status == "success" else "FAILED" + task_state.write_terminal(config.task_id, terminal_status, result_dict) + + return result_dict + + except Exception as e: + # Ensure the task is marked FAILED in DynamoDB even if the pipeline + # crashes before reaching the normal terminal-state write. + crash_result = TaskResult(status="error", error=str(e), task_id=config.task_id) + task_state.write_terminal(config.task_id, "FAILED", crash_result.model_dump()) + raise + + +def main(): + config = get_config() + + print("Task configuration loaded.", flush=True) + print("Dry run mode detected.", flush=True) + print() + + if config.dry_run: + # Context hydration for dry run + if config.issue_number: + config.issue = fetch_github_issue( + config.repo_url, config.issue_number, config.github_token + ) + prompt = assemble_prompt(config) + system_prompt = SYSTEM_PROMPT.replace("{repo_url}", config.repo_url) + system_prompt = system_prompt.replace("{task_id}", config.task_id) + system_prompt = system_prompt.replace("{workspace}", AGENT_WORKSPACE) + system_prompt = system_prompt.replace("{branch_name}", "bgagent/{task_id}/dry-run") + system_prompt = system_prompt.replace("{default_branch}", "main") + system_prompt = system_prompt.replace("{max_turns}", str(config.max_turns)) + system_prompt = system_prompt.replace("{setup_notes}", "(dry run — setup not executed)") + system_prompt = system_prompt.replace("{memory_context}", "(dry run — memory not loaded)") + overrides = config.system_prompt_overrides + if overrides: + system_prompt += f"\n\n## Additional instructions\n\n{overrides}" + system_prompt_hash = hashlib.sha256(system_prompt.encode("utf-8")).hexdigest()[:12] + prompt_hash = hashlib.sha256(prompt.encode("utf-8")).hexdigest()[:12] + print("\n--- SYSTEM PROMPT (REDACTED) ---") + print( + f"length={len(system_prompt)} chars sha256={system_prompt_hash} " + "(set DEBUG_DRY_RUN_PROMPTS=1 to print full text)", + flush=True, + ) + print("\n--- USER PROMPT (REDACTED) ---") + print( + f"length={len(prompt)} chars sha256={prompt_hash} " + "(set DEBUG_DRY_RUN_PROMPTS=1 to print full text)", + flush=True, + ) + if os.environ.get("DEBUG_DRY_RUN_PROMPTS") == "1": + print( + "\nDEBUG_DRY_RUN_PROMPTS=1 is set, but full prompt printing is disabled " + "for secure logging compliance.", + flush=True, + ) + print("\n--- DRY RUN COMPLETE ---") + return + + # Run the full pipeline. run_task() is sync and calls asyncio.run() + # internally, so main() must NOT be async (nested asyncio.run() is illegal). + result = run_task( + repo_url=config.repo_url, + task_description=config.task_description, + issue_number=config.issue_number, + github_token=config.github_token, + anthropic_model=config.anthropic_model, + max_turns=config.max_turns, + max_budget_usd=config.max_budget_usd, + aws_region=config.aws_region, + system_prompt_overrides=config.system_prompt_overrides, + ) + + # Exit with error if agent failed + if result["status"] != "success": + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/agent/src/policy.py b/agent/src/policy.py new file mode 100644 index 0000000..bbda292 --- /dev/null +++ b/agent/src/policy.py @@ -0,0 +1,266 @@ +"""Cedar policy engine for tool-call governance. + +Uses cedarpy (in-process Cedar evaluation) to enforce per-task-type +tool restrictions. No network calls, no AWS Verified Permissions. + +Custom Cedar policies (via Blueprint ``security.cedarPolicies``) must +use ``context`` conditions in ``when`` clauses — not ``resource ==`` +matching — for ``write_file`` and ``execute_bash`` actions. The engine +passes fixed sentinel resource IDs (``Agent::File::"file"``, +``Agent::BashCommand::"command"``) because Cedar entity UIDs cannot +contain the special characters found in file paths and bash commands. +The actual values are available in ``context.file_path`` and +``context.command`` respectively. For ``invoke_tool`` actions, the +resource ID is the real tool name (e.g. ``Agent::Tool::"Write"``), so +``resource ==`` matching works normally there. + +Example — correct custom policy:: + + forbid (principal, action == Agent::Action::"execute_bash", resource) + when { context.command like "*curl*" }; + +Example — WILL NOT WORK (resource is always ``"command"``):: + + forbid (principal, action == Agent::Action::"execute_bash", + resource == Agent::BashCommand::"curl http://evil.com"); +""" + +import time +from dataclasses import dataclass + +from shell import log + +# Baseline: allow all. Specific forbid rules override (deny-list approach). +_DEFAULT_POLICIES = """\ +// Catch-all permit (deny-list model) +permit (principal, action, resource); + +// pr_review: forbid Write and Edit tools +forbid ( + principal == Agent::TaskAgent::"pr_review", + action == Agent::Action::"invoke_tool", + resource == Agent::Tool::"Write" +); +forbid ( + principal == Agent::TaskAgent::"pr_review", + action == Agent::Action::"invoke_tool", + resource == Agent::Tool::"Edit" +); + +// All agents: forbid writes to .git internals +forbid (principal, action == Agent::Action::"write_file", resource) +when { context.file_path like ".git/*" }; +forbid (principal, action == Agent::Action::"write_file", resource) +when { context.file_path like "*/.git/*" }; + +// All agents: forbid destructive bash commands +forbid (principal, action == Agent::Action::"execute_bash", resource) +when { context.command like "*rm -rf /*" }; +forbid (principal, action == Agent::Action::"execute_bash", resource) +when { context.command like "*git push --force*" }; +forbid (principal, action == Agent::Action::"execute_bash", resource) +when { context.command like "*git push -f *" }; +forbid (principal, action == Agent::Action::"execute_bash", resource) +when { context.command like "*git push -f" }; +""" + + +@dataclass(frozen=True) +class PolicyDecision: + """Result of a Cedar policy evaluation.""" + + allowed: bool + reason: str + duration_ms: float = 0 + + +class PolicyEngine: + """Evaluate tool-use requests against Cedar policies.""" + + def __init__( + self, + task_type: str, + repo: str, + extra_policies: list[str] | None = None, + ) -> None: + self._task_type = task_type + self._repo = repo + self._disabled = False + + # Import cedarpy at init time so failures are caught early + try: + import cedarpy + + self._cedarpy = cedarpy + except ImportError: + log("ERROR", "cedarpy not available — policy engine disabled (fail-closed)") + self._cedarpy = None + self._disabled = True + self._policies = _DEFAULT_POLICIES + return + + # Validate task_type + from models import TaskType + + try: + TaskType(task_type) + except ValueError: + log("WARN", f"Unknown task_type '{task_type}' — using default deny-list policies") + + # Build combined policies + self._policies = _DEFAULT_POLICIES + if extra_policies: + combined = _DEFAULT_POLICIES + "\n" + "\n".join(extra_policies) + # Validate combined policies with a test authorization + try: + test_request = { + "principal": f'Agent::TaskAgent::"{task_type}"', + "action": 'Agent::Action::"invoke_tool"', + "resource": 'Agent::Tool::"Read"', + "context": {"task_type": task_type, "repo": repo}, + } + test_entities = [ + { + "uid": {"type": "Agent::TaskAgent", "id": task_type}, + "attrs": {}, + "parents": [], + }, + { + "uid": {"type": "Agent::Tool", "id": "Read"}, + "attrs": {}, + "parents": [], + }, + ] + cedarpy.is_authorized(test_request, combined, test_entities) + self._policies = combined + except Exception as e: + log( + "WARN", + f"Extra Cedar policies failed validation " + f"({type(e).__name__}: {e}) — using defaults only", + ) + + @property + def task_type(self) -> str: + return self._task_type + + def _evaluate( + self, + action: str, + resource_type: str, + resource_id: str, + context: dict, + ) -> tuple[bool, str]: + """Run a single Cedar authorization check. + + Returns (allowed, reason). Fails closed on NoDecision. + + ``resource_id`` must be a simple identifier safe for Cedar entity + UID parsing (no quotes, newlines, or special chars). Callers that + evaluate user-supplied values (bash commands, file paths) should + pass a fixed sentinel and put the real value in ``context`` where + the policies match against it. + """ + cedarpy = self._cedarpy + if cedarpy is None: + return False, "policy engine unavailable" + request = { + "principal": f'Agent::TaskAgent::"{self._task_type}"', + "action": f'Agent::Action::"{action}"', + "resource": f'{resource_type}::"{resource_id}"', + "context": context, + } + entities = [ + { + "uid": {"type": "Agent::TaskAgent", "id": self._task_type}, + "attrs": {}, + "parents": [], + }, + { + "uid": {"type": resource_type, "id": resource_id}, + "attrs": {}, + "parents": [], + }, + ] + result = cedarpy.is_authorized(request, self._policies, entities) + + if result.decision == cedarpy.Decision.NoDecision: + return False, "fail-closed: NoDecision (no valid policies loaded)" + + return result.allowed, "" + + def evaluate_tool_use(self, tool_name: str, tool_input: dict) -> PolicyDecision: + """Evaluate whether a tool call is permitted. + + Returns PolicyDecision with allowed=True/False and reason. + Fails closed on errors and NoDecision. + """ + start = time.monotonic() + + if self._disabled or self._cedarpy is None: + elapsed = (time.monotonic() - start) * 1000 + return PolicyDecision( + allowed=False, + reason="policy engine unavailable", + duration_ms=elapsed, + ) + + try: + base_context = {"task_type": self._task_type, "repo": self._repo} + + # Base evaluation: is this tool allowed? + allowed, deny_reason = self._evaluate( + "invoke_tool", + "Agent::Tool", + tool_name, + base_context, + ) + if not allowed: + elapsed = (time.monotonic() - start) * 1000 + reason = ( + deny_reason + or f"Cedar policy denied {tool_name} for task_type={self._task_type}" + ) + return PolicyDecision(allowed=False, reason=reason, duration_ms=elapsed) + + # Write/Edit: check file path against write_file policies. + # Sentinel resource_id avoids Cedar UID parsing issues (see _evaluate docstring). + if tool_name in ("Write", "Edit"): + file_path = tool_input.get("file_path", "") + if file_path: + allowed, deny_reason = self._evaluate( + "write_file", + "Agent::File", + "file", + {**base_context, "file_path": file_path}, + ) + if not allowed: + elapsed = (time.monotonic() - start) * 1000 + reason = deny_reason or f"Cedar policy denied write to {file_path}" + return PolicyDecision(allowed=False, reason=reason, duration_ms=elapsed) + + # Bash: check command against execute_bash policies. + # Sentinel resource_id avoids Cedar UID parsing issues (see _evaluate docstring). + if tool_name == "Bash": + command = tool_input.get("command", "") + if command: + allowed, deny_reason = self._evaluate( + "execute_bash", + "Agent::BashCommand", + "command", + {**base_context, "command": command}, + ) + if not allowed: + elapsed = (time.monotonic() - start) * 1000 + reason = deny_reason or "Cedar policy denied bash command" + return PolicyDecision(allowed=False, reason=reason, duration_ms=elapsed) + + elapsed = (time.monotonic() - start) * 1000 + return PolicyDecision(allowed=True, reason="permitted", duration_ms=elapsed) + + except Exception as e: + elapsed = (time.monotonic() - start) * 1000 + log("WARN", f"Cedar evaluation error (fail-closed): {type(e).__name__}: {e}") + return PolicyDecision( + allowed=False, reason=f"fail-closed: {type(e).__name__}", duration_ms=elapsed + ) diff --git a/agent/src/post_hooks.py b/agent/src/post_hooks.py new file mode 100644 index 0000000..e66ae9b --- /dev/null +++ b/agent/src/post_hooks.py @@ -0,0 +1,371 @@ +"""Post-agent hooks: build/lint verification, commit, push, PR creation.""" + +from __future__ import annotations + +import re +import subprocess +from typing import TYPE_CHECKING + +from shell import log, run_cmd + +if TYPE_CHECKING: + from models import AgentResult, RepoSetup, TaskConfig + + +def verify_build(repo_dir: str) -> bool: + """Run mise run build after agent completion to verify the build.""" + log("POST", "Running post-agent build verification (mise run build)...") + try: + result = run_cmd( + ["mise", "run", "build"], + label="mise-run-build-post", + cwd=repo_dir, + check=False, + ) + except subprocess.TimeoutExpired: + log("WARN", "Post-agent build timed out — treating as failed") + return False + if result.returncode != 0: + log("POST", "Post-agent build FAILED") + return False + log("POST", "Post-agent build: OK") + return True + + +def verify_lint(repo_dir: str) -> bool: + """Run mise run lint after agent completion to verify lint passes.""" + log("POST", "Running post-agent lint verification (mise run lint)...") + try: + result = run_cmd( + ["mise", "run", "lint"], + label="mise-run-lint-post", + cwd=repo_dir, + check=False, + ) + except subprocess.TimeoutExpired: + log("WARN", "Post-agent lint timed out — treating as failed") + return False + if result.returncode != 0: + log("POST", "Post-agent lint FAILED") + return False + log("POST", "Post-agent lint: OK") + return True + + +def ensure_committed(repo_dir: str) -> bool: + """Safety net: commit any uncommitted tracked changes before finalization. + + This catches work the agent wrote but forgot to commit (e.g. due to turn + limit or timeout). Only stages tracked-but-modified files (git add -u) to + avoid accidentally committing temp files or build artifacts. + + Returns True if a safety-net commit was created, False if nothing to commit + or if git operations fail. + """ + try: + result = subprocess.run( + ["git", "status", "--porcelain"], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + except subprocess.TimeoutExpired: + log("WARN", "git status timed out in safety-net commit") + return False + + if result.returncode != 0: + stderr = result.stderr.strip()[:200] if result.stderr else "" + log("WARN", f"git status failed (exit {result.returncode}): {stderr}") + return False + if not result.stdout.strip(): + return False + + log("POST", "Uncommitted changes detected — creating safety-net commit") + # Stage tracked-but-modified files only (not untracked files) + try: + add_result = subprocess.run( + ["git", "add", "-u"], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + except subprocess.TimeoutExpired: + log("WARN", "git add -u timed out in safety-net commit") + return False + + if add_result.returncode != 0: + stderr = add_result.stderr.strip()[:200] if add_result.stderr else "" + log("WARN", f"git add -u failed (exit {add_result.returncode}): {stderr}") + return False + + # Check if there's anything staged after add -u + staged = subprocess.run( + ["git", "diff", "--cached", "--quiet"], + cwd=repo_dir, + capture_output=True, + timeout=30, + ) + if staged.returncode == 0: + # Nothing staged (changes were only untracked files) — skip + log("POST", "No tracked file changes to commit") + return False + + commit_result = subprocess.run( + ["git", "commit", "-m", "chore(agent): save uncommitted work from session end"], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + if commit_result.returncode == 0: + log("POST", "Safety-net commit created") + return True + log("POST", f"Safety-net commit failed: {commit_result.stderr.strip()[:200]}") + return False + + +def ensure_pushed(repo_dir: str, branch: str) -> bool: + """Push the branch if there are unpushed commits.""" + result = subprocess.run( + ["git", "log", f"origin/{branch}..HEAD", "--oneline"], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + # If the remote branch doesn't exist or there are unpushed commits + if result.returncode != 0 or result.stdout.strip(): + log("POST", "Pushing unpushed commits...") + push_result = run_cmd( + ["git", "push", "-u", "origin", branch], + label="push", + cwd=repo_dir, + check=False, + ) + return push_result.returncode == 0 + return True + + +def ensure_pr( + config: TaskConfig, + setup: RepoSetup, + build_passed: bool, + lint_passed: bool, + agent_result: AgentResult | None = None, +) -> str | None: + """Check if a PR exists for the branch; if not, create one. + + For ``new_task``: creates a new PR if needed. + For ``pr_iteration``: pushes commits, then resolves the existing PR URL. + For ``pr_review``: resolves the existing PR URL without pushing (read-only). + + Returns the PR URL, or None if there are no commits beyond the default + branch or PR creation failed. ``build_passed`` and ``lint_passed`` control + the verification status shown in the PR body. + """ + repo_dir = setup.repo_dir + branch = setup.branch + default_branch = setup.default_branch + + # PR iteration/review: skip PR creation — just resolve existing PR URL + from config import PR_TASK_TYPES + + if config.task_type in PR_TASK_TYPES: + if config.task_type == "pr_iteration": + if not ensure_pushed(repo_dir, branch): + log("WARN", "Failed to push commits before resolving PR URL") + else: + log("POST", "pr_review task — skipping push (read-only)") + log("POST", f"{config.task_type} — returning existing PR URL") + result = subprocess.run( + [ + "gh", + "pr", + "view", + branch, + "--repo", + config.repo_url, + "--json", + "url", + "-q", + ".url", + ], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + if result.returncode == 0 and result.stdout.strip(): + pr_url = result.stdout.strip() + log("POST", f"Existing PR: {pr_url}") + return pr_url + stderr_msg = result.stderr.strip() if result.stderr else "(no stderr)" + log("WARN", f"Could not resolve existing PR URL (rc={result.returncode}): {stderr_msg}") + return None + + # Check if the agent already created a PR for this branch + log("POST", "Checking for existing PR...") + result = subprocess.run( + [ + "gh", + "pr", + "view", + branch, + "--repo", + config.repo_url, + "--json", + "url", + "-q", + ".url", + ], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + if result.returncode == 0 and result.stdout.strip(): + pr_url = result.stdout.strip() + log("POST", f"PR already exists: {pr_url}") + return pr_url + + # Check if there are any commits on this branch beyond the default branch + diff_result = subprocess.run( + ["git", "log", f"origin/{default_branch}..HEAD", "--oneline"], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + if diff_result.returncode != 0 or not diff_result.stdout.strip(): + log("POST", "No commits to create PR from — skipping PR creation") + return None + + # Ensure all commits are pushed + ensure_pushed(repo_dir, branch) + + # Collect commit messages for the PR body + log_result = subprocess.run( + ["git", "log", f"origin/{default_branch}..HEAD", "--pretty=format:%s%n%b---"], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + commits = log_result.stdout.strip() if log_result.returncode == 0 else "" + + # Derive PR title from first commit message + first_commit = subprocess.run( + ["git", "log", f"origin/{default_branch}..HEAD", "--pretty=format:%s", "--reverse"], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=60, + ) + pr_title = ( + first_commit.stdout.strip().split("\n")[0] + if first_commit.stdout.strip() + else f"chore: bgagent/{config.task_id}" + ) + + # Build PR body + task_source = "" + if config.issue_number: + task_source = f"Resolves #{config.issue_number}\n\n" + elif config.task_description: + task_source = f"**Task:** {config.task_description}\n\n" + + build_status = "PASS" if build_passed else "FAIL" + lint_status = "PASS" if lint_passed else "FAIL" + + cost_line = "" + if agent_result and agent_result.cost_usd is not None: + cost_line = f"- Agent cost: **${agent_result.cost_usd:.4f}**\n" + + pr_body = ( + f"## Summary\n\n" + f"{task_source}" + f"### Commits\n\n" + f"```\n{commits}\n```\n\n" + f"## Verification\n\n" + f"- `mise run build` (post-agent): **{build_status}**\n" + f"- `mise run lint` (post-agent): **{lint_status}**\n" + f"{cost_line}\n" + f"---\n\n" + f"By submitting this pull request, I confirm that you can use, modify, copy, " + f"and redistribute this contribution, under the terms of the [project license](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/blob/main/LICENSE)." + ) + + log("POST", f"Creating PR: {pr_title}") + pr_result = run_cmd( + [ + "gh", + "pr", + "create", + "--repo", + config.repo_url, + "--head", + branch, + "--base", + default_branch, + "--title", + pr_title, + "--body", + pr_body, + ], + label="create-pr", + cwd=repo_dir, + check=False, + ) + if pr_result.returncode == 0: + pr_url = pr_result.stdout.strip() + log("POST", f"PR created: {pr_url}") + return pr_url + else: + log("POST", "Failed to create PR") + return None + + +def _extract_agent_notes(repo_dir: str, branch: str, config: TaskConfig) -> str | None: + """Extract the "## Agent notes" section from the PR body. + + Checks the existing PR body via `gh pr view`. Returns the text content + of the "## Agent notes" section, or None if not found. + """ + try: + result = subprocess.run( + [ + "gh", + "pr", + "view", + branch, + "--repo", + config.repo_url, + "--json", + "body", + "-q", + ".body", + ], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=30, + ) + if result.returncode != 0 or not result.stdout.strip(): + return None + + body = result.stdout.strip() + # Find "## Agent notes" section + match = re.search( + r"##\s*Agent\s*notes\s*\n(.*?)(?=\n##\s|\Z)", + body, + re.DOTALL | re.IGNORECASE, + ) + if match: + notes = match.group(1).strip() + return notes if notes else None + return None + except Exception as e: + log("WARN", f"Failed to extract agent notes from PR body: {type(e).__name__}: {e}") + return None diff --git a/agent/src/prompt_builder.py b/agent/src/prompt_builder.py new file mode 100644 index 0000000..e523512 --- /dev/null +++ b/agent/src/prompt_builder.py @@ -0,0 +1,111 @@ +"""System prompt construction and project config discovery.""" + +from __future__ import annotations + +import glob +import os +from typing import TYPE_CHECKING + +from config import AGENT_WORKSPACE +from prompts import get_system_prompt +from shell import log +from system_prompt import SYSTEM_PROMPT + +if TYPE_CHECKING: + from models import HydratedContext, RepoSetup, TaskConfig + + +def build_system_prompt( + config: TaskConfig, + setup: RepoSetup, + hydrated_context: HydratedContext | None, + overrides: str, +) -> str: + """Assemble the system prompt with task-specific values and memory context.""" + task_type = config.task_type + try: + system_prompt = get_system_prompt(task_type) + except ValueError: + log("ERROR", f"Unknown task_type {task_type!r} — falling back to default system prompt") + system_prompt = SYSTEM_PROMPT + system_prompt = system_prompt.replace("{repo_url}", config.repo_url) + system_prompt = system_prompt.replace("{task_id}", config.task_id) + system_prompt = system_prompt.replace("{workspace}", AGENT_WORKSPACE) + system_prompt = system_prompt.replace("{branch_name}", setup.branch) + system_prompt = system_prompt.replace("{default_branch}", setup.default_branch) + system_prompt = system_prompt.replace("{max_turns}", str(config.max_turns)) + setup_notes = ( + "\n".join(f"- {n}" for n in setup.notes) + if setup.notes + else "All setup steps completed successfully." + ) + system_prompt = system_prompt.replace("{setup_notes}", setup_notes) + + # Inject memory context from orchestrator hydration + memory_context_text = "(No previous knowledge available for this repository.)" + if hydrated_context and hydrated_context.memory_context: + mc = hydrated_context.memory_context + mc_parts: list[str] = [] + if mc.repo_knowledge: + mc_parts.append("**Repository knowledge:**") + for item in mc.repo_knowledge: + mc_parts.append(f"- {item}") + if mc.past_episodes: + mc_parts.append("\n**Past task episodes:**") + for item in mc.past_episodes: + mc_parts.append(f"- {item}") + if mc_parts: + memory_context_text = "\n".join(mc_parts) + system_prompt = system_prompt.replace("{memory_context}", memory_context_text) + + # Substitute PR-specific placeholders + pr_number_val = config.pr_number + if pr_number_val: + system_prompt = system_prompt.replace("{pr_number}", str(pr_number_val)) + elif "{pr_number}" in system_prompt: + log("WARN", "System prompt contains {pr_number} placeholder but no pr_number in config") + system_prompt = system_prompt.replace("{pr_number}", "(unknown)") + + # Append Blueprint system_prompt_overrides after all placeholder + # substitutions (avoids double-substitution if overrides contain + # template placeholders like {repo_url}). + if overrides: + system_prompt += f"\n\n## Additional instructions\n\n{overrides}" + n = len(overrides) + log("TASK", f"Applied system prompt overrides ({n} chars)") + + return system_prompt + + +def discover_project_config(repo_dir: str) -> dict[str, list[str]]: + """Scan the cloned repo for project-level configuration files. + + Returns a dict mapping config categories to lists of file paths found. + """ + project_config: dict[str, list[str]] = {} + try: + # CLAUDE.md instructions + for md in ["CLAUDE.md", os.path.join(".claude", "CLAUDE.md")]: + if os.path.isfile(os.path.join(repo_dir, md)): + project_config.setdefault("instructions", []).append(md) + # .claude/rules/*.md + rules_dir = os.path.join(repo_dir, ".claude", "rules") + if os.path.isdir(rules_dir): + for p in glob.glob(os.path.join(rules_dir, "*.md")): + project_config.setdefault("rules", []).append(os.path.relpath(p, repo_dir)) + # .claude/settings.json + settings = os.path.join(repo_dir, ".claude", "settings.json") + if os.path.isfile(settings): + project_config["settings"] = [".claude/settings.json"] + # .claude/agents/*.md + agents_dir = os.path.join(repo_dir, ".claude", "agents") + if os.path.isdir(agents_dir): + for p in glob.glob(os.path.join(agents_dir, "*.md")): + project_config.setdefault("agents", []).append(os.path.relpath(p, repo_dir)) + # .mcp.json + mcp = os.path.join(repo_dir, ".mcp.json") + if os.path.isfile(mcp): + project_config["mcp_servers"] = [".mcp.json"] + except OSError as e: + log("WARN", f"Error scanning project config: {e}") + return project_config diff --git a/agent/prompts/__init__.py b/agent/src/prompts/__init__.py similarity index 100% rename from agent/prompts/__init__.py rename to agent/src/prompts/__init__.py diff --git a/agent/prompts/base.py b/agent/src/prompts/base.py similarity index 100% rename from agent/prompts/base.py rename to agent/src/prompts/base.py diff --git a/agent/prompts/new_task.py b/agent/src/prompts/new_task.py similarity index 100% rename from agent/prompts/new_task.py rename to agent/src/prompts/new_task.py diff --git a/agent/prompts/pr_iteration.py b/agent/src/prompts/pr_iteration.py similarity index 100% rename from agent/prompts/pr_iteration.py rename to agent/src/prompts/pr_iteration.py diff --git a/agent/prompts/pr_review.py b/agent/src/prompts/pr_review.py similarity index 100% rename from agent/prompts/pr_review.py rename to agent/src/prompts/pr_review.py diff --git a/agent/src/repo.py b/agent/src/repo.py new file mode 100644 index 0000000..d5b4513 --- /dev/null +++ b/agent/src/repo.py @@ -0,0 +1,216 @@ +"""Repository setup: clone, branch, mise install, initial build.""" + +import os +import subprocess + +from config import AGENT_WORKSPACE, PR_TASK_TYPES +from models import RepoSetup, TaskConfig +from shell import log, run_cmd, slugify + + +def setup_repo(config: TaskConfig) -> RepoSetup: + """Clone repo, create branch, configure git auth, run mise install. + + Returns a RepoSetup with repo_dir, branch, notes, build_before, + lint_before, and default_branch. + """ + repo_dir = f"{AGENT_WORKSPACE}/{config.task_id}" + notes: list[str] = [] + + if config.task_type in PR_TASK_TYPES and config.branch_name: + branch = config.branch_name + else: + # Derive branch slug from issue title or task description + title = "" + if config.issue: + title = config.issue.title + if not title: + title = config.task_description + slug = slugify(title) + branch = f"bgagent/{config.task_id}/{slug}" + + # Mark the repo directory as safe for git. On persistent session storage + # the mount may be owned by a different UID than the container user, + # triggering git's "dubious ownership" check on clone/resume. + run_cmd( + ["git", "config", "--global", "--add", "safe.directory", repo_dir], + label="safe-directory", + ) + + # Clone + log("SETUP", f"Cloning {config.repo_url}...") + run_cmd( + ["gh", "repo", "clone", config.repo_url, repo_dir], + label="clone", + ) + + # Configure remote URL with embedded token so git push works without + # credential helpers or extra auth setup inside the agent. + token = config.github_token + run_cmd( + [ + "git", + "remote", + "set-url", + "origin", + f"https://x-access-token:{token}@github.com/{config.repo_url}.git", + ], + label="set-remote-url", + cwd=repo_dir, + ) + + # Branch setup + if config.task_type in PR_TASK_TYPES and config.branch_name: + log("SETUP", f"Checking out existing PR branch: {branch}") + run_cmd( + ["git", "fetch", "origin", branch], + label="fetch-pr-branch", + cwd=repo_dir, + ) + run_cmd( + ["git", "checkout", "-b", branch, f"origin/{branch}"], + label="checkout-pr-branch", + cwd=repo_dir, + ) + else: + log("SETUP", f"Creating branch: {branch}") + run_cmd(["git", "checkout", "-b", branch], label="create-branch", cwd=repo_dir) + + # Trust mise config files in the cloned repo (required before mise install) + run_cmd( + ["mise", "trust", repo_dir], + label="mise-trust", + cwd=repo_dir, + check=False, + ) + + # mise install (deterministic — not left to the LLM) + log("SETUP", "Running mise install...") + result = run_cmd( + ["mise", "install"], + label="mise-install", + cwd=repo_dir, + check=False, + ) + if result.returncode != 0: + note = f"mise install failed (exit {result.returncode})" + notes.append(note) + else: + notes.append("mise install: OK") + + # Initial build (record whether the project builds before agent changes) + log("SETUP", "Running initial build (mise run build)...") + result = run_cmd( + ["mise", "run", "build"], + label="mise-run-build-pre", + cwd=repo_dir, + check=False, + ) + if result.returncode != 0: + note = "Initial build (mise run build) FAILED before agent changes" + notes.append(note) + build_before = False + else: + notes.append("Initial build (mise run build): OK") + build_before = True + + # Initial lint baseline (record whether lint passes before agent changes) + log("SETUP", "Running initial lint (mise run lint)...") + result = run_cmd( + ["mise", "run", "lint"], + label="mise-run-lint-pre", + cwd=repo_dir, + check=False, + ) + if result.returncode != 0: + note = "Initial lint (mise run lint) FAILED before agent changes" + notes.append(note) + lint_before = False + else: + notes.append("Initial lint (mise run lint): OK") + lint_before = True + + # Detect default branch + # For PR tasks (pr_iteration, pr_review): use base_branch from orchestrator if available + if config.task_type in PR_TASK_TYPES and config.base_branch: + default_branch = config.base_branch + else: + default_branch = detect_default_branch(config.repo_url, repo_dir) + + # Install prepare-commit-msg hook for code attribution + _install_commit_hook(repo_dir) + + return RepoSetup( + repo_dir=repo_dir, + branch=branch, + notes=notes, + build_before=build_before, + lint_before=lint_before, + default_branch=default_branch, + ) + + +def _install_commit_hook(repo_dir: str) -> None: + """Install the prepare-commit-msg git hook for Task-Id/Prompt-Version trailers.""" + try: + hooks_dir = os.path.join(repo_dir, ".git", "hooks") + os.makedirs(hooks_dir, exist_ok=True) + + # prepare-commit-msg.sh is at the agent root (/app/ in container, parent of src/) + hook_src = os.path.join(os.path.dirname(os.path.dirname(__file__)), "prepare-commit-msg.sh") + hook_dst = os.path.join(hooks_dir, "prepare-commit-msg") + + if not os.path.isfile(hook_src): + log("ERROR", f"Hook not found at {hook_src}") + return + + import shutil + import stat + + shutil.copy2(hook_src, hook_dst) + current = os.stat(hook_dst).st_mode + exec_bits = stat.S_IXUSR | stat.S_IXGRP + os.chmod(hook_dst, current | exec_bits) # nosemgrep + log("SETUP", "Installed prepare-commit-msg hook") + except Exception as e: + log("WARN", f"Commit hook install failed: {type(e).__name__}: {e}") + + +def detect_default_branch(repo_url: str, repo_dir: str) -> str: + """Detect the repository's default branch via gh CLI. + + Falls back to 'main' if detection fails (timeout, auth error, etc.). + """ + try: + result = subprocess.run( + [ + "gh", + "repo", + "view", + repo_url, + "--json", + "defaultBranchRef", + "-q", + ".defaultBranchRef.name", + ], + cwd=repo_dir, + capture_output=True, + text=True, + timeout=30, + ) + except subprocess.TimeoutExpired: + log("WARN", "Default branch detection timed out — defaulting to 'main'") + return "main" + + if result.returncode == 0 and result.stdout.strip(): + branch = result.stdout.strip() + log("SETUP", f"Detected default branch: {branch}") + return branch + + stderr = result.stderr.strip()[:200] if result.stderr else "(no stderr)" + log( + "WARN", + f"Could not detect default branch (exit {result.returncode}): " + f"{stderr} — defaulting to 'main'", + ) + return "main" diff --git a/agent/src/runner.py b/agent/src/runner.py new file mode 100644 index 0000000..71e1a55 --- /dev/null +++ b/agent/src/runner.py @@ -0,0 +1,402 @@ +"""Agent invocation: environment setup and Claude Agent SDK execution.""" + +from __future__ import annotations + +import os +import subprocess +from typing import Any +from urllib.parse import quote + +from config import AGENT_WORKSPACE +from models import AgentResult, TaskConfig, TokenUsage +from shell import log, truncate +from telemetry import _TrajectoryWriter + + +def _format_tool_result(block) -> tuple[str, str]: + """Extract status label and content string from a ToolResultBlock.""" + status = "ERROR" if block.is_error else "ok" + content = block.content if isinstance(block.content, str) else str(block.content) + return status, content + + +def _parse_token_usage(raw_usage: Any) -> TokenUsage: + """Normalize a raw usage value (dict or dataclass) into a TokenUsage model.""" + fields = ( + "input_tokens", + "output_tokens", + "cache_read_input_tokens", + "cache_creation_input_tokens", + ) + if isinstance(raw_usage, dict): + values = {f: raw_usage.get(f, 0) for f in fields} + else: + values = {f: getattr(raw_usage, f, 0) for f in fields} + return TokenUsage(**values) + + +def _setup_agent_env(config: TaskConfig) -> tuple[str | None, str | None]: + """Configure process environment for the Claude Code CLI subprocess. + + Sets Bedrock credentials, strips OTEL auto-instrumentation vars, and + optionally enables CLI-native OTel telemetry. + + Returns (otlp_endpoint, otlp_protocol) for logging. + """ + os.environ["CLAUDE_CODE_USE_BEDROCK"] = "1" + os.environ["AWS_REGION"] = config.aws_region + os.environ["ANTHROPIC_MODEL"] = config.anthropic_model + os.environ["GITHUB_TOKEN"] = config.github_token + os.environ["GH_TOKEN"] = config.github_token + # DO NOT set ANTHROPIC_LOG — any logging level causes the CLI to write to + # stderr, which fills the OS pipe buffer (64 KB) and deadlocks the + # single-threaded Node.js CLI process (blocked stderr write prevents stdout + # writes, while the SDK is waiting on stdout). The stderr callback in + # ClaudeAgentOptions cannot drain fast enough to prevent this. + os.environ.pop("ANTHROPIC_LOG", None) + os.environ["ANTHROPIC_DEFAULT_HAIKU_MODEL"] = "anthropic.claude-haiku-4-5-20251001-v1:0" + + # Save OTLP endpoint/protocol configured by ADOT auto-instrumentation + # before stripping, so we can re-use it for Claude Code CLI telemetry. + otlp_endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT") + otlp_protocol = os.environ.get("OTEL_EXPORTER_OTLP_PROTOCOL") + + # Strip OTEL auto-instrumentation vars from os.environ so target-repo + # child processes (mise run build, pytest, semgrep, etc.) don't attempt + # Python OTEL auto-instrumentation using the agent's packages. + # The agent's own TracerProvider is already configured at startup — it does + # not re-read env vars, so removing them is safe. + for key in [k for k in os.environ if k.startswith("OTEL_")]: + del os.environ[key] + pythonpath = os.environ.get("PYTHONPATH", "") + if pythonpath: + cleaned = os.pathsep.join( + p for p in pythonpath.split(os.pathsep) if "opentelemetry" not in p + ) + if cleaned: + os.environ["PYTHONPATH"] = cleaned + else: + os.environ.pop("PYTHONPATH", None) + + # Enable Claude Code CLI's native OTel telemetry if an OTLP endpoint is + # available. The CLI exports events (tool results, API requests/errors, + # tool decisions) as OTLP logs with per-prompt granularity — beyond the + # aggregate ResultMessage at session end. + # + # Gated on ENABLE_CLI_TELEMETRY env var (opt-in) because the ADOT sidecar + # in AgentCore Runtime is only confirmed to forward traces (configured via + # CfnRuntimeLogsMixin.TRACES.toXRay() in CDK). Whether the sidecar also + # forwards OTLP logs is unconfirmed. Set ENABLE_CLI_TELEMETRY=1 in the + # runtime environment to enable and verify logs appear in CloudWatch. + # + # Configuration choices based on AWS documentation: + # - OTEL_METRICS_EXPORTER=none: All AWS ADOT examples disable metrics + # export. CloudWatch does not ingest OTLP metrics from the sidecar. + # - OTEL_TRACES_EXPORTER=none: Explicitly disabled. The agent's own + # custom spans (task.pipeline, task.agent_execution, etc.) already + # provide trace-level coverage via the Python ADOT auto-instrumentation. + # - OTEL_LOGS_EXPORTER=otlp: SDK events (tool_result, api_request, etc.) + # are the primary telemetry of interest and are exported as OTLP logs. + # - OTEL_EXPORTER_OTLP_LOGS_HEADERS: Includes the application log group + # name so that, if the exporter sends directly to CloudWatch's OTLP + # endpoint, logs land in the correct log group. Ignored by the sidecar + # if it has its own routing config. + # - Protocol defaults to http/protobuf (AWS-recommended for OTLP). + # + # NOTE: These env vars are set on os.environ (process-global) because the + # Claude Agent SDK spawns the CLI subprocess from the process environment. + # This is safe for single-task-per-container deployments (AgentCore Runtime + # allocates one session per container). If concurrent tasks ever share a + # process, this must be revisited (pass env via subprocess instead). + if os.environ.get("ENABLE_CLI_TELEMETRY") == "1": + if not otlp_endpoint: + log("WARN", "OTEL_EXPORTER_OTLP_ENDPOINT not set by ADOT") + # Default to http/protobuf on port 4318 (AWS-recommended protocol). + otlp_endpoint = "http://localhost:4318" + if not otlp_protocol: + otlp_protocol = "http/protobuf" + + os.environ["CLAUDE_CODE_ENABLE_TELEMETRY"] = "1" + os.environ["OTEL_METRICS_EXPORTER"] = "none" + os.environ["OTEL_TRACES_EXPORTER"] = "none" + os.environ["OTEL_LOGS_EXPORTER"] = "otlp" + os.environ["OTEL_EXPORTER_OTLP_PROTOCOL"] = otlp_protocol + os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = otlp_endpoint + os.environ["OTEL_LOG_TOOL_DETAILS"] = "1" + + # Route OTLP logs to the application log group. This header is used + # when sending directly to CloudWatch's OTLP logs endpoint + # (https://logs.{region}.amazonaws.com/v1/logs). If the exporter + # sends to the ADOT sidecar instead, the sidecar may ignore this. + log_group = os.environ.get("LOG_GROUP_NAME", "") + if log_group: + os.environ["OTEL_EXPORTER_OTLP_LOGS_HEADERS"] = f"x-aws-log-group={log_group}" + + # Tag all SDK telemetry with task metadata for correlation in CloudWatch. + # Values are percent-encoded per the OTEL_RESOURCE_ATTRIBUTES spec to + # handle any special characters (commas, equals, spaces) in config values. + os.environ["OTEL_RESOURCE_ATTRIBUTES"] = ( + f"task.id={quote(config.task_id or 'unknown', safe='')}," + f"repo.url={quote(config.repo_url or 'unknown', safe='')}," + f"agent.model={quote(config.anthropic_model or 'unknown', safe='')}" + ) + log( + "AGENT", + f"Claude Code telemetry enabled: endpoint={otlp_endpoint} " + f"protocol={otlp_protocol} logs_log_group={log_group or '(not set)'}", + ) + else: + log("AGENT", "Claude Code CLI telemetry disabled (set ENABLE_CLI_TELEMETRY=1 to enable)") + + return otlp_endpoint, otlp_protocol + + +async def run_agent( + prompt: str, system_prompt: str, config: TaskConfig, cwd: str = AGENT_WORKSPACE +) -> AgentResult: + """Invoke the Claude Agent SDK and stream output.""" + from claude_agent_sdk import ( + AssistantMessage, + ClaudeAgentOptions, + ClaudeSDKClient, + ResultMessage, + SystemMessage, + TextBlock, + ThinkingBlock, + ToolResultBlock, + ToolUseBlock, + UserMessage, + ) + + _setup_agent_env(config) + + stderr_line_count = 0 + + def _on_stderr(line: str) -> None: + nonlocal stderr_line_count + stderr_line_count += 1 + log("CLI", line.rstrip()) + + # Log SDK and CLI versions for diagnosing protocol mismatches + import claude_agent_sdk as _sdk + + sdk_version = getattr(_sdk, "__version__", "unknown") + log("AGENT", f"claude-agent-sdk version: {sdk_version}") + cli_path = subprocess.run(["which", "claude"], capture_output=True, text=True, timeout=5) + if cli_path.returncode == 0: + cli_ver = subprocess.run( + ["claude", "--version"], capture_output=True, text=True, timeout=10 + ) + log("AGENT", f"claude CLI: {cli_path.stdout.strip()} version={cli_ver.stdout.strip()}") + else: + log("WARN", "claude CLI not found on PATH") + + # All tools are allowed at the SDK level; Cedar policy engine enforces + # per-task-type restrictions via PreToolUse hooks. + allowed_tools = ["Bash", "Read", "Write", "Edit", "Glob", "Grep", "WebFetch"] + + # Create trajectory writer and Cedar policy engine with hook matchers + trajectory = _TrajectoryWriter(config.task_id or "unknown") + + from hooks import build_hook_matchers + from policy import PolicyEngine + + task_type = config.task_type + repo_url = config.repo_url + cedar_policies = config.cedar_policies + policy_engine = PolicyEngine( + task_type=task_type, + repo=repo_url, + extra_policies=cedar_policies if cedar_policies else None, + ) + log( + "AGENT", + f"Cedar policy engine initialized for task_type={task_type}" + + (f" with {len(cedar_policies)} extra policies" if cedar_policies else ""), + ) + + hooks = build_hook_matchers(engine=policy_engine, trajectory=trajectory) + + options = ClaudeAgentOptions( + model=config.anthropic_model, + system_prompt=system_prompt, + allowed_tools=allowed_tools, + permission_mode="bypassPermissions", + cwd=cwd, + max_turns=config.max_turns, + setting_sources=["project"], + hooks=hooks, + max_budget_usd=config.max_budget_usd, + stderr=_on_stderr, + ) + + result = AgentResult() + message_counts = {"system": 0, "assistant": 0, "result": 0, "other": 0} + + # Use ClaudeSDKClient (connect/query/receive_response) instead of the + # standalone query() function. This matches the official AWS sample: + # https://github.com/aws-samples/sample-deploy-ClaudeAgentSDK-based-agents-to-AgentCore-Runtime + client = ClaudeSDKClient(options=options) + log("AGENT", "Connecting to Claude Code CLI subprocess...") + await client.connect() + log("AGENT", "Connected. Sending prompt...") + await client.query(prompt=prompt) + log("AGENT", "Prompt sent. Receiving messages...") + try: + async for message in client.receive_response(): + if isinstance(message, SystemMessage): + message_counts["system"] += 1 + log("SYS", f"{message.subtype}: {message.data}") + if message.subtype == "init" and isinstance(message.data, dict): + cli_ver = message.data.get("claude_code_version", "?") + log("SYS", f"CLI reports version: {cli_ver}") + log("AGENT", "Waiting for next message from CLI...") + + elif isinstance(message, AssistantMessage): + message_counts["assistant"] += 1 + result.turns += 1 + log("TURN", f"#{result.turns} (model: {message.model})") + + # Per-turn accumulators for trajectory + turn_thinking = "" + turn_text = "" + turn_tool_calls: list[dict] = [] + turn_tool_results: list[dict] = [] + + for block in message.content: + if isinstance(block, ThinkingBlock): + log("THINK", truncate(block.thinking, 200)) + turn_thinking += block.thinking + "\n" + elif isinstance(block, TextBlock): + print(block.text, flush=True) + turn_text += block.text + "\n" + elif isinstance(block, ToolUseBlock): + tool_input = block.input + if block.name == "Bash": + cmd = tool_input.get("command", "") + log("TOOL", f"Bash: {truncate(cmd, 300)}") + elif block.name in ("Read", "Glob", "Grep"): + log("TOOL", f"{block.name}: {truncate(str(tool_input))}") + elif block.name in ("Write", "Edit"): + path = tool_input.get("file_path", "") + log("TOOL", f"{block.name}: {path}") + else: + log("TOOL", f"{block.name}: {truncate(str(tool_input))}") + turn_tool_calls.append({"name": block.name, "input": tool_input}) + elif isinstance(block, ToolResultBlock): + status, content = _format_tool_result(block) + log("RESULT", f"[{status}] {truncate(content)}") + turn_tool_results.append( + { + "tool_use_id": getattr(block, "tool_use_id", ""), + "is_error": block.is_error, + "content": content, + } + ) + + # Write trajectory event for this turn + trajectory.write_turn( + turn=result.turns, + model=message.model, + thinking=turn_thinking.strip(), + text=turn_text.strip(), + tool_calls=turn_tool_calls, + tool_results=turn_tool_results, + ) + + elif isinstance(message, ResultMessage): + message_counts["result"] += 1 + result.status = message.subtype + result.cost_usd = getattr(message, "total_cost_usd", None) + result.num_turns = getattr(message, "num_turns", 0) + result.duration_ms = getattr(message, "duration_ms", 0) + result.duration_api_ms = getattr(message, "duration_api_ms", 0) + result.session_id = getattr(message, "session_id", "") or "" + + # Capture token usage from ResultMessage + raw_usage = getattr(message, "usage", None) + usage: TokenUsage | None = None + if raw_usage is not None: + # Handle both object (dataclass) and dict forms + usage = _parse_token_usage(raw_usage) + result.usage = usage + if all(v == 0 for v in usage.model_dump().values()): + log( + "WARN", + f"All token usage values are zero — usage object " + f"type={type(raw_usage).__name__}", + ) + else: + log( + "USAGE", + f"input={usage.input_tokens} " + f"output={usage.output_tokens} " + f"cache_read={usage.cache_read_input_tokens} " + f"cache_create={usage.cache_creation_input_tokens}", + ) + + log( + "DONE", + f"status={message.subtype} turns={message.num_turns} " + f"cost=${message.total_cost_usd or 0:.4f} " + f"duration={message.duration_ms / 1000:.1f}s", + ) + if message.is_error and message.result: + log("ERROR", message.result) + + # Write trajectory result summary + trajectory.write_result( + subtype=message.subtype, + num_turns=getattr(message, "num_turns", 0), + cost_usd=getattr(message, "total_cost_usd", None), + duration_ms=getattr(message, "duration_ms", 0), + duration_api_ms=getattr(message, "duration_api_ms", 0), + session_id=getattr(message, "session_id", ""), + usage=usage, + ) + + elif isinstance(message, UserMessage): + message_counts["other"] += 1 + # UserMessage carries tool results fed back to the model. + # For hook-denied calls, content is a ToolResultBlock with + # is_error=True and the denial reason. + if isinstance(message.content, list): + for block in message.content: + if isinstance(block, ToolResultBlock): + status, content = _format_tool_result(block) + log("RESULT", f"[{status}] {truncate(content)}") + elif isinstance(message.content, str): + log("USER", truncate(message.content)) + + else: + message_counts["other"] += 1 + log( + "MSG", + f"Unrecognized message type: {type(message).__name__}: " + f"{truncate(str(message), 300)}", + ) + + except Exception as e: + log("ERROR", f"Exception during receive_response(): {type(e).__name__}: {e}") + if result.status == "unknown": + result.status = "error" + result.error = f"receive_response() failed: {e}" + + log("AGENT", f"Generator finished. Messages received: {message_counts}") + log("AGENT", f"CLI stderr lines received: {stderr_line_count}") + if message_counts["assistant"] == 0 and message_counts["system"] > 0: + log( + "WARN", + "Got init SystemMessage but zero AssistantMessages. The CLI subprocess " + "started but produced no turns. Likely causes: (1) Bedrock API auth/connectivity " + "failure, (2) SDK↔CLI protocol mismatch, (3) CLI crash after init. " + "Check [CLI] stderr lines above for errors.", + ) + if message_counts["result"] == 0: + log( + "WARN", + "No ResultMessage received from the agent SDK — " + "agent metrics (cost, turns) will be unavailable", + ) + + return result diff --git a/agent/server.py b/agent/src/server.py similarity index 96% rename from agent/server.py rename to agent/src/server.py index a65e934..d501c2e 100644 --- a/agent/server.py +++ b/agent/src/server.py @@ -20,8 +20,9 @@ from fastapi import FastAPI, Request from pydantic import BaseModel -from entrypoint import resolve_github_token, run_task +from config import resolve_github_token from observability import set_session_id +from pipeline import run_task # Log the active event loop policy at import time so operators can diagnose # uvloop-related subprocess conflicts (see: uvloop SIGCHLD bug). @@ -106,6 +107,7 @@ def _run_task_background( task_type: str = "new_task", branch_name: str = "", pr_number: str = "", + cedar_policies: list[str] | None = None, ) -> None: """Run the agent task in a background thread.""" try: @@ -131,6 +133,7 @@ def _run_task_background( task_type=task_type, branch_name=branch_name, pr_number=pr_number, + cedar_policies=cedar_policies, ) except Exception as e: print(f"Background task {task_id} failed: {type(e).__name__}: {e}") @@ -170,6 +173,7 @@ def invoke_agent(request: Request, body: InvocationRequest): task_type = inp.get("task_type", "new_task") branch_name = inp.get("branch_name", "") pr_number = str(inp.get("pr_number", "")) + cedar_policies = inp.get("cedar_policies") or [] # Extract AgentCore session ID from request headers for OTEL correlation session_id = request.headers.get("x-amzn-bedrock-agentcore-runtime-session-id", "") @@ -194,6 +198,7 @@ def invoke_agent(request: Request, body: InvocationRequest): task_type, branch_name, pr_number, + cedar_policies, ), ) # Track the thread for graceful shutdown BEFORE starting it so diff --git a/agent/src/shell.py b/agent/src/shell.py new file mode 100644 index 0000000..f5e99ee --- /dev/null +++ b/agent/src/shell.py @@ -0,0 +1,104 @@ +"""Shell utilities: logging, command execution, and text helpers.""" + +import os +import re +import subprocess +import time + + +def log(prefix: str, text: str): + """Print a timestamped, redacted log line.""" + ts = time.strftime("%H:%M:%S") + print(f"[{ts}] {prefix} {redact_secrets(text)}", flush=True) + + +def truncate(text: str, max_len: int = 200) -> str: + """Truncate text for log display.""" + if not text: + return "" + text = text.replace("\n", " ").strip() + if len(text) > max_len: + return text[:max_len] + "..." + return text + + +def slugify(text: str, max_len: int = 40) -> str: + """Convert text to a URL-safe slug for branch names.""" + text = text.lower().strip() + text = re.sub(r"[^a-z0-9\s-]", "", text) + text = re.sub(r"[\s-]+", "-", text) + text = text.strip("-") + if len(text) > max_len: + text = text[:max_len].rstrip("-") + return text or "task" + + +def redact_secrets(text: str) -> str: + """Redact tokens and secrets from log output.""" + # GitHub and generic token-like values. + text = re.sub(r"(ghp_|github_pat_|gho_|ghs_|ghr_)[A-Za-z0-9_]+", r"\1***", text) + text = re.sub(r"(x-access-token:)[^\s@]+", r"\1***", text) + text = re.sub(r"(authorization:\s*(?:bearer|token)\s+)[^\s]+", r"\1***", text, flags=re.I) + text = re.sub( + r"([?&](?:token|access_token|api_key|apikey|password)=)[^&\s]+", + r"\1***", + text, + flags=re.I, + ) + text = re.sub(r"(gh[opusr]_[A-Za-z0-9_]+)", "***", text) + return text + + +def _clean_env() -> dict[str, str]: + """Return a copy of os.environ with OTEL auto-instrumentation vars removed. + + The ``opentelemetry-instrument`` wrapper injects PYTHONPATH and OTEL_* + env vars that would cause child Python processes (e.g. mise run build → + semgrep in the target repo) to attempt OTEL auto-instrumentation and fail + because the target repo's Python environment doesn't have the OTEL + packages installed. Stripping these vars isolates target-repo commands + from the agent's own instrumentation. + """ + env = {k: v for k, v in os.environ.items() if not k.startswith("OTEL_")} + # Strip only OTEL-injected PYTHONPATH components (the sitecustomize.py + # directory), preserving any entries the target repo's toolchain may need. + pythonpath = env.get("PYTHONPATH", "") + if pythonpath: + cleaned = os.pathsep.join( + p for p in pythonpath.split(os.pathsep) if "opentelemetry" not in p + ) + if cleaned: + env["PYTHONPATH"] = cleaned + else: + env.pop("PYTHONPATH", None) + return env + + +def run_cmd( + cmd: list[str], + label: str, + cwd: str | None = None, + timeout: int = 600, + check: bool = True, +) -> subprocess.CompletedProcess: + """Run a command with logging.""" + log("CMD", redact_secrets(f"{label}: {' '.join(cmd)}")) + result = subprocess.run( + cmd, + cwd=cwd, + capture_output=True, + text=True, + timeout=timeout, + env=_clean_env(), + ) + if result.returncode != 0: + log("CMD", f"{label}: FAILED (exit {result.returncode})") + if result.stderr: + for line in result.stderr.strip().splitlines()[:20]: + log("CMD", f" {line}") + if check: + stderr_snippet = redact_secrets(result.stderr.strip()[:500]) if result.stderr else "" + raise RuntimeError(f"{label} failed (exit {result.returncode}): {stderr_snippet}") + else: + log("CMD", f"{label}: OK") + return result diff --git a/agent/system_prompt.py b/agent/src/system_prompt.py similarity index 100% rename from agent/system_prompt.py rename to agent/src/system_prompt.py diff --git a/agent/task_state.py b/agent/src/task_state.py similarity index 100% rename from agent/task_state.py rename to agent/src/task_state.py diff --git a/agent/src/telemetry.py b/agent/src/telemetry.py new file mode 100644 index 0000000..094ea38 --- /dev/null +++ b/agent/src/telemetry.py @@ -0,0 +1,322 @@ +"""Telemetry: metrics, trajectory writer, and disk usage.""" + +from __future__ import annotations + +import json +import os +import subprocess +import time +from typing import TYPE_CHECKING + +from config import AGENT_WORKSPACE + +if TYPE_CHECKING: + from models import TokenUsage + + +def get_disk_usage(path: str = AGENT_WORKSPACE) -> float: + """Return disk usage in bytes for the given path.""" + try: + result = subprocess.run( + ["du", "-sb", path], + capture_output=True, + text=True, + timeout=30, + ) + return int(result.stdout.split()[0]) if result.returncode == 0 else 0 + except (subprocess.TimeoutExpired, ValueError, IndexError): + return 0 + + +def format_bytes(size: float) -> str: + """Human-readable byte size.""" + for unit in ("B", "KB", "MB", "GB"): + if abs(size) < 1024: + return f"{size:.1f} {unit}" + size /= 1024 + return f"{size:.1f} TB" + + +def _emit_metrics_to_cloudwatch(json_payload: dict) -> None: + """Write the METRICS_REPORT JSON event directly to CloudWatch Logs. + + Writes the log event directly to the APPLICATION_LOGS log group using the + CloudWatch Logs API, ensuring metrics are reliably available for dashboard + Logs Insights queries regardless of container stdout routing. + """ + log_group = os.environ.get("LOG_GROUP_NAME") + if not log_group: + return + + try: + import contextlib + + import boto3 + + region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION") + client = boto3.client("logs", region_name=region) + + task_id = json_payload.get("task_id", "unknown") + log_stream = f"metrics/{task_id}" + + # Create the log stream (ignore if it already exists) + with contextlib.suppress(client.exceptions.ResourceAlreadyExistsException): + client.create_log_stream(logGroupName=log_group, logStreamName=log_stream) + + client.put_log_events( + logGroupName=log_group, + logStreamName=log_stream, + logEvents=[ + { + "timestamp": int(time.time() * 1000), + "message": json.dumps(json_payload), + } + ], + ) + except ImportError: + print("[metrics] boto3 not available — skipping CloudWatch write", flush=True) + except Exception as e: + exc_type = type(e).__name__ + print(f"[metrics] CloudWatch Logs write failed (best-effort): {exc_type}: {e}", flush=True) + if "Credential" in exc_type or "Endpoint" in exc_type or "AccessDenied" in str(e): + print( + "[metrics] WARNING: This may indicate a deployment misconfiguration " + "(IAM role, VPC endpoint, or credentials). Dashboard data will be missing.", + flush=True, + ) + + +class _TrajectoryWriter: + """Write per-turn trajectory events to CloudWatch Logs. + + Follows the same pattern as ``_emit_metrics_to_cloudwatch()``: lazy boto3 + import, best-effort error handling, ``contextlib.suppress`` for idempotent + stream creation. Log stream: ``trajectory/{task_id}`` (parallel to the + existing ``metrics/{task_id}`` stream). + + Events are progressively truncated to stay under the CloudWatch Logs 262 KB + event-size limit: large fields (thinking, tool result content) are truncated + first, then a hard byte-level safety-net truncation is applied. + """ + + _CW_MAX_EVENT_BYTES = 262_144 # CloudWatch limit per event + + _MAX_FAILURES = 3 + + def __init__(self, task_id: str) -> None: + self._task_id = task_id + self._log_group = os.environ.get("LOG_GROUP_NAME") + self._client = None + self._disabled = False + self._failure_count = 0 + + def _ensure_client(self): + """Lazily create the CloudWatch Logs client and log stream.""" + if self._client is not None: + return + if not self._log_group: + self._disabled = True + return + + import contextlib + + import boto3 + + region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION") + self._client = boto3.client("logs", region_name=region) + + log_stream = f"trajectory/{self._task_id}" + with contextlib.suppress(self._client.exceptions.ResourceAlreadyExistsException): + self._client.create_log_stream(logGroupName=self._log_group, logStreamName=log_stream) + + def _put_event(self, payload: dict) -> None: + """Serialize *payload* to JSON, truncate if needed, and write.""" + if not self._log_group or self._disabled: + return + try: + self._ensure_client() + if self._client is None: + self._disabled = True + return + + message = json.dumps(payload, default=str) + + # Safety-net: hard byte-level truncation + encoded = message.encode("utf-8") + if len(encoded) > self._CW_MAX_EVENT_BYTES: + print( + f"[trajectory] WARNING: Event exceeded CW limit even after field " + f"truncation ({len(encoded)} bytes). Hard-truncating — event JSON " + f"will be invalid.", + flush=True, + ) + message = ( + encoded[: self._CW_MAX_EVENT_BYTES - 100].decode("utf-8", errors="ignore") + + " [TRUNCATED]" + ) + + self._client.put_log_events( + logGroupName=self._log_group, + logStreamName=f"trajectory/{self._task_id}", + logEvents=[ + { + "timestamp": int(time.time() * 1000), + "message": message, + } + ], + ) + except ImportError: + self._disabled = True + print("[trajectory] boto3 not available — skipping", flush=True) + except Exception as e: + self._failure_count += 1 + exc_type = type(e).__name__ + if self._failure_count >= self._MAX_FAILURES: + self._disabled = True + print( + f"[trajectory] CloudWatch write failed {self._failure_count} times, " + f"disabling trajectory: {exc_type}: {e}", + flush=True, + ) + else: + print( + f"[trajectory] CloudWatch write failed ({self._failure_count}/" + f"{self._MAX_FAILURES}): {exc_type}: {e}", + flush=True, + ) + if "Credential" in exc_type or "Endpoint" in exc_type or "AccessDenied" in str(e): + print( + "[trajectory] WARNING: This may indicate a deployment misconfiguration " + "(IAM role, VPC endpoint, or credentials). Trajectory data will be missing.", + flush=True, + ) + + @staticmethod + def _truncate_field(value: str, max_len: int = 4000) -> str: + """Truncate a large string field for trajectory events.""" + if not value or len(value) <= max_len: + return value + return value[:max_len] + f"... [truncated, {len(value)} chars total]" + + def write_turn( + self, + turn: int, + model: str, + thinking: str, + text: str, + tool_calls: list[dict], + tool_results: list[dict], + ) -> None: + """Write a TRAJECTORY_TURN event for one agent turn.""" + # Truncate large fields to stay under CloudWatch event limit + truncated_thinking = self._truncate_field(thinking) + truncated_text = self._truncate_field(text) + truncated_results = [] + for tr in tool_results: + entry = dict(tr) + if isinstance(entry.get("content"), str): + entry["content"] = self._truncate_field(entry["content"], 2000) + truncated_results.append(entry) + + self._put_event( + { + "event": "TRAJECTORY_TURN", + "task_id": self._task_id, + "turn": turn, + "model": model, + "thinking": truncated_thinking, + "text": truncated_text, + "tool_calls": tool_calls, + "tool_results": truncated_results, + } + ) + + def write_result( + self, + subtype: str, + num_turns: int, + cost_usd: float | None, + duration_ms: int, + duration_api_ms: int, + session_id: str, + usage: TokenUsage | None, + ) -> None: + """Write a TRAJECTORY_RESULT summary event at session end.""" + self._put_event( + { + "event": "TRAJECTORY_RESULT", + "task_id": self._task_id, + "subtype": subtype, + "num_turns": num_turns, + "cost_usd": cost_usd, + "duration_ms": duration_ms, + "duration_api_ms": duration_api_ms, + "session_id": session_id, + "usage": usage.model_dump() if usage else None, + } + ) + + def write_policy_decision( + self, tool_name: str, allowed: bool, reason: str, duration_ms: float + ) -> None: + """Write a POLICY_DECISION event for a tool-use policy evaluation.""" + self._put_event( + { + "event": "POLICY_DECISION", + "task_id": self._task_id, + "tool_name": tool_name, + "allowed": allowed, + "reason": reason, + "duration_ms": duration_ms, + } + ) + + +# Values under these keys may contain tool stderr, paths, or incidental secrets. +_METRICS_REDACT_KEYS = frozenset({"error"}) + + +def _metrics_payload_for_logging(metrics: dict) -> dict: + """Build metrics dict for stdout / CloudWatch JSON (redacts sensitive fields).""" + out: dict = {} + for k, v in metrics.items(): + if k in _METRICS_REDACT_KEYS: + out[k] = None if v is None else "[redacted]" + continue + if isinstance(v, (bool, int, float, type(None))): + out[k] = v + else: + out[k] = str(v) + return out + + +def print_metrics(metrics: dict): + """Emit a METRICS_REPORT event and print a human-readable summary. + + Writes the JSON event directly to CloudWatch Logs via + ``_emit_metrics_to_cloudwatch()`` for dashboard querying, and prints a + human-readable table to stdout for operator console inspection. + + Native types (int, float, bool, None) are preserved in the JSON payload. + None values become JSON ``null`` and are excluded by ``ispresent()`` + filters in the dashboard queries. Raw ``error`` text is never logged verbatim. + """ + safe = _metrics_payload_for_logging(metrics) + json_payload: dict = {"event": "METRICS_REPORT", **safe} + + # Write directly to CloudWatch Logs (reliable — doesn't depend on stdout capture) + _emit_metrics_to_cloudwatch(json_payload) + + # Also print to stdout for operator console visibility + print(json.dumps(json_payload), flush=True) + + # Human-readable banner only; do not print keys/values from ``metrics`` (taints logging sinks). + print("\n" + "=" * 60) + print("METRICS REPORT") + print("=" * 60) + print( + " See structured JSON on the previous line — table omitted so metric " + "keys are not echoed next to log sinks.", + flush=True, + ) + print("=" * 60) diff --git a/agent/tests/test_config.py b/agent/tests/test_config.py new file mode 100644 index 0000000..d9e32c8 --- /dev/null +++ b/agent/tests/test_config.py @@ -0,0 +1,87 @@ +"""Unit tests for config.py — build_config and constants.""" + +import pytest + +from config import PR_TASK_TYPES, build_config +from models import TaskConfig + + +class TestAgentWorkspaceConstant: + def test_default_value(self, monkeypatch): + monkeypatch.delenv("AGENT_WORKSPACE", raising=False) + import importlib + + import config + + importlib.reload(config) + assert config.AGENT_WORKSPACE == "/workspace" + + +class TestPRTaskTypes: + def test_contains_pr_iteration(self): + assert "pr_iteration" in PR_TASK_TYPES + + def test_contains_pr_review(self): + assert "pr_review" in PR_TASK_TYPES + + def test_does_not_contain_new_task(self): + assert "new_task" not in PR_TASK_TYPES + + +class TestTaskTypeValidation: + def test_invalid_task_type_raises(self): + with pytest.raises(ValueError, match="Invalid task_type"): + build_config( + repo_url="owner/repo", + task_description="fix bug", + github_token="ghp_test123", + aws_region="us-east-1", + task_type="unknown_type", + ) + + def test_valid_task_types_accepted(self): + for tt in ("new_task", "pr_iteration", "pr_review"): + desc = "" if tt in ("pr_iteration", "pr_review") else "fix bug" + pr = "42" if tt in ("pr_iteration", "pr_review") else "" + config = build_config( + repo_url="owner/repo", + task_description=desc, + github_token="ghp_test123", + aws_region="us-east-1", + task_type=tt, + pr_number=pr, + ) + assert config.task_type == tt + + +class TestBuildConfig: + def test_valid_config_returns_task_config(self): + config = build_config( + repo_url="owner/repo", + task_description="fix bug", + github_token="ghp_test123", + aws_region="us-east-1", + task_id="test-id", + ) + assert isinstance(config, TaskConfig) + assert config.repo_url == "owner/repo" + assert config.task_id == "test-id" + + def test_missing_repo_raises(self): + with pytest.raises(ValueError, match="repo_url"): + build_config( + repo_url="", + task_description="fix bug", + github_token="ghp_test", + aws_region="us-east-1", + ) + + def test_auto_generated_task_id(self): + config = build_config( + repo_url="owner/repo", + task_description="do something", + github_token="ghp_test", + aws_region="us-east-1", + ) + assert config.task_id + assert len(config.task_id) == 12 diff --git a/agent/tests/test_entrypoint.py b/agent/tests/test_entrypoint.py index 8f527dd..296da65 100644 --- a/agent/tests/test_entrypoint.py +++ b/agent/tests/test_entrypoint.py @@ -15,6 +15,14 @@ slugify, truncate, ) +from models import ( + GitHubIssue, + HydratedContext, + IssueComment, + MemoryContext, + RepoSetup, + TaskConfig, +) # --------------------------------------------------------------------------- # AGENT_WORKSPACE @@ -150,9 +158,9 @@ def test_valid_config(self): aws_region="us-east-1", task_id="test-id", ) - assert config["repo_url"] == "owner/repo" - assert config["task_id"] == "test-id" - assert config["max_turns"] == 10 # default + assert config.repo_url == "owner/repo" + assert config.task_id == "test-id" + assert config.max_turns == 10 # default def test_missing_repo_url(self): with pytest.raises(ValueError, match="repo_url"): @@ -187,8 +195,8 @@ def test_auto_generated_task_id(self): github_token="ghp_test", aws_region="us-east-1", ) - assert config["task_id"] # non-empty - assert len(config["task_id"]) == 12 + assert config.task_id # non-empty + assert len(config.task_id) == 12 def test_env_fallback(self, monkeypatch): monkeypatch.setenv("AWS_REGION", "eu-west-1") @@ -197,7 +205,7 @@ def test_env_fallback(self, monkeypatch): task_description="do something", github_token="ghp_test", ) - assert config["aws_region"] == "eu-west-1" + assert config.aws_region == "eu-west-1" # --------------------------------------------------------------------------- @@ -207,29 +215,33 @@ def test_env_fallback(self, monkeypatch): class TestAssemblePrompt: def test_with_description(self): - config = { - "task_id": "abc123", - "repo_url": "owner/repo", - "task_description": "Fix the login bug", - "issue_number": "", - } + config = TaskConfig( + task_id="abc123", + repo_url="owner/repo", + task_description="Fix the login bug", + issue_number="", + github_token="ghp_test", + aws_region="us-east-1", + ) result = assemble_prompt(config) assert "abc123" in result assert "owner/repo" in result assert "Fix the login bug" in result def test_with_issue(self): - config = { - "task_id": "abc123", - "repo_url": "owner/repo", - "task_description": "", - "issue": { - "number": 42, - "title": "Login broken", - "body": "Users cannot log in", - "comments": [{"author": "alice", "body": "Confirmed!"}], - }, - } + config = TaskConfig( + task_id="abc123", + repo_url="owner/repo", + task_description="", + github_token="ghp_test", + aws_region="us-east-1", + issue=GitHubIssue( + number=42, + title="Login broken", + body="Users cannot log in", + comments=[IssueComment(author="alice", body="Confirmed!")], + ), + ) result = assemble_prompt(config) assert "#42" in result assert "Login broken" in result @@ -298,12 +310,19 @@ def test_finds_mcp(self): class TestBuildSystemPrompt: def test_placeholder_substitution(self): - config = {"repo_url": "owner/repo", "task_id": "t123", "max_turns": 50} - setup = { - "branch": "bgagent/t123/fix", - "default_branch": "main", - "notes": ["Note 1"], - } + config = TaskConfig( + repo_url="owner/repo", + task_id="t123", + max_turns=50, + github_token="ghp_test", + aws_region="us-east-1", + ) + setup = RepoSetup( + repo_dir="/workspace/t123", + branch="bgagent/t123/fix", + default_branch="main", + notes=["Note 1"], + ) result = _build_system_prompt(config, setup, None, "") assert "owner/repo" in result assert "t123" in result @@ -311,21 +330,44 @@ def test_placeholder_substitution(self): assert "50" in result def test_memory_context_injected(self): - config = {"repo_url": "o/r", "task_id": "t1", "max_turns": 10} - setup = {"branch": "b", "default_branch": "main", "notes": []} - hydrated = { - "memory_context": { - "repo_knowledge": ["Uses TypeScript"], - "past_episodes": ["Task t0 completed"], - } - } + config = TaskConfig( + repo_url="o/r", + task_id="t1", + max_turns=10, + github_token="ghp_test", + aws_region="us-east-1", + ) + setup = RepoSetup( + repo_dir="/workspace/t1", + branch="b", + default_branch="main", + notes=[], + ) + hydrated = HydratedContext( + user_prompt="test", + memory_context=MemoryContext( + repo_knowledge=["Uses TypeScript"], + past_episodes=["Task t0 completed"], + ), + ) result = _build_system_prompt(config, setup, hydrated, "") assert "Uses TypeScript" in result assert "Task t0 completed" in result def test_overrides_appended(self): - config = {"repo_url": "o/r", "task_id": "t1", "max_turns": 10} - setup = {"branch": "b", "default_branch": "main", "notes": []} + config = TaskConfig( + repo_url="o/r", + task_id="t1", + max_turns=10, + github_token="ghp_test", + aws_region="us-east-1", + ) + setup = RepoSetup( + repo_dir="/workspace/t1", + branch="b", + default_branch="main", + notes=[], + ) result = _build_system_prompt(config, setup, None, "Always use tabs") assert "Always use tabs" in result assert "Additional instructions" in result @@ -345,8 +387,8 @@ def test_pr_iteration_with_pr_number(self): task_type="pr_iteration", pr_number="42", ) - assert config["task_type"] == "pr_iteration" - assert config["pr_number"] == "42" + assert config.task_type == "pr_iteration" + assert config.pr_number == "42" def test_pr_iteration_without_pr_number_raises(self): with pytest.raises(ValueError, match="pr_number is required"): @@ -364,7 +406,7 @@ def test_new_task_default(self): github_token="ghp_test", aws_region="us-east-1", ) - assert config["task_type"] == "new_task" + assert config.task_type == "new_task" def test_pr_review_with_pr_number(self): config = build_config( @@ -374,8 +416,8 @@ def test_pr_review_with_pr_number(self): task_type="pr_review", pr_number="55", ) - assert config["task_type"] == "pr_review" - assert config["pr_number"] == "55" + assert config.task_type == "pr_review" + assert config.pr_number == "55" def test_pr_review_without_pr_number_raises(self): with pytest.raises(ValueError, match="pr_number is required"): @@ -394,51 +436,60 @@ def test_pr_review_without_pr_number_raises(self): class TestBuildSystemPromptTaskType: def test_selects_new_task_prompt(self): - config = { - "repo_url": "owner/repo", - "task_id": "test-123", - "max_turns": 100, - "task_type": "new_task", - } - setup = { - "branch": "bgagent/test-123/fix", - "default_branch": "main", - "notes": ["All OK"], - } + config = TaskConfig( + repo_url="owner/repo", + task_id="test-123", + max_turns=100, + task_type="new_task", + github_token="ghp_test", + aws_region="us-east-1", + ) + setup = RepoSetup( + repo_dir="/workspace/test-123", + branch="bgagent/test-123/fix", + default_branch="main", + notes=["All OK"], + ) prompt = _build_system_prompt(config, setup, None, "") assert "Create a Pull Request" in prompt def test_selects_pr_iteration_prompt(self): - config = { - "repo_url": "owner/repo", - "task_id": "test-123", - "max_turns": 100, - "task_type": "pr_iteration", - "pr_number": "42", - } - setup = { - "branch": "feature/fix", - "default_branch": "main", - "notes": ["All OK"], - } + config = TaskConfig( + repo_url="owner/repo", + task_id="test-123", + max_turns=100, + task_type="pr_iteration", + pr_number="42", + github_token="ghp_test", + aws_region="us-east-1", + ) + setup = RepoSetup( + repo_dir="/workspace/test-123", + branch="feature/fix", + default_branch="main", + notes=["All OK"], + ) prompt = _build_system_prompt(config, setup, None, "") assert "Post a summary comment on the PR" in prompt assert "Reply to each review comment thread" in prompt assert "42" in prompt def test_selects_pr_review_prompt(self): - config = { - "repo_url": "owner/repo", - "task_id": "test-123", - "max_turns": 100, - "task_type": "pr_review", - "pr_number": "55", - } - setup = { - "branch": "feature/review", - "default_branch": "main", - "notes": ["All OK"], - } + config = TaskConfig( + repo_url="owner/repo", + task_id="test-123", + max_turns=100, + task_type="pr_review", + pr_number="55", + github_token="ghp_test", + aws_region="us-east-1", + ) + setup = RepoSetup( + repo_dir="/workspace/test-123", + branch="feature/review", + default_branch="main", + notes=["All OK"], + ) prompt = _build_system_prompt(config, setup, None, "") assert "READ-ONLY" in prompt assert "must NOT modify" in prompt diff --git a/agent/tests/test_hooks.py b/agent/tests/test_hooks.py new file mode 100644 index 0000000..cf51d78 --- /dev/null +++ b/agent/tests/test_hooks.py @@ -0,0 +1,147 @@ +"""Unit tests for hooks.py — Cedar policy SDK hook callbacks.""" + +import asyncio + +import pytest + +cedarpy = pytest.importorskip("cedarpy") + +from hooks import build_hook_matchers, pre_tool_use_hook +from policy import PolicyEngine + + +def _run(coro): + """Helper to run async coroutine in tests.""" + return asyncio.run(coro) + + +class TestPreToolUseHook: + def test_allows_permitted_tool(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + hook_input = { + "hook_event_name": "PreToolUse", + "tool_name": "Read", + "tool_input": {"file_path": "src/main.py"}, + "tool_use_id": "test-123", + "session_id": "sess-1", + "transcript_path": "/tmp/t", + "cwd": "/workspace", + } + result = _run(pre_tool_use_hook(hook_input, "test-123", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "allow" + + def test_denies_restricted_tool(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + hook_input = { + "hook_event_name": "PreToolUse", + "tool_name": "Write", + "tool_input": {"file_path": "src/main.py"}, + "tool_use_id": "test-456", + "session_id": "sess-1", + "transcript_path": "/tmp/t", + "cwd": "/workspace", + } + result = _run(pre_tool_use_hook(hook_input, "test-456", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "deny" + assert "pr_review" in result["hookSpecificOutput"]["permissionDecisionReason"] + + def test_denies_git_internals_path(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + hook_input = { + "hook_event_name": "PreToolUse", + "tool_name": "Write", + "tool_input": {"file_path": ".git/config"}, + "tool_use_id": "test-789", + "session_id": "sess-1", + "transcript_path": "/tmp/t", + "cwd": "/workspace", + } + result = _run(pre_tool_use_hook(hook_input, "test-789", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "deny" + + def test_denies_destructive_bash(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + hook_input = { + "hook_event_name": "PreToolUse", + "tool_name": "Bash", + "tool_input": {"command": "rm -rf /"}, + "tool_use_id": "test-abc", + "session_id": "sess-1", + "transcript_path": "/tmp/t", + "cwd": "/workspace", + } + result = _run(pre_tool_use_hook(hook_input, "test-abc", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "deny" + + def test_allows_normal_bash(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + hook_input = { + "hook_event_name": "PreToolUse", + "tool_name": "Bash", + "tool_input": {"command": "npm test"}, + "tool_use_id": "test-def", + "session_id": "sess-1", + "transcript_path": "/tmp/t", + "cwd": "/workspace", + } + result = _run(pre_tool_use_hook(hook_input, "test-def", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "allow" + + def test_handles_string_tool_input(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + hook_input = { + "hook_event_name": "PreToolUse", + "tool_name": "Read", + "tool_input": '{"file_path": "test.py"}', + "tool_use_id": "test-ghi", + "session_id": "sess-1", + "transcript_path": "/tmp/t", + "cwd": "/workspace", + } + result = _run(pre_tool_use_hook(hook_input, "test-ghi", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "allow" + + def test_denies_non_dict_hook_input(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = _run(pre_tool_use_hook("not a dict", "test-x", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "deny" + assert "invalid hook input" in result["hookSpecificOutput"]["permissionDecisionReason"] + + def test_denies_unparseable_string_tool_input(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + hook_input = { + "hook_event_name": "PreToolUse", + "tool_name": "Read", + "tool_input": "not valid json{{{", + "tool_use_id": "test-bad", + "session_id": "sess-1", + "transcript_path": "/tmp/t", + "cwd": "/workspace", + } + result = _run(pre_tool_use_hook(hook_input, "test-bad", {}, engine=engine)) + assert result["hookSpecificOutput"]["permissionDecision"] == "deny" + assert "unparseable tool input" in result["hookSpecificOutput"]["permissionDecisionReason"] + + +class TestBuildHookMatchers: + def test_returns_correct_structure(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + matchers = build_hook_matchers(engine=engine) + assert "PreToolUse" in matchers + assert "PostToolUse" not in matchers + assert len(matchers["PreToolUse"]) == 1 + + def test_hook_matchers_have_callbacks(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + matchers = build_hook_matchers(engine=engine) + pre_matcher = matchers["PreToolUse"][0] + # HookMatcher has matcher=None (match all) and hooks list + assert pre_matcher.matcher is None + assert len(pre_matcher.hooks) == 1 + + def test_matchers_with_trajectory(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + # Pass None for trajectory — should still work + matchers = build_hook_matchers(engine=engine, trajectory=None) + assert "PreToolUse" in matchers + assert "PostToolUse" not in matchers diff --git a/agent/tests/test_models.py b/agent/tests/test_models.py new file mode 100644 index 0000000..4a51531 --- /dev/null +++ b/agent/tests/test_models.py @@ -0,0 +1,266 @@ +"""Unit tests for models.py — TaskType enum and Pydantic models.""" + +import pytest +from pydantic import ValidationError + +from models import ( + AgentResult, + GitHubIssue, + HydratedContext, + IssueComment, + MemoryContext, + RepoSetup, + TaskConfig, + TaskResult, + TaskType, + TokenUsage, +) + + +class TestTaskType: + def test_new_task_value(self): + assert TaskType.new_task == "new_task" + + def test_pr_iteration_value(self): + assert TaskType.pr_iteration == "pr_iteration" + + def test_pr_review_value(self): + assert TaskType.pr_review == "pr_review" + + def test_new_task_is_not_pr_task(self): + assert not TaskType.new_task.is_pr_task + + def test_pr_iteration_is_pr_task(self): + assert TaskType.pr_iteration.is_pr_task + + def test_pr_review_is_pr_task(self): + assert TaskType.pr_review.is_pr_task + + def test_new_task_is_not_read_only(self): + assert not TaskType.new_task.is_read_only + + def test_pr_iteration_is_not_read_only(self): + assert not TaskType.pr_iteration.is_read_only + + def test_pr_review_is_read_only(self): + assert TaskType.pr_review.is_read_only + + def test_str_enum_membership(self): + assert TaskType.new_task == "new_task" + assert TaskType.pr_review == "pr_review" + + +class TestIssueComment: + def test_construction(self): + c = IssueComment(author="alice", body="Looks good!") + assert c.author == "alice" + assert c.body == "Looks good!" + + def test_frozen(self): + c = IssueComment(author="alice", body="text") + with pytest.raises(ValidationError): + c.author = "bob" + + def test_model_dump(self): + c = IssueComment(author="alice", body="text") + d = c.model_dump() + assert d == {"author": "alice", "body": "text"} + + +class TestGitHubIssue: + def test_construction_with_defaults(self): + issue = GitHubIssue(title="Bug", number=1) + assert issue.title == "Bug" + assert issue.body == "" + assert issue.number == 1 + assert issue.comments == [] + + def test_construction_with_comments(self): + issue = GitHubIssue( + title="Bug", + body="desc", + number=42, + comments=[IssueComment(author="bob", body="noted")], + ) + assert len(issue.comments) == 1 + assert issue.comments[0].author == "bob" + + def test_frozen(self): + issue = GitHubIssue(title="Bug", number=1) + with pytest.raises(ValidationError): + issue.title = "Feature" + + +class TestMemoryContext: + def test_defaults(self): + mc = MemoryContext() + assert mc.repo_knowledge == [] + assert mc.past_episodes == [] + + def test_construction(self): + mc = MemoryContext(repo_knowledge=["Uses TS"], past_episodes=["Task t0"]) + assert mc.repo_knowledge == ["Uses TS"] + assert mc.past_episodes == ["Task t0"] + + def test_frozen(self): + mc = MemoryContext() + with pytest.raises(ValidationError): + mc.repo_knowledge = ["new"] + + +class TestHydratedContext: + def test_construction(self): + hc = HydratedContext(user_prompt="Fix the bug") + assert hc.user_prompt == "Fix the bug" + assert hc.issue is None + assert hc.resolved_base_branch is None + assert hc.truncated is False + assert hc.memory_context is None + + def test_with_nested_models(self): + hc = HydratedContext( + user_prompt="Fix it", + issue=GitHubIssue(title="Bug", number=1), + memory_context=MemoryContext(repo_knowledge=["TS"]), + ) + assert hc.issue is not None and hc.issue.title == "Bug" + assert hc.memory_context is not None and hc.memory_context.repo_knowledge == ["TS"] + + def test_frozen(self): + hc = HydratedContext(user_prompt="test") + with pytest.raises(ValidationError): + hc.user_prompt = "changed" + + def test_model_validate_from_dict(self): + data = { + "user_prompt": "Fix bug", + "issue": {"title": "Bug", "number": 42, "body": "", "comments": []}, + "truncated": True, + } + hc = HydratedContext.model_validate(data) + assert hc.user_prompt == "Fix bug" + assert hc.issue is not None and hc.issue.number == 42 + assert hc.truncated is True + + +class TestTaskConfig: + def test_required_fields(self): + config = TaskConfig( + repo_url="owner/repo", + github_token="ghp_test", + aws_region="us-east-1", + ) + assert config.repo_url == "owner/repo" + assert config.task_type == "new_task" + assert config.cedar_policies == [] + assert config.issue is None + + def test_mutable_assignment(self): + config = TaskConfig( + repo_url="owner/repo", + github_token="ghp_test", + aws_region="us-east-1", + ) + config.cedar_policies = ["policy1"] + assert config.cedar_policies == ["policy1"] + + config.issue = GitHubIssue(title="Bug", number=1) + assert config.issue.title == "Bug" + + config.base_branch = "develop" + assert config.base_branch == "develop" + + def test_validate_assignment(self): + config = TaskConfig( + repo_url="owner/repo", + github_token="ghp_test", + aws_region="us-east-1", + ) + # max_turns should be validated as int + config.max_turns = 50 + assert config.max_turns == 50 + + +class TestRepoSetup: + def test_construction(self): + setup = RepoSetup(repo_dir="/workspace/abc", branch="bgagent/abc/fix") + assert setup.repo_dir == "/workspace/abc" + assert setup.branch == "bgagent/abc/fix" + assert setup.notes == [] + assert setup.build_before is True + assert setup.default_branch == "main" + + def test_frozen(self): + setup = RepoSetup(repo_dir="/workspace/abc", branch="b") + with pytest.raises(ValidationError): + setup.repo_dir = "/other" + + def test_model_dump(self): + setup = RepoSetup( + repo_dir="/workspace/abc", + branch="b", + notes=["OK"], + build_before=False, + ) + d = setup.model_dump() + assert d["repo_dir"] == "/workspace/abc" + assert d["build_before"] is False + assert d["notes"] == ["OK"] + + +class TestTokenUsage: + def test_defaults(self): + u = TokenUsage() + assert u.input_tokens == 0 + assert u.output_tokens == 0 + assert u.cache_read_input_tokens == 0 + assert u.cache_creation_input_tokens == 0 + + def test_construction(self): + u = TokenUsage(input_tokens=100, output_tokens=50) + assert u.input_tokens == 100 + assert u.output_tokens == 50 + + def test_frozen(self): + u = TokenUsage(input_tokens=100) + with pytest.raises(ValidationError): + u.input_tokens = 200 + + +class TestAgentResult: + def test_defaults(self): + r = AgentResult() + assert r.status == "unknown" + assert r.turns == 0 + assert r.cost_usd is None + assert r.usage is None + + def test_progressive_mutation(self): + r = AgentResult() + r.status = "success" + r.turns = 5 + r.cost_usd = 0.05 + r.usage = TokenUsage(input_tokens=1000) + assert r.status == "success" + assert r.usage.input_tokens == 1000 + + +class TestTaskResult: + def test_construction(self): + r = TaskResult(status="success", task_id="t1") + assert r.status == "success" + assert r.task_id == "t1" + assert r.pr_url is None + assert r.error is None + + def test_model_dump(self): + r = TaskResult( + status="success", + build_passed=True, + cost_usd=0.05, + task_id="t1", + ) + d = r.model_dump() + assert d["status"] == "success" + assert d["build_passed"] is True + assert d["cost_usd"] == 0.05 diff --git a/agent/tests/test_pipeline.py b/agent/tests/test_pipeline.py new file mode 100644 index 0000000..882f1a6 --- /dev/null +++ b/agent/tests/test_pipeline.py @@ -0,0 +1,139 @@ +"""Unit tests for pipeline.py — cedar_policies injection.""" + +from unittest.mock import MagicMock, patch + +from models import AgentResult, RepoSetup, TaskConfig + + +class TestCedarPoliciesInjection: + @patch("pipeline.run_agent") + @patch("pipeline.build_system_prompt") + @patch("pipeline.discover_project_config") + @patch("repo.setup_repo") + @patch("pipeline.task_span") + @patch("pipeline.task_state") + def test_cedar_policies_injected_into_config( + self, + _mock_task_state, + mock_task_span, + mock_setup_repo, + _mock_discover, + _mock_build_prompt, + mock_run_agent, + monkeypatch, + ): + """When cedar_policies are passed, they appear in the config.""" + monkeypatch.setenv("GITHUB_TOKEN", "ghp_test") + monkeypatch.setenv("AWS_REGION", "us-east-1") + + mock_setup_repo.return_value = RepoSetup( + repo_dir="/workspace/repo", + branch="bgagent/test/branch", + build_before=True, + ) + + captured_config: TaskConfig | None = None + + async def fake_run_agent(_prompt, _system_prompt, config, cwd=None): + nonlocal captured_config + captured_config = config + return AgentResult(status="success", turns=1, cost_usd=0.01, num_turns=1) + + mock_run_agent.side_effect = fake_run_agent + + mock_span = MagicMock() + mock_span.__enter__ = MagicMock(return_value=mock_span) + mock_span.__exit__ = MagicMock(return_value=False) + mock_task_span.return_value = mock_span + + with ( + patch("pipeline.ensure_committed", return_value=False), + patch("pipeline.verify_build", return_value=True), + patch("pipeline.verify_lint", return_value=True), + patch( + "pipeline.ensure_pr", + return_value="https://github.com/org/repo/pull/1", + ), + patch("pipeline.get_disk_usage", return_value=0), + patch("pipeline.print_metrics"), + ): + from pipeline import run_task + + policies = [ + 'forbid (principal, action, resource) when { resource == Agent::Tool::"Bash" };' + ] + run_task( + repo_url="owner/repo", + task_description="fix bug", + github_token="ghp_test", + aws_region="us-east-1", + task_id="test-id", + cedar_policies=policies, + ) + + assert captured_config is not None + assert captured_config.cedar_policies == policies + + @patch("pipeline.run_agent") + @patch("pipeline.build_system_prompt") + @patch("pipeline.discover_project_config") + @patch("repo.setup_repo") + @patch("pipeline.task_span") + @patch("pipeline.task_state") + def test_cedar_policies_absent_when_not_passed( + self, + _mock_task_state, + mock_task_span, + mock_setup_repo, + _mock_discover, + _mock_build_prompt, + mock_run_agent, + monkeypatch, + ): + """When cedar_policies are not passed, the default empty list is on config.""" + monkeypatch.setenv("GITHUB_TOKEN", "ghp_test") + monkeypatch.setenv("AWS_REGION", "us-east-1") + + mock_setup_repo.return_value = RepoSetup( + repo_dir="/workspace/repo", + branch="bgagent/test/branch", + build_before=True, + ) + + captured_config: TaskConfig | None = None + + async def fake_run_agent(_prompt, _system_prompt, config, cwd=None): + nonlocal captured_config + captured_config = config + return AgentResult(status="success", turns=1, cost_usd=0.01, num_turns=1) + + mock_run_agent.side_effect = fake_run_agent + + mock_span = MagicMock() + mock_span.__enter__ = MagicMock(return_value=mock_span) + mock_span.__exit__ = MagicMock(return_value=False) + mock_task_span.return_value = mock_span + + with ( + patch("pipeline.ensure_committed", return_value=False), + patch("pipeline.verify_build", return_value=True), + patch("pipeline.verify_lint", return_value=True), + patch( + "pipeline.ensure_pr", + return_value="https://github.com/org/repo/pull/1", + ), + patch("pipeline.get_disk_usage", return_value=0), + patch("pipeline.print_metrics"), + ): + from pipeline import run_task + + run_task( + repo_url="owner/repo", + task_description="fix bug", + github_token="ghp_test", + aws_region="us-east-1", + task_id="test-id", + ) + + assert captured_config is not None + assert captured_config.cedar_policies == [] diff --git a/agent/tests/test_policy.py b/agent/tests/test_policy.py new file mode 100644 index 0000000..e4ad69a --- /dev/null +++ b/agent/tests/test_policy.py @@ -0,0 +1,226 @@ +"""Unit tests for policy.py — Cedar policy engine.""" + +import pytest + +cedarpy = pytest.importorskip("cedarpy") + +from policy import PolicyDecision, PolicyEngine + + +class TestPolicyDecision: + def test_allowed_decision(self): + d = PolicyDecision(allowed=True, reason="permitted", duration_ms=0.5) + assert d.allowed is True + assert d.reason == "permitted" + assert d.duration_ms >= 0 + + def test_denied_decision(self): + d = PolicyDecision(allowed=False, reason="denied", duration_ms=1.0) + assert d.allowed is False + + +class TestNewTaskPermissions: + def test_allows_write(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": "src/main.py"}) + assert result.allowed is True + + def test_allows_edit(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Edit", {"file_path": "src/main.py"}) + assert result.allowed is True + + def test_allows_bash(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "npm test"}) + assert result.allowed is True + + def test_allows_read(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Read", {"file_path": "src/main.py"}) + assert result.allowed is True + + def test_allows_glob(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Glob", {"pattern": "**/*.py"}) + assert result.allowed is True + + def test_allows_grep(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Grep", {"pattern": "TODO"}) + assert result.allowed is True + + +class TestPrReviewPermissions: + def test_denies_write(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": "src/main.py"}) + assert result.allowed is False + assert "pr_review" in result.reason + + def test_denies_edit(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + result = engine.evaluate_tool_use("Edit", {"file_path": "src/main.py"}) + assert result.allowed is False + + def test_allows_read(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + result = engine.evaluate_tool_use("Read", {"file_path": "src/main.py"}) + assert result.allowed is True + + def test_allows_glob(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + result = engine.evaluate_tool_use("Glob", {"pattern": "**/*.py"}) + assert result.allowed is True + + def test_allows_grep(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + result = engine.evaluate_tool_use("Grep", {"pattern": "TODO"}) + assert result.allowed is True + + def test_allows_bash(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "npm test"}) + assert result.allowed is True + + +class TestProtectedPaths: + def test_denies_write_to_git_dir(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": ".git/config"}) + assert result.allowed is False + + def test_denies_write_to_git_dir_absolute_path(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": "/workspace/abc123/.git/config"}) + assert result.allowed is False + + def test_allows_write_to_normal_path(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": "src/app.ts"}) + assert result.allowed is True + + def test_allows_write_to_github_workflows(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": ".github/workflows/ci.yml"}) + assert result.allowed is True + + def test_allows_edit_to_github_workflows(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Edit", {"file_path": ".github/workflows/deploy.yml"}) + assert result.allowed is True + + +class TestDestructiveBashCommands: + def test_denies_rm_rf_root(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "rm -rf /"}) + assert result.allowed is False + + def test_denies_git_push_force(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "git push --force origin main"}) + assert result.allowed is False + + def test_denies_git_push_f(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "git push -f origin main"}) + assert result.allowed is False + + def test_denies_git_push_f_no_trailing_args(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "git push -f"}) + assert result.allowed is False + + def test_allows_normal_bash(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "npm test"}) + assert result.allowed is True + + def test_allows_mise_run_build(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": "mise run build"}) + assert result.allowed is True + + +class TestBashCommandsWithQuotes: + """Commands containing double quotes must not cause NoDecision.""" + + @pytest.mark.parametrize( + "cmd", + [ + 'git commit -m "fix: login bug"', + 'git commit-tree HEAD^{tree} -m "squash"', + 'gh pr create --title "my PR" --body "desc"', + 'gh api --method POST /repos/o/r/pulls -f title="PR"', + "git commit -m \"$(cat <<'EOF'\nFix the bug\nEOF\n)\"", + ], + ids=["git-commit-msg", "git-commit-tree", "gh-pr-create", "gh-api-post", "heredoc-commit"], + ) + def test_allows_command_with_quotes(self, cmd): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": cmd}) + assert result.allowed is True + + def test_denies_force_push_with_quotes(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Bash", {"command": 'git push --force "origin" main'}) + assert result.allowed is False + + +class TestFilePathsWithSpecialChars: + """File paths with special characters must not cause NoDecision.""" + + def test_allows_path_with_quotes(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": '/workspace/it"s-a-file.ts'}) + assert result.allowed is True + + def test_denies_git_dir_path_with_quotes(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": '.git/hooks/pre"commit'}) + assert result.allowed is False + + +class TestExtraPolicies: + def test_extra_forbid_applied(self): + extra = [ + 'forbid (principal, action == Agent::Action::"invoke_tool", ' + 'resource == Agent::Tool::"WebFetch");' + ] + engine = PolicyEngine(task_type="new_task", repo="owner/repo", extra_policies=extra) + result = engine.evaluate_tool_use("WebFetch", {}) + assert result.allowed is False + + +class TestFailClosed: + def test_invalid_policy_syntax_fails_closed(self): + """Invalid Cedar policy syntax should fail closed (deny the call).""" + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + # Override with invalid policies + engine._policies = "THIS IS NOT VALID CEDAR" + result = engine.evaluate_tool_use("Write", {"file_path": "test.py"}) + assert result.allowed is False + assert "fail-closed" in result.reason or "NoDecision" in result.reason + + +class TestDurationMetrics: + def test_decision_has_nonnegative_duration(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + result = engine.evaluate_tool_use("Read", {"file_path": "test.py"}) + assert result.duration_ms >= 0 + + def test_denied_decision_has_nonnegative_duration(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + result = engine.evaluate_tool_use("Write", {"file_path": "test.py"}) + assert result.duration_ms >= 0 + + +class TestTaskTypeProperty: + def test_task_type_property(self): + engine = PolicyEngine(task_type="new_task", repo="owner/repo") + assert engine.task_type == "new_task" + + def test_task_type_pr_review(self): + engine = PolicyEngine(task_type="pr_review", repo="owner/repo") + assert engine.task_type == "pr_review" diff --git a/agent/tests/test_shell.py b/agent/tests/test_shell.py new file mode 100644 index 0000000..2640327 --- /dev/null +++ b/agent/tests/test_shell.py @@ -0,0 +1,57 @@ +"""Unit tests for shell.py — slugify, redact_secrets, truncate.""" + +from shell import redact_secrets, slugify, truncate + + +class TestSlugify: + def test_basic(self): + assert slugify("Fix the login bug") == "fix-the-login-bug" + + def test_special_chars(self): + assert slugify("Add feature: OAuth2.0!") == "add-feature-oauth20" + + def test_max_len(self): + result = slugify("a very long task description indeed", max_len=10) + assert len(result) <= 10 + assert not result.endswith("-") + + def test_empty(self): + assert slugify("") == "task" + + def test_only_special_chars(self): + assert slugify("!!!") == "task" + + +class TestRedactSecrets: + def test_ghp_token(self): + assert "***" in redact_secrets("ghp_abc123XYZ") + assert "abc123XYZ" not in redact_secrets("ghp_abc123XYZ") + + def test_github_pat_token(self): + result = redact_secrets("github_pat_abcDEF123") + assert "abcDEF123" not in result + + def test_x_access_token(self): + result = redact_secrets("https://x-access-token:mysecret@github.com/foo/bar") + assert "mysecret" not in result + + def test_no_secrets(self): + text = "nothing secret here" + assert redact_secrets(text) == text + + +class TestTruncate: + def test_short_text(self): + assert truncate("hello") == "hello" + + def test_long_text(self): + long = "a" * 300 + result = truncate(long, max_len=100) + assert len(result) == 103 # 100 + "..." + assert result.endswith("...") + + def test_empty(self): + assert truncate("") == "" + + def test_newlines_replaced(self): + assert truncate("line1\nline2") == "line1 line2" diff --git a/agent/uv.lock b/agent/uv.lock index 5e1a0f5..ddbaf59 100644 --- a/agent/uv.lock +++ b/agent/uv.lock @@ -128,6 +128,7 @@ source = { virtual = "." } dependencies = [ { name = "aws-opentelemetry-distro" }, { name = "boto3" }, + { name = "cedarpy" }, { name = "claude-agent-sdk" }, { name = "fastapi" }, { name = "mcp" }, @@ -147,6 +148,7 @@ dev = [ requires-dist = [ { name = "aws-opentelemetry-distro", specifier = "~=0.15.0" }, { name = "boto3", specifier = "==1.42.54" }, + { name = "cedarpy", specifier = ">=4.8.0" }, { name = "claude-agent-sdk", specifier = "==0.1.53" }, { name = "fastapi", specifier = "==0.135.2" }, { name = "mcp", specifier = "==1.23.0" }, @@ -199,6 +201,26 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/2c/fc/1d7b80d0eb7b714984ce40efc78859c022cd930e402f599d8ca9e39c78a4/cachetools-6.2.4-py3-none-any.whl", hash = "sha256:69a7a52634fed8b8bf6e24a050fb60bff1c9bd8f6d24572b99c32d4e71e62a51", size = 11551 }, ] +[[package]] +name = "cedarpy" +version = "4.8.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/8b/60/bab3dcc838a7b214bfbf97ed7b4b52b496407d8f10f5831c60fbb1cf07ae/cedarpy-4.8.0.tar.gz", hash = "sha256:5ee4b743e8559e8483f3945b1bc24011a66f1216895d56eed4193c4e82c39612", size = 197033 } +wheels = [ + { url = "https://files.pythonhosted.org/packages/fc/1b/e710bf73aab96085db38cfc68f2c1aacc44ce3a24f8c8aa4a386b7146287/cedarpy-4.8.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:5c1b27a04399e1889035cc5bc9c86ab06aa8d936dfbfc88c6e63f3a46785c956", size = 4017278 }, + { url = "https://files.pythonhosted.org/packages/94/4f/70d4a3b1e86d60c55e314deaf67b811ab6b4b913d4de60047773137968b8/cedarpy-4.8.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:cfbeb0b13d5b4d7a2508f228d5f731683e29340ec635ca770e656c19aa45984d", size = 3904172 }, + { url = "https://files.pythonhosted.org/packages/de/76/f002be0235352796fa6ed9ef640662ca80b94b08d9b1470322a63018529c/cedarpy-4.8.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e1e7cc2f4b965a5c6bfa0c736d4df141213a5bec7dee3a051569c447178c31a3", size = 4292410 }, + { url = "https://files.pythonhosted.org/packages/29/67/1a481d251c34e3a4d5a69ba5dcdf7fa9bd276d2029a41b426eb79e1e2588/cedarpy-4.8.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:38585b66ef5f95ff0a20e87c6274b8ce1761802f135d537edabf5908027347c0", size = 4407765 }, + { url = "https://files.pythonhosted.org/packages/ab/f7/8a65d186db58479687c53c77c5440db85e163bf5c59eb49ed2171a8f8bd1/cedarpy-4.8.0-cp313-cp313-win_amd64.whl", hash = "sha256:3e457cd9a038763967baaa0dc496a696998b6741822c9a72c449cc5eb3d0eaf6", size = 3788124 }, + { url = "https://files.pythonhosted.org/packages/b0/47/7fbc65ea257b199e4720849314354ebd34e68ac3f30d5a2d2271810ffca2/cedarpy-4.8.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ed4f5fb785eaaa599e519e0bf05bb4d12b0eed55fe2cac4d9b8cc88bf87c7e54", size = 4292263 }, + { url = "https://files.pythonhosted.org/packages/03/9e/39085b3b346c940adc5654586ef4252726f087ff2b23df474148473f2f36/cedarpy-4.8.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:bdbfd1551dde8d4538ec00b3ee33083b823cc405b984b56c8478a50e7ce09593", size = 4015993 }, + { url = "https://files.pythonhosted.org/packages/d2/16/7785f2c013c73474e30895b60cf6491ca2d367a41bfdde3f52735a405b5e/cedarpy-4.8.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c49982888562bf92d5c4282fb669fab3bb71b5d3fc6414fa995ad40aa2a9e24d", size = 3902874 }, + { url = "https://files.pythonhosted.org/packages/a0/bd/762be74a9d8de7e6a575bac93c5afd71ce648a1853f85ee93888a2fe9a1c/cedarpy-4.8.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4d6bb5b61e7548e245c9468b9e48aab2845dd9cf2aaf37712b0da5a97e4f4716", size = 4291656 }, + { url = "https://files.pythonhosted.org/packages/e1/47/91e0f8f873904984833189a7a3a8841f5815b1211f413f0e593df03077c8/cedarpy-4.8.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:17227fc51724fa778db0379bab66a88f9571d3d31af257aaa512375fbc828606", size = 4408129 }, + { url = "https://files.pythonhosted.org/packages/3a/6c/29f66ac1c6c7db1021b7aa9843abd5a10fb9eef2fb66713aa32330c0eb2b/cedarpy-4.8.0-cp314-cp314-win_amd64.whl", hash = "sha256:3c41717161c6ca035bbdb396d8db58547cd805cdb00b8c0181cae9d505df9137", size = 3788010 }, + { url = "https://files.pythonhosted.org/packages/0d/de/217397e7830a17dc40cabad56396b56c9f990dfa6218602c161aa9bfc12f/cedarpy-4.8.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4f8195276bc8db6dd5d2d84b22722c1fa4e4cacb662b4026ef59a653e10e2f17", size = 4292748 }, +] + [[package]] name = "certifi" version = "2026.2.25" diff --git a/cdk/src/constructs/blueprint.ts b/cdk/src/constructs/blueprint.ts index 61b2996..55ce7ed 100644 --- a/cdk/src/constructs/blueprint.ts +++ b/cdk/src/constructs/blueprint.ts @@ -95,6 +95,17 @@ export interface BlueprintProps { readonly pollIntervalMs?: number; }; + /** + * Security configuration. + */ + readonly security?: { + /** + * Additional Cedar policy strings evaluated by the agent's PolicyEngine. + * These are appended to the default policies (deny-list model). + */ + readonly cedarPolicies?: string[]; + }; + /** * Network configuration for the agent. */ @@ -128,10 +139,16 @@ export class Blueprint extends Construct { */ public readonly egressAllowlist: readonly string[]; + /** + * Cedar policies from the security.cedarPolicies prop, exposed for inspection. + */ + public readonly cedarPolicies: readonly string[]; + constructor(scope: Construct, id: string, props: BlueprintProps) { super(scope, id); this.egressAllowlist = [...(props.networking?.egressAllowlist ?? [])]; + this.cedarPolicies = [...(props.security?.cedarPolicies ?? [])]; // Validate repo format at construct time this.node.addValidation(new RepoFormatValidation(props.repo)); @@ -171,6 +188,9 @@ export class Blueprint extends Construct { if (this.egressAllowlist.length > 0) { item.egress_allowlist = { L: this.egressAllowlist.map(d => ({ S: d })) }; } + if (this.cedarPolicies.length > 0) { + item.cedar_policies = { L: this.cedarPolicies.map(p => ({ S: p })) }; + } new cr.AwsCustomResource(this, 'RepoConfigCR', { onCreate: { @@ -240,6 +260,7 @@ export class Blueprint extends Construct { if (props.credentials?.githubTokenSecretArn) fields.push(', #github_token_secret_arn = :github_token_secret_arn'); if (props.pipeline?.pollIntervalMs !== undefined) fields.push(', #poll_interval_ms = :poll_interval_ms'); if (this.egressAllowlist.length > 0) fields.push(', #egress_allowlist = :egress_allowlist'); + if (this.cedarPolicies.length > 0) fields.push(', #cedar_policies = :cedar_policies'); return fields.join(''); } @@ -253,6 +274,7 @@ export class Blueprint extends Construct { if (props.credentials?.githubTokenSecretArn) names['#github_token_secret_arn'] = 'github_token_secret_arn'; if (props.pipeline?.pollIntervalMs !== undefined) names['#poll_interval_ms'] = 'poll_interval_ms'; if (this.egressAllowlist.length > 0) names['#egress_allowlist'] = 'egress_allowlist'; + if (this.cedarPolicies.length > 0) names['#cedar_policies'] = 'cedar_policies'; return names; } @@ -266,6 +288,7 @@ export class Blueprint extends Construct { if (props.credentials?.githubTokenSecretArn) values[':github_token_secret_arn'] = { S: props.credentials.githubTokenSecretArn }; if (props.pipeline?.pollIntervalMs !== undefined) values[':poll_interval_ms'] = { N: String(props.pipeline.pollIntervalMs) }; if (this.egressAllowlist.length > 0) values[':egress_allowlist'] = { L: this.egressAllowlist.map(d => ({ S: d })) }; + if (this.cedarPolicies.length > 0) values[':cedar_policies'] = { L: this.cedarPolicies.map(p => ({ S: p })) }; return values; } } diff --git a/cdk/src/handlers/shared/context-hydration.ts b/cdk/src/handlers/shared/context-hydration.ts index cfc2348..efd2551 100644 --- a/cdk/src/handlers/shared/context-hydration.ts +++ b/cdk/src/handlers/shared/context-hydration.ts @@ -27,6 +27,20 @@ import { isPrTaskType, type TaskRecord, type TaskType } from './types'; // Types // --------------------------------------------------------------------------- +/** Detail of a single guardrail filter that triggered. */ +export interface GuardrailAssessmentDetail { + readonly filter_type: 'CONTENT' | 'TOPIC' | 'WORD' | 'SENSITIVE_INFO'; + readonly filter_name: string; + readonly confidence?: string; + readonly action: string; +} + +/** Result of guardrail screening including assessment details. */ +export interface GuardrailScreeningResult { + readonly action: 'GUARDRAIL_INTERVENED' | 'NONE'; + readonly assessments?: GuardrailAssessmentDetail[]; +} + /** * A single comment on a GitHub issue. */ @@ -123,16 +137,74 @@ export class GuardrailScreeningError extends Error { } } +/** Mapping from policy response keys to assessment detail extraction rules. */ +const POLICY_EXTRACTORS: ReadonlyArray<{ + readonly policyKey: string; + readonly itemsKey: string; + readonly filterType: GuardrailAssessmentDetail['filter_type']; + readonly nameField: string; +}> = [ + { policyKey: 'contentPolicy', itemsKey: 'filters', filterType: 'CONTENT', nameField: 'type' }, + { policyKey: 'topicPolicy', itemsKey: 'topics', filterType: 'TOPIC', nameField: 'name' }, + { policyKey: 'wordPolicy', itemsKey: 'customWords', filterType: 'WORD', nameField: 'match' }, + { policyKey: 'wordPolicy', itemsKey: 'managedWordLists', filterType: 'WORD', nameField: 'match' }, + { policyKey: 'sensitiveInformationPolicy', itemsKey: 'piiEntities', filterType: 'SENSITIVE_INFO', nameField: 'type' }, +]; + +/** + * Extract assessment details from the Bedrock ApplyGuardrail response. + */ +function extractAssessmentDetails( + assessments: Array> | undefined, +): GuardrailAssessmentDetail[] { + const details: GuardrailAssessmentDetail[] = []; + if (!assessments) return details; + + for (const assessment of assessments) { + for (const extractor of POLICY_EXTRACTORS) { + const policy = assessment[extractor.policyKey] as Record | undefined; + const items = policy?.[extractor.itemsKey] as Array> | undefined; + if (items) { + for (const item of items) { + details.push({ + filter_type: extractor.filterType, + filter_name: (item[extractor.nameField] as string) ?? 'UNKNOWN', + ...(item.confidence !== undefined && { confidence: item.confidence as string }), + action: (item.action as string) ?? 'BLOCKED', + }); + } + } + } + } + + return details; +} + +/** + * Format a guardrail-blocked message from the screening result. + * Returns undefined when the guardrail did not intervene. + */ +function formatGuardrailBlocked( + screenResult: GuardrailScreeningResult | undefined, + prefix: string, +): string | undefined { + if (screenResult?.action !== 'GUARDRAIL_INTERVENED') return undefined; + const details = screenResult.assessments + ?.map(a => `${a.filter_type}/${a.filter_name}${a.confidence ? ` (${a.confidence})` : ''}`) + .join(', '); + return `${prefix} blocked by content policy${details ? ': ' + details : ''}`; +} + /** * Screen text through the Bedrock Guardrail for prompt injection detection. * Fail-closed: throws on Bedrock errors so unscreened content never reaches the agent. * @param text - the text to screen. * @param taskId - the task ID (for logging). - * @returns 'GUARDRAIL_INTERVENED' if blocked, 'NONE' if allowed, undefined when guardrail is - * not configured (env vars missing). + * @returns a GuardrailScreeningResult with action and assessment details, or undefined when + * guardrail is not configured (env vars missing). * @throws GuardrailScreeningError when the Bedrock Guardrail API call fails (fail-closed). */ -export async function screenWithGuardrail(text: string, taskId: string): Promise<'GUARDRAIL_INTERVENED' | 'NONE' | undefined> { +export async function screenWithGuardrail(text: string, taskId: string): Promise { if (!bedrockClient || !GUARDRAIL_ID || !GUARDRAIL_VERSION) { logger.info('Guardrail screening skipped — guardrail not configured', { task_id: taskId, @@ -149,16 +221,24 @@ export async function screenWithGuardrail(text: string, taskId: string): Promise content: [{ text: { text } }], })); + const assessments = extractAssessmentDetails( + result.assessments as Array> | undefined, + ); + if (result.action === 'GUARDRAIL_INTERVENED') { logger.warn('Content blocked by guardrail', { task_id: taskId, guardrail_id: GUARDRAIL_ID, guardrail_version: GUARDRAIL_VERSION, + assessment_details: assessments.length > 0 ? JSON.stringify(assessments) : undefined, }); - return 'GUARDRAIL_INTERVENED'; + return { + action: 'GUARDRAIL_INTERVENED', + ...(assessments.length > 0 && { assessments }), + }; } - return 'NONE'; + return { action: 'NONE' }; } catch (err) { logger.error('Guardrail screening failed (fail-closed)', { task_id: taskId, @@ -971,7 +1051,9 @@ export async function hydrateContext(task: TaskRecord, options?: HydrateContextO resolvedBaseBranch = prResult.base_ref; // Screen assembled PR prompt through Bedrock Guardrail for prompt injection - const guardrailAction = await screenWithGuardrail(userPrompt, task.task_id); + const screenResult = await screenWithGuardrail(userPrompt, task.task_id); + + const guardrailBlocked = formatGuardrailBlocked(screenResult, 'PR context'); const prContext: HydratedContext = { version: 1, @@ -982,9 +1064,7 @@ export async function hydrateContext(task: TaskRecord, options?: HydrateContextO sources, token_estimate: estimateTokens(userPrompt), truncated, - ...(guardrailAction === 'GUARDRAIL_INTERVENED' && { - guardrail_blocked: 'PR context blocked by content policy', - }), + ...(guardrailBlocked && { guardrail_blocked: guardrailBlocked }), }; return prContext; @@ -999,10 +1079,12 @@ export async function hydrateContext(task: TaskRecord, options?: HydrateContextO // Screen assembled prompt when it includes GitHub issue content (attacker-controlled input). // Skipped when no issue is present — task_description is already screened at submission time. - const guardrailAction = issue + const screenResult = issue ? await screenWithGuardrail(userPrompt, task.task_id) : undefined; + const guardrailBlocked = formatGuardrailBlocked(screenResult, 'Task context'); + return { version: 1, user_prompt: userPrompt, @@ -1011,9 +1093,7 @@ export async function hydrateContext(task: TaskRecord, options?: HydrateContextO sources, token_estimate: tokenEstimate, truncated: budgetResult.truncated, - ...(guardrailAction === 'GUARDRAIL_INTERVENED' && { - guardrail_blocked: 'Task context blocked by content policy', - }), + ...(guardrailBlocked && { guardrail_blocked: guardrailBlocked }), }; } catch (err) { // Guardrail failures must propagate (fail-closed) — unscreened content must not reach the agent diff --git a/cdk/src/handlers/shared/orchestrator.ts b/cdk/src/handlers/shared/orchestrator.ts index 0c2e9d2..539f376 100644 --- a/cdk/src/handlers/shared/orchestrator.ts +++ b/cdk/src/handlers/shared/orchestrator.ts @@ -234,6 +234,7 @@ export async function loadBlueprintConfig(task: TaskRecord): Promise 0 && { cedar_policies: blueprintConfig.cedar_policies }), prompt_version: promptVersion, ...(MEMORY_ID && { memory_id: MEMORY_ID }), hydrated_context: hydratedContext, diff --git a/cdk/src/handlers/shared/repo-config.ts b/cdk/src/handlers/shared/repo-config.ts index 5376063..e80b7a0 100644 --- a/cdk/src/handlers/shared/repo-config.ts +++ b/cdk/src/handlers/shared/repo-config.ts @@ -39,6 +39,7 @@ export interface RepoConfig { readonly github_token_secret_arn?: string; readonly poll_interval_ms?: number; readonly egress_allowlist?: string[]; + readonly cedar_policies?: string[]; } /** @@ -55,6 +56,7 @@ export interface BlueprintConfig { readonly github_token_secret_arn?: string; readonly poll_interval_ms?: number; readonly egress_allowlist?: string[]; + readonly cedar_policies?: string[]; } const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({})); diff --git a/cdk/test/constructs/blueprint.test.ts b/cdk/test/constructs/blueprint.test.ts index 62a80df..82c9b74 100644 --- a/cdk/test/constructs/blueprint.test.ts +++ b/cdk/test/constructs/blueprint.test.ts @@ -235,6 +235,71 @@ describe('Blueprint construct', () => { expect(blueprint.egressAllowlist).toEqual([]); }); + test('exposes cedarPolicies as public property', () => { + const app = new App(); + const stack = new Stack(app, 'TestStack'); + const repoTable = new dynamodb.Table(stack, 'RepoTable', { + partitionKey: { name: 'repo', type: dynamodb.AttributeType.STRING }, + }); + + const blueprint = new Blueprint(stack, 'Blueprint', { + repo: 'org/my-repo', + repoTable, + security: { cedarPolicies: ['permit (principal, action, resource);'] }, + }); + + expect(blueprint.cedarPolicies).toEqual(['permit (principal, action, resource);']); + }); + + test('cedarPolicies defaults to empty array', () => { + const app = new App(); + const stack = new Stack(app, 'TestStack'); + const repoTable = new dynamodb.Table(stack, 'RepoTable', { + partitionKey: { name: 'repo', type: dynamodb.AttributeType.STRING }, + }); + + const blueprint = new Blueprint(stack, 'Blueprint', { + repo: 'org/my-repo', + repoTable, + }); + + expect(blueprint.cedarPolicies).toEqual([]); + }); + + test('maps security cedar policies to DynamoDB list', () => { + const { template } = createStack({ + security: { cedarPolicies: ['forbid (principal, action, resource) when { resource == Agent::Tool::"Bash" };'] }, + }); + const parts = getCreateJoinParts(template); + const serialized = parts.join(''); + expect(serialized).toContain('"cedar_policies":{"L":[{"S":"forbid (principal, action, resource) when { resource == Agent::Tool::\\"Bash\\" };"}]}'); + }); + + test('omits cedar_policies when security is absent', () => { + const { template } = createStack(); + const parts = getCreateJoinParts(template); + const serialized = parts.join(''); + expect(serialized).not.toContain('cedar_policies'); + }); + + test('omits cedar_policies when cedarPolicies is empty', () => { + const { template } = createStack({ + security: { cedarPolicies: [] }, + }); + const parts = getCreateJoinParts(template); + const serialized = parts.join(''); + expect(serialized).not.toContain('cedar_policies'); + }); + + test('onUpdate includes cedar_policies in UpdateExpression', () => { + const { template } = createStack({ + security: { cedarPolicies: ['permit (principal, action, resource);'] }, + }); + const parts = getUpdateJoinParts(template); + const serialized = parts.join(''); + expect(serialized).toContain('#cedar_policies'); + }); + test('onUpdate uses DynamoDB updateItem to preserve onboarded_at', () => { const { template } = createStack(); const parts = getUpdateJoinParts(template); diff --git a/cdk/test/handlers/orchestrate-task.test.ts b/cdk/test/handlers/orchestrate-task.test.ts index e08627c..a495732 100644 --- a/cdk/test/handlers/orchestrate-task.test.ts +++ b/cdk/test/handlers/orchestrate-task.test.ts @@ -347,6 +347,30 @@ describe('loadBlueprintConfig', () => { const config = await loadBlueprintConfig(baseTask as any); expect(config.poll_interval_ms).toBeUndefined(); }); + + test('passes cedar_policies from repo config', async () => { + const policies = ['forbid (principal, action, resource) when { resource == Agent::Tool::"Bash" };']; + mockLoadRepoConfig.mockResolvedValueOnce({ + repo: 'org/repo', + status: 'active', + onboarded_at: '2024-01-01T00:00:00Z', + updated_at: '2024-01-01T00:00:00Z', + cedar_policies: policies, + }); + const config = await loadBlueprintConfig(baseTask as any); + expect(config.cedar_policies).toEqual(policies); + }); + + test('returns undefined cedar_policies when repo config has none', async () => { + mockLoadRepoConfig.mockResolvedValueOnce({ + repo: 'org/repo', + status: 'active', + onboarded_at: '2024-01-01T00:00:00Z', + updated_at: '2024-01-01T00:00:00Z', + }); + const config = await loadBlueprintConfig(baseTask as any); + expect(config.cedar_policies).toBeUndefined(); + }); }); describe('hydrateAndTransition with blueprint config', () => { @@ -426,6 +450,39 @@ describe('hydrateAndTransition with blueprint config', () => { expect.objectContaining({ githubTokenSecretArn: 'arn:aws:secretsmanager:us-east-1:123:secret:per-repo-token' }), ); }); + + test('includes cedar_policies in payload when blueprint config has them', async () => { + mockDdbSend.mockResolvedValue({}); + mockHydrateContext.mockResolvedValueOnce(mockHydratedContext); + const policies = ['forbid (principal, action, resource) when { resource == Agent::Tool::"Bash" };']; + const payload = await hydrateAndTransition(baseTask as any, { + compute_type: 'agentcore', + runtime_arn: 'arn:test', + cedar_policies: policies, + }); + expect(payload.cedar_policies).toEqual(policies); + }); + + test('omits cedar_policies from payload when blueprint config has none', async () => { + mockDdbSend.mockResolvedValue({}); + mockHydrateContext.mockResolvedValueOnce(mockHydratedContext); + const payload = await hydrateAndTransition(baseTask as any, { + compute_type: 'agentcore', + runtime_arn: 'arn:test', + }); + expect(payload.cedar_policies).toBeUndefined(); + }); + + test('omits cedar_policies from payload when array is empty', async () => { + mockDdbSend.mockResolvedValue({}); + mockHydrateContext.mockResolvedValueOnce(mockHydratedContext); + const payload = await hydrateAndTransition(baseTask as any, { + compute_type: 'agentcore', + runtime_arn: 'arn:test', + cedar_policies: [], + }); + expect(payload.cedar_policies).toBeUndefined(); + }); }); describe('startSession with blueprint config', () => { diff --git a/cdk/test/handlers/shared/context-hydration.test.ts b/cdk/test/handlers/shared/context-hydration.test.ts index 5586b58..64a0383 100644 --- a/cdk/test/handlers/shared/context-hydration.test.ts +++ b/cdk/test/handlers/shared/context-hydration.test.ts @@ -54,6 +54,7 @@ import { resolveGitHubToken, screenWithGuardrail, type GitHubIssueContext, + type GuardrailScreeningResult, type IssueComment, } from '../../../src/handlers/shared/context-hydration'; @@ -1010,17 +1011,83 @@ describe('assemblePrIterationPrompt', () => { // --------------------------------------------------------------------------- describe('screenWithGuardrail', () => { - test('returns NONE when guardrail allows the text', async () => { + test('returns {action: NONE} when guardrail allows the text', async () => { mockBedrockSend.mockResolvedValueOnce({ action: 'NONE' }); const result = await screenWithGuardrail('safe text', 'TASK001'); - expect(result).toBe('NONE'); + expect(result).toEqual({ action: 'NONE' }); expect(mockBedrockSend).toHaveBeenCalledTimes(1); }); - test('returns GUARDRAIL_INTERVENED when guardrail blocks the text', async () => { + test('returns {action: GUARDRAIL_INTERVENED} when guardrail blocks the text', async () => { mockBedrockSend.mockResolvedValueOnce({ action: 'GUARDRAIL_INTERVENED' }); const result = await screenWithGuardrail('malicious text', 'TASK001'); - expect(result).toBe('GUARDRAIL_INTERVENED'); + expect(result!.action).toBe('GUARDRAIL_INTERVENED'); + }); + + test('returns assessment details when guardrail blocks with content policy filters', async () => { + mockBedrockSend.mockResolvedValueOnce({ + action: 'GUARDRAIL_INTERVENED', + assessments: [{ + contentPolicy: { + filters: [ + { type: 'PROMPT_ATTACK', confidence: 'HIGH', action: 'BLOCKED' }, + { type: 'HATE', confidence: 'MEDIUM', action: 'BLOCKED' }, + ], + }, + }], + }); + const result = await screenWithGuardrail('attack text', 'TASK001') as GuardrailScreeningResult; + expect(result.action).toBe('GUARDRAIL_INTERVENED'); + expect(result.assessments).toHaveLength(2); + expect(result.assessments![0]).toEqual({ + filter_type: 'CONTENT', + filter_name: 'PROMPT_ATTACK', + confidence: 'HIGH', + action: 'BLOCKED', + }); + expect(result.assessments![1]).toEqual({ + filter_type: 'CONTENT', + filter_name: 'HATE', + confidence: 'MEDIUM', + action: 'BLOCKED', + }); + }); + + test('returns assessment details for topic, word, and sensitive info policies', async () => { + mockBedrockSend.mockResolvedValueOnce({ + action: 'GUARDRAIL_INTERVENED', + assessments: [{ + topicPolicy: { topics: [{ name: 'FINANCIAL_ADVICE', action: 'BLOCKED' }] }, + wordPolicy: { customWords: [{ match: 'badword', action: 'BLOCKED' }] }, + sensitiveInformationPolicy: { piiEntities: [{ type: 'SSN', action: 'BLOCKED' }] }, + }], + }); + const result = await screenWithGuardrail('sensitive text', 'TASK001') as GuardrailScreeningResult; + expect(result.action).toBe('GUARDRAIL_INTERVENED'); + expect(result.assessments).toHaveLength(3); + expect(result.assessments![0]).toEqual({ filter_type: 'TOPIC', filter_name: 'FINANCIAL_ADVICE', action: 'BLOCKED' }); + expect(result.assessments![1]).toEqual({ filter_type: 'WORD', filter_name: 'badword', action: 'BLOCKED' }); + expect(result.assessments![2]).toEqual({ filter_type: 'SENSITIVE_INFO', filter_name: 'SSN', action: 'BLOCKED' }); + }); + + test('returns assessment details for managed word lists', async () => { + mockBedrockSend.mockResolvedValueOnce({ + action: 'GUARDRAIL_INTERVENED', + assessments: [{ + wordPolicy: { managedWordLists: [{ match: 'profanity', action: 'BLOCKED' }] }, + }], + }); + const result = await screenWithGuardrail('bad text', 'TASK001') as GuardrailScreeningResult; + expect(result.action).toBe('GUARDRAIL_INTERVENED'); + expect(result.assessments).toHaveLength(1); + expect(result.assessments![0]).toEqual({ filter_type: 'WORD', filter_name: 'profanity', action: 'BLOCKED' }); + }); + + test('returns no assessments when GUARDRAIL_INTERVENED but assessments array is empty', async () => { + mockBedrockSend.mockResolvedValueOnce({ action: 'GUARDRAIL_INTERVENED', assessments: [] }); + const result = await screenWithGuardrail('text', 'TASK001') as GuardrailScreeningResult; + expect(result.action).toBe('GUARDRAIL_INTERVENED'); + expect(result.assessments).toBeUndefined(); }); test('throws GuardrailScreeningError on Bedrock error (fail-closed)', async () => { @@ -1066,7 +1133,7 @@ describe('hydrateContext — guardrail screening', () => { .mockResolvedValueOnce({ ok: true, json: async () => ([]) }); } - test('returns guardrail_blocked when PR context is blocked', async () => { + test('returns guardrail_blocked when PR context is blocked (no assessment details)', async () => { mockPrFetch(); mockBedrockSend.mockResolvedValueOnce({ action: 'GUARDRAIL_INTERVENED' }); @@ -1080,6 +1147,21 @@ describe('hydrateContext — guardrail screening', () => { expect(result.version).toBe(1); }); + test('returns enriched guardrail_blocked with assessment details for PR task', async () => { + mockPrFetch(); + mockBedrockSend.mockResolvedValueOnce({ + action: 'GUARDRAIL_INTERVENED', + assessments: [{ + contentPolicy: { + filters: [{ type: 'PROMPT_ATTACK', confidence: 'HIGH', action: 'BLOCKED' }], + }, + }], + }); + + const result = await hydrateContext(basePrTask as any); + expect(result.guardrail_blocked).toBe('PR context blocked by content policy: CONTENT/PROMPT_ATTACK (HIGH)'); + }); + test('proceeds normally when PR context passes guardrail', async () => { mockPrFetch(); mockBedrockSend.mockResolvedValueOnce({ action: 'NONE' }); @@ -1116,7 +1198,7 @@ describe('hydrateContext — guardrail screening', () => { pr_number: 20, }; const result = await hydrateContext(prReviewTask as any); - expect(result.guardrail_blocked).toBe('PR context blocked by content policy'); + expect(result.guardrail_blocked).toMatch(/^PR context blocked by content policy/); expect(mockBedrockSend).toHaveBeenCalledTimes(1); }); @@ -1168,6 +1250,26 @@ describe('hydrateContext — guardrail screening', () => { expect(mockBedrockSend).toHaveBeenCalledTimes(1); }); + test('returns enriched guardrail_blocked with assessment details for new_task', async () => { + mockIssueFetch(); + mockBedrockSend.mockResolvedValueOnce({ + action: 'GUARDRAIL_INTERVENED', + assessments: [{ + contentPolicy: { + filters: [{ type: 'HATE', confidence: 'MEDIUM', action: 'BLOCKED' }], + }, + topicPolicy: { + topics: [{ name: 'FINANCIAL_ADVICE', action: 'BLOCKED' }], + }, + }], + }); + + const result = await hydrateContext({ ...baseNewTask, issue_number: 42 } as any); + expect(result.guardrail_blocked).toBe( + 'Task context blocked by content policy: CONTENT/HATE (MEDIUM), TOPIC/FINANCIAL_ADVICE', + ); + }); + test('proceeds normally when new_task issue context passes guardrail', async () => { mockIssueFetch(); mockBedrockSend.mockResolvedValueOnce({ action: 'NONE' }); diff --git a/cdk/test/handlers/shared/repo-config.test.ts b/cdk/test/handlers/shared/repo-config.test.ts index a40e164..bc79f8d 100644 --- a/cdk/test/handlers/shared/repo-config.test.ts +++ b/cdk/test/handlers/shared/repo-config.test.ts @@ -125,6 +125,21 @@ describe('loadRepoConfig', () => { } }); + test('returns cedar_policies when present in config', async () => { + const policies = ['forbid (principal, action, resource) when { resource == Agent::Tool::"Bash" };']; + const config = { + repo: 'org/repo', + status: 'active', + onboarded_at: '2024-01-01T00:00:00Z', + updated_at: '2024-01-01T00:00:00Z', + cedar_policies: policies, + }; + mockSend.mockResolvedValueOnce({ Item: config }); + + const result = await loadRepoConfig('org/repo'); + expect(result?.cedar_policies).toEqual(policies); + }); + test('throws on DynamoDB error', async () => { mockSend.mockRejectedValueOnce(new Error('AccessDeniedException')); await expect(loadRepoConfig('org/repo')).rejects.toThrow( diff --git a/docs/design/AGENT_HARNESS.md b/docs/design/AGENT_HARNESS.md index 6fb7425..9750f22 100644 --- a/docs/design/AGENT_HARNESS.md +++ b/docs/design/AGENT_HARNESS.md @@ -36,7 +36,7 @@ Plugins, skills, and MCP servers are **out of scope for MVP**. The harness must The following are desired properties for the harness; MVP satisfies some and defers others: - **Add additional tools** — In addition to the harness’s built-in tools (e.g. file, shell), the platform must be able to attach more (e.g. via AgentCore Gateway). MVP: satisfied by Gateway (GitHub, web search). -- **Deterministic hooks** — Support for deterministic steps or hooks (e.g. pre/post tool execution, validation) so the platform can mix coded logic with the agent loop. The **blueprint execution framework** (see [REPO_ONBOARDING.md](./REPO_ONBOARDING.md#blueprint-execution-framework)) realizes this requirement: the orchestrator runs custom Lambda-backed steps at configurable pipeline phases (`pre-agent`, `post-agent`) with framework-enforced invariants (state transitions, events, cancellation). The agent harness itself does not need to implement hooks — they run at the orchestrator level, outside the agent session. +- **Deterministic hooks** — Support for deterministic steps or hooks (e.g. pre/post tool execution, validation) so the platform can mix coded logic with the agent loop. The **blueprint execution framework** (see [REPO_ONBOARDING.md](./REPO_ONBOARDING.md#blueprint-execution-framework)) realizes this requirement at the orchestrator level: custom Lambda-backed steps at configurable pipeline phases (`pre-agent`, `post-agent`) with framework-enforced invariants (state transitions, events, cancellation). Additionally, the **agent harness implements PreToolUse hooks** (`agent/src/hooks.py`) for real-time tool-call policy enforcement via the Cedar policy engine (`agent/src/policy.py`). The PreToolUse hook evaluates every tool call against Cedar policies before execution: `pr_review` agents are denied `Write`/`Edit` tools, writes to protected paths (`.github/workflows/*`, `.git/*`) are blocked, and destructive bash commands are denied. The engine is fail-closed — if `cedarpy` is unavailable or evaluation errors occur, all tool calls are denied. Denied decisions emit `POLICY_DECISION` telemetry events. Per-repo custom Cedar policies can be injected via Blueprint `security.cedarPolicies`. - **Plugins / skills / MCP** — Support for plugins, skills, or MCP servers for extensibility. Out of scope for MVP. - **Access to external memory** — The agent should be able to read and write short- and long-term memory (e.g. AgentCore Memory). MVP: AgentCore Memory is available to the agent via the runtime; the SDK or platform wires it in. - **Session persistence** — Persisting conversation and agent state across session boundaries for crash recovery or resume. MVP: Claude Code SDK has no built-in session manager; durability is via frequent commits. **Update:** AgentCore Runtime persistent session storage (preview) now mounts a per-session filesystem at `/mnt/workspace` that survives stop/resume cycles. Tool caches (mise, npm, Claude Code config) persist across invocations within a session (14-day TTL). Repo clones remain on local ephemeral disk because the S3-backed FUSE mount does not support `flock()`, which breaks build tools like `uv`. See [COMPUTE.md](./COMPUTE.md#session-storage-persistent-filesystem). diff --git a/docs/design/ARCHITECTURE.md b/docs/design/ARCHITECTURE.md index 20443c5..bb763fe 100644 --- a/docs/design/ARCHITECTURE.md +++ b/docs/design/ARCHITECTURE.md @@ -183,7 +183,7 @@ Each concept has a **source-of-truth document** and one or more documents that r | Agent self-feedback | MEMORY.md (Insights section) | EVALUATION.md (Agent self-feedback section) | | Prompt versioning | EVALUATION.md (Prompt versioning) | ORCHESTRATOR.md (data model: `prompt_version`), ROADMAP.md (3b), `src/handlers/shared/prompt-version.ts` | | Extraction prompts | MEMORY.md (Extraction prompts) | EVALUATION.md (references), ROADMAP.md (3b) | -| Tiered tool access | SECURITY.md (Input validation) | REPO_ONBOARDING.md, ROADMAP.md (Iter 5) | +| Tiered tool access / Cedar policy engine | SECURITY.md (Input validation, Policy enforcement), `agent/src/policy.py` | REPO_ONBOARDING.md, ROADMAP.md (Iter 3bis partial, Iter 5 full) | | Memory isolation | SECURITY.md (Memory-specific threats) | MEMORY.md (Requirements), ROADMAP.md (Iter 5) | | Data protection / DR | SECURITY.md (Data protection) | — | | 2GB image limit | COMPUTE.md (AgentCore Runtime 2GB) | ROADMAP.md (Iter 5: alternate runtime) | @@ -202,7 +202,7 @@ Each concept has a **source-of-truth document** and one or more documents that r | Agent swarm orchestration | ROADMAP.md (Iter 6) | — | | Adaptive model router | ROADMAP.md (Iter 5) | COST_MODEL.md | | Capability-based security | ROADMAP.md (Iter 5) | SECURITY.md | -| Centralized policy framework | ROADMAP.md (Iter 5), SECURITY.md (Policy enforcement and audit) | ORCHESTRATOR.md, OBSERVABILITY.md | +| Centralized policy framework | ROADMAP.md (Iter 5), SECURITY.md (Policy enforcement and audit), `agent/src/policy.py` (in-process Cedar, partially implemented) | ORCHESTRATOR.md, OBSERVABILITY.md | | GitHub App + AgentCore Token Vault | ROADMAP.md (Iter 3c), SECURITY.md (Authentication) | ORCHESTRATOR.md (context hydration), COMPUTE.md | | Live session replay | ROADMAP.md (Iter 4) | API_CONTRACT.md | | PR iteration task type | API_CONTRACT.md, ORCHESTRATOR.md | USER_GUIDE.md, PROMPT_GUIDE.md, SECURITY.md, AGENT_HARNESS.md | @@ -212,7 +212,7 @@ Each concept has a **source-of-truth document** and one or more documents that r | Memory input hardening (3e Phase 1) | ROADMAP.md (Iter 3e Phase 1, co-ships with 3d) | MEMORY.md, SECURITY.md (Memory-specific threats) | | Per-tool-call structured telemetry | ROADMAP.md (Iter 3d) | SECURITY.md (Mid-execution enforcement), EVALUATION.md, OBSERVABILITY.md | | Mid-execution behavioral monitoring | ROADMAP.md (Iter 5), SECURITY.md (Mid-execution enforcement) | OBSERVABILITY.md | -| Tool-call interceptor (Guardian pattern) | SECURITY.md (Mid-execution enforcement), ROADMAP.md (Iter 5) | REPO_ONBOARDING.md (Blueprint security props) | +| Tool-call interceptor (Guardian pattern) | SECURITY.md (Mid-execution enforcement), `agent/src/hooks.py` + `agent/src/policy.py` (pre-execution implemented), ROADMAP.md (Iter 5 for post-execution) | REPO_ONBOARDING.md (Blueprint security props) | ### Per-repo model selection diff --git a/docs/design/ORCHESTRATOR.md b/docs/design/ORCHESTRATOR.md index d8d3727..4c6aec7 100644 --- a/docs/design/ORCHESTRATOR.md +++ b/docs/design/ORCHESTRATOR.md @@ -22,7 +22,7 @@ These boundaries matter whenever you change task submission, the CLI, or the run |---------|-------------------|--------| | REST request/response types | `cdk/src/handlers/shared/types.ts` | **Mirror** in `cli/src/types.ts` for `bgagent` — keep them aligned on every API change. | | HTTP handlers & orchestration code | `cdk/src/handlers/` (e.g. shared `orchestrator.ts`, `create-task-core.ts`, `preflight.ts`) | Colocated Jest tests under `cdk/test/handlers/` and `cdk/test/handlers/shared/`. | -| Agent runtime behavior | `agent/` (`entrypoint.py`, `prompts/`, `system_prompt.py`, Dockerfile) | Consumes task payload and environment set by CDK/Lambda; see `agent/README.md` for PAT, tools, and local run. | +| Agent runtime behavior | `agent/src/` (`entrypoint.py` re-export shim, `pipeline.py`, `runner.py`, `config.py`, `hooks.py`, `policy.py`, `prompts/`, `system_prompt.py`, Dockerfile) | Consumes task payload and environment set by CDK/Lambda; see `agent/README.md` for PAT, tools, and local run. | | User-facing API documentation | `docs/guides/USER_GUIDE.md` (and synced site) | Regenerate Starlight content with `mise //docs:sync` after guide edits. | The orchestrator document describes **behavior** (state machine, admission, cancellation). The TypeScript `types.ts` files are the **schema** the API and CLI share; the agent implements the **work** inside compute. @@ -271,7 +271,7 @@ The orchestrator's `hydrateAndTransition()` function calls `hydrateContext()` (` - **`pr_iteration`** / **`pr_review`**: Fetches the pull request context via `fetchGitHubPullRequest()` — four parallel calls: three REST API calls (PR metadata, conversation comments, changed files) plus one GraphQL query for inline review comments. The GraphQL query filters out resolved review threads at fetch time so the agent only sees unresolved feedback. PR metadata includes title, body, head/base refs, and state; the diff summary covers changed files. The PR's `head_ref` is stored as `resolved_branch_name` and `base_ref` as `resolved_base_branch` on the hydrated context. These are used by the orchestrator to update the task record's `branch_name` from the placeholder `pending:pr_resolution` to the actual PR branch. For `pr_review`, if no `task_description` is provided, a default review instruction is used. 3. **Enforces a token budget** on the combined context. Uses a character-based heuristic (~4 chars per token). Default budget: 100K tokens (configurable via `USER_PROMPT_TOKEN_BUDGET` environment variable). When the budget is exceeded, oldest comments are removed first. The `truncated` flag is set in the result. 4. **Assembles the user prompt** based on task type: - - **`new_task`**: A structured markdown document with Task ID, Repository, GitHub Issue section, and Task section. The format mirrors the Python `assemble_prompt()` in `agent/entrypoint.py`. + - **`new_task`**: A structured markdown document with Task ID, Repository, GitHub Issue section, and Task section. The format mirrors the Python `assemble_prompt()` in `agent/src/context.py`. - **`pr_iteration`**: Assembled by `assemblePrIterationPrompt()` — includes PR metadata (number, title, body), the diff summary (changed files and patches), review comments (inline and conversation), and optional user instructions from `task_description`. 5. **Screens through Bedrock Guardrail** (PR tasks; `new_task` when issue content is present): The assembled user prompt is screened through Amazon Bedrock Guardrails (`screenWithGuardrail()`) using the `PROMPT_ATTACK` content filter. For `new_task` tasks without issue content, screening is skipped because the task description was already screened at submission time. If the guardrail detects prompt injection, `guardrail_blocked` is set on the result and the orchestrator fails the task. If the Bedrock API is unavailable, a `GuardrailScreeningError` is thrown (fail-closed — unscreened content never reaches the agent). Task descriptions for all task types are screened at submission time in `create-task-core.ts`. 6. **Returns a `HydratedContext` object** containing `version`, `user_prompt`, `issue`, `sources`, `token_estimate`, `truncated`, and for `pr_iteration`/`pr_review` tasks: `resolved_branch_name` and `resolved_base_branch`. diff --git a/docs/design/SECURITY.md b/docs/design/SECURITY.md index 2a9410f..1655300 100644 --- a/docs/design/SECURITY.md +++ b/docs/design/SECURITY.md @@ -119,22 +119,22 @@ The platform enforces policies at multiple points in the task lifecycle. Today, | **Hydration** | Guardrail prompt screening (PR + issue content) | `context-hydration.ts` | `guardrail_blocked` event emitted | | **Hydration** | Budget/quota resolution (3-tier max_turns, 2-tier max_budget_usd) | `orchestrator.ts` (`hydrateAndTransition`) | Values persisted on task record — no policy decision event | | **Hydration** | Token budget for prompt assembly | `context-hydration.ts` | No event emitted | -| **Session** | Tool access control (pr_review restrictions) | `agent/entrypoint.py` | No event emitted | +| **Session** | Tool access control (pr_review restrictions, Cedar deny-list) | `agent/src/hooks.py`, `agent/src/policy.py` (PreToolUse hook + Cedar engine) | `POLICY_DECISION` telemetry event on deny | | **Session** | Budget enforcement (turns, cost) | Claude Agent SDK | Agent SDK enforces; cost in task result | -| **Finalization** | Build/lint verification | `agent/entrypoint.py` | Results in task record and PR body | +| **Finalization** | Build/lint verification | `agent/src/post_hooks.py` | Results in task record and PR body | | **Infrastructure** | DNS Firewall egress allowlist | `dns-firewall.ts`, `agent.ts` (CDK synth) | DNS query logs in CloudWatch | | **Infrastructure** | WAF rate limiting | `task-api.ts` (CDK synth) | WAF logs | | **State machine** | Valid transition enforcement | `task-status.ts`, `orchestrator.ts` | DynamoDB conditional writes | ### Audit gaps (planned remediation) -Submission-time policy decisions (validation, onboarding gate, guardrail screening, idempotency) currently return HTTP errors without emitting structured audit events. Budget resolution decisions are persisted but not logged as policy decisions with reason codes. Tool access selection is implicit (hardcoded in agent code) with no audit event. +Submission-time policy decisions (validation, onboarding gate, guardrail screening, idempotency) currently return HTTP errors without emitting structured audit events. Budget resolution decisions are persisted but not logged as policy decisions with reason codes. Tool access is enforced by the Cedar policy engine (`agent/src/policy.py`) via PreToolUse hooks (`agent/src/hooks.py`); denied decisions emit `POLICY_DECISION` telemetry events, but these are not yet part of a unified `PolicyDecisionEvent` schema. **Planned (Iteration 5, Phase 1):** A unified `PolicyDecisionEvent` schema will normalize all policy decisions into structured events with: decision ID, policy name, version, phase, input hash, result, reason codes, and enforcement mode. Enforcement supports three modes: `enforced` (decision is binding — deny blocks, allow proceeds), `observed` (decision is logged but not enforced — shadow mode for safe rollout), and `steered` (decision modifies the input or output rather than blocking — redact PII, sanitize paths, mask secrets). New rules deploy in `observed` mode first; operators validate false-positive rates via `PolicyDecisionEvent` logs, then promote to `enforced` or `steered`. This observe-before-enforce workflow enables gradual rollout of security policies without risking false blocks on legitimate tasks. See [ROADMAP.md Iteration 5](../guides/ROADMAP.md) for the full centralized policy framework design. ### Policy resolution and authorization (planned) -**Planned (Iteration 5, Phase 2):** Cedar as the single policy engine for both **operational policy** (budget/quota/tool-access resolution, tool-call interception rules) and **authorization** (multi-tenant access control, extended when multi-user/team lands). Cedar replaces the scattered merge logic across handlers with a unified policy evaluation. A thin `policy.ts` adapter translates Cedar decisions into `PolicyDecision` objects consumed by existing handlers. Cedar is preferred over OPA: it is AWS-native, has formal verification guarantees, integrates with AgentCore Gateway, and policies can be evaluated in-process via the Cedar SDK without a separate service dependency. Cedar's binary permit/forbid model supports the three enforcement modes (`enforced`, `observed`, `steered`) via a **virtual-action classification pattern**: the interceptor evaluates against multiple virtual actions (`invoke_tool`, `invoke_tool_steered`, `invoke_tool_denied`) and uses the first permitted action to determine the mode. For example, `forbid(principal, action == Action::"invoke_tool", resource) when { resource.path like ".github/workflows/*" && principal.capability_tier != "elevated" }` blocks the call, while `permit(principal, action == Action::"invoke_tool_steered", resource) when { context.output_contains_pii }` triggers PII redaction instead of blocking. Cedar policies are stored in Amazon Verified Permissions and loaded at hydration/session-start time — policy changes take effect without CDK redeployment. When multi-user/team support lands, the same Cedar policy store expands to cover tenant-specific authorization (user/team/repo scoping, team budgets, risk-based approval requirements). +**Partially implemented / Planned (Iteration 5, Phase 2):** Cedar as the single policy engine for both **operational policy** (budget/quota/tool-access resolution, tool-call interception rules) and **authorization** (multi-tenant access control, extended when multi-user/team lands). **Current state:** An in-process Cedar policy engine (`agent/src/policy.py`, using `cedarpy`) enforces a deny-list model for tool-call governance: `pr_review` agents are forbidden from using `Write` and `Edit` tools, writes to `.git/*` internals are blocked for all agents, and destructive bash commands (`rm -rf /`, `git push --force`) are denied. The engine is fail-closed — if `cedarpy` is unavailable or evaluation errors occur, all tool calls are denied. Per-repo custom Cedar policies can be injected via Blueprint `security.cedarPolicies`. The PreToolUse hook (`agent/src/hooks.py`) integrates the policy engine with the Claude Agent SDK's hook system, and denied decisions emit `POLICY_DECISION` telemetry events via `agent/src/telemetry.py`. **Planned:** Cedar replaces the scattered merge logic across TypeScript handlers with a unified policy evaluation. A thin `policy.ts` adapter translates Cedar decisions into `PolicyDecision` objects consumed by existing handlers. Cedar is preferred over OPA: it is AWS-native, has formal verification guarantees, integrates with AgentCore Gateway, and policies can be evaluated in-process via the Cedar SDK without a separate service dependency. Cedar's binary permit/forbid model supports the three enforcement modes (`enforced`, `observed`, `steered`) via a **virtual-action classification pattern**: the interceptor evaluates against multiple virtual actions (`invoke_tool`, `invoke_tool_steered`, `invoke_tool_denied`) and uses the first permitted action to determine the mode. For example, `forbid(principal, action == Action::"invoke_tool", resource) when { resource.path like ".github/workflows/*" && principal.capability_tier != "elevated" }` blocks the call, while `permit(principal, action == Action::"invoke_tool_steered", resource) when { context.output_contains_pii }` triggers PII redaction instead of blocking. Cedar policies will be stored in Amazon Verified Permissions and loaded at hydration/session-start time — policy changes take effect without CDK redeployment. When multi-user/team support lands, the same Cedar policy store expands to cover tenant-specific authorization (user/team/repo scoping, team budgets, risk-based approval requirements). ### Mid-execution enforcement (planned) @@ -142,7 +142,7 @@ Today, once an agent session starts, the orchestrator can only observe it via po **Planned (Iteration 5):** Two complementary mechanisms address this gap: -1. **Tool-call interceptor (Guardian pattern)** — A policy-evaluation layer in the agent harness (`entrypoint.py`) that sits between the agent SDK's tool-call decision and actual tool execution. Evaluation is split into two stages: a **pre-execution stage** that validates tool inputs before the tool runs (file path deny patterns, bash command allowlist per capability tier, cost threshold checks, and per-repo rules from Blueprint `security` configuration) and blocks disallowed operations before they execute, and a **post-execution stage** that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. The interceptor can allow, modify (e.g. redact secrets from output), or deny tool calls. Denied calls return a structured error to the agent, which can retry with a different approach. This follows the Guardian interceptor pattern (Hu et al. 2025) — enforcement happens at tool-call time, not before the session starts (input guardrails) or after it ends (validation pipeline). Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision is logged as a `PolicyDecisionEvent`. +1. **Tool-call interceptor (Guardian pattern)** — A policy-evaluation layer in the agent harness (`agent/src/hooks.py` + `agent/src/policy.py`) that sits between the agent SDK's tool-call decision and actual tool execution. **Current state:** The pre-execution stage is implemented: a Cedar-based `PolicyEngine` evaluates tool calls via a PreToolUse hook before execution. The deny-list model blocks `Write`/`Edit` for `pr_review` tasks, protects `.git/*` internals, and denies destructive bash commands. The engine is fail-closed (denies on error or missing `cedarpy`). Per-repo custom Cedar policies are supported via Blueprint `security.cedarPolicies`. **Planned extensions:** Evaluation is split into two stages: a **pre-execution stage** (implemented) that validates tool inputs before the tool runs (tool-level deny-list via Cedar policies, file path deny patterns for protected paths, bash command deny patterns for destructive commands, and per-repo custom Cedar policies from Blueprint `security.cedarPolicies`) and blocks disallowed operations before they execute, and a **post-execution stage** (planned) that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. The interceptor can allow, modify (e.g. redact secrets from output), or deny tool calls. Denied calls return a structured error to the agent, which can retry with a different approach. This follows the Guardian interceptor pattern (Hu et al. 2025) — enforcement happens at tool-call time, not before the session starts (input guardrails) or after it ends (validation pipeline). Denied decisions emit `POLICY_DECISION` telemetry events via `agent/src/telemetry.py`. Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision will be logged as a `PolicyDecisionEvent`. 2. **Behavioral circuit breaker** — Lightweight monitoring of tool-call patterns within a session: call frequency (calls per minute), cumulative cost, repeated failures on the same tool, and file mutation rate. When metrics exceed configurable thresholds (e.g. >50 tool calls/minute, >$10 cumulative cost, >5 consecutive failures), the circuit breaker pauses or terminates the session and emits a `circuit_breaker_triggered` event. This catches runaway loops and cost explosions before the hard session timeout. Thresholds are configurable per-repo via Blueprint `security` props. @@ -257,7 +257,7 @@ AgentCore Memory has **no native backup mechanism**. This is a significant gap f - **No customer-managed KMS** — all encryption at rest uses AWS-managed keys. Customer-managed KMS can be added if required by compliance policy. - **CORS is fully open** — `ALL_ORIGINS` is configured for CLI consumption. Restrict origins when exposing browser clients. - **DNS Firewall IP bypass** — DNS Firewall does not block direct IP connections (see [NETWORK_ARCHITECTURE.md](./NETWORK_ARCHITECTURE.md#dns-firewall)). -- **No tiered tool access** — all agent sessions currently have the same tool set. +- **Partial tool access control** — Cedar-based policy enforcement (`agent/src/policy.py`) provides per-task-type tool restrictions (e.g. `pr_review` agents cannot use `Write`/`Edit`), `.git/*` write protection, and destructive command blocking. `.github/workflows/*` is not blocked by default because agents may legitimately need to modify CI workflows; operators can add workflow protection via Blueprint `security.cedarPolicies` if needed. Per-repo custom Cedar policies are supported via Blueprint `security.cedarPolicies`. **Important:** custom policies for `write_file` and `execute_bash` actions must use `context.file_path` / `context.command` in `when` clauses — not `resource ==` matching — because the engine uses fixed sentinel resource IDs to avoid Cedar entity UID parsing failures on special characters. `invoke_tool` actions use the real tool name as resource ID, so `resource ==` matching works for tool-level policies. Full tiered tool access (capability tiers, MCP server allowlisting) is planned for Iteration 5. ## Reference diff --git a/docs/guides/ROADMAP.md b/docs/guides/ROADMAP.md index 2e4efdc..e06a3fa 100644 --- a/docs/guides/ROADMAP.md +++ b/docs/guides/ROADMAP.md @@ -134,8 +134,8 @@ These practices apply continuously across iterations and are not treated as one- - [x] **Orchestrator fallback episode observability** — `writeMinimalEpisode` return value is now checked and logged: `logger.warn('Fallback episode write returned false')` when the inner function reports failure via its return value (previously discarded). New test `logs warning when writeMinimalEpisode returns false` covers this path. - [x] **Python unit tests** — Added pytest-based unit tests (`agent/tests/`) for pure functions: `slugify()`, `redact_secrets()`, `format_bytes()`, `truncate()`, `build_config()`, `assemble_prompt()`, `_discover_project_config()`, `_build_system_prompt()` (entrypoint), `_validate_repo()` (memory), `_now_iso()`, `_build_logs_url()` (task_state). Added pytest to dev dependency group with `pythonpath` config for in-tree imports. -- [x] **Decompose entrypoint.py** — Extracted four named subfunctions from `run_task()` and `run_agent()`: `_build_system_prompt()` (system prompt assembly + memory context), `_discover_project_config()` (repo config scanning), `_write_memory()` (episode + learnings writes), `_setup_agent_env()` (Bedrock/OTEL env var setup). All functions stay in `entrypoint.py` (no import changes). `run_task()` and `run_agent()` now call the extracted functions. -- [x] **Deprecate dual prompt assembly** — Added deprecation docstring to `assemble_prompt()` clarifying that production uses the orchestrator's `assembleUserPrompt()` via `hydrated_context["user_prompt"]`. Python version retained only for local batch mode and dry-run mode. No code deletion — just documentation of the intended flow. +- [x] **Decompose entrypoint.py** — Initially extracted four named subfunctions (`_build_system_prompt()`, `_discover_project_config()`, `_write_memory()`, `_setup_agent_env()`). Subsequently, the agent code was further decomposed into a full `agent/src/` module structure: `config.py` (configuration and validation), `models.py` (Pydantic data models and enumerations), `pipeline.py` (task orchestration), `runner.py` (agent execution), `context.py` (context hydration), `prompt_builder.py` (prompt assembly), `hooks.py` (PreToolUse policy hooks), `policy.py` (Cedar policy engine), `post_hooks.py` (deterministic post-hooks), `repo.py` (repository setup), `shell.py` (utilities), `telemetry.py` (metrics and trajectory). The original `entrypoint.py` is now a re-export shim for backward compatibility with tests. +- [x] **Deprecate dual prompt assembly** — Added deprecation docstring to `assemble_prompt()` clarifying that production uses the orchestrator's `assembleUserPrompt()` via `HydratedContext.user_prompt` (validated from the incoming JSON). Python version retained only for local batch mode and dry-run mode. No code deletion — just documentation of the intended flow. - [x] **Graceful thread drain in server.py** — Added `_active_threads` list for tracking background threads, `_drain_threads(timeout=300)` function that joins all alive threads, registered via `@app.on_event("shutdown")` (FastAPI lifecycle — uvicorn translates SIGTERM) and `atexit.register()` as backup. Thread list is cleaned on each new invocation. - [x] **Remove dead QUEUED state** — Removed `QUEUED` from `TaskStatus`, `VALID_TRANSITIONS`, and `ACTIVE_STATUSES` in `task-status.ts`. Updated SUBMITTED transitions to `[HYDRATING, FAILED, CANCELLED]`. Removed QUEUED from all tests (count assertions, cancel test, validation test) and documentation (ORCHESTRATOR.md, OBSERVABILITY.md, API_CONTRACT.md, ARCHITECTURE.md). - [x] **Hardening fixes (review round)** — Thread race in `server.py` (track thread before `start()`), defensive `.get()` on `ClientError.response` in `task_state.py`, wired `fallback_error` through `orchestrator.ts` (warning log + event metadata), TOCTOU `ConditionExpression` on reconciler update, per-user error isolation in reconciler, `TaskStatusType` propagation across types/orchestrator/memory, graduated trajectory writer failure, subprocess timeouts, FastAPI lifespan pattern, `decrementConcurrency` CCF distinction. @@ -193,7 +193,7 @@ These practices apply continuously across iterations and are not treated as one- - **Review feedback memory loop (Tier 2)** — Capture PR review comments via GitHub webhook, extract actionable rules via LLM, and persist them as searchable memory so the agent internalizes reviewer preferences over time. This is the primary feedback loop between human reviewers and the agent — no shipping coding agent does this today. Requires a GitHub webhook → API Gateway → Lambda pipeline (separate from agent execution). Two types of extracted knowledge: repo-level rules ("don't use `any` types") and task-specific corrections. See [MEMORY.md](../design/MEMORY.md) (Review feedback memory) and [SECURITY.md](../design/SECURITY.md) (prompt injection via review comments). - **PR outcome tracking** — Track whether agent-created PRs are merged, revised, or rejected via GitHub webhooks (`pull_request.closed` events). A merged PR is a positive signal; closed-without-merge is a negative signal. These outcome signals feed into the evaluation pipeline and enable the episodic memory to learn which approaches succeed. See [MEMORY.md](../design/MEMORY.md) (PR outcome signals) and [EVALUATION.md](../design/EVALUATION.md). - **Evaluation pipeline (basic)** — Automated evaluation of agent runs: failure categorization (reasoning errors, missed instructions, missing tests, timeouts, tool failures). Results are stored and surfaced in observability dashboards. Basic version: rules-based analysis of task outcomes and agent responses. Track memory effectiveness metrics: first-review merge rate, revision cycles, CI pass rate on first push, review comment density, and repeated mistakes. Advanced version (ML-based trace analysis, A/B prompt comparison, feedback loop into prompts) is deferred to Iteration 5. See [EVALUATION.md](../design/EVALUATION.md) and [OBSERVABILITY.md](../design/OBSERVABILITY.md). -- **Per-tool-call structured telemetry** — Instrument the agent harness (`entrypoint.py`) to emit structured events for every tool call: tool name, input hash (SHA-256), output hash, duration, cost attribution, and result status. Events flow through the existing `create_event` path and are surfaced in CloudWatch. This is foundational for: (a) the evaluation pipeline (tool-call-level success/failure analysis), (b) the centralized policy framework Phase 1 (tool calls become `PolicyDecisionEvent` sources in Iteration 5), and (c) future mid-execution policy enforcement (tool-call interceptor in Iteration 5). Without per-tool-call telemetry, the platform can only observe sessions as opaque black boxes — model invocation logs capture LLM reasoning but not the tool execution that connects reasoning to action. Informed by the Guardian system's tool-call interception architecture (Hu et al. 2025). See [OBSERVABILITY.md](../design/OBSERVABILITY.md) and [SECURITY.md](../design/SECURITY.md) (Mid-execution enforcement). +- **Per-tool-call structured telemetry** — Instrument the agent harness (`agent/src/telemetry.py`) to emit structured events for every tool call: tool name, input hash (SHA-256), output hash, duration, cost attribution, and result status. Events flow through the existing `create_event` path and are surfaced in CloudWatch. This is foundational for: (a) the evaluation pipeline (tool-call-level success/failure analysis), (b) the centralized policy framework Phase 1 (tool calls become `PolicyDecisionEvent` sources in Iteration 5), and (c) future mid-execution policy enforcement (tool-call interceptor in Iteration 5). Without per-tool-call telemetry, the platform can only observe sessions as opaque black boxes — model invocation logs capture LLM reasoning but not the tool execution that connects reasoning to action. Informed by the Guardian system's tool-call interception architecture (Hu et al. 2025). See [OBSERVABILITY.md](../design/OBSERVABILITY.md) and [SECURITY.md](../design/SECURITY.md) (Mid-execution enforcement). **Prerequisite: 3e Phase 1 (input hardening) ships with this iteration.** The review feedback memory loop writes attacker-controlled content (PR review comments) to persistent memory. Without content sanitization, provenance tagging, and integrity hashing (3e Phase 1), this creates a known attack vector — poisoned review comments stored as persistent rules that influence all future tasks on the repo. 3e Phase 1 items (memory content sanitization, GitHub issue input sanitization, source provenance on memory writes, content integrity hashing) must be implemented before or concurrently with the review feedback pipeline. See [SECURITY.md](../design/SECURITY.md) (Prompt injection via PR review comments). @@ -281,17 +281,19 @@ Deep research identified **9 memory-layer security gaps** in the current archite - **Formal orchestrator verification (TLA+)** — Add a formal specification of the orchestrator in TLA+ and verify it with TLC model checking. Scope includes the task state machine (8 states, valid transitions, terminal states), concurrency admission control (atomic increment + max check), cancellation races (cancel arriving during any orchestration step), reconciler/orchestrator interleavings (counter drift correction while tasks are active), and the polling loop (agent writes terminal status, orchestrator observes and finalizes). Define invariants such as valid-state progression, no illegal transitions, and repo-level safety constraints (for example, at most one active `RUNNING` task per repo when configured). Keep the spec aligned with `src/constructs/task-status.ts` and orchestrator docs so regressions surface as model-check counterexamples before production. **Note:** The TLA+ specification can be started earlier (e.g. during Iteration 3d) since the state machine and concurrency model are already stable. The spec is documentation that also catches bugs — writing it does not depend on Iteration 5 features. Consider starting the state machine and cancellation models as part of the ongoing engineering practice. - **Guardrails (output and tool-call) with interceptor pattern** — Extend Bedrock Guardrails from input screening (implemented in Iteration 3c) to **output filtering** and **agent tool-call guardrails**. Apply content filters to model responses during agent execution, restrict sensitive content generation, and enforce organizational policies (e.g. "do not modify files in `/infrastructure`"). Guardrails configuration can be per-repo (via onboarding) or platform-wide. - **Tool-call interceptor (Guardian pattern):** Implement a policy-evaluation layer in the agent harness (`entrypoint.py`) that intercepts tool calls between the agent SDK's decision and actual execution — enforcement happens at tool-call time, not before the session starts (input guardrails) or after it ends (validation pipeline). Each tool call is evaluated against a policy: file path restrictions (deny writes to `.github/workflows/`, `**/migrations/**`), bash command allowlist per capability tier, cost threshold checks, and per-repo rules from Blueprint `security` configuration. The interceptor can **allow**, **modify** (e.g. redact secrets from output), or **deny** (return structured error to agent, which retries with a different approach). Evaluation is split into two stages: a **pre-execution stage** that validates tool inputs before the tool runs (file path deny patterns, bash command allowlist, cost threshold checks) and blocks disallowed operations before they execute, and a **post-execution stage** that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision is logged as a `PolicyDecisionEvent`. This pattern is informed by the Guardian system (Hu et al. 2025) — a "guardian agent" that monitors and can intercept tool calls before execution. See [SECURITY.md](../design/SECURITY.md) (Mid-execution enforcement). + **Tool-call interceptor (Guardian pattern) — partially implemented:** A Cedar-based policy engine (`agent/src/policy.py`) with PreToolUse hooks (`agent/src/hooks.py`) intercepts tool calls between the agent SDK's decision and actual execution. **Current state (pre-execution stage implemented):** Every tool call is evaluated against Cedar deny-list policies: `pr_review` agents are denied `Write`/`Edit` tools, writes to protected paths (`.github/workflows/*`, `.git/*`) are blocked, and destructive bash commands (`rm -rf /`, `git push --force`) are denied. The engine is fail-closed — if `cedarpy` is unavailable or evaluation errors occur, all tool calls are denied. Per-repo custom Cedar policies are supported via Blueprint `security.cedarPolicies`. Denied decisions emit `POLICY_DECISION` telemetry events via `agent/src/telemetry.py`. **Remaining work:** Extend to a **post-execution stage** that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. Add cost threshold checks, bash command allowlist per capability tier, and `modify` mode (e.g. redact secrets from output). Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision will be logged as a `PolicyDecisionEvent`. This pattern is informed by the Guardian system (Hu et al. 2025) — a "guardian agent" that monitors and can intercept tool calls before execution. See [SECURITY.md](../design/SECURITY.md) (Mid-execution enforcement). - **Mid-execution behavioral monitoring** — Lightweight monitoring of agent behavior within a running session, filling the gap between input guardrails (pre-session) and validation (post-session). A **behavioral circuit breaker** in the agent harness tracks aggregate metrics: tool-call frequency (calls per minute), cumulative session cost, repeated failures on the same tool, and file mutation rate. When metrics exceed configurable thresholds (e.g. >50 tool calls/minute, >$10 cumulative cost, >5 consecutive failures on the same tool), the circuit breaker pauses or terminates the session and emits a `circuit_breaker_triggered` event. This catches runaway loops, cost explosions, and stuck agents before the hard session timeout. Thresholds are configurable per-repo via Blueprint `security` props. The circuit breaker operates within the existing agent harness — no sidecar process or external service required. For ABCA's single-agent-per-task model, embedded monitoring is simpler and more reliable than an external sidecar; sidecar architecture becomes relevant when multi-agent orchestration lands (Iteration 6). See [SECURITY.md](../design/SECURITY.md) (Mid-execution enforcement). -- **Centralized policy framework** — Consolidate the platform's distributed policy decisions into a unified policy framework and audit layer. Policy logic today is scattered across 20+ files (input validation in `validation.ts` and `create-task-core.ts`, admission control in `orchestrator.ts`, guardrail screening in `context-hydration.ts`, budget resolution across `validation.ts`/`orchestrator.ts`/`entrypoint.py`, tool access in `entrypoint.py`, network egress in `dns-firewall.ts`/`agent.ts`, state transitions in `task-status.ts`/`orchestrator.ts`). This fragmentation makes it difficult to audit what policies exist, verify consistency, or change policy behavior without touching multiple files. +- **Centralized policy framework** — Consolidate the platform's distributed policy decisions into a unified policy framework and audit layer. Policy logic today is scattered across 20+ files (input validation in `validation.ts` and `create-task-core.ts`, admission control in `orchestrator.ts`, guardrail screening in `context-hydration.ts`, budget resolution across `validation.ts`/`orchestrator.ts`/`agent/src/config.py`, tool access in `agent/src/policy.py` + `agent/src/hooks.py`, network egress in `dns-firewall.ts`/`agent.ts`, state transitions in `task-status.ts`/`orchestrator.ts`). The agent-side Cedar policy engine (`agent/src/policy.py`) is a first step — it provides in-process tool-call governance with fail-closed semantics and per-repo custom policies. The full framework extends this to the TypeScript orchestrator side. This fragmentation makes it difficult to audit what policies exist, verify consistency, or change policy behavior without touching multiple files. **Phase 1 — Policy audit normalization:** Define a stable `PolicyDecisionEvent` schema: `decision_id` (ULID), `policy_name` (e.g. `admission.concurrency`, `budget.max_turns`, `guardrail.input_screening`), `policy_version`, `phase` (`submission` | `admission` | `pre_flight` | `hydration` | `session_start` | `session` | `finalization`), `input_hash` (SHA-256 of the decision input for reproducibility), `result` (`allow` | `deny` | `modify`), `reason_codes[]`, `enforcement` (`enforced` | `observed` | `steered`), and `task_id`. The three enforcement modes serve distinct purposes: `enforced` means the decision is binding (deny blocks, allow proceeds), `observed` means the decision is logged but not enforced (shadow mode for safe rollout), and `steered` means the decision modifies the input or output rather than blocking (redact PII, sanitize paths, mask secrets). New rules deploy in `observed` mode first; operators validate false-positive rates via `PolicyDecisionEvent` logs, then promote to `enforced` or `steered`. This observe-before-enforce workflow enables gradual rollout of security policies without risking false blocks on legitimate tasks. Emit a `policy_decision` event via `emitTaskEvent` at every existing enforcement point. Today, some decisions emit events (`admission_rejected`, `preflight_failed`, `guardrail_blocked`) while others silently return HTTP errors — normalize them all. This is pure instrumentation of existing code paths; no behavior change. - **Phase 2 — Cedar policy engine:** + **Phase 2 — Cedar policy engine (partially implemented):** Introduce **Cedar** (not OPA) as the single policy engine for both **operational policy** (budget/quota/tool-access resolution, tool-call interception rules) and **authorization** (extended for multi-tenant access control when multi-user/team support lands). Cedar is AWS-native, has formal verification guarantees, and integrates with AgentCore Gateway. - **Policy resolution:** Cedar replaces the scattered budget/quota/tool-access merge logic (3-tier `max_turns` resolution, 2-tier `max_budget_usd` resolution, tool access determination in `entrypoint.py`, per-repo configuration merge in `loadBlueprintConfig`) with a unified policy evaluation. A thin `policy.ts` adapter module translates Cedar decisions into `PolicyDecision` objects (`PolicyInput` → Cedar evaluation → `PolicyDecision` with computed budgets, tool profile, risk tier, redaction directives) consumed by existing handlers — no new service, no network hop. Input validation (format checks, range checks) remains at the input boundary; Cedar handles resolution and policy composition. + **Current state:** An in-process Cedar policy engine is implemented in the agent harness (`agent/src/policy.py`) using `cedarpy` for tool-call governance. The engine enforces a deny-list model: `pr_review` agents are forbidden from `Write`/`Edit`, writes to `.github/workflows/*` and `.git/*` are blocked, and destructive bash commands are denied. The engine is fail-closed (denies on error, `cedarpy` unavailability, or Cedar `NoDecision`). Per-repo custom Cedar policies can be injected via Blueprint `security.cedarPolicies` and are validated at initialization. Task types are validated against the `TaskType` enum (`agent/src/models.py`). Denied decisions emit `POLICY_DECISION` telemetry events. + + **Remaining work:** Extend Cedar to the TypeScript orchestrator side. Cedar replaces the scattered budget/quota/tool-access merge logic (3-tier `max_turns` resolution, 2-tier `max_budget_usd` resolution, per-repo configuration merge in `loadBlueprintConfig`) with a unified policy evaluation. A thin `policy.ts` adapter module translates Cedar decisions into `PolicyDecision` objects (`PolicyInput` → Cedar evaluation → `PolicyDecision` with computed budgets, tool profile, risk tier, redaction directives) consumed by existing handlers — no new service, no network hop. Input validation (format checks, range checks) remains at the input boundary; Cedar handles resolution and policy composition. Migrate from in-process `cedarpy` to Amazon Verified Permissions for runtime-configurable policies. **Operational tool-call policies** use a **virtual-action classification pattern** to support the three enforcement modes (`enforced`, `observed`, `steered`) within Cedar's binary permit/forbid model. Instead of asking Cedar "allow or deny?", the interceptor evaluates against multiple virtual actions (`invoke_tool`, `invoke_tool_steered`, `invoke_tool_denied`) and uses the first permitted action to determine the mode. For example: `forbid(principal, action == Action::"invoke_tool", resource) when { resource.path like ".github/workflows/*" && principal.capability_tier != "elevated" }` blocks the call, while `permit(principal, action == Action::"invoke_tool_steered", resource) when { context.output_contains_pii }` triggers PII redaction. This keeps Cedar doing what it does best (binary decisions with formal verification) while the interceptor interprets the combination of decisions as allow/steer/deny. @@ -342,9 +344,9 @@ Deep research identified **9 memory-layer security gaps** in the current archite - **Iteration 3c** — Per-repo GitHub App credentials via AgentCore Token Vault (`CfnWorkloadIdentity` + Token Vault credential provider for automatic token refresh; agent uses `GetWorkloadAccessToken` for long-running sessions; sets pattern for GitLab/Jira/Slack integrations), orchestrator pre-flight checks (fail-closed before session start), persistent session storage for select caches (AgentCore Runtime `/mnt/workspace` mount for npm/Claude config; mise/uv/repo on local disk due to FUSE `flock()` limitation), pre-execution task risk classification (model/limits/approval policy selection), tiered validation pipeline (tool validation, code quality analysis, post-execution risk/blast radius analysis), PR risk level, PR review task type (`pr_review` — read-only structured review with tool restriction, defense-in-depth enforcement, CLI `--review-pr` flag), input guardrail screening (Bedrock Guardrails, fail-closed — including GitHub issue content for `new_task`), multi-modal input. - **Iteration 3d** — Review feedback memory loop (Tier 2), PR outcome tracking, evaluation pipeline (basic), per-tool-call structured telemetry (tool name, input/output hash, duration, cost — foundational for evaluation and Iteration 5 policy enforcement). Co-ships with 3e Phase 1 (memory input hardening: content sanitization, provenance tagging, integrity hashing) as a prerequisite for safely writing attacker-controlled content to memory. - **Iteration 3e** — Memory security and integrity: Phase 1 (input hardening — content sanitization, provenance tagging, integrity hashing) ships with 3d as a prerequisite; Phases 2–4 follow: trust-aware retrieval (trust scoring, temporal decay, guardian validation), detection and response (anomaly detection, circuit breaker, quarantine, rollback), advanced protections (write-ahead validation, behavioral drift detection, cryptographic provenance, red teaming). Addresses OWASP ASI06 (Memory & Context Poisoning). -- **Iteration 3bis** (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (`schema_version: "2"`), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition (4 extracted subfunctions), dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active). +- **Iteration 3bis** (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (`schema_version: "2"`), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition into `agent/src/` modules (config, models, pipeline, runner, context, prompt_builder, hooks, policy, post_hooks, repo, shell, telemetry — with entrypoint.py as re-export shim), Cedar policy engine (in-process `cedarpy`, fail-closed deny-list for tool-call governance, PreToolUse hooks, per-repo custom policies via Blueprint `security.cedarPolicies`), TaskType enum with validation, dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active). - **Iteration 4** — Additional git providers, visual proof (screenshots/videos), Slack channel, skills pipeline, user preference memory (Tier 3), control panel (restrict CORS to dashboard origin), real-time event streaming (WebSocket), live session replay and mid-task nudge, browser extension client, MFA for production. -- **Iteration 5** — Automated container (devbox) from repo, CI/CD pipeline, snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, Bedrock Guardrails output/tool-call with Guardian interceptor pattern (pre/post tool-call evaluation stages — pre-execution validates inputs before tool runs, post-execution screens outputs for PII/secrets/sensitive data before re-entering agent context; per-tool-call policy evaluation between agent decision and execution; PII, denied topics, output filters) — input screening in 3c, mid-execution behavioral monitoring (tool-call frequency circuit breaker, cost runaway detection, aggregate behavioral bounds within agent harness), centralized policy framework (Phase 1: policy audit normalization with `PolicyDecisionEvent` schema across all enforcement points, three enforcement modes — `enforced` | `observed` | `steered` — with observe-before-enforce rollout workflow; Phase 2: Cedar as single policy engine for operational tool-call policy and authorization — virtual-action classification pattern for enforce/observe/steer within Cedar's binary model, replaces scattered budget/quota/tool-access resolution, runtime-configurable policies via Amazon Verified Permissions, extended for multi-tenant authorization when multi-user/team lands, AWS-native with formal verification, integrates with AgentCore Gateway), capability-based security model (tiers feed into policy framework), alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules. +- **Iteration 5** — Automated container (devbox) from repo, CI/CD pipeline, snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, Bedrock Guardrails output/tool-call with Guardian interceptor pattern (pre-execution stage implemented via Cedar `agent/src/policy.py` + PreToolUse hooks `agent/src/hooks.py`; remaining: post-execution output screening for PII/secrets/sensitive data, cost threshold checks, `modify` mode) — input screening in 3c, mid-execution behavioral monitoring (tool-call frequency circuit breaker, cost runaway detection, aggregate behavioral bounds within agent harness), centralized policy framework (Phase 1: policy audit normalization with `PolicyDecisionEvent` schema across all enforcement points, three enforcement modes — `enforced` | `observed` | `steered` — with observe-before-enforce rollout workflow; Phase 2: Cedar partially implemented in agent harness with in-process `cedarpy` for tool-call governance; remaining: extend Cedar to TypeScript orchestrator for budget/quota resolution, migrate to Amazon Verified Permissions for runtime-configurable policies, virtual-action classification pattern for enforce/observe/steer, extended for multi-tenant authorization when multi-user/team lands), capability-based security model (tiers feed into policy framework), alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules. - **Iteration 6** — Agent swarm orchestration, skills learning, multi-repo, iterative feedback and multiplayer sessions, HITL approval, scheduled triggers, CDK constructs. Design docs to keep in sync: [ARCHITECTURE.md](../design/ARCHITECTURE.md), [ORCHESTRATOR.md](../design/ORCHESTRATOR.md), [API_CONTRACT.md](../design/API_CONTRACT.md), [INPUT_GATEWAY.md](../design/INPUT_GATEWAY.md), [REPO_ONBOARDING.md](../design/REPO_ONBOARDING.md), [MEMORY.md](../design/MEMORY.md), [OBSERVABILITY.md](../design/OBSERVABILITY.md), [COMPUTE.md](../design/COMPUTE.md), [CONTROL_PANEL.md](../design/CONTROL_PANEL.md), [SECURITY.md](../design/SECURITY.md), [EVALUATION.md](../design/EVALUATION.md). diff --git a/docs/src/content/docs/design/Agent-harness.md b/docs/src/content/docs/design/Agent-harness.md index effda15..62459fd 100644 --- a/docs/src/content/docs/design/Agent-harness.md +++ b/docs/src/content/docs/design/Agent-harness.md @@ -40,7 +40,7 @@ Plugins, skills, and MCP servers are **out of scope for MVP**. The harness must The following are desired properties for the harness; MVP satisfies some and defers others: - **Add additional tools** — In addition to the harness’s built-in tools (e.g. file, shell), the platform must be able to attach more (e.g. via AgentCore Gateway). MVP: satisfied by Gateway (GitHub, web search). -- **Deterministic hooks** — Support for deterministic steps or hooks (e.g. pre/post tool execution, validation) so the platform can mix coded logic with the agent loop. The **blueprint execution framework** (see [REPO_ONBOARDING.md](/design/repo-onboarding#blueprint-execution-framework)) realizes this requirement: the orchestrator runs custom Lambda-backed steps at configurable pipeline phases (`pre-agent`, `post-agent`) with framework-enforced invariants (state transitions, events, cancellation). The agent harness itself does not need to implement hooks — they run at the orchestrator level, outside the agent session. +- **Deterministic hooks** — Support for deterministic steps or hooks (e.g. pre/post tool execution, validation) so the platform can mix coded logic with the agent loop. The **blueprint execution framework** (see [REPO_ONBOARDING.md](/design/repo-onboarding#blueprint-execution-framework)) realizes this requirement at the orchestrator level: custom Lambda-backed steps at configurable pipeline phases (`pre-agent`, `post-agent`) with framework-enforced invariants (state transitions, events, cancellation). Additionally, the **agent harness implements PreToolUse hooks** (`agent/src/hooks.py`) for real-time tool-call policy enforcement via the Cedar policy engine (`agent/src/policy.py`). The PreToolUse hook evaluates every tool call against Cedar policies before execution: `pr_review` agents are denied `Write`/`Edit` tools, writes to protected paths (`.github/workflows/*`, `.git/*`) are blocked, and destructive bash commands are denied. The engine is fail-closed — if `cedarpy` is unavailable or evaluation errors occur, all tool calls are denied. Denied decisions emit `POLICY_DECISION` telemetry events. Per-repo custom Cedar policies can be injected via Blueprint `security.cedarPolicies`. - **Plugins / skills / MCP** — Support for plugins, skills, or MCP servers for extensibility. Out of scope for MVP. - **Access to external memory** — The agent should be able to read and write short- and long-term memory (e.g. AgentCore Memory). MVP: AgentCore Memory is available to the agent via the runtime; the SDK or platform wires it in. - **Session persistence** — Persisting conversation and agent state across session boundaries for crash recovery or resume. MVP: Claude Code SDK has no built-in session manager; durability is via frequent commits. **Update:** AgentCore Runtime persistent session storage (preview) now mounts a per-session filesystem at `/mnt/workspace` that survives stop/resume cycles. Tool caches (mise, npm, Claude Code config) persist across invocations within a session (14-day TTL). Repo clones remain on local ephemeral disk because the S3-backed FUSE mount does not support `flock()`, which breaks build tools like `uv`. See [COMPUTE.md](/design/compute#session-storage-persistent-filesystem). diff --git a/docs/src/content/docs/design/Architecture.md b/docs/src/content/docs/design/Architecture.md index d6297a1..2a7fa80 100644 --- a/docs/src/content/docs/design/Architecture.md +++ b/docs/src/content/docs/design/Architecture.md @@ -187,7 +187,7 @@ Each concept has a **source-of-truth document** and one or more documents that r | Agent self-feedback | MEMORY.md (Insights section) | EVALUATION.md (Agent self-feedback section) | | Prompt versioning | EVALUATION.md (Prompt versioning) | ORCHESTRATOR.md (data model: `prompt_version`), ROADMAP.md (3b), `src/handlers/shared/prompt-version.ts` | | Extraction prompts | MEMORY.md (Extraction prompts) | EVALUATION.md (references), ROADMAP.md (3b) | -| Tiered tool access | SECURITY.md (Input validation) | REPO_ONBOARDING.md, ROADMAP.md (Iter 5) | +| Tiered tool access / Cedar policy engine | SECURITY.md (Input validation, Policy enforcement), `agent/src/policy.py` | REPO_ONBOARDING.md, ROADMAP.md (Iter 3bis partial, Iter 5 full) | | Memory isolation | SECURITY.md (Memory-specific threats) | MEMORY.md (Requirements), ROADMAP.md (Iter 5) | | Data protection / DR | SECURITY.md (Data protection) | — | | 2GB image limit | COMPUTE.md (AgentCore Runtime 2GB) | ROADMAP.md (Iter 5: alternate runtime) | @@ -206,7 +206,7 @@ Each concept has a **source-of-truth document** and one or more documents that r | Agent swarm orchestration | ROADMAP.md (Iter 6) | — | | Adaptive model router | ROADMAP.md (Iter 5) | COST_MODEL.md | | Capability-based security | ROADMAP.md (Iter 5) | SECURITY.md | -| Centralized policy framework | ROADMAP.md (Iter 5), SECURITY.md (Policy enforcement and audit) | ORCHESTRATOR.md, OBSERVABILITY.md | +| Centralized policy framework | ROADMAP.md (Iter 5), SECURITY.md (Policy enforcement and audit), `agent/src/policy.py` (in-process Cedar, partially implemented) | ORCHESTRATOR.md, OBSERVABILITY.md | | GitHub App + AgentCore Token Vault | ROADMAP.md (Iter 3c), SECURITY.md (Authentication) | ORCHESTRATOR.md (context hydration), COMPUTE.md | | Live session replay | ROADMAP.md (Iter 4) | API_CONTRACT.md | | PR iteration task type | API_CONTRACT.md, ORCHESTRATOR.md | USER_GUIDE.md, PROMPT_GUIDE.md, SECURITY.md, AGENT_HARNESS.md | @@ -216,7 +216,7 @@ Each concept has a **source-of-truth document** and one or more documents that r | Memory input hardening (3e Phase 1) | ROADMAP.md (Iter 3e Phase 1, co-ships with 3d) | MEMORY.md, SECURITY.md (Memory-specific threats) | | Per-tool-call structured telemetry | ROADMAP.md (Iter 3d) | SECURITY.md (Mid-execution enforcement), EVALUATION.md, OBSERVABILITY.md | | Mid-execution behavioral monitoring | ROADMAP.md (Iter 5), SECURITY.md (Mid-execution enforcement) | OBSERVABILITY.md | -| Tool-call interceptor (Guardian pattern) | SECURITY.md (Mid-execution enforcement), ROADMAP.md (Iter 5) | REPO_ONBOARDING.md (Blueprint security props) | +| Tool-call interceptor (Guardian pattern) | SECURITY.md (Mid-execution enforcement), `agent/src/hooks.py` + `agent/src/policy.py` (pre-execution implemented), ROADMAP.md (Iter 5 for post-execution) | REPO_ONBOARDING.md (Blueprint security props) | ### Per-repo model selection diff --git a/docs/src/content/docs/design/Orchestrator.md b/docs/src/content/docs/design/Orchestrator.md index 8d1d88f..c9a344f 100644 --- a/docs/src/content/docs/design/Orchestrator.md +++ b/docs/src/content/docs/design/Orchestrator.md @@ -26,7 +26,7 @@ These boundaries matter whenever you change task submission, the CLI, or the run |---------|-------------------|--------| | REST request/response types | `cdk/src/handlers/shared/types.ts` | **Mirror** in `cli/src/types.ts` for `bgagent` — keep them aligned on every API change. | | HTTP handlers & orchestration code | `cdk/src/handlers/` (e.g. shared `orchestrator.ts`, `create-task-core.ts`, `preflight.ts`) | Colocated Jest tests under `cdk/test/handlers/` and `cdk/test/handlers/shared/`. | -| Agent runtime behavior | `agent/` (`entrypoint.py`, `prompts/`, `system_prompt.py`, Dockerfile) | Consumes task payload and environment set by CDK/Lambda; see `agent/README.md` for PAT, tools, and local run. | +| Agent runtime behavior | `agent/src/` (`entrypoint.py` re-export shim, `pipeline.py`, `runner.py`, `config.py`, `hooks.py`, `policy.py`, `prompts/`, `system_prompt.py`, Dockerfile) | Consumes task payload and environment set by CDK/Lambda; see `agent/README.md` for PAT, tools, and local run. | | User-facing API documentation | `docs/guides/USER_GUIDE.md` (and synced site) | Regenerate Starlight content with `mise //docs:sync` after guide edits. | The orchestrator document describes **behavior** (state machine, admission, cancellation). The TypeScript `types.ts` files are the **schema** the API and CLI share; the agent implements the **work** inside compute. @@ -275,7 +275,7 @@ The orchestrator's `hydrateAndTransition()` function calls `hydrateContext()` (` - **`pr_iteration`** / **`pr_review`**: Fetches the pull request context via `fetchGitHubPullRequest()` — four parallel calls: three REST API calls (PR metadata, conversation comments, changed files) plus one GraphQL query for inline review comments. The GraphQL query filters out resolved review threads at fetch time so the agent only sees unresolved feedback. PR metadata includes title, body, head/base refs, and state; the diff summary covers changed files. The PR's `head_ref` is stored as `resolved_branch_name` and `base_ref` as `resolved_base_branch` on the hydrated context. These are used by the orchestrator to update the task record's `branch_name` from the placeholder `pending:pr_resolution` to the actual PR branch. For `pr_review`, if no `task_description` is provided, a default review instruction is used. 3. **Enforces a token budget** on the combined context. Uses a character-based heuristic (~4 chars per token). Default budget: 100K tokens (configurable via `USER_PROMPT_TOKEN_BUDGET` environment variable). When the budget is exceeded, oldest comments are removed first. The `truncated` flag is set in the result. 4. **Assembles the user prompt** based on task type: - - **`new_task`**: A structured markdown document with Task ID, Repository, GitHub Issue section, and Task section. The format mirrors the Python `assemble_prompt()` in `agent/entrypoint.py`. + - **`new_task`**: A structured markdown document with Task ID, Repository, GitHub Issue section, and Task section. The format mirrors the Python `assemble_prompt()` in `agent/src/context.py`. - **`pr_iteration`**: Assembled by `assemblePrIterationPrompt()` — includes PR metadata (number, title, body), the diff summary (changed files and patches), review comments (inline and conversation), and optional user instructions from `task_description`. 5. **Screens through Bedrock Guardrail** (PR tasks; `new_task` when issue content is present): The assembled user prompt is screened through Amazon Bedrock Guardrails (`screenWithGuardrail()`) using the `PROMPT_ATTACK` content filter. For `new_task` tasks without issue content, screening is skipped because the task description was already screened at submission time. If the guardrail detects prompt injection, `guardrail_blocked` is set on the result and the orchestrator fails the task. If the Bedrock API is unavailable, a `GuardrailScreeningError` is thrown (fail-closed — unscreened content never reaches the agent). Task descriptions for all task types are screened at submission time in `create-task-core.ts`. 6. **Returns a `HydratedContext` object** containing `version`, `user_prompt`, `issue`, `sources`, `token_estimate`, `truncated`, and for `pr_iteration`/`pr_review` tasks: `resolved_branch_name` and `resolved_base_branch`. diff --git a/docs/src/content/docs/design/Security.md b/docs/src/content/docs/design/Security.md index 6704993..5009f5c 100644 --- a/docs/src/content/docs/design/Security.md +++ b/docs/src/content/docs/design/Security.md @@ -123,22 +123,22 @@ The platform enforces policies at multiple points in the task lifecycle. Today, | **Hydration** | Guardrail prompt screening (PR + issue content) | `context-hydration.ts` | `guardrail_blocked` event emitted | | **Hydration** | Budget/quota resolution (3-tier max_turns, 2-tier max_budget_usd) | `orchestrator.ts` (`hydrateAndTransition`) | Values persisted on task record — no policy decision event | | **Hydration** | Token budget for prompt assembly | `context-hydration.ts` | No event emitted | -| **Session** | Tool access control (pr_review restrictions) | `agent/entrypoint.py` | No event emitted | +| **Session** | Tool access control (pr_review restrictions, Cedar deny-list) | `agent/src/hooks.py`, `agent/src/policy.py` (PreToolUse hook + Cedar engine) | `POLICY_DECISION` telemetry event on deny | | **Session** | Budget enforcement (turns, cost) | Claude Agent SDK | Agent SDK enforces; cost in task result | -| **Finalization** | Build/lint verification | `agent/entrypoint.py` | Results in task record and PR body | +| **Finalization** | Build/lint verification | `agent/src/post_hooks.py` | Results in task record and PR body | | **Infrastructure** | DNS Firewall egress allowlist | `dns-firewall.ts`, `agent.ts` (CDK synth) | DNS query logs in CloudWatch | | **Infrastructure** | WAF rate limiting | `task-api.ts` (CDK synth) | WAF logs | | **State machine** | Valid transition enforcement | `task-status.ts`, `orchestrator.ts` | DynamoDB conditional writes | ### Audit gaps (planned remediation) -Submission-time policy decisions (validation, onboarding gate, guardrail screening, idempotency) currently return HTTP errors without emitting structured audit events. Budget resolution decisions are persisted but not logged as policy decisions with reason codes. Tool access selection is implicit (hardcoded in agent code) with no audit event. +Submission-time policy decisions (validation, onboarding gate, guardrail screening, idempotency) currently return HTTP errors without emitting structured audit events. Budget resolution decisions are persisted but not logged as policy decisions with reason codes. Tool access is enforced by the Cedar policy engine (`agent/src/policy.py`) via PreToolUse hooks (`agent/src/hooks.py`); denied decisions emit `POLICY_DECISION` telemetry events, but these are not yet part of a unified `PolicyDecisionEvent` schema. **Planned (Iteration 5, Phase 1):** A unified `PolicyDecisionEvent` schema will normalize all policy decisions into structured events with: decision ID, policy name, version, phase, input hash, result, reason codes, and enforcement mode. Enforcement supports three modes: `enforced` (decision is binding — deny blocks, allow proceeds), `observed` (decision is logged but not enforced — shadow mode for safe rollout), and `steered` (decision modifies the input or output rather than blocking — redact PII, sanitize paths, mask secrets). New rules deploy in `observed` mode first; operators validate false-positive rates via `PolicyDecisionEvent` logs, then promote to `enforced` or `steered`. This observe-before-enforce workflow enables gradual rollout of security policies without risking false blocks on legitimate tasks. See [ROADMAP.md Iteration 5](/roadmap/roadmap) for the full centralized policy framework design. ### Policy resolution and authorization (planned) -**Planned (Iteration 5, Phase 2):** Cedar as the single policy engine for both **operational policy** (budget/quota/tool-access resolution, tool-call interception rules) and **authorization** (multi-tenant access control, extended when multi-user/team lands). Cedar replaces the scattered merge logic across handlers with a unified policy evaluation. A thin `policy.ts` adapter translates Cedar decisions into `PolicyDecision` objects consumed by existing handlers. Cedar is preferred over OPA: it is AWS-native, has formal verification guarantees, integrates with AgentCore Gateway, and policies can be evaluated in-process via the Cedar SDK without a separate service dependency. Cedar's binary permit/forbid model supports the three enforcement modes (`enforced`, `observed`, `steered`) via a **virtual-action classification pattern**: the interceptor evaluates against multiple virtual actions (`invoke_tool`, `invoke_tool_steered`, `invoke_tool_denied`) and uses the first permitted action to determine the mode. For example, `forbid(principal, action == Action::"invoke_tool", resource) when { resource.path like ".github/workflows/*" && principal.capability_tier != "elevated" }` blocks the call, while `permit(principal, action == Action::"invoke_tool_steered", resource) when { context.output_contains_pii }` triggers PII redaction instead of blocking. Cedar policies are stored in Amazon Verified Permissions and loaded at hydration/session-start time — policy changes take effect without CDK redeployment. When multi-user/team support lands, the same Cedar policy store expands to cover tenant-specific authorization (user/team/repo scoping, team budgets, risk-based approval requirements). +**Partially implemented / Planned (Iteration 5, Phase 2):** Cedar as the single policy engine for both **operational policy** (budget/quota/tool-access resolution, tool-call interception rules) and **authorization** (multi-tenant access control, extended when multi-user/team lands). **Current state:** An in-process Cedar policy engine (`agent/src/policy.py`, using `cedarpy`) enforces a deny-list model for tool-call governance: `pr_review` agents are forbidden from using `Write` and `Edit` tools, writes to `.git/*` internals are blocked for all agents, and destructive bash commands (`rm -rf /`, `git push --force`) are denied. The engine is fail-closed — if `cedarpy` is unavailable or evaluation errors occur, all tool calls are denied. Per-repo custom Cedar policies can be injected via Blueprint `security.cedarPolicies`. The PreToolUse hook (`agent/src/hooks.py`) integrates the policy engine with the Claude Agent SDK's hook system, and denied decisions emit `POLICY_DECISION` telemetry events via `agent/src/telemetry.py`. **Planned:** Cedar replaces the scattered merge logic across TypeScript handlers with a unified policy evaluation. A thin `policy.ts` adapter translates Cedar decisions into `PolicyDecision` objects consumed by existing handlers. Cedar is preferred over OPA: it is AWS-native, has formal verification guarantees, integrates with AgentCore Gateway, and policies can be evaluated in-process via the Cedar SDK without a separate service dependency. Cedar's binary permit/forbid model supports the three enforcement modes (`enforced`, `observed`, `steered`) via a **virtual-action classification pattern**: the interceptor evaluates against multiple virtual actions (`invoke_tool`, `invoke_tool_steered`, `invoke_tool_denied`) and uses the first permitted action to determine the mode. For example, `forbid(principal, action == Action::"invoke_tool", resource) when { resource.path like ".github/workflows/*" && principal.capability_tier != "elevated" }` blocks the call, while `permit(principal, action == Action::"invoke_tool_steered", resource) when { context.output_contains_pii }` triggers PII redaction instead of blocking. Cedar policies will be stored in Amazon Verified Permissions and loaded at hydration/session-start time — policy changes take effect without CDK redeployment. When multi-user/team support lands, the same Cedar policy store expands to cover tenant-specific authorization (user/team/repo scoping, team budgets, risk-based approval requirements). ### Mid-execution enforcement (planned) @@ -146,7 +146,7 @@ Today, once an agent session starts, the orchestrator can only observe it via po **Planned (Iteration 5):** Two complementary mechanisms address this gap: -1. **Tool-call interceptor (Guardian pattern)** — A policy-evaluation layer in the agent harness (`entrypoint.py`) that sits between the agent SDK's tool-call decision and actual tool execution. Evaluation is split into two stages: a **pre-execution stage** that validates tool inputs before the tool runs (file path deny patterns, bash command allowlist per capability tier, cost threshold checks, and per-repo rules from Blueprint `security` configuration) and blocks disallowed operations before they execute, and a **post-execution stage** that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. The interceptor can allow, modify (e.g. redact secrets from output), or deny tool calls. Denied calls return a structured error to the agent, which can retry with a different approach. This follows the Guardian interceptor pattern (Hu et al. 2025) — enforcement happens at tool-call time, not before the session starts (input guardrails) or after it ends (validation pipeline). Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision is logged as a `PolicyDecisionEvent`. +1. **Tool-call interceptor (Guardian pattern)** — A policy-evaluation layer in the agent harness (`agent/src/hooks.py` + `agent/src/policy.py`) that sits between the agent SDK's tool-call decision and actual tool execution. **Current state:** The pre-execution stage is implemented: a Cedar-based `PolicyEngine` evaluates tool calls via a PreToolUse hook before execution. The deny-list model blocks `Write`/`Edit` for `pr_review` tasks, protects `.git/*` internals, and denies destructive bash commands. The engine is fail-closed (denies on error or missing `cedarpy`). Per-repo custom Cedar policies are supported via Blueprint `security.cedarPolicies`. **Planned extensions:** Evaluation is split into two stages: a **pre-execution stage** (implemented) that validates tool inputs before the tool runs (tool-level deny-list via Cedar policies, file path deny patterns for protected paths, bash command deny patterns for destructive commands, and per-repo custom Cedar policies from Blueprint `security.cedarPolicies`) and blocks disallowed operations before they execute, and a **post-execution stage** (planned) that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. The interceptor can allow, modify (e.g. redact secrets from output), or deny tool calls. Denied calls return a structured error to the agent, which can retry with a different approach. This follows the Guardian interceptor pattern (Hu et al. 2025) — enforcement happens at tool-call time, not before the session starts (input guardrails) or after it ends (validation pipeline). Denied decisions emit `POLICY_DECISION` telemetry events via `agent/src/telemetry.py`. Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision will be logged as a `PolicyDecisionEvent`. 2. **Behavioral circuit breaker** — Lightweight monitoring of tool-call patterns within a session: call frequency (calls per minute), cumulative cost, repeated failures on the same tool, and file mutation rate. When metrics exceed configurable thresholds (e.g. >50 tool calls/minute, >$10 cumulative cost, >5 consecutive failures), the circuit breaker pauses or terminates the session and emits a `circuit_breaker_triggered` event. This catches runaway loops and cost explosions before the hard session timeout. Thresholds are configurable per-repo via Blueprint `security` props. @@ -261,7 +261,7 @@ AgentCore Memory has **no native backup mechanism**. This is a significant gap f - **No customer-managed KMS** — all encryption at rest uses AWS-managed keys. Customer-managed KMS can be added if required by compliance policy. - **CORS is fully open** — `ALL_ORIGINS` is configured for CLI consumption. Restrict origins when exposing browser clients. - **DNS Firewall IP bypass** — DNS Firewall does not block direct IP connections (see [NETWORK_ARCHITECTURE.md](/design/network-architecture#dns-firewall)). -- **No tiered tool access** — all agent sessions currently have the same tool set. +- **Partial tool access control** — Cedar-based policy enforcement (`agent/src/policy.py`) provides per-task-type tool restrictions (e.g. `pr_review` agents cannot use `Write`/`Edit`), `.git/*` write protection, and destructive command blocking. `.github/workflows/*` is not blocked by default because agents may legitimately need to modify CI workflows; operators can add workflow protection via Blueprint `security.cedarPolicies` if needed. Per-repo custom Cedar policies are supported via Blueprint `security.cedarPolicies`. **Important:** custom policies for `write_file` and `execute_bash` actions must use `context.file_path` / `context.command` in `when` clauses — not `resource ==` matching — because the engine uses fixed sentinel resource IDs to avoid Cedar entity UID parsing failures on special characters. `invoke_tool` actions use the real tool name as resource ID, so `resource ==` matching works for tool-level policies. Full tiered tool access (capability tiers, MCP server allowlisting) is planned for Iteration 5. ## Reference diff --git a/docs/src/content/docs/roadmap/Roadmap.md b/docs/src/content/docs/roadmap/Roadmap.md index 3697860..a4059a5 100644 --- a/docs/src/content/docs/roadmap/Roadmap.md +++ b/docs/src/content/docs/roadmap/Roadmap.md @@ -138,8 +138,8 @@ These practices apply continuously across iterations and are not treated as one- - [x] **Orchestrator fallback episode observability** — `writeMinimalEpisode` return value is now checked and logged: `logger.warn('Fallback episode write returned false')` when the inner function reports failure via its return value (previously discarded). New test `logs warning when writeMinimalEpisode returns false` covers this path. - [x] **Python unit tests** — Added pytest-based unit tests (`agent/tests/`) for pure functions: `slugify()`, `redact_secrets()`, `format_bytes()`, `truncate()`, `build_config()`, `assemble_prompt()`, `_discover_project_config()`, `_build_system_prompt()` (entrypoint), `_validate_repo()` (memory), `_now_iso()`, `_build_logs_url()` (task_state). Added pytest to dev dependency group with `pythonpath` config for in-tree imports. -- [x] **Decompose entrypoint.py** — Extracted four named subfunctions from `run_task()` and `run_agent()`: `_build_system_prompt()` (system prompt assembly + memory context), `_discover_project_config()` (repo config scanning), `_write_memory()` (episode + learnings writes), `_setup_agent_env()` (Bedrock/OTEL env var setup). All functions stay in `entrypoint.py` (no import changes). `run_task()` and `run_agent()` now call the extracted functions. -- [x] **Deprecate dual prompt assembly** — Added deprecation docstring to `assemble_prompt()` clarifying that production uses the orchestrator's `assembleUserPrompt()` via `hydrated_context["user_prompt"]`. Python version retained only for local batch mode and dry-run mode. No code deletion — just documentation of the intended flow. +- [x] **Decompose entrypoint.py** — Initially extracted four named subfunctions (`_build_system_prompt()`, `_discover_project_config()`, `_write_memory()`, `_setup_agent_env()`). Subsequently, the agent code was further decomposed into a full `agent/src/` module structure: `config.py` (configuration and validation), `models.py` (Pydantic data models and enumerations), `pipeline.py` (task orchestration), `runner.py` (agent execution), `context.py` (context hydration), `prompt_builder.py` (prompt assembly), `hooks.py` (PreToolUse policy hooks), `policy.py` (Cedar policy engine), `post_hooks.py` (deterministic post-hooks), `repo.py` (repository setup), `shell.py` (utilities), `telemetry.py` (metrics and trajectory). The original `entrypoint.py` is now a re-export shim for backward compatibility with tests. +- [x] **Deprecate dual prompt assembly** — Added deprecation docstring to `assemble_prompt()` clarifying that production uses the orchestrator's `assembleUserPrompt()` via `HydratedContext.user_prompt` (validated from the incoming JSON). Python version retained only for local batch mode and dry-run mode. No code deletion — just documentation of the intended flow. - [x] **Graceful thread drain in server.py** — Added `_active_threads` list for tracking background threads, `_drain_threads(timeout=300)` function that joins all alive threads, registered via `@app.on_event("shutdown")` (FastAPI lifecycle — uvicorn translates SIGTERM) and `atexit.register()` as backup. Thread list is cleaned on each new invocation. - [x] **Remove dead QUEUED state** — Removed `QUEUED` from `TaskStatus`, `VALID_TRANSITIONS`, and `ACTIVE_STATUSES` in `task-status.ts`. Updated SUBMITTED transitions to `[HYDRATING, FAILED, CANCELLED]`. Removed QUEUED from all tests (count assertions, cancel test, validation test) and documentation (ORCHESTRATOR.md, OBSERVABILITY.md, API_CONTRACT.md, ARCHITECTURE.md). - [x] **Hardening fixes (review round)** — Thread race in `server.py` (track thread before `start()`), defensive `.get()` on `ClientError.response` in `task_state.py`, wired `fallback_error` through `orchestrator.ts` (warning log + event metadata), TOCTOU `ConditionExpression` on reconciler update, per-user error isolation in reconciler, `TaskStatusType` propagation across types/orchestrator/memory, graduated trajectory writer failure, subprocess timeouts, FastAPI lifespan pattern, `decrementConcurrency` CCF distinction. @@ -197,7 +197,7 @@ These practices apply continuously across iterations and are not treated as one- - **Review feedback memory loop (Tier 2)** — Capture PR review comments via GitHub webhook, extract actionable rules via LLM, and persist them as searchable memory so the agent internalizes reviewer preferences over time. This is the primary feedback loop between human reviewers and the agent — no shipping coding agent does this today. Requires a GitHub webhook → API Gateway → Lambda pipeline (separate from agent execution). Two types of extracted knowledge: repo-level rules ("don't use `any` types") and task-specific corrections. See [MEMORY.md](/design/memory) (Review feedback memory) and [SECURITY.md](/design/security) (prompt injection via review comments). - **PR outcome tracking** — Track whether agent-created PRs are merged, revised, or rejected via GitHub webhooks (`pull_request.closed` events). A merged PR is a positive signal; closed-without-merge is a negative signal. These outcome signals feed into the evaluation pipeline and enable the episodic memory to learn which approaches succeed. See [MEMORY.md](/design/memory) (PR outcome signals) and [EVALUATION.md](/design/evaluation). - **Evaluation pipeline (basic)** — Automated evaluation of agent runs: failure categorization (reasoning errors, missed instructions, missing tests, timeouts, tool failures). Results are stored and surfaced in observability dashboards. Basic version: rules-based analysis of task outcomes and agent responses. Track memory effectiveness metrics: first-review merge rate, revision cycles, CI pass rate on first push, review comment density, and repeated mistakes. Advanced version (ML-based trace analysis, A/B prompt comparison, feedback loop into prompts) is deferred to Iteration 5. See [EVALUATION.md](/design/evaluation) and [OBSERVABILITY.md](/design/observability). -- **Per-tool-call structured telemetry** — Instrument the agent harness (`entrypoint.py`) to emit structured events for every tool call: tool name, input hash (SHA-256), output hash, duration, cost attribution, and result status. Events flow through the existing `create_event` path and are surfaced in CloudWatch. This is foundational for: (a) the evaluation pipeline (tool-call-level success/failure analysis), (b) the centralized policy framework Phase 1 (tool calls become `PolicyDecisionEvent` sources in Iteration 5), and (c) future mid-execution policy enforcement (tool-call interceptor in Iteration 5). Without per-tool-call telemetry, the platform can only observe sessions as opaque black boxes — model invocation logs capture LLM reasoning but not the tool execution that connects reasoning to action. Informed by the Guardian system's tool-call interception architecture (Hu et al. 2025). See [OBSERVABILITY.md](/design/observability) and [SECURITY.md](/design/security) (Mid-execution enforcement). +- **Per-tool-call structured telemetry** — Instrument the agent harness (`agent/src/telemetry.py`) to emit structured events for every tool call: tool name, input hash (SHA-256), output hash, duration, cost attribution, and result status. Events flow through the existing `create_event` path and are surfaced in CloudWatch. This is foundational for: (a) the evaluation pipeline (tool-call-level success/failure analysis), (b) the centralized policy framework Phase 1 (tool calls become `PolicyDecisionEvent` sources in Iteration 5), and (c) future mid-execution policy enforcement (tool-call interceptor in Iteration 5). Without per-tool-call telemetry, the platform can only observe sessions as opaque black boxes — model invocation logs capture LLM reasoning but not the tool execution that connects reasoning to action. Informed by the Guardian system's tool-call interception architecture (Hu et al. 2025). See [OBSERVABILITY.md](/design/observability) and [SECURITY.md](/design/security) (Mid-execution enforcement). **Prerequisite: 3e Phase 1 (input hardening) ships with this iteration.** The review feedback memory loop writes attacker-controlled content (PR review comments) to persistent memory. Without content sanitization, provenance tagging, and integrity hashing (3e Phase 1), this creates a known attack vector — poisoned review comments stored as persistent rules that influence all future tasks on the repo. 3e Phase 1 items (memory content sanitization, GitHub issue input sanitization, source provenance on memory writes, content integrity hashing) must be implemented before or concurrently with the review feedback pipeline. See [SECURITY.md](/design/security) (Prompt injection via PR review comments). @@ -285,17 +285,19 @@ Deep research identified **9 memory-layer security gaps** in the current archite - **Formal orchestrator verification (TLA+)** — Add a formal specification of the orchestrator in TLA+ and verify it with TLC model checking. Scope includes the task state machine (8 states, valid transitions, terminal states), concurrency admission control (atomic increment + max check), cancellation races (cancel arriving during any orchestration step), reconciler/orchestrator interleavings (counter drift correction while tasks are active), and the polling loop (agent writes terminal status, orchestrator observes and finalizes). Define invariants such as valid-state progression, no illegal transitions, and repo-level safety constraints (for example, at most one active `RUNNING` task per repo when configured). Keep the spec aligned with `src/constructs/task-status.ts` and orchestrator docs so regressions surface as model-check counterexamples before production. **Note:** The TLA+ specification can be started earlier (e.g. during Iteration 3d) since the state machine and concurrency model are already stable. The spec is documentation that also catches bugs — writing it does not depend on Iteration 5 features. Consider starting the state machine and cancellation models as part of the ongoing engineering practice. - **Guardrails (output and tool-call) with interceptor pattern** — Extend Bedrock Guardrails from input screening (implemented in Iteration 3c) to **output filtering** and **agent tool-call guardrails**. Apply content filters to model responses during agent execution, restrict sensitive content generation, and enforce organizational policies (e.g. "do not modify files in `/infrastructure`"). Guardrails configuration can be per-repo (via onboarding) or platform-wide. - **Tool-call interceptor (Guardian pattern):** Implement a policy-evaluation layer in the agent harness (`entrypoint.py`) that intercepts tool calls between the agent SDK's decision and actual execution — enforcement happens at tool-call time, not before the session starts (input guardrails) or after it ends (validation pipeline). Each tool call is evaluated against a policy: file path restrictions (deny writes to `.github/workflows/`, `**/migrations/**`), bash command allowlist per capability tier, cost threshold checks, and per-repo rules from Blueprint `security` configuration. The interceptor can **allow**, **modify** (e.g. redact secrets from output), or **deny** (return structured error to agent, which retries with a different approach). Evaluation is split into two stages: a **pre-execution stage** that validates tool inputs before the tool runs (file path deny patterns, bash command allowlist, cost threshold checks) and blocks disallowed operations before they execute, and a **post-execution stage** that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision is logged as a `PolicyDecisionEvent`. This pattern is informed by the Guardian system (Hu et al. 2025) — a "guardian agent" that monitors and can intercept tool calls before execution. See [SECURITY.md](/design/security) (Mid-execution enforcement). + **Tool-call interceptor (Guardian pattern) — partially implemented:** A Cedar-based policy engine (`agent/src/policy.py`) with PreToolUse hooks (`agent/src/hooks.py`) intercepts tool calls between the agent SDK's decision and actual execution. **Current state (pre-execution stage implemented):** Every tool call is evaluated against Cedar deny-list policies: `pr_review` agents are denied `Write`/`Edit` tools, writes to protected paths (`.github/workflows/*`, `.git/*`) are blocked, and destructive bash commands (`rm -rf /`, `git push --force`) are denied. The engine is fail-closed — if `cedarpy` is unavailable or evaluation errors occur, all tool calls are denied. Per-repo custom Cedar policies are supported via Blueprint `security.cedarPolicies`. Denied decisions emit `POLICY_DECISION` telemetry events via `agent/src/telemetry.py`. **Remaining work:** Extend to a **post-execution stage** that screens tool outputs after the tool runs (PII patterns in file content, secrets in command output, sensitive data leakage) and can redact or flag content before it re-enters the agent context. Add cost threshold checks, bash command allowlist per capability tier, and `modify` mode (e.g. redact secrets from output). Combined with per-tool-call structured telemetry (Iteration 3d), every interceptor decision will be logged as a `PolicyDecisionEvent`. This pattern is informed by the Guardian system (Hu et al. 2025) — a "guardian agent" that monitors and can intercept tool calls before execution. See [SECURITY.md](/design/security) (Mid-execution enforcement). - **Mid-execution behavioral monitoring** — Lightweight monitoring of agent behavior within a running session, filling the gap between input guardrails (pre-session) and validation (post-session). A **behavioral circuit breaker** in the agent harness tracks aggregate metrics: tool-call frequency (calls per minute), cumulative session cost, repeated failures on the same tool, and file mutation rate. When metrics exceed configurable thresholds (e.g. >50 tool calls/minute, >$10 cumulative cost, >5 consecutive failures on the same tool), the circuit breaker pauses or terminates the session and emits a `circuit_breaker_triggered` event. This catches runaway loops, cost explosions, and stuck agents before the hard session timeout. Thresholds are configurable per-repo via Blueprint `security` props. The circuit breaker operates within the existing agent harness — no sidecar process or external service required. For ABCA's single-agent-per-task model, embedded monitoring is simpler and more reliable than an external sidecar; sidecar architecture becomes relevant when multi-agent orchestration lands (Iteration 6). See [SECURITY.md](/design/security) (Mid-execution enforcement). -- **Centralized policy framework** — Consolidate the platform's distributed policy decisions into a unified policy framework and audit layer. Policy logic today is scattered across 20+ files (input validation in `validation.ts` and `create-task-core.ts`, admission control in `orchestrator.ts`, guardrail screening in `context-hydration.ts`, budget resolution across `validation.ts`/`orchestrator.ts`/`entrypoint.py`, tool access in `entrypoint.py`, network egress in `dns-firewall.ts`/`agent.ts`, state transitions in `task-status.ts`/`orchestrator.ts`). This fragmentation makes it difficult to audit what policies exist, verify consistency, or change policy behavior without touching multiple files. +- **Centralized policy framework** — Consolidate the platform's distributed policy decisions into a unified policy framework and audit layer. Policy logic today is scattered across 20+ files (input validation in `validation.ts` and `create-task-core.ts`, admission control in `orchestrator.ts`, guardrail screening in `context-hydration.ts`, budget resolution across `validation.ts`/`orchestrator.ts`/`agent/src/config.py`, tool access in `agent/src/policy.py` + `agent/src/hooks.py`, network egress in `dns-firewall.ts`/`agent.ts`, state transitions in `task-status.ts`/`orchestrator.ts`). The agent-side Cedar policy engine (`agent/src/policy.py`) is a first step — it provides in-process tool-call governance with fail-closed semantics and per-repo custom policies. The full framework extends this to the TypeScript orchestrator side. This fragmentation makes it difficult to audit what policies exist, verify consistency, or change policy behavior without touching multiple files. **Phase 1 — Policy audit normalization:** Define a stable `PolicyDecisionEvent` schema: `decision_id` (ULID), `policy_name` (e.g. `admission.concurrency`, `budget.max_turns`, `guardrail.input_screening`), `policy_version`, `phase` (`submission` | `admission` | `pre_flight` | `hydration` | `session_start` | `session` | `finalization`), `input_hash` (SHA-256 of the decision input for reproducibility), `result` (`allow` | `deny` | `modify`), `reason_codes[]`, `enforcement` (`enforced` | `observed` | `steered`), and `task_id`. The three enforcement modes serve distinct purposes: `enforced` means the decision is binding (deny blocks, allow proceeds), `observed` means the decision is logged but not enforced (shadow mode for safe rollout), and `steered` means the decision modifies the input or output rather than blocking (redact PII, sanitize paths, mask secrets). New rules deploy in `observed` mode first; operators validate false-positive rates via `PolicyDecisionEvent` logs, then promote to `enforced` or `steered`. This observe-before-enforce workflow enables gradual rollout of security policies without risking false blocks on legitimate tasks. Emit a `policy_decision` event via `emitTaskEvent` at every existing enforcement point. Today, some decisions emit events (`admission_rejected`, `preflight_failed`, `guardrail_blocked`) while others silently return HTTP errors — normalize them all. This is pure instrumentation of existing code paths; no behavior change. - **Phase 2 — Cedar policy engine:** + **Phase 2 — Cedar policy engine (partially implemented):** Introduce **Cedar** (not OPA) as the single policy engine for both **operational policy** (budget/quota/tool-access resolution, tool-call interception rules) and **authorization** (extended for multi-tenant access control when multi-user/team support lands). Cedar is AWS-native, has formal verification guarantees, and integrates with AgentCore Gateway. - **Policy resolution:** Cedar replaces the scattered budget/quota/tool-access merge logic (3-tier `max_turns` resolution, 2-tier `max_budget_usd` resolution, tool access determination in `entrypoint.py`, per-repo configuration merge in `loadBlueprintConfig`) with a unified policy evaluation. A thin `policy.ts` adapter module translates Cedar decisions into `PolicyDecision` objects (`PolicyInput` → Cedar evaluation → `PolicyDecision` with computed budgets, tool profile, risk tier, redaction directives) consumed by existing handlers — no new service, no network hop. Input validation (format checks, range checks) remains at the input boundary; Cedar handles resolution and policy composition. + **Current state:** An in-process Cedar policy engine is implemented in the agent harness (`agent/src/policy.py`) using `cedarpy` for tool-call governance. The engine enforces a deny-list model: `pr_review` agents are forbidden from `Write`/`Edit`, writes to `.github/workflows/*` and `.git/*` are blocked, and destructive bash commands are denied. The engine is fail-closed (denies on error, `cedarpy` unavailability, or Cedar `NoDecision`). Per-repo custom Cedar policies can be injected via Blueprint `security.cedarPolicies` and are validated at initialization. Task types are validated against the `TaskType` enum (`agent/src/models.py`). Denied decisions emit `POLICY_DECISION` telemetry events. + + **Remaining work:** Extend Cedar to the TypeScript orchestrator side. Cedar replaces the scattered budget/quota/tool-access merge logic (3-tier `max_turns` resolution, 2-tier `max_budget_usd` resolution, per-repo configuration merge in `loadBlueprintConfig`) with a unified policy evaluation. A thin `policy.ts` adapter module translates Cedar decisions into `PolicyDecision` objects (`PolicyInput` → Cedar evaluation → `PolicyDecision` with computed budgets, tool profile, risk tier, redaction directives) consumed by existing handlers — no new service, no network hop. Input validation (format checks, range checks) remains at the input boundary; Cedar handles resolution and policy composition. Migrate from in-process `cedarpy` to Amazon Verified Permissions for runtime-configurable policies. **Operational tool-call policies** use a **virtual-action classification pattern** to support the three enforcement modes (`enforced`, `observed`, `steered`) within Cedar's binary permit/forbid model. Instead of asking Cedar "allow or deny?", the interceptor evaluates against multiple virtual actions (`invoke_tool`, `invoke_tool_steered`, `invoke_tool_denied`) and uses the first permitted action to determine the mode. For example: `forbid(principal, action == Action::"invoke_tool", resource) when { resource.path like ".github/workflows/*" && principal.capability_tier != "elevated" }` blocks the call, while `permit(principal, action == Action::"invoke_tool_steered", resource) when { context.output_contains_pii }` triggers PII redaction. This keeps Cedar doing what it does best (binary decisions with formal verification) while the interceptor interprets the combination of decisions as allow/steer/deny. @@ -346,9 +348,9 @@ Deep research identified **9 memory-layer security gaps** in the current archite - **Iteration 3c** — Per-repo GitHub App credentials via AgentCore Token Vault (`CfnWorkloadIdentity` + Token Vault credential provider for automatic token refresh; agent uses `GetWorkloadAccessToken` for long-running sessions; sets pattern for GitLab/Jira/Slack integrations), orchestrator pre-flight checks (fail-closed before session start), persistent session storage for select caches (AgentCore Runtime `/mnt/workspace` mount for npm/Claude config; mise/uv/repo on local disk due to FUSE `flock()` limitation), pre-execution task risk classification (model/limits/approval policy selection), tiered validation pipeline (tool validation, code quality analysis, post-execution risk/blast radius analysis), PR risk level, PR review task type (`pr_review` — read-only structured review with tool restriction, defense-in-depth enforcement, CLI `--review-pr` flag), input guardrail screening (Bedrock Guardrails, fail-closed — including GitHub issue content for `new_task`), multi-modal input. - **Iteration 3d** — Review feedback memory loop (Tier 2), PR outcome tracking, evaluation pipeline (basic), per-tool-call structured telemetry (tool name, input/output hash, duration, cost — foundational for evaluation and Iteration 5 policy enforcement). Co-ships with 3e Phase 1 (memory input hardening: content sanitization, provenance tagging, integrity hashing) as a prerequisite for safely writing attacker-controlled content to memory. - **Iteration 3e** — Memory security and integrity: Phase 1 (input hardening — content sanitization, provenance tagging, integrity hashing) ships with 3d as a prerequisite; Phases 2–4 follow: trust-aware retrieval (trust scoring, temporal decay, guardian validation), detection and response (anomaly detection, circuit breaker, quarantine, rollback), advanced protections (write-ahead validation, behavioral drift detection, cryptographic provenance, red teaming). Addresses OWASP ASI06 (Memory & Context Poisoning). -- **Iteration 3bis** (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (`schema_version: "2"`), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition (4 extracted subfunctions), dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active). +- **Iteration 3bis** (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (`schema_version: "2"`), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition into `agent/src/` modules (config, models, pipeline, runner, context, prompt_builder, hooks, policy, post_hooks, repo, shell, telemetry — with entrypoint.py as re-export shim), Cedar policy engine (in-process `cedarpy`, fail-closed deny-list for tool-call governance, PreToolUse hooks, per-repo custom policies via Blueprint `security.cedarPolicies`), TaskType enum with validation, dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active). - **Iteration 4** — Additional git providers, visual proof (screenshots/videos), Slack channel, skills pipeline, user preference memory (Tier 3), control panel (restrict CORS to dashboard origin), real-time event streaming (WebSocket), live session replay and mid-task nudge, browser extension client, MFA for production. -- **Iteration 5** — Automated container (devbox) from repo, CI/CD pipeline, snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, Bedrock Guardrails output/tool-call with Guardian interceptor pattern (pre/post tool-call evaluation stages — pre-execution validates inputs before tool runs, post-execution screens outputs for PII/secrets/sensitive data before re-entering agent context; per-tool-call policy evaluation between agent decision and execution; PII, denied topics, output filters) — input screening in 3c, mid-execution behavioral monitoring (tool-call frequency circuit breaker, cost runaway detection, aggregate behavioral bounds within agent harness), centralized policy framework (Phase 1: policy audit normalization with `PolicyDecisionEvent` schema across all enforcement points, three enforcement modes — `enforced` | `observed` | `steered` — with observe-before-enforce rollout workflow; Phase 2: Cedar as single policy engine for operational tool-call policy and authorization — virtual-action classification pattern for enforce/observe/steer within Cedar's binary model, replaces scattered budget/quota/tool-access resolution, runtime-configurable policies via Amazon Verified Permissions, extended for multi-tenant authorization when multi-user/team lands, AWS-native with formal verification, integrates with AgentCore Gateway), capability-based security model (tiers feed into policy framework), alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules. +- **Iteration 5** — Automated container (devbox) from repo, CI/CD pipeline, snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, Bedrock Guardrails output/tool-call with Guardian interceptor pattern (pre-execution stage implemented via Cedar `agent/src/policy.py` + PreToolUse hooks `agent/src/hooks.py`; remaining: post-execution output screening for PII/secrets/sensitive data, cost threshold checks, `modify` mode) — input screening in 3c, mid-execution behavioral monitoring (tool-call frequency circuit breaker, cost runaway detection, aggregate behavioral bounds within agent harness), centralized policy framework (Phase 1: policy audit normalization with `PolicyDecisionEvent` schema across all enforcement points, three enforcement modes — `enforced` | `observed` | `steered` — with observe-before-enforce rollout workflow; Phase 2: Cedar partially implemented in agent harness with in-process `cedarpy` for tool-call governance; remaining: extend Cedar to TypeScript orchestrator for budget/quota resolution, migrate to Amazon Verified Permissions for runtime-configurable policies, virtual-action classification pattern for enforce/observe/steer, extended for multi-tenant authorization when multi-user/team lands), capability-based security model (tiers feed into policy framework), alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules. - **Iteration 6** — Agent swarm orchestration, skills learning, multi-repo, iterative feedback and multiplayer sessions, HITL approval, scheduled triggers, CDK constructs. Design docs to keep in sync: [ARCHITECTURE.md](/design/architecture), [ORCHESTRATOR.md](/design/orchestrator), [API_CONTRACT.md](/design/api-contract), [INPUT_GATEWAY.md](/design/input-gateway), [REPO_ONBOARDING.md](/design/repo-onboarding), [MEMORY.md](/design/memory), [OBSERVABILITY.md](/design/observability), [COMPUTE.md](/design/compute), [CONTROL_PANEL.md](/design/control-panel), [SECURITY.md](/design/security), [EVALUATION.md](/design/evaluation).