aws-samples · tannermcrae · May 6, 2026
diff --git a/Workload Specific Evaluations/Coding Assistant/.github-action-example/pr-review.yml b/Workload Specific Evaluations/Coding Assistant/.github-action-example/pr-review.yml
@@ -0,0 +1,65 @@
+name: PR Review (aligned)
+
+on:
+  pull_request:
+    types: [opened, synchronize, reopened]
+
+# Concurrency: one review per PR. New pushes cancel the prior run.
+concurrency:
+  group: pr-review-${{ github.event.pull_request.number }}
+  cancel-in-progress: true
+
+permissions:
+  id-token: write        # required for AWS OIDC
+  contents: read
+  pull-requests: write   # to post the review comment
+
+jobs:
+  review:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout PR
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          ref: ${{ github.event.pull_request.head.sha }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install reviewer deps
+        run: pip install boto3
+
+      - name: Configure AWS credentials via OIDC
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
+          aws-region: ${{ vars.AWS_REGION || 'us-east-1' }}
+
+      # Adjust PR_REVIEWER_PATH to wherever you copy the pr_reviewer package.
+      # Here we assume it lives at .github/pr_reviewer/ inside the consumer repo.
+      - name: Run aligned PR reviewer
+        id: review
+        run: |
+          set +e
+          python -m pr_reviewer \
+            --repo . \
+            --base origin/${{ github.event.pull_request.base.ref }} \
+            --format markdown > review.md
+          echo "exit=$?" >> "$GITHUB_OUTPUT"
+        env:
+          PYTHONPATH: ${{ github.workspace }}/.github
+
+      - name: Post review as PR comment
+        uses: marocchino/sticky-pull-request-comment@v2
+        with:
+          header: aligned-pr-review
+          path: review.md
+
+      - name: Fail the check if review failed
+        if: steps.review.outputs.exit != '0'
+        run: |
+          echo "Aligned PR review failed. See the PR comment for details."
+          exit 1
diff --git a/Workload Specific Evaluations/Coding Assistant/01 intro and approach.ipynb b/Workload Specific Evaluations/Coding Assistant/01 intro and approach.ipynb
@@ -0,0 +1,59 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "intro-00",
+   "metadata": {},
+   "source": "# Coding Assistant Evaluations — Notebook 01: Intro & Approach\n\nThis workshop scores three coding agents — **Claude Code**, **Kiro**, and a **custom agent you build** — on a curated, prebuilt task set, along **two axes**:\n\n1. **Pair-programmer** — when you ask a question about the codebase, does the agent surface the right files (an information-retrieval problem) and answer correctly?\n2. **Autonomous** — given a task, does it produce a mergeable diff reliably?\n\n## How this workshop is structured\n\nThe eval data is **prebuilt and curated**, not LLM-generated. That matters because if Claude writes the tasks *and* gets graded on them, you measure self-preference, not capability. We ship:\n\n- A 9-task `tasks.yaml` grounded in real files in [`aws-samples/sample-agentic-platform`](https://github.com/aws-samples/sample-agentic-platform) at a pinned SHA.\n- Per-task rubrics under `scaffolding/ground_truth/`.\n- A hand-authored gold-standard set under `scaffolding/gold_standard/` for calibrating the LLM judge.\n\nYou spend the first few notebooks **inspecting** that data so you understand the schema and trust the methodology. Then you fill in a **custom-agent harness skeleton** at `my_agent/` (CLI plumbing already done — you implement the agent loop and tools), and run the eval against all three agents.\n\nBy the end you will have:\n\n1. A reproducible 9-task eval set (5 normal + 2 trap + 2 nav-only).\n2. Calibrated rubrics: 7 ground-truth rubrics + 8 hand-authored gold-standard pairs proving the LLM judge agrees with humans ≥ 80 % of the time.\n3. A working custom Strands-based agent that you wired up.\n4. **Pair-programmer scorecard** — precision@5, recall@10, MRR, answer accuracy, citation grounding, honesty.\n5. **Autonomous scorecard** — pass rates by difficulty, reliability across 3 seeds on the hardest tasks, sequence-aware tool-call quality, wall-clock efficiency.\n6. A reusable PR-review CI workflow — same reviewer, dropped into GitHub Actions.\n\n## Why not just use SWE-Bench?\n\nPublic benchmarks are fine for vendor comparison but wrong for deciding whether an agent will work on **your** codebase. Two reasons:\n\n- **Relevance**: the tasks don't look like your code, use your MCP tooling, or follow your review standards.\n- **Contamination**: SWE-Bench tasks come from high-traffic public repos. Model providers almost certainly have the issues, PRs, and discussion in training data. Strong scores mix capability with leakage at an unknowable ratio.\n\nThis workshop uses a public repo (`aws-samples/sample-agentic-platform`) as a stand-in so everyone can follow along, but the whole structure generalizes: swap the repo URL in `tasks.yaml`, hand-curate new tasks following the schema docs in notebooks 02 and 03, run the eval.\n\n## Pair-programmer metrics (notebook 06)\n\n| Metric | What it asks |\n|---|---|\n| **precision@5** | Of the first 5 files the agent touched, how many were relevant? |\n| **recall@10** | Of the relevant files, how many surfaced in the first 10? |\n| **MRR** | How quickly did the first relevant file appear? |\n| **answer accuracy** | LLM judge against your ground-truth answer |\n| **citation grounding** | All `path:line` refs in the answer actually exist |\n| **honesty** | On trap tasks, did the agent refuse to fabricate a fix? |\n\n## Autonomous metrics (notebook 07)\n\n| Signal | What it measures |\n|---|---|\n| **Rubric review** | Does the produced PR meet our review standards? |\n| **Tests** | Does it still work? |\n| **Static** | Does it pass the repo's linters? |\n| **Tool-call (sequence-aware)** | Right tools, called *before* the edit, results consumed? |\n| **Reliability** | Pass-rate across 3 seeds on the hardest tasks |\n| **Wall-clock efficiency** | Uniform across all 3 agents (Kiro can't be intercepted for tokens) |"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "intro-01",
+   "metadata": {},
+   "source": "## How the notebooks work\n\nTwo phases:\n\n**Phase 1 — Inspect the prebuilt data (notebooks 02-04).** You read the curated tasks, rubrics, and gold-standard, validate them with the included validators, and learn the schema in case you want to adapt the workshop to your own repo. No code-writing.\n\n**Phase 2 — Run the eval (notebooks 05-07).** You fill in the `my_agent/` harness skeleton, then run the pair-programmer and autonomous evals against all three agents.\n\n## Module layout\n\n| | |\n|---|---|\n| **01** (this) | Why + how + env check |\n| **02** | Inspect the prebuilt task set + schema reference |\n| **03** | Inspect the prebuilt rubrics + gold standard + schema reference |\n| **04** | Calibrate the automated PR reviewer against the gold set |\n| **05** | Fill in the `my_agent/` harness skeleton (CLI plumbing already done) |\n| **06** | Pair-programmer eval — IR + correctness + grounding + honesty |\n| **07** | Autonomous eval + reliability sub-study + final report |\n\n## Prereqs\n\nNotebook 01 (this one) is the only one with a meaningful code cell — it verifies your environment. Notebooks 02-04 just read prebuilt data. Notebook 05 is where you'll start writing code."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "intro-02",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -q -r requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "intro-03",
+   "metadata": {},
+   "source": "### Environment smoke test\n\nVerifies: AWS creds, Bedrock model access, `claude` CLI, `kiro-cli`, `uv` and `git`.\n\nIf `kiro-cli` is missing, you can still do the workshop — just exclude it from the final run in notebook 07.\n\n> If `claude` shows FAIL but you know it's installed, your Jupyter kernel's `PATH` doesn't include `~/.local/bin` (or wherever the binary lives). Restart the kernel from a shell that has it on `PATH`, or set `os.environ['PATH']` in this notebook before the check."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "intro-04",
+   "metadata": {},
+   "outputs": [],
+   "source": "import os, shutil, boto3\n\n# Make sure user-local bins are on PATH for the kernel — Jupyter often\n# launches with a stripped-down PATH that misses ~/.local/bin etc.\nfor p in (os.path.expanduser('~/.local/bin'), '/opt/homebrew/bin', '/usr/local/bin'):\n    if p not in os.environ.get('PATH', '').split(':') and os.path.isdir(p):\n        os.environ['PATH'] = p + ':' + os.environ.get('PATH', '')\n\nMODEL_ID = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'\n\ndef check(label, ok, detail=''):\n    mark = 'OK  ' if ok else 'FAIL'\n    print(f'{mark}  {label}' + (f'  ({detail})' if detail else ''))\n\ntry:\n    who = boto3.client('sts').get_caller_identity()\n    check('AWS credentials', True, who['Arn'])\nexcept Exception as e:\n    check('AWS credentials', False, str(e))\n\ntry:\n    rt = boto3.client('bedrock-runtime', region_name='us-east-1')\n    resp = rt.converse(\n        modelId=MODEL_ID,\n        messages=[{'role': 'user', 'content': [{'text': 'Say hi in one word.'}]}],\n        inferenceConfig={'maxTokens': 10},\n    )\n    check('Bedrock Claude Sonnet 4.5', True, resp['output']['message']['content'][0]['text'])\nexcept Exception as e:\n    check('Bedrock Claude Sonnet 4.5', False, str(e))\n\n# `kiro-cli` is the binary name; the `kiro` shell alias only works in\n# interactive shells, not subprocess calls.\nfor tool in ('claude', 'kiro-cli', 'uv', 'git'):\n    path = shutil.which(tool)\n    check(f'`{tool}` on PATH', bool(path), path or 'not found')"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "intro-05",
+   "metadata": {},
+   "source": "### What you need next\n\n- **Optional**: A terminal open to a Claude Code (or Kiro) session if you want help filling in `my_agent/` in notebook 05. Not required — the skeleton + TODOs are enough to do it by hand.\n- This notebook stays open; the next six will do the rest.\n\nMove on to **`02 inspect prebuilt tasks.ipynb`**."
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}