Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
name: PR Review (aligned)

on:
pull_request:
types: [opened, synchronize, reopened]

# Concurrency: one review per PR. New pushes cancel the prior run.
concurrency:
group: pr-review-${{ github.event.pull_request.number }}
cancel-in-progress: true

permissions:
id-token: write # required for AWS OIDC
contents: read
pull-requests: write # to post the review comment

jobs:
review:
runs-on: ubuntu-latest
steps:
- name: Checkout PR
uses: actions/checkout@v4
with:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.sha }}

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install reviewer deps
run: pip install boto3

- name: Configure AWS credentials via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ vars.AWS_REGION || 'us-east-1' }}

# Adjust PR_REVIEWER_PATH to wherever you copy the pr_reviewer package.
# Here we assume it lives at .github/pr_reviewer/ inside the consumer repo.
- name: Run aligned PR reviewer
id: review
run: |
set +e
python -m pr_reviewer \
--repo . \
--base origin/${{ github.event.pull_request.base.ref }} \
--format markdown > review.md
echo "exit=$?" >> "$GITHUB_OUTPUT"
env:
PYTHONPATH: ${{ github.workspace }}/.github

- name: Post review as PR comment
uses: marocchino/sticky-pull-request-comment@v2
with:
header: aligned-pr-review
path: review.md

- name: Fail the check if review failed
if: steps.review.outputs.exit != '0'
run: |
echo "Aligned PR review failed. See the PR comment for details."
exit 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "intro-00",
"metadata": {},
"source": "# Coding Assistant Evaluations — Notebook 01: Intro & Approach\n\nThis workshop scores three coding agents — **Claude Code**, **Kiro**, and a **custom agent you build** — on a curated, prebuilt task set, along **two axes**:\n\n1. **Pair-programmer** — when you ask a question about the codebase, does the agent surface the right files (an information-retrieval problem) and answer correctly?\n2. **Autonomous** — given a task, does it produce a mergeable diff reliably?\n\n## How this workshop is structured\n\nThe eval data is **prebuilt and curated**, not LLM-generated. That matters because if Claude writes the tasks *and* gets graded on them, you measure self-preference, not capability. We ship:\n\n- A 9-task `tasks.yaml` grounded in real files in [`aws-samples/sample-agentic-platform`](https://github.com/aws-samples/sample-agentic-platform) at a pinned SHA.\n- Per-task rubrics under `scaffolding/ground_truth/`.\n- A hand-authored gold-standard set under `scaffolding/gold_standard/` for calibrating the LLM judge.\n\nYou spend the first few notebooks **inspecting** that data so you understand the schema and trust the methodology. Then you fill in a **custom-agent harness skeleton** at `my_agent/` (CLI plumbing already done — you implement the agent loop and tools), and run the eval against all three agents.\n\nBy the end you will have:\n\n1. A reproducible 9-task eval set (5 normal + 2 trap + 2 nav-only).\n2. Calibrated rubrics: 7 ground-truth rubrics + 8 hand-authored gold-standard pairs proving the LLM judge agrees with humans ≥ 80 % of the time.\n3. A working custom Strands-based agent that you wired up.\n4. **Pair-programmer scorecard** — precision@5, recall@10, MRR, answer accuracy, citation grounding, honesty.\n5. **Autonomous scorecard** — pass rates by difficulty, reliability across 3 seeds on the hardest tasks, sequence-aware tool-call quality, wall-clock efficiency.\n6. A reusable PR-review CI workflow — same reviewer, dropped into GitHub Actions.\n\n## Why not just use SWE-Bench?\n\nPublic benchmarks are fine for vendor comparison but wrong for deciding whether an agent will work on **your** codebase. Two reasons:\n\n- **Relevance**: the tasks don't look like your code, use your MCP tooling, or follow your review standards.\n- **Contamination**: SWE-Bench tasks come from high-traffic public repos. Model providers almost certainly have the issues, PRs, and discussion in training data. Strong scores mix capability with leakage at an unknowable ratio.\n\nThis workshop uses a public repo (`aws-samples/sample-agentic-platform`) as a stand-in so everyone can follow along, but the whole structure generalizes: swap the repo URL in `tasks.yaml`, hand-curate new tasks following the schema docs in notebooks 02 and 03, run the eval.\n\n## Pair-programmer metrics (notebook 06)\n\n| Metric | What it asks |\n|---|---|\n| **precision@5** | Of the first 5 files the agent touched, how many were relevant? |\n| **recall@10** | Of the relevant files, how many surfaced in the first 10? |\n| **MRR** | How quickly did the first relevant file appear? |\n| **answer accuracy** | LLM judge against your ground-truth answer |\n| **citation grounding** | All `path:line` refs in the answer actually exist |\n| **honesty** | On trap tasks, did the agent refuse to fabricate a fix? |\n\n## Autonomous metrics (notebook 07)\n\n| Signal | What it measures |\n|---|---|\n| **Rubric review** | Does the produced PR meet our review standards? |\n| **Tests** | Does it still work? |\n| **Static** | Does it pass the repo's linters? |\n| **Tool-call (sequence-aware)** | Right tools, called *before* the edit, results consumed? |\n| **Reliability** | Pass-rate across 3 seeds on the hardest tasks |\n| **Wall-clock efficiency** | Uniform across all 3 agents (Kiro can't be intercepted for tokens) |"
},
{
"cell_type": "markdown",
"id": "intro-01",
"metadata": {},
"source": "## How the notebooks work\n\nTwo phases:\n\n**Phase 1 — Inspect the prebuilt data (notebooks 02-04).** You read the curated tasks, rubrics, and gold-standard, validate them with the included validators, and learn the schema in case you want to adapt the workshop to your own repo. No code-writing.\n\n**Phase 2 — Run the eval (notebooks 05-07).** You fill in the `my_agent/` harness skeleton, then run the pair-programmer and autonomous evals against all three agents.\n\n## Module layout\n\n| | |\n|---|---|\n| **01** (this) | Why + how + env check |\n| **02** | Inspect the prebuilt task set + schema reference |\n| **03** | Inspect the prebuilt rubrics + gold standard + schema reference |\n| **04** | Calibrate the automated PR reviewer against the gold set |\n| **05** | Fill in the `my_agent/` harness skeleton (CLI plumbing already done) |\n| **06** | Pair-programmer eval — IR + correctness + grounding + honesty |\n| **07** | Autonomous eval + reliability sub-study + final report |\n\n## Prereqs\n\nNotebook 01 (this one) is the only one with a meaningful code cell — it verifies your environment. Notebooks 02-04 just read prebuilt data. Notebook 05 is where you'll start writing code."
},
{
"cell_type": "code",
"execution_count": null,
"id": "intro-02",
"metadata": {},
"outputs": [],
"source": [
"%pip install -q -r requirements.txt"
]
},
{
"cell_type": "markdown",
"id": "intro-03",
"metadata": {},
"source": "### Environment smoke test\n\nVerifies: AWS creds, Bedrock model access, `claude` CLI, `kiro-cli`, `uv` and `git`.\n\nIf `kiro-cli` is missing, you can still do the workshop — just exclude it from the final run in notebook 07.\n\n> If `claude` shows FAIL but you know it's installed, your Jupyter kernel's `PATH` doesn't include `~/.local/bin` (or wherever the binary lives). Restart the kernel from a shell that has it on `PATH`, or set `os.environ['PATH']` in this notebook before the check."
},
{
"cell_type": "code",
"execution_count": null,
"id": "intro-04",
"metadata": {},
"outputs": [],
"source": "import os, shutil, boto3\n\n# Make sure user-local bins are on PATH for the kernel — Jupyter often\n# launches with a stripped-down PATH that misses ~/.local/bin etc.\nfor p in (os.path.expanduser('~/.local/bin'), '/opt/homebrew/bin', '/usr/local/bin'):\n if p not in os.environ.get('PATH', '').split(':') and os.path.isdir(p):\n os.environ['PATH'] = p + ':' + os.environ.get('PATH', '')\n\nMODEL_ID = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'\n\ndef check(label, ok, detail=''):\n mark = 'OK ' if ok else 'FAIL'\n print(f'{mark} {label}' + (f' ({detail})' if detail else ''))\n\ntry:\n who = boto3.client('sts').get_caller_identity()\n check('AWS credentials', True, who['Arn'])\nexcept Exception as e:\n check('AWS credentials', False, str(e))\n\ntry:\n rt = boto3.client('bedrock-runtime', region_name='us-east-1')\n resp = rt.converse(\n modelId=MODEL_ID,\n messages=[{'role': 'user', 'content': [{'text': 'Say hi in one word.'}]}],\n inferenceConfig={'maxTokens': 10},\n )\n check('Bedrock Claude Sonnet 4.5', True, resp['output']['message']['content'][0]['text'])\nexcept Exception as e:\n check('Bedrock Claude Sonnet 4.5', False, str(e))\n\n# `kiro-cli` is the binary name; the `kiro` shell alias only works in\n# interactive shells, not subprocess calls.\nfor tool in ('claude', 'kiro-cli', 'uv', 'git'):\n path = shutil.which(tool)\n check(f'`{tool}` on PATH', bool(path), path or 'not found')"
},
{
"cell_type": "markdown",
"id": "intro-05",
"metadata": {},
"source": "### What you need next\n\n- **Optional**: A terminal open to a Claude Code (or Kiro) session if you want help filling in `my_agent/` in notebook 05. Not required — the skeleton + TODOs are enough to do it by hand.\n- This notebook stays open; the next six will do the rest.\n\nMove on to **`02 inspect prebuilt tasks.ipynb`**."
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading