Clarify the monorepo as an agent eval toolkit

weiyi · weiyi · commit 5cd597083ab5 · 2026-05-03T22:10:45.000+08:00
The repository already behaved like an eval and reliability toolchain,
but the old monorepo name and demo output shape hid that story. This
change renames the public surface to AgentEvalKit, aligns root docs and
assets, and adds a root demo manifest so downstream automation can treat
one file as the entrypoint to the generated artifact set.

Constraint: Preserve the existing four-tool workflow and artifact contracts
Rejected: Keep the AgentCode name and only tweak the README | still too vague about eval/regression focus
Rejected: Add another root CLI wrapper first | increases surface area before clarifying the product story
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep the root demo manifest stable because it is now the simplest machine-readable handoff for CI and dashboards
Tested: Package unit suites via unittest; end-to-end demo script with manifest verification
Not-tested: GitHub Actions run after push
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -128,20 +128,25 @@ jobs:
         with:
           python-version: "3.11"
       - name: Run monorepo automation demo
-        run: ./scripts/run_automation_demo.sh /tmp/agentcode-automation-demo
+        run: ./scripts/run_automation_demo.sh /tmp/agentevalkit-automation-demo
       - name: Validate automation outputs
         run: |
           python - <<'PY'
           import json
           from pathlib import Path
 
-          out = Path("/tmp/agentcode-automation-demo")
+          out = Path("/tmp/agentevalkit-automation-demo")
+          assert (out / "manifest.json").exists()
           assert (out / "agentci-summary.json").exists()
           assert (out / "agentci-regression.json").exists()
           assert (out / "tracepack-pack" / "manifest.json").exists()
           assert (out / "failmap-clusters.json").exists()
           assert (out / "packslice" / "summary.json").exists()
 
+          demo_manifest = json.loads((out / "manifest.json").read_text())
+          assert demo_manifest["format"] == "agentevalkit-demo-v1"
+          assert demo_manifest["summary"]["agentci"]["regression_passed"] is True
+
           agentci_summary = json.loads((out / "agentci-summary.json").read_text())
           assert agentci_summary["episode_id"] == "openai-agents-demo"
           assert agentci_summary["tool_calls"] >= 1
@@ -162,5 +167,5 @@ jobs:
       - name: Upload automation demo artifacts
         uses: actions/upload-artifact@v4
         with:
-          name: agentcode-automation-demo
-          path: /tmp/agentcode-automation-demo
+          name: agentevalkit-automation-demo
+          path: /tmp/agentevalkit-automation-demo
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,6 +1,6 @@
-# Contributing to AgentCode
+# Contributing to AgentEvalKit
 
-Thanks for checking out `AgentCode`.
+Thanks for checking out `AgentEvalKit`.
 
 This monorepo is intentionally narrow: each project should solve a concrete gap in agent reproducibility, regression testing, failure analysis, or benchmark preparation. Contributions are most useful when they strengthen that end-to-end story instead of adding unrelated demos.
 
@@ -63,9 +63,11 @@ For monorepo automation checks, the root demo script is often the fastest way to
 
 ```bash
 chmod +x scripts/run_automation_demo.sh
-./scripts/run_automation_demo.sh /tmp/agentcode-demo
+./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
 ```
 
+That demo now writes a root `manifest.json` alongside the per-tool artifacts, which is the best single file to inspect when you want to confirm the end-to-end handoff shape.
+
 ## Repo layout
 
 ```text
@@ -113,7 +115,7 @@ cd projects/packslice && python -m unittest discover -s tests -v
 End-to-end validation:
 
 ```bash
-./scripts/run_automation_demo.sh /tmp/agentcode-demo
+./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
 ```
 
 If you change CLI output that is documented in the README, examples, or CI workflow, update those references in the same pull request.
@@ -160,6 +162,6 @@ If you want to propose a new project for the monorepo, start by describing:
 - the missing workflow in today's agent tooling
 - why the problem is not already well served by existing OSS
 - the minimal artifact contract and CLI that would make it useful
-- how it would connect to the rest of `AgentCode`
+- how it would connect to the rest of `AgentEvalKit`
 
 The best proposals usually start small: one tight workflow, one useful artifact, one clear CLI, and one obvious connection to the rest of the toolchain.
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
-# AgentCode
+# AgentEvalKit
 
-[![CI](https://github.com/Jasvina/AgentCode/actions/workflows/ci.yml/badge.svg)](https://github.com/Jasvina/AgentCode/actions/workflows/ci.yml)
-[![License](https://img.shields.io/github/license/Jasvina/AgentCode)](LICENSE)
-[![Monorepo](https://img.shields.io/badge/layout-agent%20tooling%20monorepo-0a7bbb)](https://github.com/Jasvina/AgentCode)
+[![CI](https://github.com/Jasvina/AgentEvalKit/actions/workflows/ci.yml/badge.svg)](https://github.com/Jasvina/AgentEvalKit/actions/workflows/ci.yml)
+[![License](https://img.shields.io/github/license/Jasvina/AgentEvalKit)](LICENSE)
+[![Monorepo](https://img.shields.io/badge/layout-agent%20tooling%20monorepo-0a7bbb)](https://github.com/Jasvina/AgentEvalKit)
 
 A public monorepo for practical open-source projects in the LLM Agent stack.
 
@@ -17,12 +17,12 @@ After surveying today's high-star Agent repositories, four opportunities stood o
 - teams can collect failures, but still lack a simple OSS layer for clustering recurring failure modes and prioritizing fixes across releases
 - teams can build eval packs, but still lack balanced split tooling for train/eval/test and release slicing
 
-`AgentCode` is a place to build those missing layers as focused OSS projects.
+`AgentEvalKit` is a place to build those missing layers as focused OSS projects.
 
 ## Architecture at a glance
 
 <p align="center">
-  <img src="docs/assets/agentcode-overview.svg" alt="AgentCode architecture overview" width="100%" />
+  <img src="docs/assets/agentevalkit-overview.svg" alt="AgentEvalKit architecture overview" width="100%" />
 </p>
 
 This is the intended product story for the monorepo:
@@ -36,13 +36,13 @@ This is the intended product story for the monorepo:
 ## Quick demo output
 
 <p align="center">
-  <img src="docs/assets/agentcode-demo-terminal.svg" alt="AgentCode terminal-style demo output" width="100%" />
+  <img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit terminal-style demo output" width="100%" />
 </p>
 
 If you want a one-command walkthrough of the whole repo:
 
 ```bash
-./scripts/run_automation_demo.sh /tmp/agentcode-demo
+./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
 ```
 
 That gives visitors an immediate answer to the most important README question: “what does this repo actually produce when I run it?”
@@ -102,6 +102,7 @@ For a fuller walkthrough, see `docs/automation.md`, the companion script `script
 The root workflow now runs an end-to-end automation demo and uploads artifacts that mirror a real team handoff:
 
 ```text
+manifest.json
 agentci-summary.json
 agentci-regression.json
 tracepack-pack/
@@ -115,12 +116,14 @@ packslice/
   test/
 ```
 
+The top-level `manifest.json` acts as a machine-readable index for the full demo run, so CI jobs, dashboards, or artifact consumers can discover the output set and key summary metrics from one stable entrypoint.
+
 That makes the repo feel less like four isolated READMEs and more like one coherent toolchain.
 
 Here is a visual snapshot of that terminal-style demo flow:
 
 <p align="center">
-  <img src="docs/assets/agentcode-demo-terminal.svg" alt="AgentCode quick demo output" width="100%" />
+  <img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit quick demo output" width="100%" />
 </p>
 
 ## Monorepo structure
diff --git a/docs/assets/agentevalkit-demo-terminal.svg b/docs/assets/agentevalkit-demo-terminal.svg
@@ -1,6 +1,6 @@
 <svg width="1280" height="900" viewBox="0 0 1280 900" fill="none" xmlns="http://www.w3.org/2000/svg" role="img" aria-labelledby="title desc">
-  <title id="title">AgentCode quick demo output</title>
-  <desc id="desc">A terminal-style screenshot showing the AgentCode automation demo command, generated artifacts, and summary metrics from AgentCI, TracePack, FailMap, and PackSlice.</desc>
+  <title id="title">AgentEvalKit quick demo output</title>
+  <desc id="desc">A terminal-style screenshot showing the AgentEvalKit automation demo command, generated artifacts, and summary metrics from AgentCI, TracePack, FailMap, and PackSlice.</desc>
   <defs>
     <linearGradient id="bg" x1="0" y1="0" x2="1280" y2="900" gradientUnits="userSpaceOnUse">
       <stop stop-color="#0B1220"/>
@@ -26,18 +26,18 @@
   <circle cx="95" cy="83" r="10" fill="#FF5F57"/>
   <circle cx="127" cy="83" r="10" fill="#FEBC2E"/>
   <circle cx="159" cy="83" r="10" fill="#28C840"/>
-  <text x="550" y="90" fill="#DDE7F5" font-family="Inter, Arial, sans-serif" font-size="22" font-weight="600">AgentCode quick demo</text>
+  <text x="550" y="90" fill="#DDE7F5" font-family="Inter, Arial, sans-serif" font-size="22" font-weight="600">AgentEvalKit quick demo</text>
 
   <rect x="92" y="146" width="1096" height="82" rx="18" fill="#0A1322" stroke="#23324B"/>
   <text x="118" y="182" fill="#8BB4FF" font-family="Menlo, Consolas, monospace" font-size="21">$</text>
-  <text x="144" y="182" fill="#E8EEF8" font-family="Menlo, Consolas, monospace" font-size="21">./scripts/run_automation_demo.sh /tmp/agentcode-demo</text>
+  <text x="144" y="182" fill="#E8EEF8" font-family="Menlo, Consolas, monospace" font-size="21">./scripts/run_automation_demo.sh /tmp/agentevalkit-demo</text>
   <text x="118" y="212" fill="#8FA6BF" font-family="Inter, Arial, sans-serif" font-size="16">Runs the whole monorepo chain and emits CI-friendly artifacts without scraping human logs.</text>
 
   <text x="92" y="278" fill="#EEF5FF" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Generated outputs</text>
   <text x="92" y="306" fill="#8FA6BF" font-family="Inter, Arial, sans-serif" font-size="16">These are the same outputs the root GitHub Actions workflow validates and uploads as artifacts.</text>
 
   <rect x="92" y="334" width="512" height="234" rx="22" fill="#0A1322" stroke="#22324B"/>
-  <text x="122" y="376" fill="#9CC7FF" font-family="Menlo, Consolas, monospace" font-size="18">/tmp/agentcode-demo</text>
+  <text x="122" y="376" fill="#9CC7FF" font-family="Menlo, Consolas, monospace" font-size="18">/tmp/agentevalkit-demo</text>
   <text x="122" y="412" fill="#D6E2F2" font-family="Menlo, Consolas, monospace" font-size="17">├── agentci-summary.json</text>
   <text x="122" y="442" fill="#D6E2F2" font-family="Menlo, Consolas, monospace" font-size="17">├── agentci-regression.json</text>
   <text x="122" y="472" fill="#D6E2F2" font-family="Menlo, Consolas, monospace" font-size="17">├── tracepack-pack/manifest.json</text>
diff --git a/docs/assets/agentevalkit-overview.svg b/docs/assets/agentevalkit-overview.svg
@@ -1,5 +1,5 @@
 <svg width="1280" height="760" viewBox="0 0 1280 760" fill="none" xmlns="http://www.w3.org/2000/svg" role="img" aria-labelledby="title desc">
-  <title id="title">AgentCode architecture overview</title>
+  <title id="title">AgentEvalKit architecture overview</title>
   <desc id="desc">A pipeline diagram showing AgentCI, TracePack, FailMap, and PackSlice connected by portable JSON artifacts and CI automation.</desc>
   <defs>
     <linearGradient id="bg" x1="0" y1="0" x2="1280" y2="760" gradientUnits="userSpaceOnUse">
@@ -37,7 +37,7 @@
   <rect width="1280" height="760" rx="32" fill="url(#bg)"/>
 
   <rect x="48" y="40" width="1184" height="108" rx="24" fill="url(#hero)" filter="url(#shadow)"/>
-  <text x="88" y="88" fill="white" font-family="Inter, Arial, sans-serif" font-size="34" font-weight="700">AgentCode</text>
+  <text x="88" y="88" fill="white" font-family="Inter, Arial, sans-serif" font-size="34" font-weight="700">AgentEvalKit</text>
   <text x="88" y="124" fill="#DCEEFF" font-family="Inter, Arial, sans-serif" font-size="18">
     Reproducibility, regression testing, failure clustering, and eval-pack preparation for tool-using LLM agents
   </text>
diff --git a/docs/automation.md b/docs/automation.md
@@ -1,6 +1,6 @@
 # Automation Guide
 
-This guide shows how to automate the full `AgentCode` toolchain in a way that works for local scripts, CI jobs, and later dashboards:
+This guide shows how to automate the full `AgentEvalKit` toolchain in a way that works for local scripts, CI jobs, and later dashboards:
 
 ```text
 AgentCI   -> record or validate episodes
@@ -35,7 +35,7 @@ For automation, `PYTHONPATH=src python -m ...` is often the simplest because it
 
 Examples in this doc assume:
 
-- repo root: `AgentCode/`
+- repo root: `AgentEvalKit/`
 - Python 3.10+
 - commands run from the relevant project directory, or with explicit `cd`
 
@@ -49,6 +49,7 @@ The repo includes a companion script:
 
 It writes a timestamped output directory under `/tmp` by default and produces:
 
+- one root `manifest.json` that indexes the whole run
 - AgentCI JSON summaries
 - a TracePack pack
 - a FailMap cluster snapshot
@@ -57,7 +58,7 @@ It writes a timestamped output directory under `/tmp` by default and produces:
 To write into a fixed directory instead:
 
 ```bash
-./scripts/run_automation_demo.sh /tmp/agentcode-demo
+./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
 ```
 
 ## Step-by-step pipeline
@@ -105,14 +106,14 @@ PYTHONPATH=src python3 -m tracepack.cli scan \
 
 PYTHONPATH=src python3 -m tracepack.cli build \
   examples/source_episodes \
-  /tmp/agentcode-demo/tracepack-pack \
+  /tmp/agentevalkit-demo/tracepack-pack \
   --only-failures \
   --redact \
   --max-per-signature 2 \
   --json
 
 PYTHONPATH=src python3 -m tracepack.cli inspect \
-  /tmp/agentcode-demo/tracepack-pack \
+  /tmp/agentevalkit-demo/tracepack-pack \
   --json
 ```
 
@@ -136,12 +137,12 @@ FailMap reads the TracePack output and turns it into a triage-oriented snapshot.
 cd projects/failmap
 
 PYTHONPATH=src python3 -m failmap.cli cluster \
-  /tmp/agentcode-demo/tracepack-pack \
-  /tmp/agentcode-demo/failmap-clusters.json \
+  /tmp/agentevalkit-demo/tracepack-pack \
+  /tmp/agentevalkit-demo/failmap-clusters.json \
   --json
 
 PYTHONPATH=src python3 -m failmap.cli summarize \
-  /tmp/agentcode-demo/failmap-clusters.json \
+  /tmp/agentevalkit-demo/failmap-clusters.json \
   --json
 ```
 
@@ -179,16 +180,16 @@ PackSlice works directly on the TracePack artifact, so you can prepare datasets
 cd projects/packslice
 
 PYTHONPATH=src python3 -m packslice.cli split \
-  /tmp/agentcode-demo/tracepack-pack \
-  /tmp/agentcode-demo/packslice \
+  /tmp/agentevalkit-demo/tracepack-pack \
+  /tmp/agentevalkit-demo/packslice \
   --group-by signature \
   --train-ratio 70 \
   --eval-ratio 15 \
   --test-ratio 15 \
   --json
 
 PYTHONPATH=src python3 -m packslice.cli summarize \
-  /tmp/agentcode-demo/packslice \
+  /tmp/agentevalkit-demo/packslice \
   --json
 ```
 
@@ -230,13 +231,13 @@ You do not need `jq`; plain Python works everywhere GitHub Actions already has P
 agentci summarize examples/openai_agents_episode.json --json > summary.json
 python -c "import json; data=json.load(open('summary.json')); assert data['tool_calls'] >= 1"
 
-tracepack inspect /tmp/agentcode-demo/tracepack-pack --json > inspect.json
+tracepack inspect /tmp/agentevalkit-demo/tracepack-pack --json > inspect.json
 python -c "import json; data=json.load(open('inspect.json')); assert data['case_count'] >= 1"
 
 failmap compare-summary compare.json --json > compare-summary.json
 python -c "import json; data=json.load(open('compare-summary.json')); assert 'summary' in data"
 
-packslice summarize /tmp/agentcode-demo/packslice --json > split-summary.json
+packslice summarize /tmp/agentevalkit-demo/packslice --json > split-summary.json
 python -c "import json; data=json.load(open('split-summary.json')); assert len(data['splits']) == 3"
 ```
 
@@ -246,6 +247,7 @@ For a real CI job, a layout like this works well:
 
 ```text
 artifacts/
+  manifest.json
   agentci/
     summary.json
     regression.json
diff --git a/scripts/run_automation_demo.sh b/scripts/run_automation_demo.sh