Skip to content

Commit 5cd5970

Browse files
weiyiweiyi
authored andcommitted
Clarify the monorepo as an agent eval toolkit
The repository already behaved like an eval and reliability toolchain, but the old monorepo name and demo output shape hid that story. This change renames the public surface to AgentEvalKit, aligns root docs and assets, and adds a root demo manifest so downstream automation can treat one file as the entrypoint to the generated artifact set. Constraint: Preserve the existing four-tool workflow and artifact contracts Rejected: Keep the AgentCode name and only tweak the README | still too vague about eval/regression focus Rejected: Add another root CLI wrapper first | increases surface area before clarifying the product story Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep the root demo manifest stable because it is now the simplest machine-readable handoff for CI and dashboards Tested: Package unit suites via unittest; end-to-end demo script with manifest verification Not-tested: GitHub Actions run after push
1 parent 107e63e commit 5cd5970

7 files changed

Lines changed: 130 additions & 39 deletions

File tree

.github/workflows/ci.yml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -128,20 +128,25 @@ jobs:
128128
with:
129129
python-version: "3.11"
130130
- name: Run monorepo automation demo
131-
run: ./scripts/run_automation_demo.sh /tmp/agentcode-automation-demo
131+
run: ./scripts/run_automation_demo.sh /tmp/agentevalkit-automation-demo
132132
- name: Validate automation outputs
133133
run: |
134134
python - <<'PY'
135135
import json
136136
from pathlib import Path
137137
138-
out = Path("/tmp/agentcode-automation-demo")
138+
out = Path("/tmp/agentevalkit-automation-demo")
139+
assert (out / "manifest.json").exists()
139140
assert (out / "agentci-summary.json").exists()
140141
assert (out / "agentci-regression.json").exists()
141142
assert (out / "tracepack-pack" / "manifest.json").exists()
142143
assert (out / "failmap-clusters.json").exists()
143144
assert (out / "packslice" / "summary.json").exists()
144145
146+
demo_manifest = json.loads((out / "manifest.json").read_text())
147+
assert demo_manifest["format"] == "agentevalkit-demo-v1"
148+
assert demo_manifest["summary"]["agentci"]["regression_passed"] is True
149+
145150
agentci_summary = json.loads((out / "agentci-summary.json").read_text())
146151
assert agentci_summary["episode_id"] == "openai-agents-demo"
147152
assert agentci_summary["tool_calls"] >= 1
@@ -162,5 +167,5 @@ jobs:
162167
- name: Upload automation demo artifacts
163168
uses: actions/upload-artifact@v4
164169
with:
165-
name: agentcode-automation-demo
166-
path: /tmp/agentcode-automation-demo
170+
name: agentevalkit-automation-demo
171+
path: /tmp/agentevalkit-automation-demo

CONTRIBUTING.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Contributing to AgentCode
1+
# Contributing to AgentEvalKit
22

3-
Thanks for checking out `AgentCode`.
3+
Thanks for checking out `AgentEvalKit`.
44

55
This monorepo is intentionally narrow: each project should solve a concrete gap in agent reproducibility, regression testing, failure analysis, or benchmark preparation. Contributions are most useful when they strengthen that end-to-end story instead of adding unrelated demos.
66

@@ -63,9 +63,11 @@ For monorepo automation checks, the root demo script is often the fastest way to
6363

6464
```bash
6565
chmod +x scripts/run_automation_demo.sh
66-
./scripts/run_automation_demo.sh /tmp/agentcode-demo
66+
./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
6767
```
6868

69+
That demo now writes a root `manifest.json` alongside the per-tool artifacts, which is the best single file to inspect when you want to confirm the end-to-end handoff shape.
70+
6971
## Repo layout
7072

7173
```text
@@ -113,7 +115,7 @@ cd projects/packslice && python -m unittest discover -s tests -v
113115
End-to-end validation:
114116

115117
```bash
116-
./scripts/run_automation_demo.sh /tmp/agentcode-demo
118+
./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
117119
```
118120

119121
If you change CLI output that is documented in the README, examples, or CI workflow, update those references in the same pull request.
@@ -160,6 +162,6 @@ If you want to propose a new project for the monorepo, start by describing:
160162
- the missing workflow in today's agent tooling
161163
- why the problem is not already well served by existing OSS
162164
- the minimal artifact contract and CLI that would make it useful
163-
- how it would connect to the rest of `AgentCode`
165+
- how it would connect to the rest of `AgentEvalKit`
164166

165167
The best proposals usually start small: one tight workflow, one useful artifact, one clear CLI, and one obvious connection to the rest of the toolchain.

README.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# AgentCode
1+
# AgentEvalKit
22

3-
[![CI](https://github.com/Jasvina/AgentCode/actions/workflows/ci.yml/badge.svg)](https://github.com/Jasvina/AgentCode/actions/workflows/ci.yml)
4-
[![License](https://img.shields.io/github/license/Jasvina/AgentCode)](LICENSE)
5-
[![Monorepo](https://img.shields.io/badge/layout-agent%20tooling%20monorepo-0a7bbb)](https://github.com/Jasvina/AgentCode)
3+
[![CI](https://github.com/Jasvina/AgentEvalKit/actions/workflows/ci.yml/badge.svg)](https://github.com/Jasvina/AgentEvalKit/actions/workflows/ci.yml)
4+
[![License](https://img.shields.io/github/license/Jasvina/AgentEvalKit)](LICENSE)
5+
[![Monorepo](https://img.shields.io/badge/layout-agent%20tooling%20monorepo-0a7bbb)](https://github.com/Jasvina/AgentEvalKit)
66

77
A public monorepo for practical open-source projects in the LLM Agent stack.
88

@@ -17,12 +17,12 @@ After surveying today's high-star Agent repositories, four opportunities stood o
1717
- teams can collect failures, but still lack a simple OSS layer for clustering recurring failure modes and prioritizing fixes across releases
1818
- teams can build eval packs, but still lack balanced split tooling for train/eval/test and release slicing
1919

20-
`AgentCode` is a place to build those missing layers as focused OSS projects.
20+
`AgentEvalKit` is a place to build those missing layers as focused OSS projects.
2121

2222
## Architecture at a glance
2323

2424
<p align="center">
25-
<img src="docs/assets/agentcode-overview.svg" alt="AgentCode architecture overview" width="100%" />
25+
<img src="docs/assets/agentevalkit-overview.svg" alt="AgentEvalKit architecture overview" width="100%" />
2626
</p>
2727

2828
This is the intended product story for the monorepo:
@@ -36,13 +36,13 @@ This is the intended product story for the monorepo:
3636
## Quick demo output
3737

3838
<p align="center">
39-
<img src="docs/assets/agentcode-demo-terminal.svg" alt="AgentCode terminal-style demo output" width="100%" />
39+
<img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit terminal-style demo output" width="100%" />
4040
</p>
4141

4242
If you want a one-command walkthrough of the whole repo:
4343

4444
```bash
45-
./scripts/run_automation_demo.sh /tmp/agentcode-demo
45+
./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
4646
```
4747

4848
That gives visitors an immediate answer to the most important README question: “what does this repo actually produce when I run it?”
@@ -102,6 +102,7 @@ For a fuller walkthrough, see `docs/automation.md`, the companion script `script
102102
The root workflow now runs an end-to-end automation demo and uploads artifacts that mirror a real team handoff:
103103

104104
```text
105+
manifest.json
105106
agentci-summary.json
106107
agentci-regression.json
107108
tracepack-pack/
@@ -115,12 +116,14 @@ packslice/
115116
test/
116117
```
117118

119+
The top-level `manifest.json` acts as a machine-readable index for the full demo run, so CI jobs, dashboards, or artifact consumers can discover the output set and key summary metrics from one stable entrypoint.
120+
118121
That makes the repo feel less like four isolated READMEs and more like one coherent toolchain.
119122

120123
Here is a visual snapshot of that terminal-style demo flow:
121124

122125
<p align="center">
123-
<img src="docs/assets/agentcode-demo-terminal.svg" alt="AgentCode quick demo output" width="100%" />
126+
<img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit quick demo output" width="100%" />
124127
</p>
125128

126129
## Monorepo structure

docs/assets/agentcode-demo-terminal.svg renamed to docs/assets/agentevalkit-demo-terminal.svg

Lines changed: 5 additions & 5 deletions
Loading
Lines changed: 2 additions & 2 deletions
Loading

docs/automation.md

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Automation Guide
22

3-
This guide shows how to automate the full `AgentCode` toolchain in a way that works for local scripts, CI jobs, and later dashboards:
3+
This guide shows how to automate the full `AgentEvalKit` toolchain in a way that works for local scripts, CI jobs, and later dashboards:
44

55
```text
66
AgentCI -> record or validate episodes
@@ -35,7 +35,7 @@ For automation, `PYTHONPATH=src python -m ...` is often the simplest because it
3535

3636
Examples in this doc assume:
3737

38-
- repo root: `AgentCode/`
38+
- repo root: `AgentEvalKit/`
3939
- Python 3.10+
4040
- commands run from the relevant project directory, or with explicit `cd`
4141

@@ -49,6 +49,7 @@ The repo includes a companion script:
4949

5050
It writes a timestamped output directory under `/tmp` by default and produces:
5151

52+
- one root `manifest.json` that indexes the whole run
5253
- AgentCI JSON summaries
5354
- a TracePack pack
5455
- a FailMap cluster snapshot
@@ -57,7 +58,7 @@ It writes a timestamped output directory under `/tmp` by default and produces:
5758
To write into a fixed directory instead:
5859

5960
```bash
60-
./scripts/run_automation_demo.sh /tmp/agentcode-demo
61+
./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
6162
```
6263

6364
## Step-by-step pipeline
@@ -105,14 +106,14 @@ PYTHONPATH=src python3 -m tracepack.cli scan \
105106

106107
PYTHONPATH=src python3 -m tracepack.cli build \
107108
examples/source_episodes \
108-
/tmp/agentcode-demo/tracepack-pack \
109+
/tmp/agentevalkit-demo/tracepack-pack \
109110
--only-failures \
110111
--redact \
111112
--max-per-signature 2 \
112113
--json
113114

114115
PYTHONPATH=src python3 -m tracepack.cli inspect \
115-
/tmp/agentcode-demo/tracepack-pack \
116+
/tmp/agentevalkit-demo/tracepack-pack \
116117
--json
117118
```
118119

@@ -136,12 +137,12 @@ FailMap reads the TracePack output and turns it into a triage-oriented snapshot.
136137
cd projects/failmap
137138

138139
PYTHONPATH=src python3 -m failmap.cli cluster \
139-
/tmp/agentcode-demo/tracepack-pack \
140-
/tmp/agentcode-demo/failmap-clusters.json \
140+
/tmp/agentevalkit-demo/tracepack-pack \
141+
/tmp/agentevalkit-demo/failmap-clusters.json \
141142
--json
142143

143144
PYTHONPATH=src python3 -m failmap.cli summarize \
144-
/tmp/agentcode-demo/failmap-clusters.json \
145+
/tmp/agentevalkit-demo/failmap-clusters.json \
145146
--json
146147
```
147148

@@ -179,16 +180,16 @@ PackSlice works directly on the TracePack artifact, so you can prepare datasets
179180
cd projects/packslice
180181

181182
PYTHONPATH=src python3 -m packslice.cli split \
182-
/tmp/agentcode-demo/tracepack-pack \
183-
/tmp/agentcode-demo/packslice \
183+
/tmp/agentevalkit-demo/tracepack-pack \
184+
/tmp/agentevalkit-demo/packslice \
184185
--group-by signature \
185186
--train-ratio 70 \
186187
--eval-ratio 15 \
187188
--test-ratio 15 \
188189
--json
189190

190191
PYTHONPATH=src python3 -m packslice.cli summarize \
191-
/tmp/agentcode-demo/packslice \
192+
/tmp/agentevalkit-demo/packslice \
192193
--json
193194
```
194195

@@ -230,13 +231,13 @@ You do not need `jq`; plain Python works everywhere GitHub Actions already has P
230231
agentci summarize examples/openai_agents_episode.json --json > summary.json
231232
python -c "import json; data=json.load(open('summary.json')); assert data['tool_calls'] >= 1"
232233

233-
tracepack inspect /tmp/agentcode-demo/tracepack-pack --json > inspect.json
234+
tracepack inspect /tmp/agentevalkit-demo/tracepack-pack --json > inspect.json
234235
python -c "import json; data=json.load(open('inspect.json')); assert data['case_count'] >= 1"
235236

236237
failmap compare-summary compare.json --json > compare-summary.json
237238
python -c "import json; data=json.load(open('compare-summary.json')); assert 'summary' in data"
238239

239-
packslice summarize /tmp/agentcode-demo/packslice --json > split-summary.json
240+
packslice summarize /tmp/agentevalkit-demo/packslice --json > split-summary.json
240241
python -c "import json; data=json.load(open('split-summary.json')); assert len(data['splits']) == 3"
241242
```
242243

@@ -246,6 +247,7 @@ For a real CI job, a layout like this works well:
246247

247248
```text
248249
artifacts/
250+
manifest.json
249251
agentci/
250252
summary.json
251253
regression.json

0 commit comments

Comments
 (0)