Skip to content

Commit 2cd6640

Browse files
weiyiweiyi
authored andcommitted
Make the GitHub repo feel like a mature OSS project
The repo already had a stronger name and clearer eval-tooling scope, but the public GitHub surface still lacked the trust signals and landing page structure that help visitors quickly understand the project and contributors engage with it safely. This adds core community-health docs and rewrites the README to lead with the product story, artifact flow, and machine-readable demo outputs. Constraint: Keep the README aligned with the existing four-tool workflow and validated demo artifacts Rejected: Add long governance/process docs first | too much ceremony for the current project stage Rejected: Keep the previous README structure and only tweak wording | still buried the core value proposition too deep Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep the README optimized for first-time GitHub visitors; lead with the reliability loop and artifact outputs, not internal repo structure Tested: Package unit suites via unittest; end-to-end demo script; presence of community-health docs and root manifest verification Not-tested: GitHub community health checklist UI after push
1 parent 37e5af4 commit 2cd6640

4 files changed

Lines changed: 231 additions & 113 deletions

File tree

CODE_OF_CONDUCT.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Code of Conduct
2+
3+
## Our Commitment
4+
5+
We want `AgentEvalKit` to be a useful, welcoming open-source project for people working on agent evals, infrastructure, reliability, and research tooling.
6+
7+
Contributors, maintainers, and community members are expected to keep interactions respectful, constructive, and focused on improving the work.
8+
9+
## Expected Behavior
10+
11+
Examples of behavior that help this project:
12+
13+
- giving actionable technical feedback
14+
- assuming good intent while still being precise about problems
15+
- discussing tradeoffs without turning disagreements personal
16+
- keeping critique focused on artifacts, code, docs, or decisions
17+
- helping others reproduce bugs and validate fixes
18+
19+
## Unacceptable Behavior
20+
21+
Examples of unacceptable behavior include:
22+
23+
- harassment, abuse, or personal attacks
24+
- discriminatory or sexualized language or imagery
25+
- doxxing, threats, or intimidation
26+
- bad-faith trolling or deliberate derailment
27+
- publishing private or sensitive information without permission
28+
29+
## Enforcement
30+
31+
Project maintainers may remove comments, issues, pull requests, or contributors whose behavior violates this code of conduct.
32+
33+
## Reporting
34+
35+
To report conduct issues, email:
36+
37+
- `conduct@jasvina.com`
38+
39+
Please include links, screenshots, or other context when possible.
40+
41+
## Attribution
42+
43+
This policy is a lightweight project-specific adaptation inspired by common open-source community standards, including the Contributor Covenant.

README.md

Lines changed: 105 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -4,141 +4,82 @@
44
[![License](https://img.shields.io/github/license/Jasvina/AgentEvalKit)](LICENSE)
55
[![Monorepo](https://img.shields.io/badge/layout-agent%20tooling%20monorepo-0a7bbb)](https://github.com/Jasvina/AgentEvalKit)
66

7-
A public monorepo for practical open-source projects in the LLM Agent stack.
7+
Open-source tooling for agent evals, regression testing, trace packaging, failure clustering, and dataset slicing.
88

9-
I am deliberately not collecting random demos here. Each project is chosen from a specific gap in the current GitHub landscape: crowded categories already have plenty of frameworks, browser agents, coding agents, and memory layers, so this repo focuses on under-built infrastructure around reproducibility, regression testing, and turning real traces into reusable eval assets.
9+
`AgentEvalKit` is a focused monorepo for a specific gap in the LLM agent stack: teams can build agents, but still struggle to replay failures, turn real traces into reusable eval assets, cluster recurring failure modes, and produce stable train/eval/test slices from the same evidence.
1010

11-
## Why this repo exists
11+
## Why this exists
1212

13-
After surveying today's high-star Agent repositories, four opportunities stood out:
13+
A lot of agent repos optimize for demos, orchestration, or UI. Fewer repos help with the reliability loop after a run goes wrong.
1414

15-
- teams can build agents, but still struggle to replay failures and guard against regressions
16-
- teams can trace agents, but still lack clean tooling to turn real trajectories into reusable eval packs and benchmark cases
17-
- teams can collect failures, but still lack a simple OSS layer for clustering recurring failure modes and prioritizing fixes across releases
18-
- teams can build eval packs, but still lack balanced split tooling for train/eval/test and release slicing
15+
This repo is built around that loop:
1916

20-
`AgentEvalKit` is a place to build those missing layers as focused OSS projects.
17+
1. capture a real run
18+
2. replay or diff it in CI
19+
3. package it into a reusable eval artifact
20+
4. cluster repeated failures across runs or releases
21+
5. slice the same artifact into reproducible datasets
2122

22-
## Architecture at a glance
23+
That makes `AgentEvalKit` closer to an eval-and-reliability toolkit than a general agent framework.
2324

24-
<p align="center">
25-
<img src="docs/assets/agentevalkit-overview.svg" alt="AgentEvalKit architecture overview" width="100%" />
26-
</p>
27-
28-
This is the intended product story for the monorepo:
25+
## What you get
2926

30-
- `AgentCI` turns real runs into replayable regression artifacts
31-
- `TracePack` packages those runs into reusable benchmark cases
32-
- `FailMap` turns repeated failures into triage-ready clusters
33-
- `PackSlice` creates stable train/eval/test slices from the same pack
34-
- the root CI workflow validates that the whole chain works end to end
27+
- `AgentCI` for replay-first regression testing of tool-using agents
28+
- `TracePack` for turning real traces into reusable benchmark packs
29+
- `FailMap` for clustering recurring failures and comparing releases
30+
- `PackSlice` for balanced train/eval/test splits from the same pack
31+
- a root automation flow that proves the whole chain works together
3532

36-
## Quick demo output
33+
## Toolchain at a glance
3734

3835
<p align="center">
39-
<img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit terminal-style demo output" width="100%" />
36+
<img src="docs/assets/agentevalkit-overview.svg" alt="AgentEvalKit architecture overview" width="100%" />
4037
</p>
4138

42-
If you want a one-command walkthrough of the whole repo:
43-
44-
```bash
45-
./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
46-
```
47-
48-
That gives visitors an immediate answer to the most important README question: “what does this repo actually produce when I run it?”
49-
50-
## Projects
51-
52-
### 1. AgentCI
53-
54-
Path: `projects/agentci`
55-
56-
Replay-first regression testing for tool-using LLM agents, with portable episode traces, HTML diff reports, and pytest-friendly regression assertions.
57-
58-
### 2. TracePack
59-
60-
Path: `projects/tracepack`
61-
62-
Build reusable benchmark packs from real agent traces, with recursive redaction, case labels, jsonl/chat export, and signature-aware sampling for eval pipelines.
63-
64-
### 3. FailMap
65-
66-
Path: `projects/failmap`
67-
68-
Cluster recurring agent failures from TracePack packs, compare releases, generate issue-ready triage drafts with rules-driven routing, bundle them for planning, and track failure trends across snapshots.
69-
70-
### 4. PackSlice
71-
72-
Path: `projects/packslice`
73-
74-
Create balanced train/eval/test splits from TracePack packs with distribution-aware, label-aware, and chronological slicing modes.
75-
76-
## Toolchain story
77-
7839
```text
7940
AgentCI -> record and diff trajectories
8041
TracePack -> turn trajectories into reusable benchmark packs
8142
FailMap -> cluster failures, compare releases, generate triage issues, bundle work
8243
PackSlice -> split packs into balanced train/eval/test datasets
8344
```
8445

85-
## Machine-readable CLI story
46+
## What the demo produces
8647

87-
All four projects now support JSON-friendly CLI flows, so they can be chained in CI without scraping human text output:
48+
<p align="center">
49+
<img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit terminal-style demo output" width="100%" />
50+
</p>
51+
52+
Run the end-to-end repo demo with:
8853

8954
```bash
90-
agentci summarize projects/agentci/examples/math_episode.json --json
91-
tracepack scan projects/tracepack/examples/source_episodes --json
92-
failmap summarize projects/failmap/examples/clusters.json --json
93-
packslice summarize projects/packslice/examples/split_demo --json
55+
./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
9456
```
9557

96-
That makes it easier to build release checks, artifact pipelines, and automated dashboards on top of the same OSS commands shown in the READMEs.
97-
98-
For a fuller walkthrough, see `docs/automation.md`, the companion script `scripts/run_automation_demo.sh`, and the monorepo contributor guide in `CONTRIBUTING.md`.
99-
100-
## What the monorepo demo produces
101-
102-
The root workflow now runs an end-to-end automation demo and uploads artifacts that mirror a real team handoff:
58+
The output is intentionally machine-readable. A successful run gives you a root `manifest.json` plus per-tool artifacts:
10359

10460
```text
10561
manifest.json
10662
agentci-summary.json
10763
agentci-regression.json
64+
tracepack-scan.json
65+
tracepack-build.json
66+
tracepack-inspect.json
10867
tracepack-pack/
10968
manifest.json
11069
cases/
70+
failmap-cluster.json
11171
failmap-clusters.json
72+
failmap-summary.json
73+
packslice-split.json
74+
packslice-summary.json
11275
packslice/
11376
summary.json
11477
train/
11578
eval/
11679
test/
11780
```
11881

119-
The top-level `manifest.json` acts as a machine-readable index for the full demo run, so CI jobs, dashboards, or artifact consumers can discover the output set and key summary metrics from one stable entrypoint.
120-
121-
That makes the repo feel less like four isolated READMEs and more like one coherent toolchain.
122-
123-
Here is a visual snapshot of that terminal-style demo flow:
124-
125-
<p align="center">
126-
<img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit quick demo output" width="100%" />
127-
</p>
128-
129-
## Monorepo structure
130-
131-
```text
132-
projects/
133-
agentci/ replay-first regression testing
134-
tracepack/ trace-to-benchmark packaging
135-
failmap/ failure clustering and release comparison
136-
packslice/ balanced dataset splitting for trace packs
137-
.github/
138-
workflows/ monorepo CI
139-
```
140-
141-
If you want to contribute at the monorepo level, start with `CONTRIBUTING.md`.
82+
The root `manifest.json` is the single best entrypoint for CI jobs, dashboards, or downstream automation that needs to discover the whole artifact set.
14283

14384
## Quick start
14485

@@ -154,9 +95,6 @@ agentci diff examples/math_episode.json examples/math_episode_candidate.json
15495
agentci diff-html examples/math_episode.json examples/math_episode_candidate.json examples/math_diff.html
15596
agentci assert-regression examples/math_episode.json examples/math_episode_latency_candidate.json --ignore-diff-prefix metric:latency_ms
15697
agentci detect-flaky examples/math_episode.json examples/math_episode_latency_candidate.json examples/math_episode_candidate.json
157-
# optional: install pytest extra for regression-suite integration
158-
# pip install -e .[pytest]
159-
# pytest -q --agentci-ignore-diff metric:latency_ms
16098
```
16199

162100
### TracePack
@@ -167,12 +105,9 @@ python -m venv .venv
167105
source .venv/bin/activate
168106
pip install -e .
169107
python examples/make_sample_episodes.py
170-
tracepack scan examples/source_episodes
171-
tracepack build examples/source_episodes examples/demo_pack --only-failures --redact --max-per-signature 1
172-
tracepack inspect examples/demo_pack
173-
tracepack export-jsonl examples/demo_pack examples/demo_pack.jsonl
174-
tracepack export-chat examples/demo_pack examples/demo_chat.jsonl
175108
tracepack scan examples/source_episodes --json
109+
tracepack build examples/source_episodes examples/demo_pack --only-failures --redact --max-per-signature 1
110+
tracepack inspect examples/demo_pack --json
176111
```
177112

178113
### FailMap
@@ -185,7 +120,6 @@ pip install -e .
185120
failmap compare examples/baseline_clusters.json examples/candidate_clusters.json examples/compare.json
186121
failmap issue-drafts examples/compare.json examples/issues --rules examples/triage_rules.json
187122
failmap issue-bundle examples/issues examples/bundle
188-
failmap trend examples/trends.json examples/baseline_clusters.json examples/candidate_clusters.json examples/release3_clusters.json
189123
failmap compare-summary examples/compare.json --json
190124
```
191125

@@ -197,28 +131,86 @@ python -m venv .venv
197131
source .venv/bin/activate
198132
pip install -e .
199133
packslice split examples/sample_pack examples/split_demo --group-by signature
200-
packslice summarize examples/split_demo
201-
packslice markdown examples/split_demo examples/split_demo/REPORT.md
202134
packslice summarize examples/split_demo --json
135+
packslice markdown examples/split_demo examples/split_demo/REPORT.md
203136
```
204137

205-
## Why these projects have star potential
138+
## JSON-first workflow
139+
140+
All four tools support machine-readable CLI output, so they can be chained in CI without scraping terminal prose:
206141

207-
High-star Agent infra repos usually win when they are:
142+
```bash
143+
agentci summarize projects/agentci/examples/math_episode.json --json
144+
tracepack scan projects/tracepack/examples/source_episodes --json
145+
failmap summarize projects/failmap/examples/clusters.json --json
146+
packslice summarize projects/packslice/examples/split_demo --json
147+
```
148+
149+
That is the core design choice of the repo: artifacts first, dashboards and release checks second.
150+
151+
## Projects
152+
153+
### 1. AgentCI
154+
155+
Path: `projects/agentci`
156+
157+
Replay-first regression testing for tool-using LLM agents, with portable episode traces, HTML diff reports, and pytest-friendly regression assertions.
158+
159+
### 2. TracePack
160+
161+
Path: `projects/tracepack`
162+
163+
Build reusable benchmark packs from real agent traces, with recursive redaction, case labels, jsonl/chat export, and signature-aware sampling for eval pipelines.
164+
165+
### 3. FailMap
166+
167+
Path: `projects/failmap`
168+
169+
Cluster recurring agent failures from TracePack packs, compare releases, generate issue-ready triage drafts with rules-driven routing, bundle them for planning, and track failure trends across snapshots.
170+
171+
### 4. PackSlice
172+
173+
Path: `projects/packslice`
174+
175+
Create balanced train/eval/test splits from TracePack packs with distribution-aware, label-aware, and chronological slicing modes.
176+
177+
## Why this repo can be valuable
178+
179+
The most useful agent infra repos are usually:
208180

209181
1. painkiller products, not toy abstractions
210182
2. compatible with existing stacks
211-
3. easy to demo in under five minutes
183+
3. demoable in a few minutes
212184
4. useful to both researchers and production teams
213185

214-
The projects in this repo are designed around that rule.
186+
`AgentEvalKit` is built around that rule.
187+
188+
## Monorepo structure
189+
190+
```text
191+
projects/
192+
agentci/ replay-first regression testing
193+
tracepack/ trace-to-benchmark packaging
194+
failmap/ failure clustering and release comparison
195+
packslice/ balanced dataset splitting for trace packs
196+
.github/
197+
workflows/ monorepo CI
198+
```
199+
200+
## Docs and contribution entrypoints
201+
202+
- repo-level walkthrough: `docs/automation.md`
203+
- contributor guide: `CONTRIBUTING.md`
204+
- issue and PR templates: `.github/`
205+
- security policy: `SECURITY.md`
206+
- support guidance: `SUPPORT.md`
215207

216208
## Roadmap
217209

218-
- add more AgentCI integrations and richer HTML diff reports
219-
- strengthen TracePack redaction policies, labeling workflows, and dataset export formats
220-
- add richer FailMap issue templates, trend views, and release-to-release cluster drilldowns
221-
- expand PackSlice with temporal and label-aware slicing
210+
- add more `AgentCI` integrations and richer HTML diff reports
211+
- strengthen `TracePack` redaction policies, labeling workflows, and export formats
212+
- add richer `FailMap` issue templates, trend views, and release-to-release drilldowns
213+
- expand `PackSlice` with temporal and label-aware slicing
222214
- add more focused projects around agent eval infra, failure mining, and trajectory analytics
223215

224216
## License

SECURITY.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Security Policy
2+
3+
## Scope
4+
5+
`AgentEvalKit` is a public toolkit for agent eval, regression testing, trace packaging, failure clustering, and dataset slicing. Security reports are especially helpful when they involve:
6+
7+
- secret leakage or incomplete redaction in `TracePack`
8+
- unsafe artifact handling in `AgentCI`, `FailMap`, or `PackSlice`
9+
- path handling, file overwrite, or unintended data exposure issues
10+
- supply-chain or packaging concerns that affect the published tools
11+
12+
## Reporting a Vulnerability
13+
14+
Please do **not** open a public GitHub issue for undisclosed security problems.
15+
16+
Instead, report the issue privately by emailing:
17+
18+
- `security@jasvina.com`
19+
20+
Include as much of the following as you can:
21+
22+
- affected component (`AgentCI`, `TracePack`, `FailMap`, `PackSlice`, or root automation)
23+
- exact version, commit, or file path involved
24+
- reproduction steps or a minimal proof of concept
25+
- impact assessment
26+
- whether the issue is already being exploited, if known
27+
28+
## Response Expectations
29+
30+
Best effort targets:
31+
32+
- initial acknowledgment within 7 days
33+
- status update after reproduction and triage
34+
- coordinated disclosure after a fix is available or mitigation guidance is ready
35+
36+
## Supported Surfaces
37+
38+
Because this repository is evolving quickly, support is best-effort for the current `main` branch and the latest public package state in this repository.
39+
40+
## Safe Harbor
41+
42+
If you act in good faith, avoid privacy violations, avoid service disruption, and do not exfiltrate data beyond what is necessary to demonstrate the issue, your research will be treated as authorized.

0 commit comments

Comments
 (0)