You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make the GitHub repo feel like a mature OSS project
The repo already had a stronger name and clearer eval-tooling scope,
but the public GitHub surface still lacked the trust signals and landing
page structure that help visitors quickly understand the project and
contributors engage with it safely. This adds core community-health docs
and rewrites the README to lead with the product story, artifact flow,
and machine-readable demo outputs.
Constraint: Keep the README aligned with the existing four-tool workflow and validated demo artifacts
Rejected: Add long governance/process docs first | too much ceremony for the current project stage
Rejected: Keep the previous README structure and only tweak wording | still buried the core value proposition too deep
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep the README optimized for first-time GitHub visitors; lead with the reliability loop and artifact outputs, not internal repo structure
Tested: Package unit suites via unittest; end-to-end demo script; presence of community-health docs and root manifest verification
Not-tested: GitHub community health checklist UI after push
We want `AgentEvalKit` to be a useful, welcoming open-source project for people working on agent evals, infrastructure, reliability, and research tooling.
6
+
7
+
Contributors, maintainers, and community members are expected to keep interactions respectful, constructive, and focused on improving the work.
8
+
9
+
## Expected Behavior
10
+
11
+
Examples of behavior that help this project:
12
+
13
+
- giving actionable technical feedback
14
+
- assuming good intent while still being precise about problems
15
+
- discussing tradeoffs without turning disagreements personal
16
+
- keeping critique focused on artifacts, code, docs, or decisions
17
+
- helping others reproduce bugs and validate fixes
18
+
19
+
## Unacceptable Behavior
20
+
21
+
Examples of unacceptable behavior include:
22
+
23
+
- harassment, abuse, or personal attacks
24
+
- discriminatory or sexualized language or imagery
25
+
- doxxing, threats, or intimidation
26
+
- bad-faith trolling or deliberate derailment
27
+
- publishing private or sensitive information without permission
28
+
29
+
## Enforcement
30
+
31
+
Project maintainers may remove comments, issues, pull requests, or contributors whose behavior violates this code of conduct.
32
+
33
+
## Reporting
34
+
35
+
To report conduct issues, email:
36
+
37
+
-`conduct@jasvina.com`
38
+
39
+
Please include links, screenshots, or other context when possible.
40
+
41
+
## Attribution
42
+
43
+
This policy is a lightweight project-specific adaptation inspired by common open-source community standards, including the Contributor Covenant.
A public monorepo for practical open-source projects in the LLM Agent stack.
7
+
Open-source tooling for agent evals, regression testing, trace packaging, failure clustering, and dataset slicing.
8
8
9
-
I am deliberately not collecting random demos here. Each project is chosen from a specific gap in the current GitHub landscape: crowded categories already have plenty of frameworks, browser agents, coding agents, and memory layers, so this repo focuses on under-built infrastructure around reproducibility, regression testing, and turning real traces into reusable eval assets.
9
+
`AgentEvalKit` is a focused monorepo for a specific gap in the LLM agent stack: teams can build agents, but still struggle to replay failures, turn real traces into reusable eval assets, cluster recurring failure modes, and produce stable train/eval/test slices from the same evidence.
10
10
11
-
## Why this repo exists
11
+
## Why this exists
12
12
13
-
After surveying today's high-star Agent repositories, four opportunities stood out:
13
+
A lot of agent repos optimize for demos, orchestration, or UI. Fewer repos help with the reliability loop after a run goes wrong.
14
14
15
-
- teams can build agents, but still struggle to replay failures and guard against regressions
16
-
- teams can trace agents, but still lack clean tooling to turn real trajectories into reusable eval packs and benchmark cases
17
-
- teams can collect failures, but still lack a simple OSS layer for clustering recurring failure modes and prioritizing fixes across releases
18
-
- teams can build eval packs, but still lack balanced split tooling for train/eval/test and release slicing
15
+
This repo is built around that loop:
19
16
20
-
`AgentEvalKit` is a place to build those missing layers as focused OSS projects.
17
+
1. capture a real run
18
+
2. replay or diff it in CI
19
+
3. package it into a reusable eval artifact
20
+
4. cluster repeated failures across runs or releases
21
+
5. slice the same artifact into reproducible datasets
21
22
22
-
## Architecture at a glance
23
+
That makes `AgentEvalKit` closer to an eval-and-reliability toolkit than a general agent framework.
That gives visitors an immediate answer to the most important README question: “what does this repo actually produce when I run it?”
49
-
50
-
## Projects
51
-
52
-
### 1. AgentCI
53
-
54
-
Path: `projects/agentci`
55
-
56
-
Replay-first regression testing for tool-using LLM agents, with portable episode traces, HTML diff reports, and pytest-friendly regression assertions.
57
-
58
-
### 2. TracePack
59
-
60
-
Path: `projects/tracepack`
61
-
62
-
Build reusable benchmark packs from real agent traces, with recursive redaction, case labels, jsonl/chat export, and signature-aware sampling for eval pipelines.
63
-
64
-
### 3. FailMap
65
-
66
-
Path: `projects/failmap`
67
-
68
-
Cluster recurring agent failures from TracePack packs, compare releases, generate issue-ready triage drafts with rules-driven routing, bundle them for planning, and track failure trends across snapshots.
69
-
70
-
### 4. PackSlice
71
-
72
-
Path: `projects/packslice`
73
-
74
-
Create balanced train/eval/test splits from TracePack packs with distribution-aware, label-aware, and chronological slicing modes.
75
-
76
-
## Toolchain story
77
-
78
39
```text
79
40
AgentCI -> record and diff trajectories
80
41
TracePack -> turn trajectories into reusable benchmark packs
That makes it easier to build release checks, artifact pipelines, and automated dashboards on top of the same OSS commands shown in the READMEs.
97
-
98
-
For a fuller walkthrough, see `docs/automation.md`, the companion script `scripts/run_automation_demo.sh`, and the monorepo contributor guide in `CONTRIBUTING.md`.
99
-
100
-
## What the monorepo demo produces
101
-
102
-
The root workflow now runs an end-to-end automation demo and uploads artifacts that mirror a real team handoff:
58
+
The output is intentionally machine-readable. A successful run gives you a root `manifest.json` plus per-tool artifacts:
103
59
104
60
```text
105
61
manifest.json
106
62
agentci-summary.json
107
63
agentci-regression.json
64
+
tracepack-scan.json
65
+
tracepack-build.json
66
+
tracepack-inspect.json
108
67
tracepack-pack/
109
68
manifest.json
110
69
cases/
70
+
failmap-cluster.json
111
71
failmap-clusters.json
72
+
failmap-summary.json
73
+
packslice-split.json
74
+
packslice-summary.json
112
75
packslice/
113
76
summary.json
114
77
train/
115
78
eval/
116
79
test/
117
80
```
118
81
119
-
The top-level `manifest.json` acts as a machine-readable index for the full demo run, so CI jobs, dashboards, or artifact consumers can discover the output set and key summary metrics from one stable entrypoint.
120
-
121
-
That makes the repo feel less like four isolated READMEs and more like one coherent toolchain.
122
-
123
-
Here is a visual snapshot of that terminal-style demo flow:
failmap/ failure clustering and release comparison
136
-
packslice/ balanced dataset splitting for trace packs
137
-
.github/
138
-
workflows/ monorepo CI
139
-
```
140
-
141
-
If you want to contribute at the monorepo level, start with `CONTRIBUTING.md`.
82
+
The root `manifest.json` is the single best entrypoint for CI jobs, dashboards, or downstream automation that needs to discover the whole artifact set.
That is the core design choice of the repo: artifacts first, dashboards and release checks second.
150
+
151
+
## Projects
152
+
153
+
### 1. AgentCI
154
+
155
+
Path: `projects/agentci`
156
+
157
+
Replay-first regression testing for tool-using LLM agents, with portable episode traces, HTML diff reports, and pytest-friendly regression assertions.
158
+
159
+
### 2. TracePack
160
+
161
+
Path: `projects/tracepack`
162
+
163
+
Build reusable benchmark packs from real agent traces, with recursive redaction, case labels, jsonl/chat export, and signature-aware sampling for eval pipelines.
164
+
165
+
### 3. FailMap
166
+
167
+
Path: `projects/failmap`
168
+
169
+
Cluster recurring agent failures from TracePack packs, compare releases, generate issue-ready triage drafts with rules-driven routing, bundle them for planning, and track failure trends across snapshots.
170
+
171
+
### 4. PackSlice
172
+
173
+
Path: `projects/packslice`
174
+
175
+
Create balanced train/eval/test splits from TracePack packs with distribution-aware, label-aware, and chronological slicing modes.
176
+
177
+
## Why this repo can be valuable
178
+
179
+
The most useful agent infra repos are usually:
208
180
209
181
1. painkiller products, not toy abstractions
210
182
2. compatible with existing stacks
211
-
3.easy to demo in under five minutes
183
+
3.demoable in a few minutes
212
184
4. useful to both researchers and production teams
213
185
214
-
The projects in this repo are designed around that rule.
186
+
`AgentEvalKit` is built around that rule.
187
+
188
+
## Monorepo structure
189
+
190
+
```text
191
+
projects/
192
+
agentci/ replay-first regression testing
193
+
tracepack/ trace-to-benchmark packaging
194
+
failmap/ failure clustering and release comparison
195
+
packslice/ balanced dataset splitting for trace packs
196
+
.github/
197
+
workflows/ monorepo CI
198
+
```
199
+
200
+
## Docs and contribution entrypoints
201
+
202
+
- repo-level walkthrough: `docs/automation.md`
203
+
- contributor guide: `CONTRIBUTING.md`
204
+
- issue and PR templates: `.github/`
205
+
- security policy: `SECURITY.md`
206
+
- support guidance: `SUPPORT.md`
215
207
216
208
## Roadmap
217
209
218
-
- add more AgentCI integrations and richer HTML diff reports
`AgentEvalKit` is a public toolkit for agent eval, regression testing, trace packaging, failure clustering, and dataset slicing. Security reports are especially helpful when they involve:
6
+
7
+
- secret leakage or incomplete redaction in `TracePack`
8
+
- unsafe artifact handling in `AgentCI`, `FailMap`, or `PackSlice`
9
+
- path handling, file overwrite, or unintended data exposure issues
10
+
- supply-chain or packaging concerns that affect the published tools
11
+
12
+
## Reporting a Vulnerability
13
+
14
+
Please do **not** open a public GitHub issue for undisclosed security problems.
15
+
16
+
Instead, report the issue privately by emailing:
17
+
18
+
-`security@jasvina.com`
19
+
20
+
Include as much of the following as you can:
21
+
22
+
- affected component (`AgentCI`, `TracePack`, `FailMap`, `PackSlice`, or root automation)
23
+
- exact version, commit, or file path involved
24
+
- reproduction steps or a minimal proof of concept
25
+
- impact assessment
26
+
- whether the issue is already being exploited, if known
27
+
28
+
## Response Expectations
29
+
30
+
Best effort targets:
31
+
32
+
- initial acknowledgment within 7 days
33
+
- status update after reproduction and triage
34
+
- coordinated disclosure after a fix is available or mitigation guidance is ready
35
+
36
+
## Supported Surfaces
37
+
38
+
Because this repository is evolving quickly, support is best-effort for the current `main` branch and the latest public package state in this repository.
39
+
40
+
## Safe Harbor
41
+
42
+
If you act in good faith, avoid privacy violations, avoid service disruption, and do not exfiltrate data beyond what is necessary to demonstrate the issue, your research will be treated as authorized.
0 commit comments