Make the GitHub repo feel like a mature OSS project

weiyi · weiyi · commit 2cd66400252f · 2026-05-03T22:20:08.000+08:00
The repo already had a stronger name and clearer eval-tooling scope,
but the public GitHub surface still lacked the trust signals and landing
page structure that help visitors quickly understand the project and
contributors engage with it safely. This adds core community-health docs
and rewrites the README to lead with the product story, artifact flow,
and machine-readable demo outputs.

Constraint: Keep the README aligned with the existing four-tool workflow and validated demo artifacts
Rejected: Add long governance/process docs first | too much ceremony for the current project stage
Rejected: Keep the previous README structure and only tweak wording | still buried the core value proposition too deep
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep the README optimized for first-time GitHub visitors; lead with the reliability loop and artifact outputs, not internal repo structure
Tested: Package unit suites via unittest; end-to-end demo script; presence of community-health docs and root manifest verification
Not-tested: GitHub community health checklist UI after push
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,43 @@
+# Code of Conduct
+
+## Our Commitment
+
+We want `AgentEvalKit` to be a useful, welcoming open-source project for people working on agent evals, infrastructure, reliability, and research tooling.
+
+Contributors, maintainers, and community members are expected to keep interactions respectful, constructive, and focused on improving the work.
+
+## Expected Behavior
+
+Examples of behavior that help this project:
+
+- giving actionable technical feedback
+- assuming good intent while still being precise about problems
+- discussing tradeoffs without turning disagreements personal
+- keeping critique focused on artifacts, code, docs, or decisions
+- helping others reproduce bugs and validate fixes
+
+## Unacceptable Behavior
+
+Examples of unacceptable behavior include:
+
+- harassment, abuse, or personal attacks
+- discriminatory or sexualized language or imagery
+- doxxing, threats, or intimidation
+- bad-faith trolling or deliberate derailment
+- publishing private or sensitive information without permission
+
+## Enforcement
+
+Project maintainers may remove comments, issues, pull requests, or contributors whose behavior violates this code of conduct.
+
+## Reporting
+
+To report conduct issues, email:
+
+- `conduct@jasvina.com`
+
+Please include links, screenshots, or other context when possible.
+
+## Attribution
+
+This policy is a lightweight project-specific adaptation inspired by common open-source community standards, including the Contributor Covenant.
diff --git a/README.md b/README.md
@@ -4,141 +4,82 @@
 [![License](https://img.shields.io/github/license/Jasvina/AgentEvalKit)](LICENSE)
 [![Monorepo](https://img.shields.io/badge/layout-agent%20tooling%20monorepo-0a7bbb)](https://github.com/Jasvina/AgentEvalKit)
 
-A public monorepo for practical open-source projects in the LLM Agent stack.
+Open-source tooling for agent evals, regression testing, trace packaging, failure clustering, and dataset slicing.
 
-I am deliberately not collecting random demos here. Each project is chosen from a specific gap in the current GitHub landscape: crowded categories already have plenty of frameworks, browser agents, coding agents, and memory layers, so this repo focuses on under-built infrastructure around reproducibility, regression testing, and turning real traces into reusable eval assets.
+`AgentEvalKit` is a focused monorepo for a specific gap in the LLM agent stack: teams can build agents, but still struggle to replay failures, turn real traces into reusable eval assets, cluster recurring failure modes, and produce stable train/eval/test slices from the same evidence.
 
-## Why this repo exists
+## Why this exists
 
-After surveying today's high-star Agent repositories, four opportunities stood out:
+A lot of agent repos optimize for demos, orchestration, or UI. Fewer repos help with the reliability loop after a run goes wrong.
 
-- teams can build agents, but still struggle to replay failures and guard against regressions
-- teams can trace agents, but still lack clean tooling to turn real trajectories into reusable eval packs and benchmark cases
-- teams can collect failures, but still lack a simple OSS layer for clustering recurring failure modes and prioritizing fixes across releases
-- teams can build eval packs, but still lack balanced split tooling for train/eval/test and release slicing
+This repo is built around that loop:
 
-`AgentEvalKit` is a place to build those missing layers as focused OSS projects.
+1. capture a real run
+2. replay or diff it in CI
+3. package it into a reusable eval artifact
+4. cluster repeated failures across runs or releases
+5. slice the same artifact into reproducible datasets
 
-## Architecture at a glance
+That makes `AgentEvalKit` closer to an eval-and-reliability toolkit than a general agent framework.
 
-<p align="center">
-  <img src="docs/assets/agentevalkit-overview.svg" alt="AgentEvalKit architecture overview" width="100%" />
-</p>
-
-This is the intended product story for the monorepo:
+## What you get
 
-- `AgentCI` turns real runs into replayable regression artifacts
-- `TracePack` packages those runs into reusable benchmark cases
-- `FailMap` turns repeated failures into triage-ready clusters
-- `PackSlice` creates stable train/eval/test slices from the same pack
-- the root CI workflow validates that the whole chain works end to end
+- `AgentCI` for replay-first regression testing of tool-using agents
+- `TracePack` for turning real traces into reusable benchmark packs
+- `FailMap` for clustering recurring failures and comparing releases
+- `PackSlice` for balanced train/eval/test splits from the same pack
+- a root automation flow that proves the whole chain works together
 
-## Quick demo output
+## Toolchain at a glance
 
 <p align="center">
-  <img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit terminal-style demo output" width="100%" />
+  <img src="docs/assets/agentevalkit-overview.svg" alt="AgentEvalKit architecture overview" width="100%" />
 </p>
 
-If you want a one-command walkthrough of the whole repo:
-
-```bash
-./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
-```
-
-That gives visitors an immediate answer to the most important README question: “what does this repo actually produce when I run it?”
-
-## Projects
-
-### 1. AgentCI
-
-Path: `projects/agentci`
-
-Replay-first regression testing for tool-using LLM agents, with portable episode traces, HTML diff reports, and pytest-friendly regression assertions.
-
-### 2. TracePack
-
-Path: `projects/tracepack`
-
-Build reusable benchmark packs from real agent traces, with recursive redaction, case labels, jsonl/chat export, and signature-aware sampling for eval pipelines.
-
-### 3. FailMap
-
-Path: `projects/failmap`
-
-Cluster recurring agent failures from TracePack packs, compare releases, generate issue-ready triage drafts with rules-driven routing, bundle them for planning, and track failure trends across snapshots.
-
-### 4. PackSlice
-
-Path: `projects/packslice`
-
-Create balanced train/eval/test splits from TracePack packs with distribution-aware, label-aware, and chronological slicing modes.
-
-## Toolchain story
-
 ```text
 AgentCI   -> record and diff trajectories
 TracePack -> turn trajectories into reusable benchmark packs
 FailMap   -> cluster failures, compare releases, generate triage issues, bundle work
 PackSlice -> split packs into balanced train/eval/test datasets
 ```
 
-## Machine-readable CLI story
+## What the demo produces
 
-All four projects now support JSON-friendly CLI flows, so they can be chained in CI without scraping human text output:
+<p align="center">
+  <img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit terminal-style demo output" width="100%" />
+</p>
+
+Run the end-to-end repo demo with:
 
 ```bash
-agentci summarize projects/agentci/examples/math_episode.json --json
-tracepack scan projects/tracepack/examples/source_episodes --json
-failmap summarize projects/failmap/examples/clusters.json --json
-packslice summarize projects/packslice/examples/split_demo --json
+./scripts/run_automation_demo.sh /tmp/agentevalkit-demo
 ```
 
-That makes it easier to build release checks, artifact pipelines, and automated dashboards on top of the same OSS commands shown in the READMEs.
-
-For a fuller walkthrough, see `docs/automation.md`, the companion script `scripts/run_automation_demo.sh`, and the monorepo contributor guide in `CONTRIBUTING.md`.
-
-## What the monorepo demo produces
-
-The root workflow now runs an end-to-end automation demo and uploads artifacts that mirror a real team handoff:
+The output is intentionally machine-readable. A successful run gives you a root `manifest.json` plus per-tool artifacts:
 
 ```text
 manifest.json
 agentci-summary.json
 agentci-regression.json
+tracepack-scan.json
+tracepack-build.json
+tracepack-inspect.json
 tracepack-pack/
   manifest.json
   cases/
+failmap-cluster.json
 failmap-clusters.json
+failmap-summary.json
+packslice-split.json
+packslice-summary.json
 packslice/
   summary.json
   train/
   eval/
   test/
 ```
 
-The top-level `manifest.json` acts as a machine-readable index for the full demo run, so CI jobs, dashboards, or artifact consumers can discover the output set and key summary metrics from one stable entrypoint.
-
-That makes the repo feel less like four isolated READMEs and more like one coherent toolchain.
-
-Here is a visual snapshot of that terminal-style demo flow:
-
-<p align="center">
-  <img src="docs/assets/agentevalkit-demo-terminal.svg" alt="AgentEvalKit quick demo output" width="100%" />
-</p>
-
-## Monorepo structure
-
-```text
-projects/
-  agentci/    replay-first regression testing
-  tracepack/  trace-to-benchmark packaging
-  failmap/    failure clustering and release comparison
-  packslice/  balanced dataset splitting for trace packs
-.github/
-  workflows/  monorepo CI
-```
-
-If you want to contribute at the monorepo level, start with `CONTRIBUTING.md`.
+The root `manifest.json` is the single best entrypoint for CI jobs, dashboards, or downstream automation that needs to discover the whole artifact set.
 
 ## Quick start
 
@@ -154,9 +95,6 @@ agentci diff examples/math_episode.json examples/math_episode_candidate.json
 agentci diff-html examples/math_episode.json examples/math_episode_candidate.json examples/math_diff.html
 agentci assert-regression examples/math_episode.json examples/math_episode_latency_candidate.json --ignore-diff-prefix metric:latency_ms
 agentci detect-flaky examples/math_episode.json examples/math_episode_latency_candidate.json examples/math_episode_candidate.json
-# optional: install pytest extra for regression-suite integration
-# pip install -e .[pytest]
-# pytest -q --agentci-ignore-diff metric:latency_ms
 ```
 
 ### TracePack
@@ -167,12 +105,9 @@ python -m venv .venv
 source .venv/bin/activate
 pip install -e .
 python examples/make_sample_episodes.py
-tracepack scan examples/source_episodes
-tracepack build examples/source_episodes examples/demo_pack --only-failures --redact --max-per-signature 1
-tracepack inspect examples/demo_pack
-tracepack export-jsonl examples/demo_pack examples/demo_pack.jsonl
-tracepack export-chat examples/demo_pack examples/demo_chat.jsonl
 tracepack scan examples/source_episodes --json
+tracepack build examples/source_episodes examples/demo_pack --only-failures --redact --max-per-signature 1
+tracepack inspect examples/demo_pack --json
 ```
 
 ### FailMap
@@ -185,7 +120,6 @@ pip install -e .
 failmap compare examples/baseline_clusters.json examples/candidate_clusters.json examples/compare.json
 failmap issue-drafts examples/compare.json examples/issues --rules examples/triage_rules.json
 failmap issue-bundle examples/issues examples/bundle
-failmap trend examples/trends.json examples/baseline_clusters.json examples/candidate_clusters.json examples/release3_clusters.json
 failmap compare-summary examples/compare.json --json
 ```
 
@@ -197,28 +131,86 @@ python -m venv .venv
 source .venv/bin/activate
 pip install -e .
 packslice split examples/sample_pack examples/split_demo --group-by signature
-packslice summarize examples/split_demo
-packslice markdown examples/split_demo examples/split_demo/REPORT.md
 packslice summarize examples/split_demo --json
+packslice markdown examples/split_demo examples/split_demo/REPORT.md
 ```
 
-## Why these projects have star potential
+## JSON-first workflow
+
+All four tools support machine-readable CLI output, so they can be chained in CI without scraping terminal prose:
 
-High-star Agent infra repos usually win when they are:
+```bash
+agentci summarize projects/agentci/examples/math_episode.json --json
+tracepack scan projects/tracepack/examples/source_episodes --json
+failmap summarize projects/failmap/examples/clusters.json --json
+packslice summarize projects/packslice/examples/split_demo --json
+```
+
+That is the core design choice of the repo: artifacts first, dashboards and release checks second.
+
+## Projects
+
+### 1. AgentCI
+
+Path: `projects/agentci`
+
+Replay-first regression testing for tool-using LLM agents, with portable episode traces, HTML diff reports, and pytest-friendly regression assertions.
+
+### 2. TracePack
+
+Path: `projects/tracepack`
+
+Build reusable benchmark packs from real agent traces, with recursive redaction, case labels, jsonl/chat export, and signature-aware sampling for eval pipelines.
+
+### 3. FailMap
+
+Path: `projects/failmap`
+
+Cluster recurring agent failures from TracePack packs, compare releases, generate issue-ready triage drafts with rules-driven routing, bundle them for planning, and track failure trends across snapshots.
+
+### 4. PackSlice
+
+Path: `projects/packslice`
+
+Create balanced train/eval/test splits from TracePack packs with distribution-aware, label-aware, and chronological slicing modes.
+
+## Why this repo can be valuable
+
+The most useful agent infra repos are usually:
 
 1. painkiller products, not toy abstractions
 2. compatible with existing stacks
-3. easy to demo in under five minutes
+3. demoable in a few minutes
 4. useful to both researchers and production teams
 
-The projects in this repo are designed around that rule.
+`AgentEvalKit` is built around that rule.
+
+## Monorepo structure
+
+```text
+projects/
+  agentci/    replay-first regression testing
+  tracepack/  trace-to-benchmark packaging
+  failmap/    failure clustering and release comparison
+  packslice/  balanced dataset splitting for trace packs
+.github/
+  workflows/  monorepo CI
+```
+
+## Docs and contribution entrypoints
+
+- repo-level walkthrough: `docs/automation.md`
+- contributor guide: `CONTRIBUTING.md`
+- issue and PR templates: `.github/`
+- security policy: `SECURITY.md`
+- support guidance: `SUPPORT.md`
 
 ## Roadmap
 
-- add more AgentCI integrations and richer HTML diff reports
-- strengthen TracePack redaction policies, labeling workflows, and dataset export formats
-- add richer FailMap issue templates, trend views, and release-to-release cluster drilldowns
-- expand PackSlice with temporal and label-aware slicing
+- add more `AgentCI` integrations and richer HTML diff reports
+- strengthen `TracePack` redaction policies, labeling workflows, and export formats
+- add richer `FailMap` issue templates, trend views, and release-to-release drilldowns
+- expand `PackSlice` with temporal and label-aware slicing
 - add more focused projects around agent eval infra, failure mining, and trajectory analytics
 
 ## License
diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,42 @@
+# Security Policy
+
+## Scope
+
+`AgentEvalKit` is a public toolkit for agent eval, regression testing, trace packaging, failure clustering, and dataset slicing. Security reports are especially helpful when they involve:
+
+- secret leakage or incomplete redaction in `TracePack`
+- unsafe artifact handling in `AgentCI`, `FailMap`, or `PackSlice`
+- path handling, file overwrite, or unintended data exposure issues
+- supply-chain or packaging concerns that affect the published tools
+
+## Reporting a Vulnerability
+
+Please do **not** open a public GitHub issue for undisclosed security problems.
+
+Instead, report the issue privately by emailing:
+
+- `security@jasvina.com`
+
+Include as much of the following as you can:
+
+- affected component (`AgentCI`, `TracePack`, `FailMap`, `PackSlice`, or root automation)
+- exact version, commit, or file path involved
+- reproduction steps or a minimal proof of concept
+- impact assessment
+- whether the issue is already being exploited, if known
+
+## Response Expectations
+
+Best effort targets:
+
+- initial acknowledgment within 7 days
+- status update after reproduction and triage
+- coordinated disclosure after a fix is available or mitigation guidance is ready
+
+## Supported Surfaces
+
+Because this repository is evolving quickly, support is best-effort for the current `main` branch and the latest public package state in this repository.
+
+## Safe Harbor
+
+If you act in good faith, avoid privacy violations, avoid service disruption, and do not exfiltrate data beyond what is necessary to demonstrate the issue, your research will be treated as authorized.
diff --git a/SUPPORT.md b/SUPPORT.md