You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+56Lines changed: 56 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,6 +12,62 @@ These are non-negotiable. Every PR, feature, and design decision must respect th
12
12
-**No overclaiming in public docs**: README and CHANGELOG must be evidence-backed. Don't claim capabilities that aren't shipped and tested.
13
13
-**internal-docs is private**: Never commit `internal-docs/` pointer changes unless explicitly intended. The submodule is always dirty locally; ignore it.
14
14
15
+
## Evaluation Integrity (NON-NEGOTIABLE)
16
+
17
+
These rules prevent metric gaming, overfitting, and false quality claims. Violation of these rules means the feature CANNOT ship.
18
+
19
+
### Rule 1: Eval Sets are Frozen Before Implementation
20
+
21
+
-**Define test queries and expected results BEFORE writing any code**
22
+
- Commit the eval fixture (e.g., `tests/fixtures/eval-queries.json`) BEFORE starting implementation
23
+
-**NEVER adjust expected results to match system output** - If the system returns different results, that's a failure, not a fixture bug
24
+
- Exception: If the original expected result was factually wrong (file doesn't exist, query is ambiguous), document the correction with justification
25
+
26
+
### Rule 2: Eval Sets Must Be General
27
+
28
+
-**Minimum 20 queries** across diverse patterns (exact names, conceptual, multi-concept, edge cases)
29
+
- Test on **multiple codebases** (minimum 2: one you control, one public/real-world)
30
+
- Include queries that are HARD and likely to fail - don't cherry-pick easy wins
31
+
- Eval set must represent real user queries, not synthetic examples designed to pass
32
+
33
+
### Rule 3: Public Eval Methodology
34
+
35
+
- Full eval harness code must be in `tests/` (public repository)
36
+
- Eval fixtures must be public (or provide reproducible public examples)
37
+
- Document how to run eval: `npm run eval -- /path/to/codebase`
38
+
- Results must be reproducible by external users
39
+
40
+
### Rule 4: No Score Manipulation
41
+
42
+
-**NEVER add heuristics specifically to game eval metrics** (e.g., "if query contains X, boost Y")
43
+
-**NEVER adjust scoring to break ties just to improve top-1 accuracy**
44
+
- If you add ranking heuristics, they must be general-purpose and justified by search theory, not by "it makes test #7 pass"
45
+
- Document all ranking heuristics with research citations or principled justification
0 commit comments