|
10 | 10 | - selected_benchmark_tasks.json entries: {benchmark, task_id, repo, language, difficulty, mcp_benefit_score, ...} |
11 | 11 | - Archive location: benchmarks/archive/ for deprecated suites, configs/archive/ for deprecated configs |
12 | 12 | - Shared scorer template at `benchmarks/templates/weighted_checklist_scorer.sh` is parameterizable via REPORT_PATH env var |
| 13 | +- F1 JSON scorer at `benchmarks/templates/f1_json_scorer.sh` — configurable via OUTPUT_PATH env var and `key_fields` in ground_truth.json |
13 | 14 | - New suites output paths: nlqa→`/logs/agent/investigation.md`, security→`/logs/agent/triage.md`, onboarding→`/logs/agent/onboarding.md`, docgen→`/workspace/documentation.md` |
14 | 15 | - Config script run function names must be unique per suite (e.g., `_nlqa_run_single`) to avoid shell function collisions |
15 | 16 | - Run directory prefix in JOBS_BASE (e.g., `nlqa_`) must match DIR_PREFIX_TO_SUITE mappings in generate_manifest.py and aggregate_status.py |
|
68 | 69 | - DIR_PREFIX_TO_SUITE mappings for repoqa_ were left in place in scripts — archived suites' run data may still exist in runs/official and the scripts should still recognize them |
69 | 70 | - When archiving suites: move benchmark dir, move config script, remove from selected_benchmark_tasks.json, update benchmarks/README.md |
70 | 71 | --- |
| 72 | + |
| 73 | +## 2026-02-16 - US-005 |
| 74 | +- Created F1 JSON scorer template at benchmarks/templates/f1_json_scorer.sh |
| 75 | +- Scores structured JSON output (e.g., callers.json, symbols.json) against ground truth |
| 76 | +- Computes precision, recall, and F1 by matching entries on configurable composite key fields |
| 77 | +- Handles all edge cases: empty output (0.0), missing ground truth (error → 0.0), malformed JSON (0.0), markdown-fenced JSON (stripped) |
| 78 | +- Detailed output: shows matched entries, missed entries, TP/FP/FN counts |
| 79 | +- Files changed: benchmarks/templates/f1_json_scorer.sh |
| 80 | +- **Learnings for future iterations:** |
| 81 | + - F1 scorer uses `key_fields` in ground_truth.json to define composite match key — more flexible than hardcoded field names |
| 82 | + - The codereview test.sh has a hybrid approach (F1 detection + fix verification) — F1 scorer template only handles the detection/matching portion |
| 83 | + - Both scorer templates follow the same shell structure: bash prereq checks → embedded Python for scoring logic |
| 84 | + - Agent output can be wrapped in markdown code fences (```json blocks) — always strip these before parsing |
| 85 | + - Scorer templates write to /logs/verifier/reward.txt (absolute path) — they're designed to run inside Docker containers, not locally |
| 86 | +--- |
0 commit comments