chore: mark US-005 as passing, update progress log

LoCoBench Bot · claude · LoCoBench Bot · commit dbf648bd08a9 · 2026-02-16T15:25:15.000Z
Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/ralph-gapfill-infra/prd.json b/ralph-gapfill-infra/prd.json
@@ -81,7 +81,7 @@
         "bash -n benchmarks/templates/f1_json_scorer.sh passes (syntax valid)"
       ],
       "priority": 5,
-      "passes": false,
+      "passes": true,
       "notes": "The scorer should match entries by a composite key (e.g., repo+file+function) and compute standard IR metrics. Look at ccb_codereview's test.sh for F1 scoring pattern to reuse."
     }
   ]
diff --git a/ralph-gapfill-infra/progress.txt b/ralph-gapfill-infra/progress.txt
@@ -10,6 +10,7 @@
 - selected_benchmark_tasks.json entries: {benchmark, task_id, repo, language, difficulty, mcp_benefit_score, ...}
 - Archive location: benchmarks/archive/ for deprecated suites, configs/archive/ for deprecated configs
 - Shared scorer template at `benchmarks/templates/weighted_checklist_scorer.sh` is parameterizable via REPORT_PATH env var
+- F1 JSON scorer at `benchmarks/templates/f1_json_scorer.sh` — configurable via OUTPUT_PATH env var and `key_fields` in ground_truth.json
 - New suites output paths: nlqa→`/logs/agent/investigation.md`, security→`/logs/agent/triage.md`, onboarding→`/logs/agent/onboarding.md`, docgen→`/workspace/documentation.md`
 - Config script run function names must be unique per suite (e.g., `_nlqa_run_single`) to avoid shell function collisions
 - Run directory prefix in JOBS_BASE (e.g., `nlqa_`) must match DIR_PREFIX_TO_SUITE mappings in generate_manifest.py and aggregate_status.py
@@ -68,3 +69,18 @@
   - DIR_PREFIX_TO_SUITE mappings for repoqa_ were left in place in scripts — archived suites' run data may still exist in runs/official and the scripts should still recognize them
   - When archiving suites: move benchmark dir, move config script, remove from selected_benchmark_tasks.json, update benchmarks/README.md
 ---
+
+## 2026-02-16 - US-005
+- Created F1 JSON scorer template at benchmarks/templates/f1_json_scorer.sh
+- Scores structured JSON output (e.g., callers.json, symbols.json) against ground truth
+- Computes precision, recall, and F1 by matching entries on configurable composite key fields
+- Handles all edge cases: empty output (0.0), missing ground truth (error → 0.0), malformed JSON (0.0), markdown-fenced JSON (stripped)
+- Detailed output: shows matched entries, missed entries, TP/FP/FN counts
+- Files changed: benchmarks/templates/f1_json_scorer.sh
+- **Learnings for future iterations:**
+  - F1 scorer uses `key_fields` in ground_truth.json to define composite match key — more flexible than hardcoded field names
+  - The codereview test.sh has a hybrid approach (F1 detection + fix verification) — F1 scorer template only handles the detection/matching portion
+  - Both scorer templates follow the same shell structure: bash prereq checks → embedded Python for scoring logic
+  - Agent output can be wrapped in markdown code fences (```json blocks) — always strip these before parsing
+  - Scorer templates write to /logs/verifier/reward.txt (absolute path) — they're designed to run inside Docker containers, not locally
+---

Original file line number	Diff line number	Diff line change
`@@ -81,7 +81,7 @@`
`81`	`81`	`"bash -n benchmarks/templates/f1_json_scorer.sh passes (syntax valid)"`
`82`	`82`	`],`
`83`	`83`	`"priority": 5,`
`84`		`- "passes": false,`
	`84`	`+ "passes": true,`
`85`	`85`	`"notes": "The scorer should match entries by a composite key (e.g., repo+file+function) and compute standard IR metrics. Look at ccb_codereview's test.sh for F1 scoring pattern to reuse."`
`86`	`86`	`}`
`87`	`87`	`]`