Skip to content

Commit dbf648b

Browse files
LoCoBench Botclaude
andcommitted
chore: mark US-005 as passing, update progress log
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6542883 commit dbf648b

2 files changed

Lines changed: 17 additions & 1 deletion

File tree

ralph-gapfill-infra/prd.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@
8181
"bash -n benchmarks/templates/f1_json_scorer.sh passes (syntax valid)"
8282
],
8383
"priority": 5,
84-
"passes": false,
84+
"passes": true,
8585
"notes": "The scorer should match entries by a composite key (e.g., repo+file+function) and compute standard IR metrics. Look at ccb_codereview's test.sh for F1 scoring pattern to reuse."
8686
}
8787
]

ralph-gapfill-infra/progress.txt

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
- selected_benchmark_tasks.json entries: {benchmark, task_id, repo, language, difficulty, mcp_benefit_score, ...}
1111
- Archive location: benchmarks/archive/ for deprecated suites, configs/archive/ for deprecated configs
1212
- Shared scorer template at `benchmarks/templates/weighted_checklist_scorer.sh` is parameterizable via REPORT_PATH env var
13+
- F1 JSON scorer at `benchmarks/templates/f1_json_scorer.sh` — configurable via OUTPUT_PATH env var and `key_fields` in ground_truth.json
1314
- New suites output paths: nlqa→`/logs/agent/investigation.md`, security→`/logs/agent/triage.md`, onboarding→`/logs/agent/onboarding.md`, docgen→`/workspace/documentation.md`
1415
- Config script run function names must be unique per suite (e.g., `_nlqa_run_single`) to avoid shell function collisions
1516
- Run directory prefix in JOBS_BASE (e.g., `nlqa_`) must match DIR_PREFIX_TO_SUITE mappings in generate_manifest.py and aggregate_status.py
@@ -68,3 +69,18 @@
6869
- DIR_PREFIX_TO_SUITE mappings for repoqa_ were left in place in scripts — archived suites' run data may still exist in runs/official and the scripts should still recognize them
6970
- When archiving suites: move benchmark dir, move config script, remove from selected_benchmark_tasks.json, update benchmarks/README.md
7071
---
72+
73+
## 2026-02-16 - US-005
74+
- Created F1 JSON scorer template at benchmarks/templates/f1_json_scorer.sh
75+
- Scores structured JSON output (e.g., callers.json, symbols.json) against ground truth
76+
- Computes precision, recall, and F1 by matching entries on configurable composite key fields
77+
- Handles all edge cases: empty output (0.0), missing ground truth (error → 0.0), malformed JSON (0.0), markdown-fenced JSON (stripped)
78+
- Detailed output: shows matched entries, missed entries, TP/FP/FN counts
79+
- Files changed: benchmarks/templates/f1_json_scorer.sh
80+
- **Learnings for future iterations:**
81+
- F1 scorer uses `key_fields` in ground_truth.json to define composite match key — more flexible than hardcoded field names
82+
- The codereview test.sh has a hybrid approach (F1 detection + fix verification) — F1 scorer template only handles the detection/matching portion
83+
- Both scorer templates follow the same shell structure: bash prereq checks → embedded Python for scoring logic
84+
- Agent output can be wrapped in markdown code fences (```json blocks) — always strip these before parsing
85+
- Scorer templates write to /logs/verifier/reward.txt (absolute path) — they're designed to run inside Docker containers, not locally
86+
---

0 commit comments

Comments
 (0)