Add detailed, human readable, eval_results schema (status, F2P/P2P breakdown)#104
Conversation
Replace the per-instance bool in eval_results.json with a structured
dict that exposes which specific FAIL_TO_PASS / PASS_TO_PASS tests
passed or failed, plus an error message on failure paths. This lets
users triage failures without opening per-instance log files.
Schema change
- Before: {instance_id: True/False}
- After: {instance_id: {
"status": "Pass" | "Fail",
"resolved": bool, # convenience flag
"PASS_TO_PASS": "N/M passed (failed: t1, t2, ...)",
"FAIL_TO_PASS": "N/M passed (failed: t1, t2, ...)",
"error": "..." # only on failure paths
}}
This is a breaking change for any downstream consumer that parses
eval_results.json as {id: bool}. The "resolved" key is provided as a
convenience flag so most migrations are mechanical:
# Before
if eval_results[id]: ...
sum(eval_results.values())
# After
if eval_results[id]["resolved"]: ...
sum(r["resolved"] for r in eval_results.values())
Also pretty-prints eval_results.json with indent=2 since the new
structure benefits from human-readable formatting.
Internal cleanup
- Extracted _make_failure_result, _format_test_breakdown,
_build_detailed_result, and _running_accuracy helpers so the main
loop reads at one level of abstraction.
- Uses output.get("tests", []) to avoid KeyError on partially-populated
outputs.
No new dependencies.
Co-authored-by: Cursor <cursoragent@cursor.com>
| def _make_failure_result(error_msg: str) -> dict: | ||
| return { | ||
| "status": "Fail", | ||
| "resolved": False, | ||
| "PASS_TO_PASS": "", | ||
| "FAIL_TO_PASS": "", | ||
| "error": error_msg, | ||
| } |
There was a problem hiding this comment.
Inconsistent
PASS_TO_PASS/FAIL_TO_PASS value format between the two result constructors. _make_failure_result stores empty strings "" for these fields, while _build_detailed_result always stores a formatted "N/M passed (failed: ...)" string. The module-level schema comment documents both fields as "N/M passed (failed: a, b, c)" without noting the empty-string exception, so any consumer that tries to parse the count (result["FAIL_TO_PASS"].split("/")[0], etc.) will crash on infra-failure entries. Setting these to "N/A" for infra failures would make the format uniform and safe to parse.
| def _make_failure_result(error_msg: str) -> dict: | |
| return { | |
| "status": "Fail", | |
| "resolved": False, | |
| "PASS_TO_PASS": "", | |
| "FAIL_TO_PASS": "", | |
| "error": error_msg, | |
| } | |
| def _make_failure_result(error_msg: str) -> dict: | |
| return { | |
| "status": "Fail", | |
| "resolved": False, | |
| "PASS_TO_PASS": "N/A", | |
| "FAIL_TO_PASS": "N/A", | |
| "error": error_msg, | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 295-302
Comment:
Inconsistent `PASS_TO_PASS`/`FAIL_TO_PASS` value format between the two result constructors. `_make_failure_result` stores empty strings `""` for these fields, while `_build_detailed_result` always stores a formatted `"N/M passed (failed: ...)"` string. The module-level schema comment documents both fields as `"N/M passed (failed: a, b, c)"` without noting the empty-string exception, so any consumer that tries to parse the count (`result["FAIL_TO_PASS"].split("/")[0]`, etc.) will crash on infra-failure entries. Setting these to `"N/A"` for infra failures would make the format uniform and safe to parse.
```suggestion
def _make_failure_result(error_msg: str) -> dict:
return {
"status": "Fail",
"resolved": False,
"PASS_TO_PASS": "N/A",
"FAIL_TO_PASS": "N/A",
"error": error_msg,
}
```
How can I resolve this? If you propose a fix, please make it concise.| def _format_test_breakdown(expected: set, passed: set) -> str: | ||
| actually_passed = expected & passed | ||
| failed = expected - passed | ||
| line = f"{len(actually_passed)}/{len(expected)} passed" | ||
| if failed: | ||
| line += f" (failed: {', '.join(sorted(failed))})" | ||
| return line |
There was a problem hiding this comment.
_format_test_breakdown returns "0/0 passed" when expected is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers.
| def _format_test_breakdown(expected: set, passed: set) -> str: | |
| actually_passed = expected & passed | |
| failed = expected - passed | |
| line = f"{len(actually_passed)}/{len(expected)} passed" | |
| if failed: | |
| line += f" (failed: {', '.join(sorted(failed))})" | |
| return line | |
| def _format_test_breakdown(expected: set, passed: set) -> str: | |
| if not expected: | |
| return "N/A (no tests)" | |
| actually_passed = expected & passed | |
| failed = expected - passed | |
| line = f"{len(actually_passed)}/{len(expected)} passed" | |
| if failed: | |
| line += f" (failed: {', '.join(sorted(failed))})" | |
| return line |
Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 305-311
Comment:
`_format_test_breakdown` returns `"0/0 passed"` when `expected` is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers.
```suggestion
def _format_test_breakdown(expected: set, passed: set) -> str:
if not expected:
return "N/A (no tests)"
actually_passed = expected & passed
failed = expected - passed
line = f"{len(actually_passed)}/{len(expected)} passed"
if failed:
line += f" (failed: {', '.join(sorted(failed))})"
return line
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Replace the per-instance bool in eval_results.json with a structured dict that exposes which specific FAIL_TO_PASS / PASS_TO_PASS tests passed or failed, plus an error message on failure paths. This lets users triage failures without opening per-instance log files.
Schema change
"status": "Pass" | "Fail",
"resolved": bool, # convenience flag
"PASS_TO_PASS": "N/M passed (failed: t1, t2, ...)",
"FAIL_TO_PASS": "N/M passed (failed: t1, t2, ...)",
"error": "..." # only on failure paths
}}
This is a breaking change for any downstream consumer that parses eval_results.json as {id: bool}. The "resolved" key is provided as a convenience flag so most migrations are mechanical:
Also pretty-prints eval_results.json with indent=2 since the new structure benefits from human-readable formatting.
Internal cleanup
No new dependencies.
Greptile Summary
This PR replaces the flat
{instance_id: bool}format ineval_results.jsonwith a richer{instance_id: {status, resolved, PASS_TO_PASS, FAIL_TO_PASS, error}}schema, extracting four focused helper functions to keep the main loop readable and addingindent=2pretty-printing.errorfield on infra-failure paths, making triage possible without consulting per-instance logs._make_failure_result,_format_test_breakdown,_build_detailed_result, and_running_accuracyare all well-scoped;output.get(\"tests\", [])defensively handles partially-populated outputs.eval_results[id]as a bool must migrate toeval_results[id][\"resolved\"]; a migration guide is included in the PR description.Confidence Score: 4/5
Safe to merge — the core logic is correct and this is additive schema enrichment; the main risk is downstream consumers parsing the human-readable breakdown strings, which have a subtle format inconsistency between infra-failure and test-failure entries.
The result-building logic is sound and
_running_accuracyfaithfully replicates the old denominator semantics. Two format inconsistencies exist:_make_failure_resultstores""forPASS_TO_PASS/FAIL_TO_PASSwhile_build_detailed_resultalways stores a formatted"N/M passed ..."string, and_format_test_breakdownemits the ambiguous"0/0 passed"for instances with no tests of a given type. Neither causes a runtime failure in the writer, but both can surprise consumers who parse these strings.swe_bench_pro_eval.py — specifically
_make_failure_resultand_format_test_breakdownand their interaction with the documented schema.Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[future.result] -->|is None| B[_make_failure_result] A -->|output dict| C{instance_id in raw_sample_df?} C -->|No| D[_make_failure_result] C -->|Yes| E[collect passed_tests via output.get tests] E --> F[eval f2p and p2p from raw_sample] F --> G[_build_detailed_result] G --> H{f2p union p2p subset of passed_tests?} H -->|Yes| I[status: Pass, resolved: True] H -->|No| J[status: Fail, resolved: False] A -->|raises Exception| K[_make_failure_result with str of exc] B --> L[result with error key, PASS_TO_PASS empty string] D --> L K --> L I --> M[result without error key, formatted PASS_TO_PASS and FAIL_TO_PASS] J --> M L --> N[eval_results dict] M --> N N --> O[_running_accuracy] N --> P[json.dump with indent=2]Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Add detailed eval_results schema (status..." | Re-trigger Greptile