Add detailed, human readable, eval_results schema (status, F2P/P2P breakdown) by jhmblundin · Pull Request #104 · scaleapi/SWE-bench_Pro-os

jhmblundin · 2026-06-03T01:06:08Z

Replace the per-instance bool in eval_results.json with a structured dict that exposes which specific FAIL_TO_PASS / PASS_TO_PASS tests passed or failed, plus an error message on failure paths. This lets users triage failures without opening per-instance log files.

Schema change

Before: {instance_id: True/False}
After: {instance_id: {
"status": "Pass" | "Fail",
"resolved": bool, # convenience flag
"PASS_TO_PASS": "N/M passed (failed: t1, t2, ...)",
"FAIL_TO_PASS": "N/M passed (failed: t1, t2, ...)",
"error": "..." # only on failure paths
}}

This is a breaking change for any downstream consumer that parses eval_results.json as {id: bool}. The "resolved" key is provided as a convenience flag so most migrations are mechanical:

# Before
if eval_results[id]: ...
sum(eval_results.values())

# After
if eval_results[id]["resolved"]: ...
sum(r["resolved"] for r in eval_results.values())

Also pretty-prints eval_results.json with indent=2 since the new structure benefits from human-readable formatting.

Internal cleanup

Extracted _make_failure_result, _format_test_breakdown, _build_detailed_result, and _running_accuracy helpers so the main loop reads at one level of abstraction.
Uses output.get("tests", []) to avoid KeyError on partially-populated outputs.

No new dependencies.

Greptile Summary

This PR replaces the flat {instance_id: bool} format in eval_results.json with a richer {instance_id: {status, resolved, PASS_TO_PASS, FAIL_TO_PASS, error}} schema, extracting four focused helper functions to keep the main loop readable and adding indent=2 pretty-printing.

Schema change: each entry now carries a human-readable breakdown of which tests passed/failed and an error field on infra-failure paths, making triage possible without consulting per-instance logs.
Helpers: _make_failure_result, _format_test_breakdown, _build_detailed_result, and _running_accuracy are all well-scoped; output.get(\"tests\", []) defensively handles partially-populated outputs.
Breaking change: downstream consumers reading eval_results[id] as a bool must migrate to eval_results[id][\"resolved\"]; a migration guide is included in the PR description.

Confidence Score: 4/5

Safe to merge — the core logic is correct and this is additive schema enrichment; the main risk is downstream consumers parsing the human-readable breakdown strings, which have a subtle format inconsistency between infra-failure and test-failure entries.

The result-building logic is sound and _running_accuracy faithfully replicates the old denominator semantics. Two format inconsistencies exist: _make_failure_result stores "" for PASS_TO_PASS/FAIL_TO_PASS while _build_detailed_result always stores a formatted "N/M passed ..." string, and _format_test_breakdown emits the ambiguous "0/0 passed" for instances with no tests of a given type. Neither causes a runtime failure in the writer, but both can surprise consumers who parse these strings.

swe_bench_pro_eval.py — specifically _make_failure_result and _format_test_breakdown and their interaction with the documented schema.

Important Files Changed

Filename	Overview
swe_bench_pro_eval.py	Adds structured per-instance result dicts to eval_results.json, replacing plain booleans; logic is correct but PASS_TO_PASS/FAIL_TO_PASS field format is inconsistent between infra-failure and test-failure paths.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[future.result] -->|is None| B[_make_failure_result]
    A -->|output dict| C{instance_id in raw_sample_df?}
    C -->|No| D[_make_failure_result]
    C -->|Yes| E[collect passed_tests via output.get tests]
    E --> F[eval f2p and p2p from raw_sample]
    F --> G[_build_detailed_result]
    G --> H{f2p union p2p subset of passed_tests?}
    H -->|Yes| I[status: Pass, resolved: True]
    H -->|No| J[status: Fail, resolved: False]
    A -->|raises Exception| K[_make_failure_result with str of exc]
    B --> L[result with error key, PASS_TO_PASS empty string]
    D --> L
    K --> L
    I --> M[result without error key, formatted PASS_TO_PASS and FAIL_TO_PASS]
    J --> M
    L --> N[eval_results dict]
    M --> N
    N --> O[_running_accuracy]
    N --> P[json.dump with indent=2]

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
swe_bench_pro_eval.py:295-302
Inconsistent `PASS_TO_PASS`/`FAIL_TO_PASS` value format between the two result constructors. `_make_failure_result` stores empty strings `""` for these fields, while `_build_detailed_result` always stores a formatted `"N/M passed (failed: ...)"` string. The module-level schema comment documents both fields as `"N/M passed (failed: a, b, c)"` without noting the empty-string exception, so any consumer that tries to parse the count (`result["FAIL_TO_PASS"].split("/")[0]`, etc.) will crash on infra-failure entries. Setting these to `"N/A"` for infra failures would make the format uniform and safe to parse.

```suggestion
def _make_failure_result(error_msg: str) -> dict:
    return {
        "status": "Fail",
        "resolved": False,
        "PASS_TO_PASS": "N/A",
        "FAIL_TO_PASS": "N/A",
        "error": error_msg,
    }
```

### Issue 2 of 2
swe_bench_pro_eval.py:305-311
`_format_test_breakdown` returns `"0/0 passed"` when `expected` is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers.

```suggestion
def _format_test_breakdown(expected: set, passed: set) -> str:
    if not expected:
        return "N/A (no tests)"
    actually_passed = expected & passed
    failed = expected - passed
    line = f"{len(actually_passed)}/{len(expected)} passed"
    if failed:
        line += f" (failed: {', '.join(sorted(failed))})"
    return line
```

_{Reviews (1): Last reviewed commit: "Add detailed eval_results schema (status..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

Replace the per-instance bool in eval_results.json with a structured dict that exposes which specific FAIL_TO_PASS / PASS_TO_PASS tests passed or failed, plus an error message on failure paths. This lets users triage failures without opening per-instance log files. Schema change - Before: {instance_id: True/False} - After: {instance_id: { "status": "Pass" | "Fail", "resolved": bool, # convenience flag "PASS_TO_PASS": "N/M passed (failed: t1, t2, ...)", "FAIL_TO_PASS": "N/M passed (failed: t1, t2, ...)", "error": "..." # only on failure paths }} This is a breaking change for any downstream consumer that parses eval_results.json as {id: bool}. The "resolved" key is provided as a convenience flag so most migrations are mechanical: # Before if eval_results[id]: ... sum(eval_results.values()) # After if eval_results[id]["resolved"]: ... sum(r["resolved"] for r in eval_results.values()) Also pretty-prints eval_results.json with indent=2 since the new structure benefits from human-readable formatting. Internal cleanup - Extracted _make_failure_result, _format_test_breakdown, _build_detailed_result, and _running_accuracy helpers so the main loop reads at one level of abstraction. - Uses output.get("tests", []) to avoid KeyError on partially-populated outputs. No new dependencies. Co-authored-by: Cursor <cursoragent@cursor.com>

greptile-apps · 2026-06-03T01:09:46Z

+def _make_failure_result(error_msg: str) -> dict:
+    return {
+        "status": "Fail",
+        "resolved": False,
+        "PASS_TO_PASS": "",
+        "FAIL_TO_PASS": "",
+        "error": error_msg,
+    }


Inconsistent PASS_TO_PASS/FAIL_TO_PASS value format between the two result constructors. _make_failure_result stores empty strings "" for these fields, while _build_detailed_result always stores a formatted "N/M passed (failed: ...)" string. The module-level schema comment documents both fields as "N/M passed (failed: a, b, c)" without noting the empty-string exception, so any consumer that tries to parse the count (result["FAIL_TO_PASS"].split("/")[0], etc.) will crash on infra-failure entries. Setting these to "N/A" for infra failures would make the format uniform and safe to parse.

Suggested change

def _make_failure_result(error_msg: str) -> dict:

return {

"status": "Fail",

"resolved": False,

"PASS_TO_PASS": "",

"FAIL_TO_PASS": "",

"error": error_msg,

}

def _make_failure_result(error_msg: str) -> dict:

return {

"status": "Fail",

"resolved": False,

"PASS_TO_PASS": "N/A",

"FAIL_TO_PASS": "N/A",

"error": error_msg,

}

Prompt To Fix With AI

This is a comment left during a code review. Path: swe_bench_pro_eval.py Line: 295-302 Comment: Inconsistent `PASS_TO_PASS`/`FAIL_TO_PASS` value format between the two result constructors. `_make_failure_result` stores empty strings `""` for these fields, while `_build_detailed_result` always stores a formatted `"N/M passed (failed: ...)"` string. The module-level schema comment documents both fields as `"N/M passed (failed: a, b, c)"` without noting the empty-string exception, so any consumer that tries to parse the count (`result["FAIL_TO_PASS"].split("/")[0]`, etc.) will crash on infra-failure entries. Setting these to `"N/A"` for infra failures would make the format uniform and safe to parse. ```suggestion def _make_failure_result(error_msg: str) -> dict: return { "status": "Fail", "resolved": False, "PASS_TO_PASS": "N/A", "FAIL_TO_PASS": "N/A", "error": error_msg, } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-03T01:09:47Z

+def _format_test_breakdown(expected: set, passed: set) -> str:
+    actually_passed = expected & passed
+    failed = expected - passed
+    line = f"{len(actually_passed)}/{len(expected)} passed"
+    if failed:
+        line += f" (failed: {', '.join(sorted(failed))})"
+    return line


_format_test_breakdown returns "0/0 passed" when expected is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers.

Suggested change

def _format_test_breakdown(expected: set, passed: set) -> str:

actually_passed = expected & passed

failed = expected - passed

line = f"{len(actually_passed)}/{len(expected)} passed"

if failed:

line += f" (failed: {', '.join(sorted(failed))})"

return line

def _format_test_breakdown(expected: set, passed: set) -> str:

if not expected:

return "N/A (no tests)"

actually_passed = expected & passed

failed = expected - passed

line = f"{len(actually_passed)}/{len(expected)} passed"

if failed:

line += f" (failed: {', '.join(sorted(failed))})"

return line

Prompt To Fix With AI

This is a comment left during a code review. Path: swe_bench_pro_eval.py Line: 305-311 Comment: `_format_test_breakdown` returns `"0/0 passed"` when `expected` is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers. ```suggestion def _format_test_breakdown(expected: set, passed: set) -> str: if not expected: return "N/A (no tests)" actually_passed = expected & passed failed = expected - passed line = f"{len(actually_passed)}/{len(expected)} passed" if failed: line += f" (failed: {', '.join(sorted(failed))})" return line ``` How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detailed, human readable, eval_results schema (status, F2P/P2P breakdown)#104

Add detailed, human readable, eval_results schema (status, F2P/P2P breakdown)#104
jhmblundin wants to merge 1 commit into
scaleapi:mainfrom
blitzy-showcase:upstream-contrib/detailed-eval-results

jhmblundin commented Jun 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhmblundin commented Jun 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jhmblundin commented Jun 3, 2026 •

edited by greptile-apps Bot

Loading