Skip to content

Add detailed, human readable, eval_results schema (status, F2P/P2P breakdown)#104

Open
jhmblundin wants to merge 1 commit into
scaleapi:mainfrom
blitzy-showcase:upstream-contrib/detailed-eval-results
Open

Add detailed, human readable, eval_results schema (status, F2P/P2P breakdown)#104
jhmblundin wants to merge 1 commit into
scaleapi:mainfrom
blitzy-showcase:upstream-contrib/detailed-eval-results

Conversation

@jhmblundin

@jhmblundin jhmblundin commented Jun 3, 2026

Copy link
Copy Markdown

Replace the per-instance bool in eval_results.json with a structured dict that exposes which specific FAIL_TO_PASS / PASS_TO_PASS tests passed or failed, plus an error message on failure paths. This lets users triage failures without opening per-instance log files.

Schema change

  • Before: {instance_id: True/False}
  • After: {instance_id: {
    "status": "Pass" | "Fail",
    "resolved": bool, # convenience flag
    "PASS_TO_PASS": "N/M passed (failed: t1, t2, ...)",
    "FAIL_TO_PASS": "N/M passed (failed: t1, t2, ...)",
    "error": "..." # only on failure paths
    }}

This is a breaking change for any downstream consumer that parses eval_results.json as {id: bool}. The "resolved" key is provided as a convenience flag so most migrations are mechanical:

# Before
if eval_results[id]: ...
sum(eval_results.values())

# After
if eval_results[id]["resolved"]: ...
sum(r["resolved"] for r in eval_results.values())

Also pretty-prints eval_results.json with indent=2 since the new structure benefits from human-readable formatting.

Internal cleanup

  • Extracted _make_failure_result, _format_test_breakdown, _build_detailed_result, and _running_accuracy helpers so the main loop reads at one level of abstraction.
  • Uses output.get("tests", []) to avoid KeyError on partially-populated outputs.

No new dependencies.

Greptile Summary

This PR replaces the flat {instance_id: bool} format in eval_results.json with a richer {instance_id: {status, resolved, PASS_TO_PASS, FAIL_TO_PASS, error}} schema, extracting four focused helper functions to keep the main loop readable and adding indent=2 pretty-printing.

  • Schema change: each entry now carries a human-readable breakdown of which tests passed/failed and an error field on infra-failure paths, making triage possible without consulting per-instance logs.
  • Helpers: _make_failure_result, _format_test_breakdown, _build_detailed_result, and _running_accuracy are all well-scoped; output.get(\"tests\", []) defensively handles partially-populated outputs.
  • Breaking change: downstream consumers reading eval_results[id] as a bool must migrate to eval_results[id][\"resolved\"]; a migration guide is included in the PR description.

Confidence Score: 4/5

Safe to merge — the core logic is correct and this is additive schema enrichment; the main risk is downstream consumers parsing the human-readable breakdown strings, which have a subtle format inconsistency between infra-failure and test-failure entries.

The result-building logic is sound and _running_accuracy faithfully replicates the old denominator semantics. Two format inconsistencies exist: _make_failure_result stores "" for PASS_TO_PASS/FAIL_TO_PASS while _build_detailed_result always stores a formatted "N/M passed ..." string, and _format_test_breakdown emits the ambiguous "0/0 passed" for instances with no tests of a given type. Neither causes a runtime failure in the writer, but both can surprise consumers who parse these strings.

swe_bench_pro_eval.py — specifically _make_failure_result and _format_test_breakdown and their interaction with the documented schema.

Important Files Changed

Filename Overview
swe_bench_pro_eval.py Adds structured per-instance result dicts to eval_results.json, replacing plain booleans; logic is correct but PASS_TO_PASS/FAIL_TO_PASS field format is inconsistent between infra-failure and test-failure paths.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[future.result] -->|is None| B[_make_failure_result]
    A -->|output dict| C{instance_id in raw_sample_df?}
    C -->|No| D[_make_failure_result]
    C -->|Yes| E[collect passed_tests via output.get tests]
    E --> F[eval f2p and p2p from raw_sample]
    F --> G[_build_detailed_result]
    G --> H{f2p union p2p subset of passed_tests?}
    H -->|Yes| I[status: Pass, resolved: True]
    H -->|No| J[status: Fail, resolved: False]
    A -->|raises Exception| K[_make_failure_result with str of exc]
    B --> L[result with error key, PASS_TO_PASS empty string]
    D --> L
    K --> L
    I --> M[result without error key, formatted PASS_TO_PASS and FAIL_TO_PASS]
    J --> M
    L --> N[eval_results dict]
    M --> N
    N --> O[_running_accuracy]
    N --> P[json.dump with indent=2]
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
swe_bench_pro_eval.py:295-302
Inconsistent `PASS_TO_PASS`/`FAIL_TO_PASS` value format between the two result constructors. `_make_failure_result` stores empty strings `""` for these fields, while `_build_detailed_result` always stores a formatted `"N/M passed (failed: ...)"` string. The module-level schema comment documents both fields as `"N/M passed (failed: a, b, c)"` without noting the empty-string exception, so any consumer that tries to parse the count (`result["FAIL_TO_PASS"].split("/")[0]`, etc.) will crash on infra-failure entries. Setting these to `"N/A"` for infra failures would make the format uniform and safe to parse.

```suggestion
def _make_failure_result(error_msg: str) -> dict:
    return {
        "status": "Fail",
        "resolved": False,
        "PASS_TO_PASS": "N/A",
        "FAIL_TO_PASS": "N/A",
        "error": error_msg,
    }
```

### Issue 2 of 2
swe_bench_pro_eval.py:305-311
`_format_test_breakdown` returns `"0/0 passed"` when `expected` is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers.

```suggestion
def _format_test_breakdown(expected: set, passed: set) -> str:
    if not expected:
        return "N/A (no tests)"
    actually_passed = expected & passed
    failed = expected - passed
    line = f"{len(actually_passed)}/{len(expected)} passed"
    if failed:
        line += f" (failed: {', '.join(sorted(failed))})"
    return line
```

Reviews (1): Last reviewed commit: "Add detailed eval_results schema (status..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

Replace the per-instance bool in eval_results.json with a structured
dict that exposes which specific FAIL_TO_PASS / PASS_TO_PASS tests
passed or failed, plus an error message on failure paths. This lets
users triage failures without opening per-instance log files.

Schema change
- Before: {instance_id: True/False}
- After:  {instance_id: {
            "status":       "Pass" | "Fail",
            "resolved":     bool,    # convenience flag
            "PASS_TO_PASS": "N/M passed (failed: t1, t2, ...)",
            "FAIL_TO_PASS": "N/M passed (failed: t1, t2, ...)",
            "error":        "..."    # only on failure paths
         }}

This is a breaking change for any downstream consumer that parses
eval_results.json as {id: bool}. The "resolved" key is provided as a
convenience flag so most migrations are mechanical:

    # Before
    if eval_results[id]: ...
    sum(eval_results.values())

    # After
    if eval_results[id]["resolved"]: ...
    sum(r["resolved"] for r in eval_results.values())

Also pretty-prints eval_results.json with indent=2 since the new
structure benefits from human-readable formatting.

Internal cleanup
- Extracted _make_failure_result, _format_test_breakdown,
  _build_detailed_result, and _running_accuracy helpers so the main
  loop reads at one level of abstraction.
- Uses output.get("tests", []) to avoid KeyError on partially-populated
  outputs.

No new dependencies.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread swe_bench_pro_eval.py
Comment on lines +295 to +302
def _make_failure_result(error_msg: str) -> dict:
return {
"status": "Fail",
"resolved": False,
"PASS_TO_PASS": "",
"FAIL_TO_PASS": "",
"error": error_msg,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inconsistent PASS_TO_PASS/FAIL_TO_PASS value format between the two result constructors. _make_failure_result stores empty strings "" for these fields, while _build_detailed_result always stores a formatted "N/M passed (failed: ...)" string. The module-level schema comment documents both fields as "N/M passed (failed: a, b, c)" without noting the empty-string exception, so any consumer that tries to parse the count (result["FAIL_TO_PASS"].split("/")[0], etc.) will crash on infra-failure entries. Setting these to "N/A" for infra failures would make the format uniform and safe to parse.

Suggested change
def _make_failure_result(error_msg: str) -> dict:
return {
"status": "Fail",
"resolved": False,
"PASS_TO_PASS": "",
"FAIL_TO_PASS": "",
"error": error_msg,
}
def _make_failure_result(error_msg: str) -> dict:
return {
"status": "Fail",
"resolved": False,
"PASS_TO_PASS": "N/A",
"FAIL_TO_PASS": "N/A",
"error": error_msg,
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 295-302

Comment:
Inconsistent `PASS_TO_PASS`/`FAIL_TO_PASS` value format between the two result constructors. `_make_failure_result` stores empty strings `""` for these fields, while `_build_detailed_result` always stores a formatted `"N/M passed (failed: ...)"` string. The module-level schema comment documents both fields as `"N/M passed (failed: a, b, c)"` without noting the empty-string exception, so any consumer that tries to parse the count (`result["FAIL_TO_PASS"].split("/")[0]`, etc.) will crash on infra-failure entries. Setting these to `"N/A"` for infra failures would make the format uniform and safe to parse.

```suggestion
def _make_failure_result(error_msg: str) -> dict:
    return {
        "status": "Fail",
        "resolved": False,
        "PASS_TO_PASS": "N/A",
        "FAIL_TO_PASS": "N/A",
        "error": error_msg,
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment thread swe_bench_pro_eval.py
Comment on lines +305 to +311
def _format_test_breakdown(expected: set, passed: set) -> str:
actually_passed = expected & passed
failed = expected - passed
line = f"{len(actually_passed)}/{len(expected)} passed"
if failed:
line += f" (failed: {', '.join(sorted(failed))})"
return line

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _format_test_breakdown returns "0/0 passed" when expected is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers.

Suggested change
def _format_test_breakdown(expected: set, passed: set) -> str:
actually_passed = expected & passed
failed = expected - passed
line = f"{len(actually_passed)}/{len(expected)} passed"
if failed:
line += f" (failed: {', '.join(sorted(failed))})"
return line
def _format_test_breakdown(expected: set, passed: set) -> str:
if not expected:
return "N/A (no tests)"
actually_passed = expected & passed
failed = expected - passed
line = f"{len(actually_passed)}/{len(expected)} passed"
if failed:
line += f" (failed: {', '.join(sorted(failed))})"
return line
Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 305-311

Comment:
`_format_test_breakdown` returns `"0/0 passed"` when `expected` is an empty set (i.e., the instance has no FAIL_TO_PASS or no PASS_TO_PASS tests). This is ambiguous — it looks identical to a case where all zero-of-zero tests ran and passed. A special case would make it unambiguous to readers and downstream parsers.

```suggestion
def _format_test_breakdown(expected: set, passed: set) -> str:
    if not expected:
        return "N/A (no tests)"
    actually_passed = expected & passed
    failed = expected - passed
    line = f"{len(actually_passed)}/{len(expected)} passed"
    if failed:
        line += f" (failed: {', '.join(sorted(failed))})"
    return line
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant