Skip to content

Add SWE-bench Pro model run analyses#106

Open
chirag9127 wants to merge 7 commits into
scaleapi:mainfrom
chirag9127:swe-bench-pro-model-runs
Open

Add SWE-bench Pro model run analyses#106
chirag9127 wants to merge 7 commits into
scaleapi:mainfrom
chirag9127:swe-bench-pro-model-runs

Conversation

@chirag9127

@chirag9127 chirag9127 commented Jun 8, 2026

Copy link
Copy Markdown

Summary

  • Add mini SWE-agent SWE-bench Pro pilot artifacts
  • Add model error taxonomy analysis for Claude Sonnet 4 and GPT-4o
  • Add public task analysis artifacts

Notes

  • is not included
  • Error taxonomy markdown is in

Greptile Summary

This PR adds mini SWE-agent pilot run artifacts for 15 benchmark tasks across three models (claude-haiku-4-5, claude-opus-4-8, kimi-k2.5), along with analysis documents covering public task distribution, golden patch test coverage, and a model error taxonomy for Claude Sonnet 4 and GPT-4o.

  • helper_code/run_mini_swe_pro_modal.py: New driver script that orchestrates parallel agent runs inside Modal sandboxes, collecting predictions, trajectories, usage stats, and metadata.
  • Analysis docs (error_analysis/model_error_taxonomy.md, docs/golden_patch_test_coverage_analysis.md, SWE_BENCH_PRO_PUBLIC_TASK_ANALYSIS.md, traj/paper_run_results.md): Summarize benchmark task composition and per-model failure modes from LLM-as-judge classification.

Confidence Score: 4/5

Safe to merge with one operational fix recommended: the hardcoded personal DockerHub username should be addressed before the script is run by other team members.

The Python orchestration script has a hardcoded personal DockerHub username ("jefzda") as the CLI default. Anyone running the script without explicitly passing --dockerhub-username will silently attempt to pull images from that personal account and fail. The analysis documents and run artifact files are additive and carry no logic risk.

helper_code/run_mini_swe_pro_modal.py — specifically the --dockerhub-username default in parse_args.

Important Files Changed

Filename Overview
helper_code/run_mini_swe_pro_modal.py New orchestration script for Modal-backed agent runs; contains a hardcoded personal DockerHub username as the CLI default and several inline magic numbers for sandbox resource limits.
error_analysis/model_error_taxonomy.md New analysis doc summarizing LLM-judge error taxonomy for Claude Sonnet 4 and GPT-4o; documentation only, no logic concerns.
docs/golden_patch_test_coverage_analysis.md New doc analysing golden patch size and test coverage distribution across 731 tasks; documentation only.
SWE_BENCH_PRO_PUBLIC_TASK_ANALYSIS.md New doc breaking down the 731 public tasks by repo, language, and heuristic task type; documentation only.
traj/paper_run_results.md New summary of paper-run solve rates for five models; documentation only.

Sequence Diagram

sequenceDiagram
    participant CLI as main()
    participant TP as ThreadPoolExecutor
    participant RA as run_attempt()
    participant MM as get_model()
    participant MS as ModalSandboxEnvironment
    participant DA as DefaultAgent
    participant FS as Filesystem

    CLI->>TP: submit tasks × models
    TP->>RA: run_attempt(task, model_alias)
    RA->>MM: get_model(config)
    MM-->>RA: model
    RA->>MS: ModalSandboxEnvironment(image_uri)
    MS-->>RA: env (Modal sandbox)
    RA->>DA: DefaultAgent(model, env, ...)
    DA-->>RA: agent
    RA->>DA: agent.run(problem_statement)
    DA->>MS: execute(command)
    MS-->>DA: "{output, returncode}"
    DA-->>RA: exit_status, result
    RA->>FS: write pred / traj / usage / metadata
    RA->>MS: env.cleanup()
    RA-->>TP: metadata row
    TP-->>CLI: all rows
    CLI->>FS: write_patches_json + write_status_csv
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
helper_code/run_mini_swe_pro_modal.py:395
The default value `"jefzda"` is a personal DockerHub username. Any caller who omits `--dockerhub-username` will silently attempt to pull images from that personal account, which will fail with an image-not-found error for anyone outside the original author's setup. Consider requiring the flag explicitly (no default) or sourcing it from an environment variable so the failure is obvious rather than silently misconfigured.

```suggestion
    parser.add_argument("--dockerhub-username", default=os.environ.get("DOCKERHUB_USERNAME", "jefzda"))
```

### Issue 2 of 2
helper_code/run_mini_swe_pro_modal.py:149-156
The sandbox CPU and memory bounds are inline magic numbers. Per project conventions, numeric resource constants should be stored as named variables so their intent is clear and they are easy to tune without hunting through the implementation body. The same applies to `step_limit=250` and `cost_limit=3.0` in `run_attempt`.

```suggestion
        _CPU_MIN, _CPU_MAX = 1, 4
        _MEM_MIN_MB, _MEM_MAX_MB = 5 * 1024, 30 * 1024
        self.sandbox = modal.Sandbox.create(
            image=image,
            app=app,
            timeout=self.config.sandbox_timeout,
            cpu=(_CPU_MIN, _CPU_MAX),
            memory=(_MEM_MIN_MB, _MEM_MAX_MB),
            block_network=self.config.block_network,
        )
```

Reviews (3): Last reviewed commit: "Replace Haiku API-error SWE-bench Pro ar..." | Re-trigger Greptile

Context used:

  • Rule used - Store magic numbers as class or instance variables... (source)

Learned From
scaleapi/scaleapi#126388

Comment on lines +368 to +380

def write_status_csv(output_dir: Path, rows: list[dict[str, Any]]):
path = output_dir / "summary" / "generation_status.csv"
path.parent.mkdir(parents=True, exist_ok=True)
fieldnames = [
"model",
"instance_id",
"status",
"exit_status",
"wall_seconds",
"pred_path",
"traj_path",
"usage_path",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 save_traj called with potentially None agent

When get_model or ModalSandboxEnvironment raises an exception before DefaultAgent is instantiated, agent remains None. The code then calls save_traj(agent, traj_path, ...) unconditionally, which will raise an AttributeError inside save_traj and lose the error metadata entirely. The usage_from_messages call on the following line correctly guards with agent is not None, but save_traj does not receive the same treatment.

Prompt To Fix With AI
This is a comment left during a code review.
Path: helper_code/run_mini_swe_pro_modal.py
Line: 368-380

Comment:
**`save_traj` called with potentially `None` agent**

When `get_model` or `ModalSandboxEnvironment` raises an exception before `DefaultAgent` is instantiated, `agent` remains `None`. The code then calls `save_traj(agent, traj_path, ...)` unconditionally, which will raise an `AttributeError` inside `save_traj` and lose the error metadata entirely. The `usage_from_messages` call on the following line correctly guards with `agent is not None`, but `save_traj` does not receive the same treatment.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +282 to +305
}

started = time.time()
model_config = MODEL_SPECS[model_alias].copy()
if model_alias == "kimi-k2.5":
model_config["model_name"] = "openai/" + os.getenv("MODEL_NAME", "kimi-k2.5")
model_config.setdefault("model_kwargs", {})["api_base"] = os.getenv("MODEL_API_URL", "")

image_uri = get_dockerhub_image_uri(instance_id, dockerhub_username, task.get("repo", ""))
agent = None
env = None
exit_status = "Unknown"
result = ""
error_info = None

try:
model = get_model(config=model_config)
env = ModalSandboxEnvironment(image=image_uri, timeout=command_timeout)
agent = DefaultAgent(
model,
env,
system_template=DEFAULT_SYSTEM_TEMPLATE,
instance_template=DEFAULT_INSTANCE_TEMPLATE,
action_observation_template=DEFAULT_ACTION_OBSERVATION_TEMPLATE,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Anthropic token double-counting in usage_from_messages

The generic for key in list(usage) loop accumulates any key from item that matches a key in the usage dict — including prompt_tokens and completion_tokens. The subsequent Anthropic-specific block then also adds item["input_tokens"]prompt_tokens and item["output_tokens"]completion_tokens. If the response object contains both OpenAI-style (prompt_tokens) and Anthropic-style (input_tokens) keys simultaneously — which is possible via some proxy layers — the prompt and completion token counts are added twice. Additionally, api_calls and instance_cost are also in the usage dict and would be incorrectly accumulated if item happens to contain those keys.

Prompt To Fix With AI
This is a comment left during a code review.
Path: helper_code/run_mini_swe_pro_modal.py
Line: 282-305

Comment:
**Anthropic token double-counting in `usage_from_messages`**

The generic `for key in list(usage)` loop accumulates any key from `item` that matches a key in the `usage` dict — including `prompt_tokens` and `completion_tokens`. The subsequent Anthropic-specific block then also adds `item["input_tokens"]``prompt_tokens` and `item["output_tokens"]``completion_tokens`. If the response object contains both OpenAI-style (`prompt_tokens`) and Anthropic-style (`input_tokens`) keys simultaneously — which is possible via some proxy layers — the prompt and completion token counts are added twice. Additionally, `api_calls` and `instance_cost` are also in the `usage` dict and would be incorrectly accumulated if `item` happens to contain those keys.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

@chirag9127 chirag9127 force-pushed the swe-bench-pro-model-runs branch from 920d91b to a759cab Compare June 8, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant