Add SWE-bench Pro model run analyses by chirag9127 · Pull Request #106 · scaleapi/SWE-bench_Pro-os

chirag9127 · 2026-06-08T13:36:06Z

Summary

Add mini SWE-agent SWE-bench Pro pilot artifacts
Add model error taxonomy analysis for Claude Sonnet 4 and GPT-4o
Add public task analysis artifacts

Notes

is not included
Error taxonomy markdown is in

Greptile Summary

This PR adds mini SWE-agent pilot run artifacts for 15 benchmark tasks across three models (claude-haiku-4-5, claude-opus-4-8, kimi-k2.5), along with analysis documents covering public task distribution, golden patch test coverage, and a model error taxonomy for Claude Sonnet 4 and GPT-4o.

helper_code/run_mini_swe_pro_modal.py: New driver script that orchestrates parallel agent runs inside Modal sandboxes, collecting predictions, trajectories, usage stats, and metadata.
Analysis docs (error_analysis/model_error_taxonomy.md, docs/golden_patch_test_coverage_analysis.md, SWE_BENCH_PRO_PUBLIC_TASK_ANALYSIS.md, traj/paper_run_results.md): Summarize benchmark task composition and per-model failure modes from LLM-as-judge classification.

Confidence Score: 4/5

Safe to merge with one operational fix recommended: the hardcoded personal DockerHub username should be addressed before the script is run by other team members.

The Python orchestration script has a hardcoded personal DockerHub username ("jefzda") as the CLI default. Anyone running the script without explicitly passing --dockerhub-username will silently attempt to pull images from that personal account and fail. The analysis documents and run artifact files are additive and carry no logic risk.

helper_code/run_mini_swe_pro_modal.py — specifically the --dockerhub-username default in parse_args.

Important Files Changed

Filename	Overview
helper_code/run_mini_swe_pro_modal.py	New orchestration script for Modal-backed agent runs; contains a hardcoded personal DockerHub username as the CLI default and several inline magic numbers for sandbox resource limits.
error_analysis/model_error_taxonomy.md	New analysis doc summarizing LLM-judge error taxonomy for Claude Sonnet 4 and GPT-4o; documentation only, no logic concerns.
docs/golden_patch_test_coverage_analysis.md	New doc analysing golden patch size and test coverage distribution across 731 tasks; documentation only.
SWE_BENCH_PRO_PUBLIC_TASK_ANALYSIS.md	New doc breaking down the 731 public tasks by repo, language, and heuristic task type; documentation only.
traj/paper_run_results.md	New summary of paper-run solve rates for five models; documentation only.

Sequence Diagram

sequenceDiagram
    participant CLI as main()
    participant TP as ThreadPoolExecutor
    participant RA as run_attempt()
    participant MM as get_model()
    participant MS as ModalSandboxEnvironment
    participant DA as DefaultAgent
    participant FS as Filesystem

    CLI->>TP: submit tasks × models
    TP->>RA: run_attempt(task, model_alias)
    RA->>MM: get_model(config)
    MM-->>RA: model
    RA->>MS: ModalSandboxEnvironment(image_uri)
    MS-->>RA: env (Modal sandbox)
    RA->>DA: DefaultAgent(model, env, ...)
    DA-->>RA: agent
    RA->>DA: agent.run(problem_statement)
    DA->>MS: execute(command)
    MS-->>DA: "{output, returncode}"
    DA-->>RA: exit_status, result
    RA->>FS: write pred / traj / usage / metadata
    RA->>MS: env.cleanup()
    RA-->>TP: metadata row
    TP-->>CLI: all rows
    CLI->>FS: write_patches_json + write_status_csv

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
helper_code/run_mini_swe_pro_modal.py:395
The default value `"jefzda"` is a personal DockerHub username. Any caller who omits `--dockerhub-username` will silently attempt to pull images from that personal account, which will fail with an image-not-found error for anyone outside the original author's setup. Consider requiring the flag explicitly (no default) or sourcing it from an environment variable so the failure is obvious rather than silently misconfigured.

```suggestion
    parser.add_argument("--dockerhub-username", default=os.environ.get("DOCKERHUB_USERNAME", "jefzda"))
```

### Issue 2 of 2
helper_code/run_mini_swe_pro_modal.py:149-156
The sandbox CPU and memory bounds are inline magic numbers. Per project conventions, numeric resource constants should be stored as named variables so their intent is clear and they are easy to tune without hunting through the implementation body. The same applies to `step_limit=250` and `cost_limit=3.0` in `run_attempt`.

```suggestion
        _CPU_MIN, _CPU_MAX = 1, 4
        _MEM_MIN_MB, _MEM_MAX_MB = 5 * 1024, 30 * 1024
        self.sandbox = modal.Sandbox.create(
            image=image,
            app=app,
            timeout=self.config.sandbox_timeout,
            cpu=(_CPU_MIN, _CPU_MAX),
            memory=(_MEM_MIN_MB, _MEM_MAX_MB),
            block_network=self.config.block_network,
        )
```

_{Reviews (3): Last reviewed commit: "Replace Haiku API-error SWE-bench Pro ar..." | Re-trigger Greptile}

Context used:

Rule used - Store magic numbers as class or instance variables... (source)

Learned From
scaleapi/scaleapi#126388

greptile-apps · 2026-06-08T13:39:18Z

+
+def write_status_csv(output_dir: Path, rows: list[dict[str, Any]]):
+    path = output_dir / "summary" / "generation_status.csv"
+    path.parent.mkdir(parents=True, exist_ok=True)
+    fieldnames = [
+        "model",
+        "instance_id",
+        "status",
+        "exit_status",
+        "wall_seconds",
+        "pred_path",
+        "traj_path",
+        "usage_path",


save_traj called with potentially None agent

When get_model or ModalSandboxEnvironment raises an exception before DefaultAgent is instantiated, agent remains None. The code then calls save_traj(agent, traj_path, ...) unconditionally, which will raise an AttributeError inside save_traj and lose the error metadata entirely. The usage_from_messages call on the following line correctly guards with agent is not None, but save_traj does not receive the same treatment.

Prompt To Fix With AI

This is a comment left during a code review. Path: helper_code/run_mini_swe_pro_modal.py Line: 368-380 Comment: **`save_traj` called with potentially `None` agent** When `get_model` or `ModalSandboxEnvironment` raises an exception before `DefaultAgent` is instantiated, `agent` remains `None`. The code then calls `save_traj(agent, traj_path, ...)` unconditionally, which will raise an `AttributeError` inside `save_traj` and lose the error metadata entirely. The `usage_from_messages` call on the following line correctly guards with `agent is not None`, but `save_traj` does not receive the same treatment. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-08T13:39:19Z

+        }
+
+    started = time.time()
+    model_config = MODEL_SPECS[model_alias].copy()
+    if model_alias == "kimi-k2.5":
+        model_config["model_name"] = "openai/" + os.getenv("MODEL_NAME", "kimi-k2.5")
+        model_config.setdefault("model_kwargs", {})["api_base"] = os.getenv("MODEL_API_URL", "")
+
+    image_uri = get_dockerhub_image_uri(instance_id, dockerhub_username, task.get("repo", ""))
+    agent = None
+    env = None
+    exit_status = "Unknown"
+    result = ""
+    error_info = None
+
+    try:
+        model = get_model(config=model_config)
+        env = ModalSandboxEnvironment(image=image_uri, timeout=command_timeout)
+        agent = DefaultAgent(
+            model,
+            env,
+            system_template=DEFAULT_SYSTEM_TEMPLATE,
+            instance_template=DEFAULT_INSTANCE_TEMPLATE,
+            action_observation_template=DEFAULT_ACTION_OBSERVATION_TEMPLATE,


Anthropic token double-counting in usage_from_messages

The generic for key in list(usage) loop accumulates any key from item that matches a key in the usage dict — including prompt_tokens and completion_tokens. The subsequent Anthropic-specific block then also adds item["input_tokens"] → prompt_tokens and item["output_tokens"] → completion_tokens. If the response object contains both OpenAI-style (prompt_tokens) and Anthropic-style (input_tokens) keys simultaneously — which is possible via some proxy layers — the prompt and completion token counts are added twice. Additionally, api_calls and instance_cost are also in the usage dict and would be incorrectly accumulated if item happens to contain those keys.

Prompt To Fix With AI

This is a comment left during a code review. Path: helper_code/run_mini_swe_pro_modal.py Line: 282-305 Comment: **Anthropic token double-counting in `usage_from_messages`** The generic `for key in list(usage)` loop accumulates any key from `item` that matches a key in the `usage` dict — including `prompt_tokens` and `completion_tokens`. The subsequent Anthropic-specific block then also adds `item["input_tokens"]` → `prompt_tokens` and `item["output_tokens"]` → `completion_tokens`. If the response object contains both OpenAI-style (`prompt_tokens`) and Anthropic-style (`input_tokens`) keys simultaneously — which is possible via some proxy layers — the prompt and completion token counts are added twice. Additionally, `api_calls` and `instance_cost` are also in the `usage` dict and would be incorrectly accumulated if `item` happens to contain those keys. How can I resolve this? If you propose a fix, please make it concise.

Add paper run results analysis

chirag9127 added 5 commits June 8, 2026 06:15

Add mini SWE-agent SWE-bench Pro pilot artifacts

6ff2a00

Add model error taxonomy analysis

c391cb5

Add SWE-bench Pro public task analysis

712c7bb

Add golden patch coverage analysis

a759cab

Add SWE-bench Pro paper run results analysis

920d91b

greptile-apps Bot reviewed Jun 8, 2026

View reviewed changes

chirag9127 force-pushed the swe-bench-pro-model-runs branch from 920d91b to a759cab Compare June 8, 2026 13:40

chirag9127 and others added 2 commits June 8, 2026 06:41

Merge pull request #2 from chirag9127/traj-paper-run-results-analysis

74b3fcc

Add paper run results analysis

Replace Haiku API-error SWE-bench Pro artifacts

16437c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SWE-bench Pro model run analyses#106

Add SWE-bench Pro model run analyses#106
chirag9127 wants to merge 7 commits into
scaleapi:mainfrom
chirag9127:swe-bench-pro-model-runs

chirag9127 commented Jun 8, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 8, 2026

Uh oh!

greptile-apps Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chirag9127 commented Jun 8, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chirag9127 commented Jun 8, 2026 •

edited by greptile-apps Bot

Loading