Add SWE-bench Pro model run analyses#106
Conversation
|
|
||
| def write_status_csv(output_dir: Path, rows: list[dict[str, Any]]): | ||
| path = output_dir / "summary" / "generation_status.csv" | ||
| path.parent.mkdir(parents=True, exist_ok=True) | ||
| fieldnames = [ | ||
| "model", | ||
| "instance_id", | ||
| "status", | ||
| "exit_status", | ||
| "wall_seconds", | ||
| "pred_path", | ||
| "traj_path", | ||
| "usage_path", |
There was a problem hiding this comment.
save_traj called with potentially None agent
When get_model or ModalSandboxEnvironment raises an exception before DefaultAgent is instantiated, agent remains None. The code then calls save_traj(agent, traj_path, ...) unconditionally, which will raise an AttributeError inside save_traj and lose the error metadata entirely. The usage_from_messages call on the following line correctly guards with agent is not None, but save_traj does not receive the same treatment.
Prompt To Fix With AI
This is a comment left during a code review.
Path: helper_code/run_mini_swe_pro_modal.py
Line: 368-380
Comment:
**`save_traj` called with potentially `None` agent**
When `get_model` or `ModalSandboxEnvironment` raises an exception before `DefaultAgent` is instantiated, `agent` remains `None`. The code then calls `save_traj(agent, traj_path, ...)` unconditionally, which will raise an `AttributeError` inside `save_traj` and lose the error metadata entirely. The `usage_from_messages` call on the following line correctly guards with `agent is not None`, but `save_traj` does not receive the same treatment.
How can I resolve this? If you propose a fix, please make it concise.| } | ||
|
|
||
| started = time.time() | ||
| model_config = MODEL_SPECS[model_alias].copy() | ||
| if model_alias == "kimi-k2.5": | ||
| model_config["model_name"] = "openai/" + os.getenv("MODEL_NAME", "kimi-k2.5") | ||
| model_config.setdefault("model_kwargs", {})["api_base"] = os.getenv("MODEL_API_URL", "") | ||
|
|
||
| image_uri = get_dockerhub_image_uri(instance_id, dockerhub_username, task.get("repo", "")) | ||
| agent = None | ||
| env = None | ||
| exit_status = "Unknown" | ||
| result = "" | ||
| error_info = None | ||
|
|
||
| try: | ||
| model = get_model(config=model_config) | ||
| env = ModalSandboxEnvironment(image=image_uri, timeout=command_timeout) | ||
| agent = DefaultAgent( | ||
| model, | ||
| env, | ||
| system_template=DEFAULT_SYSTEM_TEMPLATE, | ||
| instance_template=DEFAULT_INSTANCE_TEMPLATE, | ||
| action_observation_template=DEFAULT_ACTION_OBSERVATION_TEMPLATE, |
There was a problem hiding this comment.
Anthropic token double-counting in
usage_from_messages
The generic for key in list(usage) loop accumulates any key from item that matches a key in the usage dict — including prompt_tokens and completion_tokens. The subsequent Anthropic-specific block then also adds item["input_tokens"] → prompt_tokens and item["output_tokens"] → completion_tokens. If the response object contains both OpenAI-style (prompt_tokens) and Anthropic-style (input_tokens) keys simultaneously — which is possible via some proxy layers — the prompt and completion token counts are added twice. Additionally, api_calls and instance_cost are also in the usage dict and would be incorrectly accumulated if item happens to contain those keys.
Prompt To Fix With AI
This is a comment left during a code review.
Path: helper_code/run_mini_swe_pro_modal.py
Line: 282-305
Comment:
**Anthropic token double-counting in `usage_from_messages`**
The generic `for key in list(usage)` loop accumulates any key from `item` that matches a key in the `usage` dict — including `prompt_tokens` and `completion_tokens`. The subsequent Anthropic-specific block then also adds `item["input_tokens"]` → `prompt_tokens` and `item["output_tokens"]` → `completion_tokens`. If the response object contains both OpenAI-style (`prompt_tokens`) and Anthropic-style (`input_tokens`) keys simultaneously — which is possible via some proxy layers — the prompt and completion token counts are added twice. Additionally, `api_calls` and `instance_cost` are also in the `usage` dict and would be incorrectly accumulated if `item` happens to contain those keys.
How can I resolve this? If you propose a fix, please make it concise.920d91b to
a759cab
Compare
Add paper run results analysis
Summary
Notes
Greptile Summary
This PR adds mini SWE-agent pilot run artifacts for 15 benchmark tasks across three models (
claude-haiku-4-5,claude-opus-4-8,kimi-k2.5), along with analysis documents covering public task distribution, golden patch test coverage, and a model error taxonomy for Claude Sonnet 4 and GPT-4o.helper_code/run_mini_swe_pro_modal.py: New driver script that orchestrates parallel agent runs inside Modal sandboxes, collecting predictions, trajectories, usage stats, and metadata.error_analysis/model_error_taxonomy.md,docs/golden_patch_test_coverage_analysis.md,SWE_BENCH_PRO_PUBLIC_TASK_ANALYSIS.md,traj/paper_run_results.md): Summarize benchmark task composition and per-model failure modes from LLM-as-judge classification.Confidence Score: 4/5
Safe to merge with one operational fix recommended: the hardcoded personal DockerHub username should be addressed before the script is run by other team members.
The Python orchestration script has a hardcoded personal DockerHub username (
"jefzda") as the CLI default. Anyone running the script without explicitly passing--dockerhub-usernamewill silently attempt to pull images from that personal account and fail. The analysis documents and run artifact files are additive and carry no logic risk.helper_code/run_mini_swe_pro_modal.py — specifically the
--dockerhub-usernamedefault inparse_args.Important Files Changed
Sequence Diagram
sequenceDiagram participant CLI as main() participant TP as ThreadPoolExecutor participant RA as run_attempt() participant MM as get_model() participant MS as ModalSandboxEnvironment participant DA as DefaultAgent participant FS as Filesystem CLI->>TP: submit tasks × models TP->>RA: run_attempt(task, model_alias) RA->>MM: get_model(config) MM-->>RA: model RA->>MS: ModalSandboxEnvironment(image_uri) MS-->>RA: env (Modal sandbox) RA->>DA: DefaultAgent(model, env, ...) DA-->>RA: agent RA->>DA: agent.run(problem_statement) DA->>MS: execute(command) MS-->>DA: "{output, returncode}" DA-->>RA: exit_status, result RA->>FS: write pred / traj / usage / metadata RA->>MS: env.cleanup() RA-->>TP: metadata row TP-->>CLI: all rows CLI->>FS: write_patches_json + write_status_csvPrompt To Fix All With AI
Reviews (3): Last reviewed commit: "Replace Haiku API-error SWE-bench Pro ar..." | Re-trigger Greptile
Context used:
Learned From
scaleapi/scaleapi#126388