fix(swe-bench-pro): make the default rk run work out of the box#26
Merged
Conversation
…orkarounds A default swe-bench-pro run (`rk run examples/specs/swe-bench-pro-spacedock-codex.yaml`, default `--materialize bind`) hit two blockers before the solver ever ran: 1. Docker build failed: `failed to read dockerfile: no such file or directory`. In bind/link mode the materializer symlinked the whole task tree, including `environment/Dockerfile`. `docker compose build` uses the view's `environment/` as its build context and BuildKit cannot read a Dockerfile that symlinks outside the context — so every build-from-source benchmark broke under the default mode. Fix: always materialize the `environment/` build context as real files, even in link mode (mirrors the existing task.toml carve-out); bulk task files still symlink, preserving bind's no-eager-duplication benefit. 2. Agent setup failed: `codex runtime adapter cannot honor ... 'max_turns'`. The example spec set `max_turns: 400`, but the codex runtime only accepts the default (200); any other value raises (intentional — claude honors max_turns, codex does not). Fix: set the example spec to 200 with a comment, and make the rejection message actionable (point users at the default + timeout budgeting). Leakage hardening (from codex review): copying the build context follows symlinks (shutil.copy2), and the name-based deny filter only sees a link's own path — so a disguised symlink (`environment/leak.patch -> ../gold_patch.diff`) could embed a denied answer artifact's bytes under an allowed view path. The materializer now resolves any source symlink and re-applies the deny check + source containment, dropping denied/out-of-tree targets in both copy and link modes. Verified end-to-end: a default-settings N=1 smoke now builds, solves, and scores (reward 1.0 on the ansible task). Unit tests cover all three (link-mode build context stays real; max_turns rejection is actionable; disguised symlink to a denied target is dropped, proven to fail pre-fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes two “out of the box” blockers for running the SWE-bench-pro Codex example with the default rk run materialization mode, by making Harbor task view materialization Docker-build friendly and making Codex’s max_turns rejection message actionable.
Changes:
- Materialize the
environment/Docker build context as real files even inview_mode="link", and harden leakage filtering by re-checking resolved symlink targets (deny-globs + source-tree containment). - Improve Codex runtime adapter errors for unsupported
harbor_agent_kwargs.max_turnswith an actionable hint (keep200, use timeouts for wall-clock budget). - Add unit tests covering link-mode build-context materialization, Codex actionable error messaging, and disguised-symlink leakage prevention; update the SWE-bench-pro Codex example spec to use
max_turns: 200with guidance.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/razorback/harbor_tasks/materialize.py |
Forces environment/ to be copied as real files in link mode; prevents symlink-target smuggling of denied/out-of-tree content. |
src/razorback/agents/_runtime/codex.py |
Adds a targeted, actionable hint when rejecting non-default max_turns for Codex. |
tests/unit/test_harbor_task_view_materializer.py |
Adds regression tests for link-mode Docker build context behavior and symlink leakage hardening. |
tests/unit/test_runtime_adapters.py |
Adds a test ensuring the Codex max_turns rejection message contains actionable guidance. |
examples/specs/swe-bench-pro-spacedock-codex.yaml |
Updates the example to max_turns: 200 and documents using timeouts for wall-clock budget. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A default swe-bench-pro run —
rk run examples/specs/swe-bench-pro-spacedock-codex.yamlwith the default--materialize bind— hit two blockers before the solver ever ran, each requiring a per-spec/per-invocation workaround to get past:failed to read dockerfile: open Dockerfile: no such file or directorycodex runtime adapter cannot honor unsupported harbor_agent_kwargs field 'max_turns'Root causes & fixes
1. Symlinked Docker build context. In bind/link mode the materializer symlinked the whole task tree, including
environment/Dockerfile.docker compose builduses the view'senvironment/as its build context, and BuildKit cannot read a Dockerfile that symlinks outside the context — so every build-from-source benchmark broke under the default mode. Fix: always materialize theenvironment/build context as real files, even in link mode (mirrors the existingtask.tomlcarve-out). Bulk task files still symlink, preserving bind's no-eager-duplication benefit.2.
max_turnson codex. The example spec setmax_turns: 400, but the codex runtime only accepts the default (200); any other value raises (intentional — claude honorsmax_turns, codex does not). Fix: set the example spec to200with an explanatory comment, and make the rejection message actionable (point users at the default +override_timeout_sec/max_timeout_secfor wall-clock budget).3. Leakage hardening (from a codex adversarial review of this diff). Copying the build context follows symlinks (
shutil.copy2), and the name-based deny filter only sees a link's own path — so a disguised symlink (environment/leak.patch -> ../gold_patch.diff) could embed a denied answer artifact's bytes under an allowed view path. The materializer now resolves any source symlink and re-applies the deny check + source-containment, dropping denied/out-of-tree targets in both copy and link modes.Verification
ansibletask (no--materialize copy, no spec edits).max_turnsrejection is actionable; disguised symlink to a denied target is dropped — proven to fail pre-fix). 104 passing across the materialize / leakage / translate / runtime-adapter / swe-bench-pro surfaces incl. the integration explain test.🤖 Generated with Claude Code