Skip to content

Commit 13ac345

Browse files
DvirDukhanCopilot
andcommitted
bench: force tool usage in per-config instance template
Smoke #2 confirmed that even with cg/lsp shims on PATH, indexed repos, REPO_NAME set, and explicit "use cg/lsp first" framing in the system preamble, Claude Opus 4.5 ignored the differentiating tools and fell straight back to grep/sed/cat. The 3-way comparison was real but uninformative: tool choice was identical across configs. This commit adds two new instance templates (INSTANCE_TEMPLATE_LSP and INSTANCE_TEMPLATE_CODE_GRAPH) that embed a 'Required workflow.' block directly in the task description — the first thing the model sees each turn. Selection via load_instance_template(config); baseline keeps the original template. Smoke #3 result: lsp track now invokes 'lsp' 3x, code_graph track invokes 'cg' 5x (including cg auto-complete returning the exact buggy function with line numbers + docstring). The structured-navigation tools are finally exercised, so token deltas measured against baseline are now meaningful signal rather than noise. n=1 finding: both lsp (+128%) and code_graph (+85%) use MORE tokens than baseline on this instance. Bigger preambles + verbose JSON tool replies + occasional retries (cg find-symbol exact-match bug) outweigh any savings. Headline run should scale n or pivot to a function-calling harness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 03c7a73 commit 13ac345

1 file changed

Lines changed: 67 additions & 1 deletion

File tree

bench/runners/mini_runner.py

Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,72 @@ class Task:
9797
"""
9898

9999

100+
# The lsp / code_graph configs use a sharper template that mandates an
101+
# initial tool call. Smoke #2 showed Claude reads the system preamble's
102+
# "use cg/lsp first" guidance and then ignores it; embedding the
103+
# requirement in the per-instance task description is more obtrusive.
104+
105+
INSTANCE_TEMPLATE_LSP = """\
106+
You are working in the repository at {{cwd}}.
107+
108+
The task to solve:
109+
110+
{{task}}
111+
112+
**Required workflow.** Before reading or editing any file, your first
113+
two bash commands MUST be:
114+
115+
1. `grep -rn "<a symbol named in the task description>" --include='*.py' .`
116+
to locate a `file:line` for jedi to anchor on.
117+
2. `lsp goto-definition --file <file> --line <line> --col <col>` to
118+
resolve the true definition.
119+
120+
Then use `lsp find-references` whenever you would have used a recursive
121+
grep, and `lsp document-symbols` whenever you would have run a textual
122+
outline pass. Reach for plain grep/sed/cat only after you've exhausted
123+
the LSP for navigation.
124+
125+
When you believe the task is complete, finish your turn with a final
126+
message that contains a unified diff of your changes inside a fenced
127+
``` block, then exit. Do not commit; the harness reads the diff via
128+
`git diff`.
129+
"""
130+
131+
INSTANCE_TEMPLATE_CODE_GRAPH = """\
132+
You are working in the repository at {{cwd}}.
133+
The code-graph service has already indexed this repository under the
134+
name `$REPO_NAME` (use the env var literally).
135+
136+
The task to solve:
137+
138+
{{task}}
139+
140+
**Required workflow.** Before reading or editing any file, your first
141+
bash command MUST be:
142+
143+
`cg find-symbol --repo "$REPO_NAME" --name <a symbol named in the task description>`
144+
145+
then use `cg get-neighbors --repo "$REPO_NAME" --ids <id>` to expand
146+
relationships before doing any textual search. After every file edit,
147+
run `cg note-edit --repo "$REPO_NAME" --path <relpath>` so subsequent
148+
graph queries reflect your change. Reach for grep/sed/cat only for
149+
content reading after `cg` has located the right place.
150+
151+
When you believe the task is complete, finish your turn with a final
152+
message that contains a unified diff of your changes inside a fenced
153+
``` block, then exit. Do not commit; the harness reads the diff via
154+
`git diff`.
155+
"""
156+
157+
158+
def load_instance_template(config: str) -> str:
159+
if config == "lsp":
160+
return INSTANCE_TEMPLATE_LSP
161+
if config == "code_graph":
162+
return INSTANCE_TEMPLATE_CODE_GRAPH
163+
return INSTANCE_TEMPLATE
164+
165+
100166
def load_preamble(config: str) -> str:
101167
"""Read the per-config system preamble; fall back to a generic stub."""
102168
path = TOOLS_DIR / config / "system_preamble.md"
@@ -310,7 +376,7 @@ def run_task(
310376
model,
311377
env,
312378
system_template=preamble,
313-
instance_template=INSTANCE_TEMPLATE,
379+
instance_template=load_instance_template(config),
314380
step_limit=step_limit,
315381
cost_limit=cost_limit,
316382
wall_time_limit_seconds=wall_time_limit_seconds,

0 commit comments

Comments
 (0)