Commit 13ac345
bench: force tool usage in per-config instance template
Smoke #2 confirmed that even with cg/lsp shims on PATH, indexed repos,
REPO_NAME set, and explicit "use cg/lsp first" framing in the system
preamble, Claude Opus 4.5 ignored the differentiating tools and fell
straight back to grep/sed/cat. The 3-way comparison was real but
uninformative: tool choice was identical across configs.
This commit adds two new instance templates (INSTANCE_TEMPLATE_LSP and
INSTANCE_TEMPLATE_CODE_GRAPH) that embed a 'Required workflow.' block
directly in the task description — the first thing the model sees
each turn. Selection via load_instance_template(config); baseline keeps
the original template.
Smoke #3 result: lsp track now invokes 'lsp' 3x, code_graph track
invokes 'cg' 5x (including cg auto-complete returning the exact buggy
function with line numbers + docstring). The structured-navigation
tools are finally exercised, so token deltas measured against
baseline are now meaningful signal rather than noise.
n=1 finding: both lsp (+128%) and code_graph (+85%) use MORE tokens
than baseline on this instance. Bigger preambles + verbose JSON tool
replies + occasional retries (cg find-symbol exact-match bug) outweigh
any savings. Headline run should scale n or pivot to a function-calling
harness.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 03c7a73 commit 13ac345
1 file changed
Lines changed: 67 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
100 | 166 | | |
101 | 167 | | |
102 | 168 | | |
| |||
310 | 376 | | |
311 | 377 | | |
312 | 378 | | |
313 | | - | |
| 379 | + | |
314 | 380 | | |
315 | 381 | | |
316 | 382 | | |
| |||
0 commit comments