Commit 5829f53
committed
test(reasoning): use temperature=0 to prevent Metal flap in ReasoningBudgetTest
The two tests that assert reasoning_content is present relied on the Qwen3
model sampling into a <think> block, which is non-deterministic at the
default temperature. Metal (macOS arm64) GPU arithmetic occasionally produces
different logit distributions than CPU backends and can sample a non-thinking
first token, causing a spurious failure.
Adding withTemperature(0.0f) (greedy sampling) makes both tests deterministic
across all platforms. The llama.cpp copy-loop bug (per-request
reasoning_budget_tokens ignored) is still present in b9739 — only the test
stability is improved here.
Co-Authored-By: Claude <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017xJt9CFqWYC4tWmC13YwVv1 parent b20c2d3 commit 5829f53
1 file changed
Lines changed: 12 additions & 0 deletions
Lines changed: 12 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
100 | 105 | | |
101 | 106 | | |
102 | 107 | | |
103 | 108 | | |
104 | 109 | | |
| 110 | + | |
105 | 111 | | |
106 | 112 | | |
107 | 113 | | |
| |||
132 | 138 | | |
133 | 139 | | |
134 | 140 | | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
135 | 146 | | |
136 | 147 | | |
137 | 148 | | |
138 | 149 | | |
139 | 150 | | |
| 151 | + | |
140 | 152 | | |
141 | 153 | | |
142 | 154 | | |
| |||
0 commit comments