Replies: 2 comments 1 reply
-
|
this looks close to a couple existing llama.cpp threads rather than a totally new config issue. #21445 has the accepted pointer for per-request control: For the Step 3.7-specific loop/overthinking behavior with tool-ish reasoning, #24181 is probably the better thread to watch/add your repro details to. |
Beta Was this translation helpful? Give feedback.
-
|
I don't think Step 3.7 Flash really supports reasoning efforts like GPT-OSS. Even though it is documented, I've tried setting it to "low" and saw no difference in its output when using the official API. In the end the amount of reasoning seems to be decided by the complexity of the task and some randomness. llama.cpp reasoning budget options can work quite well if set up proper.y Before I reported the parser bug (#24181), I had been using reasoning_budget as a workaround. See this for more details: https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF/discussions/6 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I wonder if I'm doing something wrong. The 3.7 model seems to massively overthink even simple questions such as writing C++ AVL implementation. It's to the point where it's effectively several times slower than comparably-sized models. It does not get into a loop and eventually finishes.
However, I don't see the value of "checking" these test cases when it's actually NOT executing the code and testing. It's just regurgitating test cases.
I was planning to use this for non-coding scenarios but coding is simpler to test. Thanks.
Arguments: --jinja --chat-template-kwargs {"reasoning_effort":"low"}
Llama.cpp: b9496
Model: bartowski/Step-3.7-Flash-GGUF
Quant: Q8_0
MTP layer: Step3.7-flash-mtp-Q8_0.gguf
I observed the same behavior with the stepfun-ai/Step-3.7-Flash-GGUF model. MTP is working and t/s increases by 25% from 8 t/s to 10 t/s.
Code Thinking Snippet:
Test Case Thinking Snippet:
Beta Was this translation helpful? Give feedback.
All reactions