Skip to content

Commit 743b7f2

Browse files
committed
[argus] loop_detector + prompt: nudge toward observation when stuck
Two related changes addressing a failure mode observed on Qwen3-27B: the model has full multimodal vision but tends to skip past observing tool output and pattern-match the bug from priors. When iteration doesn't change the visible result, a single-line "stop calling tools" warning amplifies the wrong instinct — the model concludes "I should do this all at once" and rewrites the file from scratch, usually introducing new bugs. 1. loop_detection_middleware: rewrite the soft-warning text. Old: "[LOOP DETECTED] You are repeating the same tool calls. Stop calling tools and produce your final answer now." New: "[REPEAT TOOL CALL DETECTED] You have just made the same tool call several times in a row. If the observable result is not changing between calls, your model of the bug is likely wrong... (a) describe what you actually observe in the latest result, (b) note explicitly what is different from what you expected, (c) instrument or pick a clearly different angle. Do not 'rewrite from scratch' as a debugging strategy." Hard-stop messages (>= hard_limit) are unchanged — at that point producing a final answer is the right outcome. Soft-warning threshold is the place where a strategy switch helps, so the new text points at strategy not at termination. Test assertions updated from "LOOP DETECTED" to "REPEAT TOOL CALL DETECTED" — 8 sites in test_loop_detection_middleware.py. 2. lead_agent prompt: add an "Observe before you diagnose" paragraph to <debugging_when_stuck>. Tells the model to describe what it actually sees in tool output (visible elements, missing elements, error messages and warnings verbatim) BEFORE proposing a fix. This forces the next fix to be grounded in observation rather than priors, breaking the pattern-match-from-prior loop that produces consecutive same-area blind fixes. Companion to a render-and-verify SKILL.md update on the Argus side that turns this into an explicit numbered flow. PR-candidate: maybe (loop detector text), maybe (prompt block) Reason: Both changes target a Qwen-shaped failure mode that we observed on a specific minion-render thread; the principles generalise but upstream may have different priors on what soft-warning text should say.
1 parent ef623b5 commit 743b7f2

4 files changed

Lines changed: 27 additions & 11 deletions

File tree

backend/packages/harness/deerflow/agents/lead_agent/prompt.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -464,6 +464,10 @@ def _build_subagent_section(max_concurrent: int) -> str:
464464
**For deliverables in `/mnt/user-data/outputs`:** write the file once with `write_file`. If you find a bug after writing, use `str_replace` to fix in place. Do not re-run a HEREDOC `cat > ...` to rewrite the whole file.
465465
</file_editing>
466466
<debugging_when_stuck>
467+
**Observe before you diagnose.** When a tool returns output (rendered screenshot, test results, log dump, command stdout), describe what you actually see in plain words *before* proposing a fix.
468+
List visible elements, missing elements, error messages, and warnings — verbatim, not paraphrased. State explicitly what differs from your expectation.
469+
This forces you to ground the next fix in observation rather than priors. The cost is one short paragraph; the benefit is that you stop pattern-matching bugs that aren't there.
470+
467471
**Two failed fixes in a row that don't change the observable result is a signal — your model of the bug is wrong.**
468472
A third blind fix is the most expensive thing you can do: it costs tokens, takes time, and probably won't work either. Stop fixing and start instrumenting.
469473

backend/packages/harness/deerflow/agents/middlewares/loop_detection_middleware.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -126,10 +126,21 @@ def _hash_tool_calls(tool_calls: list[dict]) -> str:
126126
return hashlib.md5(blob.encode()).hexdigest()[:12]
127127

128128

129-
_WARNING_MSG = "[LOOP DETECTED] You are repeating the same tool calls. Stop calling tools and produce your final answer now. If you cannot complete the task, summarize what you accomplished so far."
129+
_WARNING_MSG = (
130+
"[REPEAT TOOL CALL DETECTED] You have just made the same tool call (or near-identical: same tool, similar arguments) several times in a row. "
131+
"If the observable result is not changing between calls, your model of the bug is likely wrong — making another similar call is unlikely to help. "
132+
"Before the next tool call: (a) describe what you actually observe in the latest result, (b) note explicitly what is different from what you expected, "
133+
"(c) instrument (add logging, inspect intermediate values, reduce the test surface) or pick a clearly different angle. "
134+
"Do not 'rewrite from scratch' as a debugging strategy — that hides the bug rather than finding it. "
135+
"If the task genuinely cannot be completed, summarize what you accomplished and stop."
136+
)
130137

131138
_TOOL_FREQ_WARNING_MSG = (
132-
"[LOOP DETECTED] You have called {tool_name} {count} times without producing a final answer. Stop calling tools and produce your final answer now. If you cannot complete the task, summarize what you accomplished so far."
139+
"[REPEAT TOOL CALL DETECTED] You have called {tool_name} {count} times in this conversation. "
140+
"Step back: are these calls converging on a result, or are you cycling through similar variations? "
141+
"If you are cycling, switch strategy — instrument, reduce the test surface, or pick a clearly different angle. "
142+
"Do not rewrite the artifact from scratch; that usually introduces new bugs without fixing the original. "
143+
"If the task genuinely cannot be completed, summarize what you accomplished and stop."
133144
)
134145

135146
_HARD_STOP_MSG = "[FORCED STOP] Repeated tool calls exceeded the safety limit. Producing final answer with results collected so far."

backend/tests/test_lead_agent_prompt.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,8 @@ def test_apply_prompt_template_includes_debugging_when_stuck_block(monkeypatch):
123123

124124
assert "<debugging_when_stuck>" in prompt
125125
assert "</debugging_when_stuck>" in prompt
126-
# Three core principles must all be present
126+
# Four core principles must all be present
127+
assert "Observe before you diagnose" in prompt
127128
assert "Two failed fixes in a row" in prompt
128129
assert "Instrument first, fix second" in prompt
129130
assert "reduce the test surface" in prompt

backend/tests/test_loop_detection_middleware.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ def test_warn_at_threshold(self):
200200
msgs = result["messages"]
201201
assert len(msgs) == 1
202202
assert isinstance(msgs[0], HumanMessage)
203-
assert "LOOP DETECTED" in msgs[0].content
203+
assert "REPEAT TOOL CALL DETECTED" in msgs[0].content
204204

205205
def test_warn_only_injected_once(self):
206206
"""Warning for the same hash should only be injected once per thread."""
@@ -215,7 +215,7 @@ def test_warn_only_injected_once(self):
215215
# Third — warning injected
216216
result = mw._apply(_make_state(tool_calls=call), runtime)
217217
assert result is not None
218-
assert "LOOP DETECTED" in result["messages"][0].content
218+
assert "REPEAT TOOL CALL DETECTED" in result["messages"][0].content
219219

220220
# Fourth — warning already injected, should return None
221221
result = mw._apply(_make_state(tool_calls=call), runtime)
@@ -306,12 +306,12 @@ def test_thread_id_from_runtime_context(self):
306306
# Second call on thread A — triggers warning (2 >= warn_threshold)
307307
result = mw._apply(_make_state(tool_calls=call), runtime_a)
308308
assert result is not None
309-
assert "LOOP DETECTED" in result["messages"][0].content
309+
assert "REPEAT TOOL CALL DETECTED" in result["messages"][0].content
310310

311311
# Second call on thread B — also triggers (independent tracking)
312312
result = mw._apply(_make_state(tool_calls=call), runtime_b)
313313
assert result is not None
314-
assert "LOOP DETECTED" in result["messages"][0].content
314+
assert "REPEAT TOOL CALL DETECTED" in result["messages"][0].content
315315

316316
def test_lru_eviction(self):
317317
"""Old threads should be evicted when max_tracked_threads is exceeded."""
@@ -533,7 +533,7 @@ def test_freq_warn_at_threshold(self):
533533
msg = result["messages"][0]
534534
assert isinstance(msg, HumanMessage)
535535
assert "read_file" in msg.content
536-
assert "LOOP DETECTED" in msg.content
536+
assert "REPEAT TOOL CALL DETECTED" in msg.content
537537

538538
def test_freq_warn_only_injected_once(self):
539539
mw = LoopDetectionMiddleware(tool_freq_warn=3, tool_freq_hard_limit=10)
@@ -545,7 +545,7 @@ def test_freq_warn_only_injected_once(self):
545545
# 3rd triggers warning
546546
result = mw._apply(_make_state(tool_calls=[self._read_call("/file_2.py")]), runtime)
547547
assert result is not None
548-
assert "LOOP DETECTED" in result["messages"][0].content
548+
assert "REPEAT TOOL CALL DETECTED" in result["messages"][0].content
549549

550550
# 4th should not re-warn (already warned for read_file)
551551
result = mw._apply(_make_state(tool_calls=[self._read_call("/file_3.py")]), runtime)
@@ -619,7 +619,7 @@ def test_freq_reset_per_thread_clears_only_target(self):
619619
# thread-B state should still be intact — 3rd call triggers warn
620620
result = mw._apply(_make_state(tool_calls=[self._read_call("/b_2.py")]), runtime_b)
621621
assert result is not None
622-
assert "LOOP DETECTED" in result["messages"][0].content
622+
assert "REPEAT TOOL CALL DETECTED" in result["messages"][0].content
623623

624624
# thread-A restarted from 0 — should not trigger
625625
result = mw._apply(_make_state(tool_calls=[self._read_call("/a_new.py")]), runtime_a)
@@ -642,7 +642,7 @@ def test_freq_per_thread_isolation(self):
642642
# 3rd call on thread A — triggers (count=3 for thread A only)
643643
result = mw._apply(_make_state(tool_calls=[self._read_call("/file_2.py")]), runtime_a)
644644
assert result is not None
645-
assert "LOOP DETECTED" in result["messages"][0].content
645+
assert "REPEAT TOOL CALL DETECTED" in result["messages"][0].content
646646

647647
def test_multi_tool_single_response_counted(self):
648648
"""When a single response has multiple tool calls, each is counted."""

0 commit comments

Comments
 (0)