Problem Description
When GA executes a structured multi-step plan (via plan_sop.md + enter_plan_mode in ga.py), the LLM consistently exhibits execution drift — silently omitting steps, swapping intended approaches for simpler approximations, and confirming completion while tasks remain unfinished. This is not a transient hallucination but a cumulative degradation pattern:
- Execute plan v1.0 → omissions/deviations found on review
- Upgrade to plan v2.0 with more explicit constraints → still deviates on different steps
- Upgrade to plan v3.0 → drift shifts but doesn't shrink
- Ask the LLM to "verify its own work" → it confirms completeness while still missing items (the verification paradox)
- After several plan iterations → the plan itself becomes "rotten" from accumulated patches
This manifests most severely in medium-to-long tasks (10+ turns) with interdependent steps, where early deviations compound into later failures.
Root Cause Analysis
Based on examination of the official source code (commit 9aeb80fd, 2026-06-13), I believe the problem has multiple contributing causes:
1. Plan mode verification is purely formal (ga.py lines 434-439)
def _check_plan_completion(self):
if not os.path.isfile(self.working.get('in_plan_mode', '')):
return 0
text = open(self.working['in_plan_mode'], encoding='utf-8').read()
# 仅检查plan.md中的[ ] 方括号
return text.count('[ ]') + text.count('待完成')
_check_plan_completion() only counts unchecked [ ] checkboxes in the plan markdown file. It does not verify:
- Whether the executed output matches the plan's intent
- Whether the LLM's understanding of "what counts as done" aligns with the user's intent
- Whether steps were performed in the correct order
The LLM can check off all boxes while having done only superficial work. The plan mode provides no semantic gate — only a syntactic one.
2. No step-level execution constraints (agent_loop.py lines 42-134)
The core loop in agent_loop.py is a flat pipeline:
while turn < handler.max_turns:
response = (yield from client.chat(messages=messages, tools=tools_schema))
tool_calls = extract_calls(response)
for tc in tool_calls:
outcome = yield from handler.dispatch(tool_name, args, response)
messages.append(assistant_msg)
messages.append(user_msg_with_results)
There is no step gating — no mechanism that:
- Freezes the current plan step so the LLM must work on it before moving on
- Validates step output against a spec before allowing the next step
- Prevents the LLM from "looking ahead" and mixing steps
- Forces evidence collection before conclusions
Every tool call and response is processed identically. The LLM has full freedom to decide what constitutes "step N complete."
3. The verification paradox: same LLM, same bias (ga.py lines 549-581)
When the LLM is asked to "verify its own work" (which happens naturally in plan mode), the turn_end_callback injects periodic hints like:
if _plan and turn >= 10 and turn % 5 == 0:
next_prompt = f"[Plan Hint] 正在计划模式。必须 file_read({_plan}) 确认当前步骤..."
This re-reading of the plan is helpful, but the verification is performed by the same LLM instance that holds a biased, self-consistent view of what it has already done. This is well-documented in LLM research as confirmation bias in autoregressive models — once the model has committed to a course of action in its context, subsequent "verification" tends to rationalize rather than correct.
4. No "I don't know" / "incomplete" safe harbor (assets/sys_prompt_en.txt)
The system prompt (in sys_prompt_en.txt) is optimized for execution confidence — it tells the LLM to "probe with tools, never speculate." However, it does not provide a structured way for the LLM to report:
- "I cannot find evidence to verify this step"
- "This step has only been partially completed"
- "The approach I tried for step 3 failed, what should I do?"
Without a safe "incomplete" state, the LLM defaults to optimistic reporting — claiming completion on the basis of weak evidence.
5. Memory persistence without truth validation (ga.py lines 544, 576-579)
The update_working_checkpoint tool (do_update_working_checkpoint in ga.py) stores key info that persists across turns:
if self.working.get('key_info'):
prompt += f"\n<key_info>{self.working.get('key_info')}</key_info>"
This is useful for context retention, but incorrect information, once written to working memory, persists and reinforces future drift. There is no mechanism to challenge or verify claims stored in key_info before they influence subsequent turns.
Existing Mitigations and Their Gaps
| Mechanism |
Location |
What it does |
Why it's insufficient |
Plan Mode + _check_plan_completion |
ga.py:428-501 |
Counts [ ] checkboxes |
Formal only — no semantic verification |
| Secondary confirmation |
ga.py:459-496 |
Blocks LLM from outputting code without tool calls |
Only catches one pattern (talk-only) |
| Periodic turn hints |
ga.py:564-574 |
Reminds LLM to re-read plan, checkpoint, etc. |
Advisory only — LLM can ignore |
update_working_checkpoint |
ga.py tool |
Stores key info across turns |
No validation of stored claims |
start_long_term_update |
ga.py tool |
Persists verified facts to long-term memory |
"Verified" is self-reported |
| Fresh context each turn |
ga.py:537-547 |
Injects recent history + working memory |
History only — no plan-vs-actual comparison |
| Ultraplan orchestrator |
assets/ga_ultraplan.py + memory/ultraplan_sop.md |
Phase/Pipeline orchestration (very new) |
Too early to evaluate; likely same gaps without explicit gates |
Community Context
I searched the GitHub issues, PRs, and commit history for this repository. Key findings:
This suggests the problem exists but hasn't been systematically characterized or addressed yet.
Suggested Direction
While I don't have a complete solution, I've observed a pattern that effectively addresses both hallucination and drift in other systems: separating execution from verification by introducing an independent reference frame.
Concretely, this could mean:
- Plan freezing: Once a plan step is defined, its spec should be "frozen" and referenced independently — not subject to the LLM's reinterpretation when executing.
- Independent verification: The "did we complete step N correctly?" check should not be done by the same LLM instance that executed the step. A separate verification pass (or a constrained checker) would catch confirmation bias.
- Evidence gates: Before a step is marked complete, require evidence (specific file content, tool output, or user confirmation) that satisfies a pre-defined criterion — not a self-assessment.
- "Incomplete" as a valid state: The LLM needs a structured protocol to report "step X cannot be verified as complete" without this being treated as failure.
The recent ultraplan orchestrator (assets/ga_ultraplan.py) introduces phase-based execution, which could be a foundation for this kind of structured gating. However, based on the current implementation, it likely inherits the same verification gap unless explicit output gates are added between phases.
Reproduction
This pattern is most reliably triggered with:
- A complex multi-step task (5+ interdependent steps)
- Using
plan_sop.md + enter_plan_mode for structured execution
- Using a medium-strength LLM backend (the effect is less pronounced with very strong models like Claude Opus, but still present)
- Tasks where steps 3-5 depend on the quality of steps 1-2 (drift compounds)
The drift becomes visible by: after the LLM reports "all steps complete," manually re-executing the verification criteria for each step and finding gaps.
Environment
- GA version:
main branch (latest commit tested: 9aeb80fd, 2026-06-13)
- Tested with multiple LLM backends: GPT-4o, Claude Sonnet 4, Qwen3
- Plan mode activated via
plan_sop.md protocol
- No custom modifications to
agent_loop.py or ga.py
Problem Description
When GA executes a structured multi-step plan (via
plan_sop.md+enter_plan_modeinga.py), the LLM consistently exhibits execution drift — silently omitting steps, swapping intended approaches for simpler approximations, and confirming completion while tasks remain unfinished. This is not a transient hallucination but a cumulative degradation pattern:This manifests most severely in medium-to-long tasks (10+ turns) with interdependent steps, where early deviations compound into later failures.
Root Cause Analysis
Based on examination of the official source code (commit
9aeb80fd, 2026-06-13), I believe the problem has multiple contributing causes:1. Plan mode verification is purely formal (
ga.pylines 434-439)_check_plan_completion()only counts unchecked[ ]checkboxes in the plan markdown file. It does not verify:The LLM can check off all boxes while having done only superficial work. The plan mode provides no semantic gate — only a syntactic one.
2. No step-level execution constraints (
agent_loop.pylines 42-134)The core loop in
agent_loop.pyis a flat pipeline:There is no step gating — no mechanism that:
Every tool call and response is processed identically. The LLM has full freedom to decide what constitutes "step N complete."
3. The verification paradox: same LLM, same bias (
ga.pylines 549-581)When the LLM is asked to "verify its own work" (which happens naturally in plan mode), the
turn_end_callbackinjects periodic hints like:This re-reading of the plan is helpful, but the verification is performed by the same LLM instance that holds a biased, self-consistent view of what it has already done. This is well-documented in LLM research as confirmation bias in autoregressive models — once the model has committed to a course of action in its context, subsequent "verification" tends to rationalize rather than correct.
4. No "I don't know" / "incomplete" safe harbor (
assets/sys_prompt_en.txt)The system prompt (in
sys_prompt_en.txt) is optimized for execution confidence — it tells the LLM to "probe with tools, never speculate." However, it does not provide a structured way for the LLM to report:Without a safe "incomplete" state, the LLM defaults to optimistic reporting — claiming completion on the basis of weak evidence.
5. Memory persistence without truth validation (
ga.pylines 544, 576-579)The
update_working_checkpointtool (do_update_working_checkpointinga.py) stores key info that persists across turns:This is useful for context retention, but incorrect information, once written to working memory, persists and reinforces future drift. There is no mechanism to challenge or verify claims stored in
key_infobefore they influence subsequent turns.Existing Mitigations and Their Gaps
_check_plan_completionga.py:428-501[ ]checkboxesga.py:459-496ga.py:564-574update_working_checkpointga.pytoolstart_long_term_updatega.pytoolga.py:537-547assets/ga_ultraplan.py+memory/ultraplan_sop.mdCommunity Context
I searched the GitHub issues, PRs, and commit history for this repository. Key findings:
ultraplanorchestrator with phase/pipeline orchestration (June 26)worldline checkpoint-tree rewind(June 18) — drift recovery, not preventionconductor approval workflow(June 15) — human-in-the-loop for critical opsThis suggests the problem exists but hasn't been systematically characterized or addressed yet.
Suggested Direction
While I don't have a complete solution, I've observed a pattern that effectively addresses both hallucination and drift in other systems: separating execution from verification by introducing an independent reference frame.
Concretely, this could mean:
The recent
ultraplanorchestrator (assets/ga_ultraplan.py) introduces phase-based execution, which could be a foundation for this kind of structured gating. However, based on the current implementation, it likely inherits the same verification gap unless explicit output gates are added between phases.Reproduction
This pattern is most reliably triggered with:
plan_sop.md+enter_plan_modefor structured executionThe drift becomes visible by: after the LLM reports "all steps complete," manually re-executing the verification criteria for each step and finding gaps.
Environment
mainbranch (latest commit tested:9aeb80fd, 2026-06-13)plan_sop.mdprotocolagent_loop.pyorga.py