Execution drift: LLM systematically deviates from structured plans, and self-verification compounds the problem

## Problem Description

When GA executes a structured multi-step plan (via `plan_sop.md` + `enter_plan_mode` in `ga.py`), the LLM consistently exhibits **execution drift** — silently omitting steps, swapping intended approaches for simpler approximations, and confirming completion while tasks remain unfinished. This is not a transient hallucination but a **cumulative degradation pattern**:

1. Execute plan v1.0 → omissions/deviations found on review
2. Upgrade to plan v2.0 with more explicit constraints → still deviates on different steps
3. Upgrade to plan v3.0 → drift shifts but doesn't shrink
4. Ask the LLM to "verify its own work" → it confirms completeness while still missing items (the **verification paradox**)
5. After several plan iterations → the plan itself becomes "rotten" from accumulated patches

This manifests most severely in medium-to-long tasks (10+ turns) with interdependent steps, where early deviations compound into later failures.

---

## Root Cause Analysis

Based on examination of the official source code (commit `9aeb80fd`, 2026-06-13), I believe the problem has multiple contributing causes:

### 1. Plan mode verification is purely formal (`ga.py` lines 434-439)

```python
def _check_plan_completion(self):
    if not os.path.isfile(self.working.get('in_plan_mode', '')):
        return 0
    text = open(self.working['in_plan_mode'], encoding='utf-8').read()
    # 仅检查plan.md中的[ ] 方括号
    return text.count('[ ]') + text.count('待完成')
```

`_check_plan_completion()` only counts unchecked `[ ]` checkboxes in the plan markdown file. It does **not** verify:
- Whether the executed output matches the plan's intent
- Whether the LLM's understanding of "what counts as done" aligns with the user's intent
- Whether steps were performed in the correct order

The LLM can check off all boxes while having done only superficial work. The plan mode provides **no semantic gate** — only a syntactic one.

### 2. No step-level execution constraints (`agent_loop.py` lines 42-134)

The core loop in `agent_loop.py` is a flat pipeline:

```python
while turn < handler.max_turns:
    response = (yield from client.chat(messages=messages, tools=tools_schema))
    tool_calls = extract_calls(response)
    for tc in tool_calls:
        outcome = yield from handler.dispatch(tool_name, args, response)
    messages.append(assistant_msg)
    messages.append(user_msg_with_results)
```

There is **no step gating** — no mechanism that:
- Freezes the current plan step so the LLM must work on it before moving on
- Validates step output against a spec before allowing the next step
- Prevents the LLM from "looking ahead" and mixing steps
- Forces evidence collection before conclusions

Every tool call and response is processed identically. The LLM has full freedom to decide what constitutes "step N complete."

### 3. The verification paradox: same LLM, same bias (`ga.py` lines 549-581)

When the LLM is asked to "verify its own work" (which happens naturally in plan mode), the `turn_end_callback` injects periodic hints like:

```python
if _plan and turn >= 10 and turn % 5 == 0:
    next_prompt = f"[Plan Hint] 正在计划模式。必须 file_read({_plan}) 确认当前步骤..."
```

This re-reading of the plan is helpful, but the verification is performed by **the same LLM instance** that holds a biased, self-consistent view of what it has already done. This is well-documented in LLM research as **confirmation bias in autoregressive models** — once the model has committed to a course of action in its context, subsequent "verification" tends to rationalize rather than correct.

### 4. No "I don't know" / "incomplete" safe harbor (`assets/sys_prompt_en.txt`)

The system prompt (in `sys_prompt_en.txt`) is optimized for **execution confidence** — it tells the LLM to "probe with tools, never speculate." However, it does not provide a structured way for the LLM to report:

- "I cannot find evidence to verify this step" 
- "This step has only been partially completed"
- "The approach I tried for step 3 failed, what should I do?"

Without a safe "incomplete" state, the LLM defaults to **optimistic reporting** — claiming completion on the basis of weak evidence.

### 5. Memory persistence without truth validation (`ga.py` lines 544, 576-579)

The `update_working_checkpoint` tool (`do_update_working_checkpoint` in `ga.py`) stores key info that persists across turns:

```python
if self.working.get('key_info'): 
    prompt += f"\n<key_info>{self.working.get('key_info')}</key_info>"
```

This is useful for context retention, but **incorrect information, once written to working memory, persists and reinforces future drift**. There is no mechanism to challenge or verify claims stored in `key_info` before they influence subsequent turns.

---

## Existing Mitigations and Their Gaps

| Mechanism | Location | What it does | Why it's insufficient |
|-----------|----------|-------------|----------------------|
| **Plan Mode** + `_check_plan_completion` | `ga.py:428-501` | Counts `[ ]` checkboxes | Formal only — no semantic verification |
| **Secondary confirmation** | `ga.py:459-496` | Blocks LLM from outputting code without tool calls | Only catches one pattern (talk-only) |
| **Periodic turn hints** | `ga.py:564-574` | Reminds LLM to re-read plan, checkpoint, etc. | Advisory only — LLM can ignore |
| **`update_working_checkpoint`** | `ga.py` tool | Stores key info across turns | No validation of stored claims |
| **`start_long_term_update`** | `ga.py` tool | Persists verified facts to long-term memory | "Verified" is self-reported |
| **Fresh context each turn** | `ga.py:537-547` | Injects recent history + working memory | History only — no plan-vs-actual comparison |
| **Ultraplan orchestrator** | `assets/ga_ultraplan.py` + `memory/ultraplan_sop.md` | Phase/Pipeline orchestration (very new) | Too early to evaluate; likely same gaps without explicit gates |

---

## Community Context

I searched the GitHub issues, PRs, and commit history for this repository. Key findings:

- **No existing issues about execution drift or hallucination.** The closest is #474 (code quality scan) and #522 (session sorting drift — unrelated).
- **PR #561** ("fix: add codebase verification step to task_planning to prevent duplicate TODOs") shows awareness of the need for pre-execution verification, but scoped to task planning only.
- **Recent commits** (2026-06-18 to 2026-06-26) show movement in the right direction:
  - `ultraplan` orchestrator with phase/pipeline orchestration (June 26)
  - `worldline checkpoint-tree rewind` (June 18) — drift recovery, not prevention
  - `conductor approval workflow` (June 15) — human-in-the-loop for critical ops
- **No commit specifically addresses semantic drift detection or step-level verification.**

This suggests the problem exists but hasn't been systematically characterized or addressed yet.

---

## Suggested Direction

While I don't have a complete solution, I've observed a pattern that effectively addresses both hallucination and drift in other systems: **separating execution from verification by introducing an independent reference frame.**

Concretely, this could mean:

1. **Plan freezing**: Once a plan step is defined, its spec should be "frozen" and referenced independently — not subject to the LLM's reinterpretation when executing.
2. **Independent verification**: The "did we complete step N correctly?" check should not be done by the same LLM instance that executed the step. A separate verification pass (or a constrained checker) would catch confirmation bias.
3. **Evidence gates**: Before a step is marked complete, require evidence (specific file content, tool output, or user confirmation) that satisfies a pre-defined criterion — not a self-assessment.
4. **"Incomplete" as a valid state**: The LLM needs a structured protocol to report "step X cannot be verified as complete" without this being treated as failure.

The recent `ultraplan` orchestrator (`assets/ga_ultraplan.py`) introduces phase-based execution, which could be a foundation for this kind of structured gating. However, based on the current implementation, it likely inherits the same verification gap unless explicit output gates are added between phases.

---

## Reproduction

This pattern is most reliably triggered with:

1. A complex multi-step task (5+ interdependent steps)
2. Using `plan_sop.md` + `enter_plan_mode` for structured execution
3. Using a medium-strength LLM backend (the effect is less pronounced with very strong models like Claude Opus, but still present)
4. Tasks where steps 3-5 depend on the quality of steps 1-2 (drift compounds)

The drift becomes visible by: after the LLM reports "all steps complete," manually re-executing the verification criteria for each step and finding gaps.

---

## Environment

- GA version: `main` branch (latest commit tested: `9aeb80fd`, 2026-06-13)
- Tested with multiple LLM backends: GPT-4o, Claude Sonnet 4, Qwen3
- Plan mode activated via `plan_sop.md` protocol
- No custom modifications to `agent_loop.py` or `ga.py`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution drift: LLM systematically deviates from structured plans, and self-verification compounds the problem #647

Problem Description

Root Cause Analysis

1. Plan mode verification is purely formal (`ga.py` lines 434-439)

2. No step-level execution constraints (`agent_loop.py` lines 42-134)

3. The verification paradox: same LLM, same bias (`ga.py` lines 549-581)

4. No "I don't know" / "incomplete" safe harbor (`assets/sys_prompt_en.txt`)

5. Memory persistence without truth validation (`ga.py` lines 544, 576-579)

Existing Mitigations and Their Gaps

Community Context

Suggested Direction

Reproduction

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Mechanism	Location	What it does	Why it's insufficient
Plan Mode + `_check_plan_completion`	`ga.py:428-501`	Counts `[ ]` checkboxes	Formal only — no semantic verification
Secondary confirmation	`ga.py:459-496`	Blocks LLM from outputting code without tool calls	Only catches one pattern (talk-only)
Periodic turn hints	`ga.py:564-574`	Reminds LLM to re-read plan, checkpoint, etc.	Advisory only — LLM can ignore
`update_working_checkpoint`	`ga.py` tool	Stores key info across turns	No validation of stored claims
`start_long_term_update`	`ga.py` tool	Persists verified facts to long-term memory	"Verified" is self-reported
Fresh context each turn	`ga.py:537-547`	Injects recent history + working memory	History only — no plan-vs-actual comparison
Ultraplan orchestrator	`assets/ga_ultraplan.py` + `memory/ultraplan_sop.md`	Phase/Pipeline orchestration (very new)	Too early to evaluate; likely same gaps without explicit gates

Execution drift: LLM systematically deviates from structured plans, and self-verification compounds the problem #647

Description

Problem Description

Root Cause Analysis

1. Plan mode verification is purely formal (ga.py lines 434-439)

2. No step-level execution constraints (agent_loop.py lines 42-134)

3. The verification paradox: same LLM, same bias (ga.py lines 549-581)

4. No "I don't know" / "incomplete" safe harbor (assets/sys_prompt_en.txt)

5. Memory persistence without truth validation (ga.py lines 544, 576-579)

Existing Mitigations and Their Gaps

Community Context

Suggested Direction

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Plan mode verification is purely formal (`ga.py` lines 434-439)

2. No step-level execution constraints (`agent_loop.py` lines 42-134)

3. The verification paradox: same LLM, same bias (`ga.py` lines 549-581)

4. No "I don't know" / "incomplete" safe harbor (`assets/sys_prompt_en.txt`)

5. Memory persistence without truth validation (`ga.py` lines 544, 576-579)