Description
DagScheduler holds the entire TaskGraph in-memory. GraphPersistence::save() is called only at two points in plan.rs:
- Defensively before
scheduler_result? (on scheduler error)
- Terminally after the final graph is assembled
A process crash during execution (between task completions) loses all intermediate task state. TaskStatus transitions (Pending → Running → Completed/Failed), retry counts, predicate outcomes, lineage chains — all are lost.
Reproduction Steps
- Start a multi-task plan with 10+ tasks
- Kill the process mid-execution (e.g. SIGKILL after 3 tasks complete)
- Resume with
/plan list — the graph shows status from the last terminal save only
Expected Behavior
Completed/failed task states should survive a crash. At minimum, a per-tick snapshot after each task completion would reduce the replay window.
Actual Behavior
All in-flight and completed-since-last-save state is lost. The defensive save at scheduler_result? only fires when the scheduler loop exits — not at individual task completion boundaries.
Environment
- Affected:
crates/zeph-orchestration/src/scheduler/, crates/zeph-core/src/agent/plan.rs
- All execution modes
Logs / Evidence
plan.rs:292 — defensive save before scheduler_result?
plan.rs:315 — terminal authoritative save after into_graph()
No save call inside the scheduler tick loop or per-task completion handler.
Description
DagSchedulerholds the entireTaskGraphin-memory.GraphPersistence::save()is called only at two points inplan.rs:scheduler_result?(on scheduler error)A process crash during execution (between task completions) loses all intermediate task state.
TaskStatustransitions (Pending → Running → Completed/Failed), retry counts, predicate outcomes, lineage chains — all are lost.Reproduction Steps
/plan list— the graph shows status from the last terminal save onlyExpected Behavior
Completed/failed task states should survive a crash. At minimum, a per-tick snapshot after each task completion would reduce the replay window.
Actual Behavior
All in-flight and completed-since-last-save state is lost. The defensive save at
scheduler_result?only fires when the scheduler loop exits — not at individual task completion boundaries.Environment
crates/zeph-orchestration/src/scheduler/,crates/zeph-core/src/agent/plan.rsLogs / Evidence
plan.rs:292— defensive save beforescheduler_result?plan.rs:315— terminal authoritative save afterinto_graph()No save call inside the scheduler tick loop or per-task completion handler.