Skip to content

feat(state,engine): run checkpointing + resume (agentctl run --resume)#127

Merged
leo-aa88 merged 4 commits into
mainfrom
feat/run-checkpointing-resume-105
Jun 2, 2026
Merged

feat(state,engine): run checkpointing + resume (agentctl run --resume)#127
leo-aa88 merged 4 commits into
mainfrom
feat/run-checkpointing-resume-105

Conversation

@leo-aa88
Copy link
Copy Markdown
Member

@leo-aa88 leo-aa88 commented Jun 2, 2026

Summary

  • Adds SQLite run_checkpoints table (migration 003) with SaveCheckpoint, GetLatestCheckpoint, and UpdateRunStatus on RuntimeStore. Checkpoints cascade when trace retention prunes runs.
  • Makes the workflow engine checkpoint-aware: after each completed step it persists a canonical JSON snapshot of interpolation context (${input.*}, ${steps.*}) and accumulated cost. Resume continues from the next step without replaying earlier steps.
  • Adds agentctl run --resume <run-id> to rehydrate and continue an interrupted or crash-recovered run. Interrupted runs exit cleanly (status interrupted, exit code 0).

Closes #105

Design notes

Test plan

  • make ci (gofmt, vet, go test -race ./...)
  • State: checkpoint round-trip, FK enforcement, cascade on DeleteRunsStartedBefore
  • Engine: interrupt → resume without replaying completed steps; crash recovery from running checkpoint; reject completed checkpoint
  • Runtime: end-to-end interrupt → resume with coherent trace log
  • CLI: --resume validation paths (missing run, conflicting args)

Made with Cursor

leo-aa88 and others added 3 commits June 2, 2026 01:22
Persist per-run execution snapshots in SQLite so workflow runs can pause
and resume. Adds migration 003, RunCheckpoint model, SaveCheckpoint,
GetLatestCheckpoint, and UpdateRunStatus with cascade on trace retention.

Closes #105 (state layer).

Co-authored-by: Cursor <cursoragent@cursor.com>
Write canonical JSON checkpoints after every completed step. Resume
rehydrates interpolation context from the latest checkpoint. ErrInterrupted
signals a clean pause for future approval gates; stub via
InterruptAfterStepIndex for tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
Resume reuses the persisted run row and checkpoint, emits run.resumed
trace events, and exits 0 on interrupted runs awaiting human action.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

ReviewGate [WARN] WARN

Stats

  • Files changed: 25
  • Raw LOC changed: 1607
  • LOC after §10.4 exclusions (human_loc_changed): 1607
  • PR author class: human (human collaborator account) — login leo-aa88 (§10.4.2).

Warnings (3)

  • medium too_many_files_changed -- PR exceeds warn files_changed threshold: 25 files (threshold 25).
  • medium too_large_human_loc -- PR exceeds warn human_loc_changed threshold: 1607 lines (threshold 800).
  • medium many_risky_files -- This PR touches 2 risky files, at or above the warning threshold of 2 (risky_files_changed).

Suggested labels: reviewability-warn, too-large, risky-change

File categories: 25 files (2 risky)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Automated review

Summary

Feature introduces run checkpointing and resume functionality.

Findings

  • high · internal/engine/checkpoint.goRisk of data loss on checkpointing
    Checkpoints are designed to be written under specific conditions. If these conditions fail, such as database write errors during a crash, it may lead to data inconsistencies or loss.
  • medium · internal/cli/run.goPotential security risk with resumed runs
    The implementation of --resume should ensure the run ID is safe and validated. Without verification, it can allow arbitrary workflow continuations, potentially leading to conflicts or unauthorized data modifications.
  • medium · internal/runtime/local/runner.goWorkflow state synchronization issues
    Resuming workflows without validating the spec hash can introduce issues if the workflow definitions change during an existing run.
  • medium · internal/state/sqlite/checkpoint.goForeign key integrity risks
    Checkpoints have foreign key dependencies on runs. If run data is deleted, the integrity of checkpoints could be compromised without proper cascading rules in place.

Persist checkpoints before run_steps succeeded rows to close the crash
replay window. Pin workflow_spec_hash and environment_name on runs (migration
004) and reject resume on drift. Add checkpoint payload version, size bounds,
and step validation; typed run status constants; DRY prepareProject helper;
CLI Args validation exit 2; happy-path resume integration test.

Co-authored-by: Cursor <cursoragent@cursor.com>
@leo-aa88 leo-aa88 merged commit 0b870ae into main Jun 2, 2026
7 checks passed
@leo-aa88 leo-aa88 deleted the feat/run-checkpointing-resume-105 branch June 2, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(state,engine): run checkpointing + resume (agentctl run --resume)

1 participant