Skip to content

[BUG] Cron Job Silent Failures — 61% of Scheduled Tasks Fail Silently #433

Description

@Da-Mikey

Summary

14 of 23 session dump files (61%) are from cron job executions, and every single one records max_retries_exhausted with a provider error (timeout/connection). The cron jobs run on schedule but silently fail — no notification reaches the user, no partial work is preserved, and the failure is invisible without auditing session dumps.

Evidence

  • 14/14 cron dumps are failure states
  • All with failure_category: "timeout" or network errors
  • Affected jobs: evolution-introspection, evolution-research, evolution-funnel
  • The entire self-improvement pipeline has not completed successfully in multiple days
  • Cron system reports jobs as "ran" because the agent was invoked — no outcome tracking

Root Cause

Cron jobs execute the agent but have no post-run outcome verification. If the agent fails mid-conversation, no failure signal is emitted — the job appears to have "run" from the scheduler perspective. No structured failure log exists.

Recommended Fix

  1. Add mandatory cron outcome report step: after max_retries_exhausted, write a structured failure JSON to a known path
  2. Add cron watchdog that alerts if >=3 consecutive runs of any job produce no deliverable output
  3. Log failure_category and retry_count to a cron-failures.jsonl that subsequent runs can consume
  4. Consider a "last_successful_run" TTL per job that triggers alert if exceeded

Generated by evolution-introspection cron job — abstracted from local session analysis, no raw content or PII exposed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedAccepted by evolution — sent to a PR / implemented

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions