Skip to content

feat(retention): strict bool env parsing + stale-RUNNING cleanup override#306

Open
gabrycina wants to merge 1 commit into
mainfrom
retention/stale-running-and-strict-bool-env
Open

feat(retention): strict bool env parsing + stale-RUNNING cleanup override#306
gabrycina wants to merge 1 commit into
mainfrom
retention/stale-running-and-strict-bool-env

Conversation

@gabrycina

@gabrycina gabrycina commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Two hardening changes to the scheduled task-retention cleanup (#272 follow-up) ahead of the EY rollout. Context: EY data-retention thread + the sgp-dev verification doc.

1. Strict boolean env parsing

RETENTION_CLEANUP_DRY_RUN was parsed with os.environ.get(...) == "true", so any unrecognized value silently meant dry_run=False → live deletion. DRY_RUN=True (capital T — exactly what YAML/Helm tooling tends to render) was the documented gotcha in the sgp-dev test writeup; in the EY environment this config flows through system-manager value_overrides → Helm → env, which makes that footgun very real.

Booleans now accept true/false/1/0 case-insensitively and raise on anything else, so a misconfigured worker refuses to run instead of deleting data. Applied to RETENTION_CLEANUP_DRY_RUN, RETENTION_CLEANUP_ENABLED, and ENABLE_HEALTH_CHECK_WORKFLOW (same pattern).

2. RETENTION_CLEANUP_STALE_RUNNING_DAYS (default 0 = disabled, current behavior unchanged)

The sweep unconditionally refuses RUNNING tasks. Tasks abandoned mid-run (agent crashed, workflow never reached terminal state) therefore stay in Mongo/Postgres forever — the sgp-dev test surfaced 121 such tasks on a single agent (test-v2: cleaned 0, skipped 121). For a retention-compliance feature that's a structural gap: the data most likely to be forgotten is precisely the data the policy can never touch.

When set > 0, a RUNNING task with no interaction for at least that many days (same last-interaction definition as the idle check: max(task.updated_at, latest message)) is treated as abandoned and becomes cleanable. Each override emits a WARNING forensic log. The unprocessed-events correctness guard is unchanged. Recommended setting for OneEdge/EY: 30.

Wiring: env → load_cleanup_config → sweep → child workflow → clean_task activity → use case → service. The new activity parameter defaults to 0, so in-flight workflows started before a deploy are unaffected.

Tests

  • env parsing: case-insensitivity matrix, garbage → ValueError, new var default/parse
  • service: RUNNING refused by default; stale-RUNNING cleaned with override; idle-but-not-stale-enough refused; recent Mongo message blocks override; preview honors override without writes
  • workflow: stale_running_days propagation sweep → child → activity; existing fakes updated for the new activity arg
  • uv run --group test pytest tests/unit/...retention...36 passed; ruff + ruff-format clean

cc @stas

Greptile Summary

This PR adds two hardening changes to the scheduled task-retention cleanup ahead of the EY rollout: strict boolean env parsing that raises on unrecognized values (replacing silent == "true" coercion that caused DRY_RUN=True to silently enable live deletion), and a new RETENTION_CLEANUP_STALE_RUNNING_DAYS opt-in that allows abandoned RUNNING tasks to be cleaned after a configurable inactivity period.

  • Strict bool parsing: _parse_bool_env now accepts true/false/1/0 case-insensitively and raises ValueError on anything else; applied to RETENTION_CLEANUP_DRY_RUN, RETENTION_CLEANUP_ENABLED, and ENABLE_HEALTH_CHECK_WORKFLOW. Workers with a misconfigured value will refuse to start rather than silently flip to live-deletion mode.
  • Stale-RUNNING override: When RETENTION_CLEANUP_STALE_RUNNING_DAYS > 0, a RUNNING task with no interaction for at least that many days is treated as abandoned and becomes eligible for cleanup; the default is 0 (disabled), preserving prior behavior and leaving in-flight workflows unaffected by a rolling deploy.
  • Wiring: The new parameter threads cleanly through env → activity config → sweep workflow → child workflow → clean_task activity → use case → service, with continue_as_new carrying it across sweep pages for consistency.

Confidence Score: 4/5

Safe to merge with the dry-run logging concern addressed; the stale-RUNNING default of 0 means existing deployments are unaffected until the override is explicitly set.

The _check_running_guard helper emits a WARNING log intended as a forensic record of live RUNNING-task overrides, but it is called unconditionally from preview_clean_task (dry-run mode) as well. A dry-run sweep over 121 stale RUNNING tasks would produce 121 spurious warnings, polluting the audit trail this feature is specifically designed to maintain for EY compliance. The bool-parsing hardening and end-to-end wiring are correct; no data correctness risk exists at the default configuration.

agentex/src/domain/services/task_retention_service.py — specifically the _check_running_guard method and how it interacts with the dry-run path in preview_clean_task.

Important Files Changed

Filename Overview
agentex/src/config/environment_variables.py Adds _parse_bool_env for strict true/false/1/0 parsing (replaces silent == "true" coercion), wires all three bool env vars through it, and adds RETENTION_CLEANUP_STALE_RUNNING_DAYS int field with default 0. Logic is correct and well-tested.
agentex/src/domain/services/task_retention_service.py Adds _check_running_guard helper and threads stale_running_days through clean_task and preview_clean_task. The WARNING forensic log fires in dry-run mode (false forensic records), and _is_task_idle is called twice for stale-RUNNING tasks.
agentex/src/temporal/activities/retention_cleanup_activities.py Adds stale_running_days to load_cleanup_config dict and clean_task activity signature; propagation is correct end-to-end.
agentex/src/temporal/workflows/retention_cleanup_workflow.py Correctly extracts stale_running_days from config args, passes it through child workflow dicts, and includes it in continue_as_new state so the value is consistent across all pages of a sweep.
agentex/tests/unit/services/test_task_retention_service.py Good coverage: refuses RUNNING by default, cleans stale-RUNNING with override, refuses when not stale enough, respects recent Mongo messages, and verifies preview doesn't write. All scenarios are distinct and well-structured.
agentex/tests/unit/config/test_retention_cleanup_env.py Tests case-insensitivity matrix, garbage-value rejection, and new STALE_RUNNING_DAYS default/parse. Complete and correct.
agentex/tests/unit/temporal/test_retention_cleanup_workflow.py Existing fake activities updated for new stale_running_days param; new test verifies end-to-end sweep→child→activity propagation of stale_running_days=30 from config.
agentex/docker-compose.yml Adds RETENTION_CLEANUP_STALE_RUNNING_DAYS env var with default 0 to both retention workers. Consistent with the new field.

Sequence Diagram

sequenceDiagram
    participant S as RetentionCleanupSweepWorkflow
    participant LC as load_cleanup_config activity
    participant FC as find_cleanup_candidates activity
    participant CW as RetentionCleanupTaskWorkflow (child)
    participant CA as clean_task activity
    participant UC as TaskRetentionUseCase
    participant SVC as TaskRetentionService

    S->>LC: (first page only) load config
    LC-->>S: "{enabled, idle_days, dry_run, stale_running_days, ...}"
    S->>FC: find_cleanup_candidates(after_id, idle_days, agent_names)
    FC-->>S: [task_ids] (no status filter — RUNNING tasks included)
    loop per batch (max_in_flight)
        S->>CW: "{task_id, idle_days, dry_run, stale_running_days}"
        CW->>CA: clean_task(task_id, idle_days, dry_run, stale_running_days)
        CA->>UC: clean_task / preview_clean_task(..., stale_running_days)
        UC->>SVC: clean_task(..., stale_running_days)
        SVC->>SVC: _check_running_guard(task, stale_running_days)
        alt "task RUNNING AND stale_running_days > 0 AND idle >= stale_running_days"
            SVC-->>SVC: log WARNING task_cleanup_stale_running_override
            SVC->>SVC: continue to idle + events checks then delete
        else task RUNNING not stale enough
            SVC-->>CA: raise ClientError skipped outcome
        end
        CA-->>CW: "CleanTaskOutcome status cleaned|skipped|failed"
        CW-->>S: outcome
    end
    S->>S: "continue_as_new({..., stale_running_days, after_id: page_last_id})"
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
agentex/src/domain/services/task_retention_service.py:460-481
**Forensic WARNING fires during dry-runs / previews**

`_check_running_guard` is shared between `clean_task` (real deletion) and `preview_clean_task` (no writes). The `logger.warning("task_cleanup_stale_running_override", ...)` is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what _would_ be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.

### Issue 2 of 2
agentex/src/domain/services/task_retention_service.py:468-480
**Redundant MongoDB round-trip for stale-RUNNING tasks**

When a task is RUNNING and `stale_running_days > 0`, `_is_task_idle` is called here (inside `_check_running_guard`) with `stale_running_days` as the threshold. If the guard passes, `clean_task` immediately calls `_is_task_idle` a second time with `idle_days`. Because `stale_running_days` is always expected to be ≥ `idle_days` (e.g. 30 vs 7), the second call's result is always `True` when the first call returns `True`, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.

Reviews (1): Last reviewed commit: "feat(retention): strict bool env parsing..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

…ride

Two hardening changes to the scheduled task-retention cleanup ahead of the
EY rollout:

1. Strict boolean env parsing (_parse_bool_env). The previous pattern
   (os.environ.get(...) == "true") silently coerced any unrecognized value
   to False. For RETENTION_CLEANUP_DRY_RUN that failure mode is destructive:
   DRY_RUN=True (capital T, as YAML/Helm tooling tends to render booleans)
   meant dry_run=False, i.e. live deletion. Booleans now parse
   true/false/1/0 case-insensitively and raise on anything else, so a
   misconfigured worker refuses to run instead of deleting data.
   Applied to RETENTION_CLEANUP_DRY_RUN, RETENTION_CLEANUP_ENABLED, and
   ENABLE_HEALTH_CHECK_WORKFLOW (same pattern).

2. RETENTION_CLEANUP_STALE_RUNNING_DAYS (default 0 = disabled). The sweep
   refuses RUNNING tasks unconditionally, so tasks abandoned mid-run (agent
   crashed, workflow never reached a terminal state) are exempt from
   retention forever — in sgp-dev testing one agent had 121 such tasks,
   all unpurgeable. When this is set > 0, a RUNNING task with no
   interaction (same last-interaction definition as the idle check) for at
   least that many days is treated as abandoned and cleaned, with a
   WARNING-level forensic log per override. The unprocessed-events
   correctness guard still applies. Wired env -> load_cleanup_config ->
   sweep -> child workflow -> activity -> use case -> service; the new
   activity arg defaults to 0 so in-flight workflows are unaffected.
@gabrycina gabrycina requested a review from a team as a code owner June 12, 2026 13:03
Comment on lines +460 to +481
async def _check_running_guard(self, task, stale_running_days: int) -> None:
"""
Raise ClientError if `task` is RUNNING, unless the stale-RUNNING override
applies: stale_running_days > 0 and the task has had no interaction for
at least that many days (same last-interaction definition as
_is_task_idle). Overrides are logged at WARNING for forensics — cleaning
a RUNNING task means we are declaring its workflow abandoned.
"""
if task.status != TaskStatus.RUNNING:
return
if stale_running_days > 0 and await self._is_task_idle(
task, stale_running_days
):
logger.warning(
"task_cleanup_stale_running_override",
extra={
"task_id": task.id,
"stale_running_days": stale_running_days,
},
)
return
raise ClientError(f"Cannot clean task {task.id}: status is RUNNING (active)")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Forensic WARNING fires during dry-runs / previews

_check_running_guard is shared between clean_task (real deletion) and preview_clean_task (no writes). The logger.warning("task_cleanup_stale_running_override", ...) is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what would be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/domain/services/task_retention_service.py
Line: 460-481

Comment:
**Forensic WARNING fires during dry-runs / previews**

`_check_running_guard` is shared between `clean_task` (real deletion) and `preview_clean_task` (no writes). The `logger.warning("task_cleanup_stale_running_override", ...)` is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what _would_ be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +468 to +480
if task.status != TaskStatus.RUNNING:
return
if stale_running_days > 0 and await self._is_task_idle(
task, stale_running_days
):
logger.warning(
"task_cleanup_stale_running_override",
extra={
"task_id": task.id,
"stale_running_days": stale_running_days,
},
)
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant MongoDB round-trip for stale-RUNNING tasks

When a task is RUNNING and stale_running_days > 0, _is_task_idle is called here (inside _check_running_guard) with stale_running_days as the threshold. If the guard passes, clean_task immediately calls _is_task_idle a second time with idle_days. Because stale_running_days is always expected to be ≥ idle_days (e.g. 30 vs 7), the second call's result is always True when the first call returns True, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/domain/services/task_retention_service.py
Line: 468-480

Comment:
**Redundant MongoDB round-trip for stale-RUNNING tasks**

When a task is RUNNING and `stale_running_days > 0`, `_is_task_idle` is called here (inside `_check_running_guard`) with `stale_running_days` as the threshold. If the guard passes, `clean_task` immediately calls `_is_task_idle` a second time with `idle_days`. Because `stale_running_days` is always expected to be ≥ `idle_days` (e.g. 30 vs 7), the second call's result is always `True` when the first call returns `True`, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant