feat(retention): strict bool env parsing + stale-RUNNING cleanup override by gabrycina · Pull Request #306 · scaleapi/scale-agentex

gabrycina · 2026-06-12T13:03:23Z

Two hardening changes to the scheduled task-retention cleanup (#272 follow-up) ahead of the EY rollout. Context: EY data-retention thread + the sgp-dev verification doc.

1. Strict boolean env parsing

RETENTION_CLEANUP_DRY_RUN was parsed with os.environ.get(...) == "true", so any unrecognized value silently meant dry_run=False → live deletion. DRY_RUN=True (capital T — exactly what YAML/Helm tooling tends to render) was the documented gotcha in the sgp-dev test writeup; in the EY environment this config flows through system-manager value_overrides → Helm → env, which makes that footgun very real.

Booleans now accept true/false/1/0 case-insensitively and raise on anything else, so a misconfigured worker refuses to run instead of deleting data. Applied to RETENTION_CLEANUP_DRY_RUN, RETENTION_CLEANUP_ENABLED, and ENABLE_HEALTH_CHECK_WORKFLOW (same pattern).

2. `RETENTION_CLEANUP_STALE_RUNNING_DAYS` (default `0` = disabled, current behavior unchanged)

The sweep unconditionally refuses RUNNING tasks. Tasks abandoned mid-run (agent crashed, workflow never reached terminal state) therefore stay in Mongo/Postgres forever — the sgp-dev test surfaced 121 such tasks on a single agent (test-v2: cleaned 0, skipped 121). For a retention-compliance feature that's a structural gap: the data most likely to be forgotten is precisely the data the policy can never touch.

When set > 0, a RUNNING task with no interaction for at least that many days (same last-interaction definition as the idle check: max(task.updated_at, latest message)) is treated as abandoned and becomes cleanable. Each override emits a WARNING forensic log. The unprocessed-events correctness guard is unchanged. Recommended setting for OneEdge/EY: 30.

Wiring: env → load_cleanup_config → sweep → child workflow → clean_task activity → use case → service. The new activity parameter defaults to 0, so in-flight workflows started before a deploy are unaffected.

Tests

env parsing: case-insensitivity matrix, garbage → ValueError, new var default/parse
service: RUNNING refused by default; stale-RUNNING cleaned with override; idle-but-not-stale-enough refused; recent Mongo message blocks override; preview honors override without writes
workflow: stale_running_days propagation sweep → child → activity; existing fakes updated for the new activity arg
uv run --group test pytest tests/unit/...retention... → 36 passed; ruff + ruff-format clean

cc @stas

Greptile Summary

This PR adds two hardening changes to the scheduled task-retention cleanup ahead of the EY rollout: strict boolean env parsing that raises on unrecognized values (replacing silent == "true" coercion that caused DRY_RUN=True to silently enable live deletion), and a new RETENTION_CLEANUP_STALE_RUNNING_DAYS opt-in that allows abandoned RUNNING tasks to be cleaned after a configurable inactivity period.

Strict bool parsing: _parse_bool_env now accepts true/false/1/0 case-insensitively and raises ValueError on anything else; applied to RETENTION_CLEANUP_DRY_RUN, RETENTION_CLEANUP_ENABLED, and ENABLE_HEALTH_CHECK_WORKFLOW. Workers with a misconfigured value will refuse to start rather than silently flip to live-deletion mode.
Stale-RUNNING override: When RETENTION_CLEANUP_STALE_RUNNING_DAYS > 0, a RUNNING task with no interaction for at least that many days is treated as abandoned and becomes eligible for cleanup; the default is 0 (disabled), preserving prior behavior and leaving in-flight workflows unaffected by a rolling deploy.
Wiring: The new parameter threads cleanly through env → activity config → sweep workflow → child workflow → clean_task activity → use case → service, with continue_as_new carrying it across sweep pages for consistency.

Confidence Score: 4/5

Safe to merge with the dry-run logging concern addressed; the stale-RUNNING default of 0 means existing deployments are unaffected until the override is explicitly set.

The _check_running_guard helper emits a WARNING log intended as a forensic record of live RUNNING-task overrides, but it is called unconditionally from preview_clean_task (dry-run mode) as well. A dry-run sweep over 121 stale RUNNING tasks would produce 121 spurious warnings, polluting the audit trail this feature is specifically designed to maintain for EY compliance. The bool-parsing hardening and end-to-end wiring are correct; no data correctness risk exists at the default configuration.

agentex/src/domain/services/task_retention_service.py — specifically the _check_running_guard method and how it interacts with the dry-run path in preview_clean_task.

Important Files Changed

Filename	Overview
agentex/src/config/environment_variables.py	Adds `_parse_bool_env` for strict true/false/1/0 parsing (replaces silent `== "true"` coercion), wires all three bool env vars through it, and adds `RETENTION_CLEANUP_STALE_RUNNING_DAYS` int field with default 0. Logic is correct and well-tested.
agentex/src/domain/services/task_retention_service.py	Adds `_check_running_guard` helper and threads `stale_running_days` through `clean_task` and `preview_clean_task`. The WARNING forensic log fires in dry-run mode (false forensic records), and `_is_task_idle` is called twice for stale-RUNNING tasks.
agentex/src/temporal/activities/retention_cleanup_activities.py	Adds `stale_running_days` to `load_cleanup_config` dict and `clean_task` activity signature; propagation is correct end-to-end.
agentex/src/temporal/workflows/retention_cleanup_workflow.py	Correctly extracts `stale_running_days` from config args, passes it through child workflow dicts, and includes it in `continue_as_new` state so the value is consistent across all pages of a sweep.
agentex/tests/unit/services/test_task_retention_service.py	Good coverage: refuses RUNNING by default, cleans stale-RUNNING with override, refuses when not stale enough, respects recent Mongo messages, and verifies preview doesn't write. All scenarios are distinct and well-structured.
agentex/tests/unit/config/test_retention_cleanup_env.py	Tests case-insensitivity matrix, garbage-value rejection, and new STALE_RUNNING_DAYS default/parse. Complete and correct.
agentex/tests/unit/temporal/test_retention_cleanup_workflow.py	Existing fake activities updated for new `stale_running_days` param; new test verifies end-to-end sweep→child→activity propagation of `stale_running_days=30` from config.
agentex/docker-compose.yml	Adds `RETENTION_CLEANUP_STALE_RUNNING_DAYS` env var with default 0 to both retention workers. Consistent with the new field.

Sequence Diagram

sequenceDiagram
    participant S as RetentionCleanupSweepWorkflow
    participant LC as load_cleanup_config activity
    participant FC as find_cleanup_candidates activity
    participant CW as RetentionCleanupTaskWorkflow (child)
    participant CA as clean_task activity
    participant UC as TaskRetentionUseCase
    participant SVC as TaskRetentionService

    S->>LC: (first page only) load config
    LC-->>S: "{enabled, idle_days, dry_run, stale_running_days, ...}"
    S->>FC: find_cleanup_candidates(after_id, idle_days, agent_names)
    FC-->>S: [task_ids] (no status filter — RUNNING tasks included)
    loop per batch (max_in_flight)
        S->>CW: "{task_id, idle_days, dry_run, stale_running_days}"
        CW->>CA: clean_task(task_id, idle_days, dry_run, stale_running_days)
        CA->>UC: clean_task / preview_clean_task(..., stale_running_days)
        UC->>SVC: clean_task(..., stale_running_days)
        SVC->>SVC: _check_running_guard(task, stale_running_days)
        alt "task RUNNING AND stale_running_days > 0 AND idle >= stale_running_days"
            SVC-->>SVC: log WARNING task_cleanup_stale_running_override
            SVC->>SVC: continue to idle + events checks then delete
        else task RUNNING not stale enough
            SVC-->>CA: raise ClientError skipped outcome
        end
        CA-->>CW: "CleanTaskOutcome status cleaned|skipped|failed"
        CW-->>S: outcome
    end
    S->>S: "continue_as_new({..., stale_running_days, after_id: page_last_id})"

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
agentex/src/domain/services/task_retention_service.py:460-481
**Forensic WARNING fires during dry-runs / previews**

`_check_running_guard` is shared between `clean_task` (real deletion) and `preview_clean_task` (no writes). The `logger.warning("task_cleanup_stale_running_override", ...)` is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what _would_ be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.

### Issue 2 of 2
agentex/src/domain/services/task_retention_service.py:468-480
**Redundant MongoDB round-trip for stale-RUNNING tasks**

When a task is RUNNING and `stale_running_days > 0`, `_is_task_idle` is called here (inside `_check_running_guard`) with `stale_running_days` as the threshold. If the guard passes, `clean_task` immediately calls `_is_task_idle` a second time with `idle_days`. Because `stale_running_days` is always expected to be ≥ `idle_days` (e.g. 30 vs 7), the second call's result is always `True` when the first call returns `True`, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.

_{Reviews (1): Last reviewed commit: "feat(retention): strict bool env parsing..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

…ride Two hardening changes to the scheduled task-retention cleanup ahead of the EY rollout: 1. Strict boolean env parsing (_parse_bool_env). The previous pattern (os.environ.get(...) == "true") silently coerced any unrecognized value to False. For RETENTION_CLEANUP_DRY_RUN that failure mode is destructive: DRY_RUN=True (capital T, as YAML/Helm tooling tends to render booleans) meant dry_run=False, i.e. live deletion. Booleans now parse true/false/1/0 case-insensitively and raise on anything else, so a misconfigured worker refuses to run instead of deleting data. Applied to RETENTION_CLEANUP_DRY_RUN, RETENTION_CLEANUP_ENABLED, and ENABLE_HEALTH_CHECK_WORKFLOW (same pattern). 2. RETENTION_CLEANUP_STALE_RUNNING_DAYS (default 0 = disabled). The sweep refuses RUNNING tasks unconditionally, so tasks abandoned mid-run (agent crashed, workflow never reached a terminal state) are exempt from retention forever — in sgp-dev testing one agent had 121 such tasks, all unpurgeable. When this is set > 0, a RUNNING task with no interaction (same last-interaction definition as the idle check) for at least that many days is treated as abandoned and cleaned, with a WARNING-level forensic log per override. The unprocessed-events correctness guard still applies. Wired env -> load_cleanup_config -> sweep -> child workflow -> activity -> use case -> service; the new activity arg defaults to 0 so in-flight workflows are unaffected.

greptile-apps · 2026-06-12T13:08:36Z

+    async def _check_running_guard(self, task, stale_running_days: int) -> None:
+        """
+        Raise ClientError if `task` is RUNNING, unless the stale-RUNNING override
+        applies: stale_running_days > 0 and the task has had no interaction for
+        at least that many days (same last-interaction definition as
+        _is_task_idle). Overrides are logged at WARNING for forensics — cleaning
+        a RUNNING task means we are declaring its workflow abandoned.
+        """
+        if task.status != TaskStatus.RUNNING:
+            return
+        if stale_running_days > 0 and await self._is_task_idle(
+            task, stale_running_days
+        ):
+            logger.warning(
+                "task_cleanup_stale_running_override",
+                extra={
+                    "task_id": task.id,
+                    "stale_running_days": stale_running_days,
+                },
+            )
+            return
+        raise ClientError(f"Cannot clean task {task.id}: status is RUNNING (active)")


Forensic WARNING fires during dry-runs / previews

_check_running_guard is shared between clean_task (real deletion) and preview_clean_task (no writes). The logger.warning("task_cleanup_stale_running_override", ...) is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what would be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.

Prompt To Fix With AI

This is a comment left during a code review. Path: agentex/src/domain/services/task_retention_service.py Line: 460-481 Comment: **Forensic WARNING fires during dry-runs / previews** `_check_running_guard` is shared between `clean_task` (real deletion) and `preview_clean_task` (no writes). The `logger.warning("task_cleanup_stale_running_override", ...)` is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what _would_ be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-12T13:08:37Z

+        if task.status != TaskStatus.RUNNING:
+            return
+        if stale_running_days > 0 and await self._is_task_idle(
+            task, stale_running_days
+        ):
+            logger.warning(
+                "task_cleanup_stale_running_override",
+                extra={
+                    "task_id": task.id,
+                    "stale_running_days": stale_running_days,
+                },
+            )
+            return


Redundant MongoDB round-trip for stale-RUNNING tasks

When a task is RUNNING and stale_running_days > 0, _is_task_idle is called here (inside _check_running_guard) with stale_running_days as the threshold. If the guard passes, clean_task immediately calls _is_task_idle a second time with idle_days. Because stale_running_days is always expected to be ≥ idle_days (e.g. 30 vs 7), the second call's result is always True when the first call returns True, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.

Prompt To Fix With AI

This is a comment left during a code review. Path: agentex/src/domain/services/task_retention_service.py Line: 468-480 Comment: **Redundant MongoDB round-trip for stale-RUNNING tasks** When a task is RUNNING and `stale_running_days > 0`, `_is_task_idle` is called here (inside `_check_running_guard`) with `stale_running_days` as the threshold. If the guard passes, `clean_task` immediately calls `_is_task_idle` a second time with `idle_days`. Because `stale_running_days` is always expected to be ≥ `idle_days` (e.g. 30 vs 7), the second call's result is always `True` when the first call returns `True`, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep. How can I resolve this? If you propose a fix, please make it concise.

gabrycina requested a review from a team as a code owner June 12, 2026 13:03

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(retention): strict bool env parsing + stale-RUNNING cleanup override#306

feat(retention): strict bool env parsing + stale-RUNNING cleanup override#306
gabrycina wants to merge 1 commit into
mainfrom
retention/stale-running-and-strict-bool-env

gabrycina commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gabrycina commented Jun 12, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Strict boolean env parsing

2. RETENTION_CLEANUP_STALE_RUNNING_DAYS (default 0 = disabled, current behavior unchanged)

Tests

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabrycina commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading

2. `RETENTION_CLEANUP_STALE_RUNNING_DAYS` (default `0` = disabled, current behavior unchanged)