feat(retention): strict bool env parsing + stale-RUNNING cleanup override#306
feat(retention): strict bool env parsing + stale-RUNNING cleanup override#306gabrycina wants to merge 1 commit into
Conversation
…ride Two hardening changes to the scheduled task-retention cleanup ahead of the EY rollout: 1. Strict boolean env parsing (_parse_bool_env). The previous pattern (os.environ.get(...) == "true") silently coerced any unrecognized value to False. For RETENTION_CLEANUP_DRY_RUN that failure mode is destructive: DRY_RUN=True (capital T, as YAML/Helm tooling tends to render booleans) meant dry_run=False, i.e. live deletion. Booleans now parse true/false/1/0 case-insensitively and raise on anything else, so a misconfigured worker refuses to run instead of deleting data. Applied to RETENTION_CLEANUP_DRY_RUN, RETENTION_CLEANUP_ENABLED, and ENABLE_HEALTH_CHECK_WORKFLOW (same pattern). 2. RETENTION_CLEANUP_STALE_RUNNING_DAYS (default 0 = disabled). The sweep refuses RUNNING tasks unconditionally, so tasks abandoned mid-run (agent crashed, workflow never reached a terminal state) are exempt from retention forever — in sgp-dev testing one agent had 121 such tasks, all unpurgeable. When this is set > 0, a RUNNING task with no interaction (same last-interaction definition as the idle check) for at least that many days is treated as abandoned and cleaned, with a WARNING-level forensic log per override. The unprocessed-events correctness guard still applies. Wired env -> load_cleanup_config -> sweep -> child workflow -> activity -> use case -> service; the new activity arg defaults to 0 so in-flight workflows are unaffected.
| async def _check_running_guard(self, task, stale_running_days: int) -> None: | ||
| """ | ||
| Raise ClientError if `task` is RUNNING, unless the stale-RUNNING override | ||
| applies: stale_running_days > 0 and the task has had no interaction for | ||
| at least that many days (same last-interaction definition as | ||
| _is_task_idle). Overrides are logged at WARNING for forensics — cleaning | ||
| a RUNNING task means we are declaring its workflow abandoned. | ||
| """ | ||
| if task.status != TaskStatus.RUNNING: | ||
| return | ||
| if stale_running_days > 0 and await self._is_task_idle( | ||
| task, stale_running_days | ||
| ): | ||
| logger.warning( | ||
| "task_cleanup_stale_running_override", | ||
| extra={ | ||
| "task_id": task.id, | ||
| "stale_running_days": stale_running_days, | ||
| }, | ||
| ) | ||
| return | ||
| raise ClientError(f"Cannot clean task {task.id}: status is RUNNING (active)") |
There was a problem hiding this comment.
Forensic WARNING fires during dry-runs / previews
_check_running_guard is shared between clean_task (real deletion) and preview_clean_task (no writes). The logger.warning("task_cleanup_stale_running_override", ...) is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what would be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/domain/services/task_retention_service.py
Line: 460-481
Comment:
**Forensic WARNING fires during dry-runs / previews**
`_check_running_guard` is shared between `clean_task` (real deletion) and `preview_clean_task` (no writes). The `logger.warning("task_cleanup_stale_running_override", ...)` is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what _would_ be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.
How can I resolve this? If you propose a fix, please make it concise.| if task.status != TaskStatus.RUNNING: | ||
| return | ||
| if stale_running_days > 0 and await self._is_task_idle( | ||
| task, stale_running_days | ||
| ): | ||
| logger.warning( | ||
| "task_cleanup_stale_running_override", | ||
| extra={ | ||
| "task_id": task.id, | ||
| "stale_running_days": stale_running_days, | ||
| }, | ||
| ) | ||
| return |
There was a problem hiding this comment.
Redundant MongoDB round-trip for stale-RUNNING tasks
When a task is RUNNING and stale_running_days > 0, _is_task_idle is called here (inside _check_running_guard) with stale_running_days as the threshold. If the guard passes, clean_task immediately calls _is_task_idle a second time with idle_days. Because stale_running_days is always expected to be ≥ idle_days (e.g. 30 vs 7), the second call's result is always True when the first call returns True, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/domain/services/task_retention_service.py
Line: 468-480
Comment:
**Redundant MongoDB round-trip for stale-RUNNING tasks**
When a task is RUNNING and `stale_running_days > 0`, `_is_task_idle` is called here (inside `_check_running_guard`) with `stale_running_days` as the threshold. If the guard passes, `clean_task` immediately calls `_is_task_idle` a second time with `idle_days`. Because `stale_running_days` is always expected to be ≥ `idle_days` (e.g. 30 vs 7), the second call's result is always `True` when the first call returns `True`, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.
How can I resolve this? If you propose a fix, please make it concise.
Two hardening changes to the scheduled task-retention cleanup (#272 follow-up) ahead of the EY rollout. Context: EY data-retention thread + the sgp-dev verification doc.
1. Strict boolean env parsing
RETENTION_CLEANUP_DRY_RUNwas parsed withos.environ.get(...) == "true", so any unrecognized value silently meantdry_run=False→ live deletion.DRY_RUN=True(capital T — exactly what YAML/Helm tooling tends to render) was the documented gotcha in the sgp-dev test writeup; in the EY environment this config flows through system-manager value_overrides → Helm → env, which makes that footgun very real.Booleans now accept
true/false/1/0case-insensitively and raise on anything else, so a misconfigured worker refuses to run instead of deleting data. Applied toRETENTION_CLEANUP_DRY_RUN,RETENTION_CLEANUP_ENABLED, andENABLE_HEALTH_CHECK_WORKFLOW(same pattern).2.
RETENTION_CLEANUP_STALE_RUNNING_DAYS(default0= disabled, current behavior unchanged)The sweep unconditionally refuses RUNNING tasks. Tasks abandoned mid-run (agent crashed, workflow never reached terminal state) therefore stay in Mongo/Postgres forever — the sgp-dev test surfaced 121 such tasks on a single agent (
test-v2: cleaned 0, skipped 121). For a retention-compliance feature that's a structural gap: the data most likely to be forgotten is precisely the data the policy can never touch.When set
> 0, a RUNNING task with no interaction for at least that many days (same last-interaction definition as the idle check:max(task.updated_at, latest message)) is treated as abandoned and becomes cleanable. Each override emits a WARNING forensic log. The unprocessed-events correctness guard is unchanged. Recommended setting for OneEdge/EY:30.Wiring: env →
load_cleanup_config→ sweep → child workflow →clean_taskactivity → use case → service. The new activity parameter defaults to0, so in-flight workflows started before a deploy are unaffected.Tests
ValueError, new var default/parsestale_running_dayspropagation sweep → child → activity; existing fakes updated for the new activity arguv run --group test pytest tests/unit/...retention...→ 36 passed; ruff + ruff-format cleancc @stas
Greptile Summary
This PR adds two hardening changes to the scheduled task-retention cleanup ahead of the EY rollout: strict boolean env parsing that raises on unrecognized values (replacing silent
== "true"coercion that causedDRY_RUN=Trueto silently enable live deletion), and a newRETENTION_CLEANUP_STALE_RUNNING_DAYSopt-in that allows abandoned RUNNING tasks to be cleaned after a configurable inactivity period._parse_bool_envnow acceptstrue/false/1/0case-insensitively and raisesValueErroron anything else; applied toRETENTION_CLEANUP_DRY_RUN,RETENTION_CLEANUP_ENABLED, andENABLE_HEALTH_CHECK_WORKFLOW. Workers with a misconfigured value will refuse to start rather than silently flip to live-deletion mode.RETENTION_CLEANUP_STALE_RUNNING_DAYS > 0, a RUNNING task with no interaction for at least that many days is treated as abandoned and becomes eligible for cleanup; the default is0(disabled), preserving prior behavior and leaving in-flight workflows unaffected by a rolling deploy.clean_taskactivity → use case → service, withcontinue_as_newcarrying it across sweep pages for consistency.Confidence Score: 4/5
Safe to merge with the dry-run logging concern addressed; the stale-RUNNING default of 0 means existing deployments are unaffected until the override is explicitly set.
The
_check_running_guardhelper emits a WARNING log intended as a forensic record of live RUNNING-task overrides, but it is called unconditionally frompreview_clean_task(dry-run mode) as well. A dry-run sweep over 121 stale RUNNING tasks would produce 121 spurious warnings, polluting the audit trail this feature is specifically designed to maintain for EY compliance. The bool-parsing hardening and end-to-end wiring are correct; no data correctness risk exists at the default configuration.agentex/src/domain/services/task_retention_service.py — specifically the
_check_running_guardmethod and how it interacts with the dry-run path inpreview_clean_task.Important Files Changed
_parse_bool_envfor strict true/false/1/0 parsing (replaces silent== "true"coercion), wires all three bool env vars through it, and addsRETENTION_CLEANUP_STALE_RUNNING_DAYSint field with default 0. Logic is correct and well-tested._check_running_guardhelper and threadsstale_running_daysthroughclean_taskandpreview_clean_task. The WARNING forensic log fires in dry-run mode (false forensic records), and_is_task_idleis called twice for stale-RUNNING tasks.stale_running_daystoload_cleanup_configdict andclean_taskactivity signature; propagation is correct end-to-end.stale_running_daysfrom config args, passes it through child workflow dicts, and includes it incontinue_as_newstate so the value is consistent across all pages of a sweep.stale_running_daysparam; new test verifies end-to-end sweep→child→activity propagation ofstale_running_days=30from config.RETENTION_CLEANUP_STALE_RUNNING_DAYSenv var with default 0 to both retention workers. Consistent with the new field.Sequence Diagram
sequenceDiagram participant S as RetentionCleanupSweepWorkflow participant LC as load_cleanup_config activity participant FC as find_cleanup_candidates activity participant CW as RetentionCleanupTaskWorkflow (child) participant CA as clean_task activity participant UC as TaskRetentionUseCase participant SVC as TaskRetentionService S->>LC: (first page only) load config LC-->>S: "{enabled, idle_days, dry_run, stale_running_days, ...}" S->>FC: find_cleanup_candidates(after_id, idle_days, agent_names) FC-->>S: [task_ids] (no status filter — RUNNING tasks included) loop per batch (max_in_flight) S->>CW: "{task_id, idle_days, dry_run, stale_running_days}" CW->>CA: clean_task(task_id, idle_days, dry_run, stale_running_days) CA->>UC: clean_task / preview_clean_task(..., stale_running_days) UC->>SVC: clean_task(..., stale_running_days) SVC->>SVC: _check_running_guard(task, stale_running_days) alt "task RUNNING AND stale_running_days > 0 AND idle >= stale_running_days" SVC-->>SVC: log WARNING task_cleanup_stale_running_override SVC->>SVC: continue to idle + events checks then delete else task RUNNING not stale enough SVC-->>CA: raise ClientError skipped outcome end CA-->>CW: "CleanTaskOutcome status cleaned|skipped|failed" CW-->>S: outcome end S->>S: "continue_as_new({..., stale_running_days, after_id: page_last_id})"Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat(retention): strict bool env parsing..." | Re-trigger Greptile