-
Notifications
You must be signed in to change notification settings - Fork 49
feat(retention): strict bool env parsing + stale-RUNNING cleanup override #306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -177,6 +177,7 @@ async def clean_task( | |
| *, | ||
| enforce_idle_threshold: bool = True, | ||
| idle_days: int = 7, | ||
| stale_running_days: int = 0, | ||
| ) -> TaskCleanupResultEntity: | ||
| """ | ||
| Delete content-bearing rows for a stale task. Idempotent: re-running on a | ||
|
|
@@ -189,9 +190,18 @@ async def clean_task( | |
| scheduled Temporal sweep always sets True. The admin endpoint | ||
| accepts a force=true flag that flips this to False. | ||
| idle_days: Idle threshold in days (when enforce_idle_threshold=True). | ||
| stale_running_days: When > 0, a task whose status is RUNNING but | ||
| whose last interaction is at least this many days old is treated | ||
| as abandoned and may be cleaned. 0 (default) keeps the strict | ||
| behavior: RUNNING tasks are never cleaned. Tasks that hang in | ||
| RUNNING forever (agent crashed mid-run, workflow never reached a | ||
| terminal state) would otherwise be exempt from retention | ||
| indefinitely, defeating the policy for exactly the data most | ||
| likely to be forgotten. | ||
|
|
||
| Refuses (raises) if: | ||
| - task is currently active (status == RUNNING). | ||
| - task is currently active (status == RUNNING) and not stale per | ||
| stale_running_days. | ||
| - enforce_idle_threshold=True and the task is not idle long enough. | ||
| - unprocessed events exist past agent_task_tracker cursors. | ||
|
|
||
|
|
@@ -236,10 +246,7 @@ async def clean_task( | |
| ) | ||
|
|
||
| # 2. Status + idle threshold guards. | ||
| if task.status == TaskStatus.RUNNING: | ||
| raise ClientError( | ||
| f"Cannot clean task {task_id}: status is RUNNING (active)" | ||
| ) | ||
| await self._check_running_guard(task, stale_running_days) | ||
| if enforce_idle_threshold and not await self._is_task_idle(task, idle_days): | ||
| raise ClientError( | ||
| f"Cannot clean task {task_id}: not idle for {idle_days} days " | ||
|
|
@@ -293,6 +300,7 @@ async def preview_clean_task( | |
| *, | ||
| enforce_idle_threshold: bool = True, | ||
| idle_days: int = 7, | ||
| stale_running_days: int = 0, | ||
| ) -> TaskCleanupResultEntity: | ||
| """ | ||
| Run the same safety checks as clean_task without deleting or updating data. | ||
|
|
@@ -306,10 +314,7 @@ async def preview_clean_task( | |
| if task.cleaned_at is not None: | ||
| cleaned_at = task.cleaned_at | ||
| else: | ||
| if task.status == TaskStatus.RUNNING: | ||
| raise ClientError( | ||
| f"Cannot clean task {task_id}: status is RUNNING (active)" | ||
| ) | ||
| await self._check_running_guard(task, stale_running_days) | ||
| if enforce_idle_threshold and not await self._is_task_idle(task, idle_days): | ||
| raise ClientError( | ||
| f"Cannot clean task {task_id}: not idle for {idle_days} days " | ||
|
|
@@ -452,6 +457,29 @@ async def rehydrate_task( | |
|
|
||
| # ---- internal helpers ---- | ||
|
|
||
| async def _check_running_guard(self, task, stale_running_days: int) -> None: | ||
| """ | ||
| Raise ClientError if `task` is RUNNING, unless the stale-RUNNING override | ||
| applies: stale_running_days > 0 and the task has had no interaction for | ||
| at least that many days (same last-interaction definition as | ||
| _is_task_idle). Overrides are logged at WARNING for forensics — cleaning | ||
| a RUNNING task means we are declaring its workflow abandoned. | ||
| """ | ||
| if task.status != TaskStatus.RUNNING: | ||
| return | ||
| if stale_running_days > 0 and await self._is_task_idle( | ||
| task, stale_running_days | ||
| ): | ||
| logger.warning( | ||
| "task_cleanup_stale_running_override", | ||
| extra={ | ||
| "task_id": task.id, | ||
| "stale_running_days": stale_running_days, | ||
| }, | ||
| ) | ||
| return | ||
| raise ClientError(f"Cannot clean task {task.id}: status is RUNNING (active)") | ||
|
Comment on lines
+460
to
+481
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Prompt To Fix With AIThis is a comment left during a code review.
Path: agentex/src/domain/services/task_retention_service.py
Line: 460-481
Comment:
**Forensic WARNING fires during dry-runs / previews**
`_check_running_guard` is shared between `clean_task` (real deletion) and `preview_clean_task` (no writes). The `logger.warning("task_cleanup_stale_running_override", ...)` is documented as a forensic record that "cleaning a RUNNING task means we are declaring its workflow abandoned" — but it fires on every dry-run sweep as well. An operator running a dry-run to audit what _would_ be cleaned will produce WARNING entries that are indistinguishable from the entries produced by a live deletion run. Forensic tooling or alert rules keyed on this event name would fire false positives on every dry-run invocation.
How can I resolve this? If you propose a fix, please make it concise. |
||
|
|
||
| async def _is_task_idle(self, task, idle_days: int) -> bool: | ||
| """ | ||
| True iff the task has no interaction within the idle window. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a task is RUNNING and
stale_running_days > 0,_is_task_idleis called here (inside_check_running_guard) withstale_running_daysas the threshold. If the guard passes,clean_taskimmediately calls_is_task_idlea second time withidle_days. Becausestale_running_daysis always expected to be ≥idle_days(e.g. 30 vs 7), the second call's result is alwaysTruewhen the first call returnsTrue, so the second query is always redundant for RUNNING tasks that pass the stale check. Each call issues a MongoDB message-fetch — at scale this doubles the read traffic for every stale-RUNNING task in the sweep.Prompt To Fix With AI