|
| 1 | +# Exception and Cleanup Findings (2026-05-18) |
| 2 | + |
| 3 | +This document captures a focused review of exception handling and shutdown/cleanup behavior in the Python SDK runtime lifecycle. |
| 4 | + |
| 5 | +## Scope |
| 6 | + |
| 7 | +- Command provider session lifecycle |
| 8 | +- Command consumer session lifecycle |
| 9 | +- Context startup/shutdown orchestration |
| 10 | +- Hook exception isolation |
| 11 | + |
| 12 | +## Findings Already Handled |
| 13 | + |
| 14 | +### 1. Consumer hook exception isolation is robust |
| 15 | + |
| 16 | +- `on_status`, `on_ack`, `on_exec_status`, and `on_terminal` hook exceptions are caught and logged. |
| 17 | +- Consumer session cleanup still completes even if `on_terminal` raises. |
| 18 | +- Existing tests cover this behavior (`test_consumer_hook_isolation.py`). |
| 19 | + |
| 20 | +### 2. Shutdown is idempotent |
| 21 | + |
| 22 | +- `DDSContext.shutdown()` uses `_is_shutdown` guard and returns on repeated calls. |
| 23 | + |
| 24 | +### 3. Shutdown order is correct for dispatcher safety |
| 25 | + |
| 26 | +- Dispatcher is stopped before task cancellation and DDS entity teardown. |
| 27 | +- Service-level `close()` is used for logical cleanup only. |
| 28 | + |
| 29 | +### 4. Context-level service close failures are contained |
| 30 | + |
| 31 | +- `DDSContext.shutdown()` catches/logs service `close()` exceptions and continues teardown. |
| 32 | + |
| 33 | +## Open Findings (Not Yet Fully Handled) |
| 34 | + |
| 35 | +### 1. Provider `on_terminal` exceptions can skip provider session cleanup |
| 36 | + |
| 37 | +In `CommandProviderSession.run()` finalization, `await self._provider.on_terminal(self)` executes before: |
| 38 | + |
| 39 | +- provider instance disposal |
| 40 | +- active-session map removal |
| 41 | + |
| 42 | +If `on_terminal()` raises, disposal and map cleanup may be skipped for that session. |
| 43 | + |
| 44 | +Risk: |
| 45 | + |
| 46 | +- lingering `_active_sessions` entries |
| 47 | +- missed instance disposal |
| 48 | +- shutdown path inconsistencies under hook failure |
| 49 | + |
| 50 | +### 2. `run_until_shutdown()` can double-start already-started services |
| 51 | + |
| 52 | +`DDSContext.run_until_shutdown()` currently creates a new task for every service exposing `_run()` without checking whether a prior `start()` already created a live task. |
| 53 | + |
| 54 | +Risk: |
| 55 | + |
| 56 | +- duplicate reader loops for services that were manually started |
| 57 | +- hard-to-debug duplicated processing |
| 58 | + |
| 59 | +### 3. Provider `close()` can abort early on non-cancel exception from `_run` task await |
| 60 | + |
| 61 | +`CommandProvider.close()` cancels `_task` and only handles `asyncio.CancelledError` when awaiting it. |
| 62 | +If awaiting `_task` raises a different exception, active-session fail/cleanup logic below may not execute in that `close()` call. |
| 63 | + |
| 64 | +Risk: |
| 65 | + |
| 66 | +- incomplete fail-on-shutdown behavior for active sessions |
| 67 | +- reduced cleanup resilience after reader-loop failure |
| 68 | + |
| 69 | +## Suggested Fixes |
| 70 | + |
| 71 | +### A. Harden provider session finalization |
| 72 | + |
| 73 | +Wrap provider `on_terminal` in `try/except` in the `finally` block, and always run disposal and `_active_sessions.pop(...)` afterward. |
| 74 | + |
| 75 | +Suggested shape: |
| 76 | + |
| 77 | +1. `try: await on_terminal(...)` |
| 78 | +2. `except Exception: log` |
| 79 | +3. always dispose instances |
| 80 | +4. always remove session from active map |
| 81 | + |
| 82 | +### B. Prevent double-start in `run_until_shutdown()` |
| 83 | + |
| 84 | +Before creating a task for `_run()`, check whether `_task` exists and is still running. |
| 85 | + |
| 86 | +Suggested guard: |
| 87 | + |
| 88 | +- start only when `_task is None` or `_task.done()` |
| 89 | + |
| 90 | +### C. Make provider `close()` resilient to non-cancel task failures |
| 91 | + |
| 92 | +When awaiting canceled `_task`, catch generic exceptions (log and continue) so active sessions are still failed and awaited. |
| 93 | + |
| 94 | +## Test Gaps to Add |
| 95 | + |
| 96 | +1. Provider hook isolation test: |
| 97 | + |
| 98 | +- Provider subclass whose `on_terminal()` raises. |
| 99 | +- Assert session still disposes and is removed from `_active_sessions`. |
| 100 | + |
| 101 | +2. Lifecycle double-start guard test: |
| 102 | + |
| 103 | +- Call `service.start()` and then `ctx.run_until_shutdown()`. |
| 104 | +- Assert only one `_run()` task loop executes. |
| 105 | + |
| 106 | +3. Provider close resilience test: |
| 107 | + |
| 108 | +- Force `_run()` task to fail with non-cancel exception. |
| 109 | +- Assert `close()` still proceeds to fail/await active sessions. |
| 110 | + |
| 111 | +## Priority |
| 112 | + |
| 113 | +- High: provider `on_terminal` finalization hardening |
| 114 | +- High: double-start guard in `run_until_shutdown()` |
| 115 | +- Medium: provider `close()` robustness for non-cancel task exceptions |
| 116 | + |
| 117 | +## Notes |
| 118 | + |
| 119 | +- Findings are based on code-path verification in runtime implementation, not architecture docs alone. |
| 120 | +- Consumer exception isolation is in better shape than provider finalization paths. |
0 commit comments