|
| 1 | +# Process-Level Singleton Fix for GitHub Copilot Provider |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Fix the GitHub Copilot provider so that all sub-agents spawned by a recipe within a single Amplifier session share one `CopilotClientWrapper` instance (and therefore one copilot CLI subprocess), rather than each spawning their own. |
| 6 | + |
| 7 | +Tracked as: [Issue #7](https://github.com/microsoft/amplifier-module-provider-github-copilot/issues/7) |
| 8 | + |
| 9 | +## Background |
| 10 | + |
| 11 | +The copilot CLI binary is Electron-bundled at ~500 MB per process. With the current implementation, N sub-agents spawned by a recipe creates N subprocesses consuming N × ~500 MB RAM — the root cause of the OOM being investigated. |
| 12 | + |
| 13 | +Two architectural facts constrain the solution: |
| 14 | + |
| 15 | +- Sub-agents spawned by the `task` tool run as async coroutines in the **same Python process** and same asyncio event loop as the parent session (kernel-guaranteed) |
| 16 | +- Each sub-agent gets its own fresh `ModuleCoordinator` — coordinators are not shared across sessions |
| 17 | + |
| 18 | +There is no kernel facility for cross-session resource sharing; resource pooling is module-level policy, consistent with Amplifier's "mechanism not policy" principle. |
| 19 | + |
| 20 | +## Approach |
| 21 | + |
| 22 | +A process-level singleton held as module-level state in `__init__.py`, with reference counting to ensure the subprocess lives exactly as long as at least one session is mounted. The change is entirely contained to `__init__.py` — no other files change. |
| 23 | + |
| 24 | +Two other approaches were considered and rejected: |
| 25 | + |
| 26 | +- **Process-level cap (semaphore)**: Limits blast radius but doesn't eliminate waste; more complex without being more correct |
| 27 | +- **Status quo + polish only**: Valid if the OOM were purely a runaway delegation bug, but the singleton is the right architecture independently — even legitimate heavy recipes waste N × 500 MB today |
| 28 | + |
| 29 | +## Architecture |
| 30 | + |
| 31 | +The change is surgical. `CopilotClientWrapper` itself doesn't change. The only change is in when and how many times it gets created. |
| 32 | + |
| 33 | +**Before:** |
| 34 | +``` |
| 35 | +mount() called → new CopilotClientWrapper() → new copilot subprocess |
| 36 | +mount() called → new CopilotClientWrapper() → new copilot subprocess |
| 37 | +mount() called → new CopilotClientWrapper() → new copilot subprocess |
| 38 | +``` |
| 39 | + |
| 40 | +**After:** |
| 41 | +``` |
| 42 | +mount() called → _acquire_shared_client() → new copilot subprocess (ref=1) |
| 43 | +mount() called → _acquire_shared_client() → reuse existing (ref=2) |
| 44 | +mount() called → _acquire_shared_client() → reuse existing (ref=3) |
| 45 | +cleanup called → _release_shared_client() → ref=2 |
| 46 | +cleanup called → _release_shared_client() → ref=1 |
| 47 | +cleanup called → _release_shared_client() → ref=0 → subprocess shuts down |
| 48 | +``` |
| 49 | + |
| 50 | +**Blast radius:** `__init__.py` only. `CopilotClientWrapper`, `provider.py`, `tool_capture.py`, `_constants.py`, and `converters.py` are untouched. |
| 51 | + |
| 52 | +## Components |
| 53 | + |
| 54 | +### Module-Level Singleton State |
| 55 | + |
| 56 | +Three module-level variables at the top of `__init__.py`: |
| 57 | + |
| 58 | +```python |
| 59 | +_shared_client: CopilotClientWrapper | None = None |
| 60 | +_shared_client_refcount: int = 0 |
| 61 | +_shared_client_lock: asyncio.Lock | None = None |
| 62 | +``` |
| 63 | + |
| 64 | +The lock is initialized lazily rather than at module load time. `asyncio.Lock()` must be created on the running event loop — creating it at import time can cause issues if the module is imported before an event loop exists (common in test environments). A `_get_lock()` helper handles this: |
| 65 | + |
| 66 | +```python |
| 67 | +def _get_lock() -> asyncio.Lock: |
| 68 | + global _shared_client_lock |
| 69 | + if _shared_client_lock is None: |
| 70 | + _shared_client_lock = asyncio.Lock() |
| 71 | + return _shared_client_lock |
| 72 | +``` |
| 73 | + |
| 74 | +### Acquire and Release Functions |
| 75 | + |
| 76 | +```python |
| 77 | +async def _acquire_shared_client(config, timeout) -> CopilotClientWrapper: |
| 78 | + global _shared_client, _shared_client_refcount |
| 79 | + async with _get_lock(): |
| 80 | + if _shared_client is None: |
| 81 | + _shared_client = CopilotClientWrapper(config, timeout) |
| 82 | + _shared_client_refcount += 1 |
| 83 | + return _shared_client |
| 84 | + |
| 85 | +async def _release_shared_client() -> None: |
| 86 | + global _shared_client, _shared_client_refcount |
| 87 | + async with _get_lock(): |
| 88 | + _shared_client_refcount -= 1 |
| 89 | + if _shared_client_refcount <= 0: |
| 90 | + if _shared_client is not None: |
| 91 | + await _shared_client.close() |
| 92 | + _shared_client = None |
| 93 | + _shared_client_refcount = 0 # safety floor against accidental negatives |
| 94 | +``` |
| 95 | + |
| 96 | +The safety floor at zero matters: if cleanup is ever called more times than mount (e.g., in test teardown), a negative count must not prevent the next acquisition from creating a fresh client. |
| 97 | + |
| 98 | +### Updated `mount()` |
| 99 | + |
| 100 | +`mount()` changes in one place only — acquiring the shared client instead of constructing a new one directly: |
| 101 | + |
| 102 | +```python |
| 103 | +async def mount(coordinator, config): |
| 104 | + timeout = config.get("timeout", DEFAULT_TIMEOUT) |
| 105 | + |
| 106 | + client = await _acquire_shared_client(config, timeout) |
| 107 | + |
| 108 | + provider = CopilotSdkProvider(client=client, config=config) |
| 109 | + coordinator.providers.register("github-copilot", provider) |
| 110 | + |
| 111 | + async def cleanup(): |
| 112 | + await _release_shared_client() |
| 113 | + |
| 114 | + return cleanup |
| 115 | +``` |
| 116 | + |
| 117 | +Everything else — provider construction, coordinator registration, returning a cleanup callable — is unchanged. |
| 118 | + |
| 119 | +## Data Flow |
| 120 | + |
| 121 | +`_acquire_shared_client()` only uses `config` and `timeout` on the **first** call when it constructs the wrapper. Subsequent calls reuse the existing client regardless of the values passed. Sub-agents inherit bundle config from the parent, so in practice they all pass identical values. |
| 122 | + |
| 123 | +If a sub-agent passes a different `timeout` than the first mount, `_acquire_shared_client()` emits a `DEBUG` log warning and returns the existing client unchanged. No exception is raised — the caller receives a working client and the discrepancy is visible in logs. |
| 124 | + |
| 125 | +## Error Handling |
| 126 | + |
| 127 | +**Config mismatch**: `DEBUG` warning logged, existing client returned, no exception raised. |
| 128 | + |
| 129 | +**Concurrent mounts at startup**: Two sub-agents mounting simultaneously before the client exists are serialized by the `asyncio.Lock`. The second waiter finds `_shared_client` already populated, increments the refcount, and returns. Double-creation is not possible. |
| 130 | + |
| 131 | +**Unclean shutdown** (SIGKILL / process crash): The refcount never reaches zero, so `close()` never fires and the copilot subprocess becomes an orphan. This is acceptable — the OS reclaims the subprocess when the parent Python process dies. A code comment will document this reasoning explicitly so future maintainers don't treat it as an oversight. |
| 132 | + |
| 133 | +## Observability |
| 134 | + |
| 135 | +Three events are emitted via `coordinator.hooks.emit()`, consistent with existing module patterns (e.g., `provider:tool_sequence_repaired`): |
| 136 | + |
| 137 | +| Event | When | Payload | |
| 138 | +|---|---|---| |
| 139 | +| `github-copilot:subprocess_created` | First acquisition | `session_id` | |
| 140 | +| `github-copilot:subprocess_reused` | Subsequent acquisitions | `session_id`, `refcount` | |
| 141 | +| `github-copilot:subprocess_shutdown` | Refcount reaches zero | `session_id` | |
| 142 | + |
| 143 | +These are visible in session logs without any extra configuration. |
| 144 | + |
| 145 | +## Testing Strategy |
| 146 | + |
| 147 | +New tests live in `tests/test_mount.py`. `CopilotClientWrapper.__init__` is mocked to count instantiations. A pytest fixture resets `_shared_client`, `_shared_client_refcount`, and `_shared_client_lock` to `None`/`0`/`None` before each test — required for isolation with module-level state. |
| 148 | + |
| 149 | +| Test | What it verifies | |
| 150 | +|---|---| |
| 151 | +| **Singleton creation** | `mount()` once → one `CopilotClientWrapper` created, refcount=1 | |
| 152 | +| **Reuse** | `mount()` three times → one `CopilotClientWrapper` created, refcount=3 | |
| 153 | +| **Cleanup lifecycle** | Three mounts, three cleanups → `close()` called only on the third (refcount=0) | |
| 154 | +| **Concurrent mounts** | `asyncio.gather()` fires multiple `mount()` calls simultaneously → still one client (exercises the lock) | |
| 155 | +| **Config mismatch warning** | Mount with `timeout=300`, mount again with `timeout=600` → `DEBUG` warning emitted, no exception | |
| 156 | + |
| 157 | +No integration tests are needed — the copilot subprocess is already mocked in the existing suite. These tests only exercise the reference counting logic, which is pure Python with no external dependencies. Existing tests for `CopilotClientWrapper` and `CopilotSdkProvider` are unaffected since neither class changes. |
| 158 | + |
| 159 | +## Open Questions |
| 160 | + |
| 161 | +None — design is fully validated and ready for implementation. |
0 commit comments