Skip to content

Commit 2f2c5fb

Browse files
docs: add process-singleton design document for OOM fix
Design for sharing a single CopilotClientWrapper instance (and copilot CLI subprocess) across all sub-agents within a session via a process-level singleton with reference counting for lifecycle management. Addresses the OOM issue tracked in GitHub issue #7. 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
1 parent 1e0734d commit 2f2c5fb

1 file changed

Lines changed: 161 additions & 0 deletions

File tree

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# Process-Level Singleton Fix for GitHub Copilot Provider
2+
3+
## Goal
4+
5+
Fix the GitHub Copilot provider so that all sub-agents spawned by a recipe within a single Amplifier session share one `CopilotClientWrapper` instance (and therefore one copilot CLI subprocess), rather than each spawning their own.
6+
7+
Tracked as: [Issue #7](https://github.com/microsoft/amplifier-module-provider-github-copilot/issues/7)
8+
9+
## Background
10+
11+
The copilot CLI binary is Electron-bundled at ~500 MB per process. With the current implementation, N sub-agents spawned by a recipe creates N subprocesses consuming N × ~500 MB RAM — the root cause of the OOM being investigated.
12+
13+
Two architectural facts constrain the solution:
14+
15+
- Sub-agents spawned by the `task` tool run as async coroutines in the **same Python process** and same asyncio event loop as the parent session (kernel-guaranteed)
16+
- Each sub-agent gets its own fresh `ModuleCoordinator` — coordinators are not shared across sessions
17+
18+
There is no kernel facility for cross-session resource sharing; resource pooling is module-level policy, consistent with Amplifier's "mechanism not policy" principle.
19+
20+
## Approach
21+
22+
A process-level singleton held as module-level state in `__init__.py`, with reference counting to ensure the subprocess lives exactly as long as at least one session is mounted. The change is entirely contained to `__init__.py` — no other files change.
23+
24+
Two other approaches were considered and rejected:
25+
26+
- **Process-level cap (semaphore)**: Limits blast radius but doesn't eliminate waste; more complex without being more correct
27+
- **Status quo + polish only**: Valid if the OOM were purely a runaway delegation bug, but the singleton is the right architecture independently — even legitimate heavy recipes waste N × 500 MB today
28+
29+
## Architecture
30+
31+
The change is surgical. `CopilotClientWrapper` itself doesn't change. The only change is in when and how many times it gets created.
32+
33+
**Before:**
34+
```
35+
mount() called → new CopilotClientWrapper() → new copilot subprocess
36+
mount() called → new CopilotClientWrapper() → new copilot subprocess
37+
mount() called → new CopilotClientWrapper() → new copilot subprocess
38+
```
39+
40+
**After:**
41+
```
42+
mount() called → _acquire_shared_client() → new copilot subprocess (ref=1)
43+
mount() called → _acquire_shared_client() → reuse existing (ref=2)
44+
mount() called → _acquire_shared_client() → reuse existing (ref=3)
45+
cleanup called → _release_shared_client() → ref=2
46+
cleanup called → _release_shared_client() → ref=1
47+
cleanup called → _release_shared_client() → ref=0 → subprocess shuts down
48+
```
49+
50+
**Blast radius:** `__init__.py` only. `CopilotClientWrapper`, `provider.py`, `tool_capture.py`, `_constants.py`, and `converters.py` are untouched.
51+
52+
## Components
53+
54+
### Module-Level Singleton State
55+
56+
Three module-level variables at the top of `__init__.py`:
57+
58+
```python
59+
_shared_client: CopilotClientWrapper | None = None
60+
_shared_client_refcount: int = 0
61+
_shared_client_lock: asyncio.Lock | None = None
62+
```
63+
64+
The lock is initialized lazily rather than at module load time. `asyncio.Lock()` must be created on the running event loop — creating it at import time can cause issues if the module is imported before an event loop exists (common in test environments). A `_get_lock()` helper handles this:
65+
66+
```python
67+
def _get_lock() -> asyncio.Lock:
68+
global _shared_client_lock
69+
if _shared_client_lock is None:
70+
_shared_client_lock = asyncio.Lock()
71+
return _shared_client_lock
72+
```
73+
74+
### Acquire and Release Functions
75+
76+
```python
77+
async def _acquire_shared_client(config, timeout) -> CopilotClientWrapper:
78+
global _shared_client, _shared_client_refcount
79+
async with _get_lock():
80+
if _shared_client is None:
81+
_shared_client = CopilotClientWrapper(config, timeout)
82+
_shared_client_refcount += 1
83+
return _shared_client
84+
85+
async def _release_shared_client() -> None:
86+
global _shared_client, _shared_client_refcount
87+
async with _get_lock():
88+
_shared_client_refcount -= 1
89+
if _shared_client_refcount <= 0:
90+
if _shared_client is not None:
91+
await _shared_client.close()
92+
_shared_client = None
93+
_shared_client_refcount = 0 # safety floor against accidental negatives
94+
```
95+
96+
The safety floor at zero matters: if cleanup is ever called more times than mount (e.g., in test teardown), a negative count must not prevent the next acquisition from creating a fresh client.
97+
98+
### Updated `mount()`
99+
100+
`mount()` changes in one place only — acquiring the shared client instead of constructing a new one directly:
101+
102+
```python
103+
async def mount(coordinator, config):
104+
timeout = config.get("timeout", DEFAULT_TIMEOUT)
105+
106+
client = await _acquire_shared_client(config, timeout)
107+
108+
provider = CopilotSdkProvider(client=client, config=config)
109+
coordinator.providers.register("github-copilot", provider)
110+
111+
async def cleanup():
112+
await _release_shared_client()
113+
114+
return cleanup
115+
```
116+
117+
Everything else — provider construction, coordinator registration, returning a cleanup callable — is unchanged.
118+
119+
## Data Flow
120+
121+
`_acquire_shared_client()` only uses `config` and `timeout` on the **first** call when it constructs the wrapper. Subsequent calls reuse the existing client regardless of the values passed. Sub-agents inherit bundle config from the parent, so in practice they all pass identical values.
122+
123+
If a sub-agent passes a different `timeout` than the first mount, `_acquire_shared_client()` emits a `DEBUG` log warning and returns the existing client unchanged. No exception is raised — the caller receives a working client and the discrepancy is visible in logs.
124+
125+
## Error Handling
126+
127+
**Config mismatch**: `DEBUG` warning logged, existing client returned, no exception raised.
128+
129+
**Concurrent mounts at startup**: Two sub-agents mounting simultaneously before the client exists are serialized by the `asyncio.Lock`. The second waiter finds `_shared_client` already populated, increments the refcount, and returns. Double-creation is not possible.
130+
131+
**Unclean shutdown** (SIGKILL / process crash): The refcount never reaches zero, so `close()` never fires and the copilot subprocess becomes an orphan. This is acceptable — the OS reclaims the subprocess when the parent Python process dies. A code comment will document this reasoning explicitly so future maintainers don't treat it as an oversight.
132+
133+
## Observability
134+
135+
Three events are emitted via `coordinator.hooks.emit()`, consistent with existing module patterns (e.g., `provider:tool_sequence_repaired`):
136+
137+
| Event | When | Payload |
138+
|---|---|---|
139+
| `github-copilot:subprocess_created` | First acquisition | `session_id` |
140+
| `github-copilot:subprocess_reused` | Subsequent acquisitions | `session_id`, `refcount` |
141+
| `github-copilot:subprocess_shutdown` | Refcount reaches zero | `session_id` |
142+
143+
These are visible in session logs without any extra configuration.
144+
145+
## Testing Strategy
146+
147+
New tests live in `tests/test_mount.py`. `CopilotClientWrapper.__init__` is mocked to count instantiations. A pytest fixture resets `_shared_client`, `_shared_client_refcount`, and `_shared_client_lock` to `None`/`0`/`None` before each test — required for isolation with module-level state.
148+
149+
| Test | What it verifies |
150+
|---|---|
151+
| **Singleton creation** | `mount()` once → one `CopilotClientWrapper` created, refcount=1 |
152+
| **Reuse** | `mount()` three times → one `CopilotClientWrapper` created, refcount=3 |
153+
| **Cleanup lifecycle** | Three mounts, three cleanups → `close()` called only on the third (refcount=0) |
154+
| **Concurrent mounts** | `asyncio.gather()` fires multiple `mount()` calls simultaneously → still one client (exercises the lock) |
155+
| **Config mismatch warning** | Mount with `timeout=300`, mount again with `timeout=600``DEBUG` warning emitted, no exception |
156+
157+
No integration tests are needed — the copilot subprocess is already mocked in the existing suite. These tests only exercise the reference counting logic, which is pure Python with no external dependencies. Existing tests for `CopilotClientWrapper` and `CopilotSdkProvider` are unaffected since neither class changes.
158+
159+
## Open Questions
160+
161+
None — design is fully validated and ready for implementation.

0 commit comments

Comments
 (0)