Skip to content

Commit ae3e5af

Browse files
committed
feat(box): shared MCP container with persistent session support
- All MCP servers share single 'mcp-shared' Docker container - Each server gets unique process_id within shared session - Persistent workspace survives LangBot restarts - Cleanup only removes managed process, not shared session - Add process_id passthrough in service, workspace, and MCP layers - Fix workspace.py docstring placement for ruff E402 - Update tests for shared session model and default image change
1 parent 1bdf62e commit ae3e5af

18 files changed

Lines changed: 2299 additions & 2007 deletions

docker/docker-compose.yaml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,7 @@ services:
2323
container_name: langbot_box
2424
volumes:
2525
- ./data/box:/workspaces
26-
# Mount container runtime socket for Box sandbox backend.
27-
# Uncomment the one that matches your container runtime:
28-
# - /var/run/podman/podman.sock:/var/run/podman/podman.sock # Podman
29-
- /var/run/docker.sock:/var/run/docker.sock # Docker
26+
- /var/run/docker.sock:/var/run/docker.sock
3027
restart: on-failure
3128
environment:
3229
- TZ=Asia/Shanghai

docs/review/box-architecture.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,7 @@
3939
│ BoxRuntime (session 管理/进程生命周期) │
4040
│ │ │
4141
│ Backend (启动时选择一个): │
42-
│ PodmanBackend ─┐ │
43-
│ DockerBackend ─┤── CLISandboxBackend │
42+
│ DockerBackend ─┬── CLISandboxBackend │
4443
│ NsjailBackend ─┘ │
4544
│ │
4645
│ aiohttp WS Relay (:5410) │
@@ -50,7 +49,7 @@
5049
5150
5251
┌──────────────────────────────────────────────────────────────┐
53-
│ 容器/沙箱 (Podman/Docker container 或 nsjail sandbox) │
52+
│ 容器/沙箱 (Docker container 或 nsjail sandbox)
5453
│ - 隔离文件系统、网络、PID 命名空间 │
5554
│ - 资源限制 (CPU, 内存, PID 数) │
5655
│ - exec: 用户命令在此执行 │
@@ -158,14 +157,14 @@ Session 生命周期:
158157

159158
### 3.2 Backend 系统
160159

161-
#### CLISandboxBackend (`box/backend.py`, 389 行)
160+
#### CLISandboxBackend (`box/backend.py`)
162161

163-
Podman/Docker 的公共基类
162+
Docker 的基类
164163

165164
```
166165
start_session(spec):
167166
1. validate_sandbox_security(spec) 安全校验
168-
2. docker/podman run -d --rm --name <name>
167+
2. docker run -d --rm --name <name>
169168
--network none (可选)
170169
--cpus/--memory/--pids-limit 资源限制
171170
--read-only + --tmpfs /tmp 只读根文件系统
@@ -174,11 +173,11 @@ start_session(spec):
174173
3. 返回 BoxSessionInfo
175174
176175
exec(session, spec):
177-
docker/podman exec -e KEY=VAL <container>
176+
docker exec -e KEY=VAL <container>
178177
sh -lc 'mkdir -p <workdir> && cd <workdir> && <cmd>'
179178
180179
start_managed_process(session, spec):
181-
docker/podman exec -i <container>
180+
docker exec -i <container>
182181
sh -lc 'mkdir -p <cwd> && cd <cwd> && exec <command> <args>'
183182
返回 asyncio.subprocess.Process (stdin/stdout PIPE)
184183
```
@@ -197,7 +196,7 @@ start_managed_process(session, spec):
197196
- 资源限制: cgroup v2 优先,fallback 到 rlimit
198197
- **无自定义镜像**: 使用宿主 OS,`image` 字段固定为 `'host'`
199198

200-
**后端选择优先级**: Podman → Docker → nsjail(启动时逐个探测,首个可用的胜出,不做运行时 failover)
199+
**后端选择优先级**: Docker → nsjail(启动时逐个探测,首个可用的胜出,不做运行时 failover)
201200

202201
### 3.3 Server (`box/server.py`, 268 行)
203202

@@ -237,7 +236,7 @@ start_managed_process(session, spec):
237236

238237
### 3.6 Security (`box/security.py`, 54 行)
239238

240-
`validate_sandbox_security()`: 黑名单校验 host_path,阻止挂载 `/etc`/`/proc`/`/sys`/`/dev`/`/root`/`/boot` 及 Docker/Podman socket。
239+
`validate_sandbox_security()`: 黑名单校验 host_path,阻止挂载 `/etc`/`/proc`/`/sys`/`/dev`/`/root`/`/boot` 及 Docker socket。
241240

242241
**已知缺陷**: 根路径 `/` 未拦截,用户 home 目录未拦截,是 denylist 而非 allowlist 策略。详见 [问题清单 #5](./box-issues.md)
243242

docs/review/box-test-coverage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@
6868
|------|------|
6969
| **Session TTL 过期** | 测试配置了 `session_ttl_sec` 但从未推进时间验证过期清理 |
7070
| **并发 session 访问** | 无并发 exec / 并发创建 / race condition 测试 |
71-
| **Container backend (Podman/Docker)** | 仅通过集成测试覆盖(CI 不运行),单元测试全用 FakeBackend |
71+
| **Container backend (Docker)** | 仅通过集成测试覆盖(CI 不运行),单元测试全用 FakeBackend |
7272
| **BoxRuntime shutdown()** | 在 test cleanup 中调用但未验证行为 |
7373
| **BoxServerHandler 错误路径** | 畸形请求、未知 action 类型 |
7474
| **WS relay** | 仅在集成测试中覆盖(CI 不运行) |
@@ -83,7 +83,7 @@
8383
|------|------|
8484
| BoxSpec 校验 | 无效 session_id 格式、超长命令、env 特殊字符 |
8585
| BoxExecutionResult | 仅 COMPLETED 和 TIMED_OUT,无 ERROR 状态测试 |
86-
| 多后端 fallback | 仅单后端配置,无 Podman 不可用 → fallback Docker 测试 |
86+
| 多后端 fallback | 仅单后端配置,无 Docker 不可用 → fallback nsjail 测试 |
8787
| Profile YAML 加载 | 测试用硬编码字符串,未从真实 config.yaml 加载 |
8888

8989
---

docs/review/box-tob-analysis.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
| 能力 | toB 价值 | 代码位置 |
1111
|------|---------|---------|
1212
| **沙箱隔离执行** | 企业安全运行不受信代码的基础能力 | SDK `box/backend.py` |
13-
| **多后端支持** | 适配不同企业容器基础设施 (Podman/Docker/nsjail) | SDK `box/runtime.py` `_select_backend()` |
13+
| **多后端支持** | 适配不同企业容器基础设施 (Docker/nsjail) | SDK `box/runtime.py` `_select_backend()` |
1414
| **Profile + locked 字段** | 运维锁定安全边界,LLM/用户无法绕过 | `pkg/box/service.py`, SDK `box/models.py` |
1515
| **资源限制** | CPU/内存/PID 数限制防止资源滥用 | SDK `backend.py` `--cpus/--memory/--pids-limit` |
1616
| **Workspace quota** | 磁盘用量控制 | `pkg/box/service.py` `_enforce_workspace_quota` |

src/langbot/pkg/box/connector.py

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,21 @@ def resolve_box_ws_relay_url(ap: core_app.Application) -> str:
4343
"""
4444
box_cfg = _get_box_config(ap)
4545

46-
# Explicit relay URL takes precedence.
46+
# Explicit runtime URL takes precedence. The config value should be
47+
# a bare ``ws://host:port`` (no path) – the connector appends paths
48+
# like ``/rpc/ws`` or ``/v1/sessions/…`` as needed.
4749
runtime_url = str(box_cfg.get('runtime_url', '')).strip()
4850
if runtime_url:
49-
return runtime_url
51+
parsed = urlparse(runtime_url)
52+
scheme = parsed.scheme or 'ws'
53+
# Normalise WebSocket schemes to HTTP for the relay base URL.
54+
if scheme == 'ws':
55+
scheme = 'http'
56+
elif scheme == 'wss':
57+
scheme = 'https'
58+
host = parsed.hostname or '127.0.0.1'
59+
port = parsed.port or _DEFAULT_PORT
60+
return f'{scheme}://{host}:{port}'
5061

5162
# In Docker, relay lives on the box runtime container.
5263
if platform.get_platform() == 'docker':
@@ -192,9 +203,18 @@ def _resolve_rpc_ws_url(self) -> str:
192203
"""Determine the action-RPC WebSocket URL.
193204
194205
All endpoints share a single port; action RPC is at ``/rpc/ws``.
206+
The configured ``runtime_url`` is a bare ``ws://host:port`` base;
207+
the ``/rpc/ws`` path is always appended by this method.
195208
"""
196209
if self.configured_runtime_url:
197-
return self.configured_runtime_url
210+
base = self.configured_runtime_url.rstrip('/')
211+
parsed = urlparse(base)
212+
scheme = parsed.scheme or 'ws'
213+
if scheme in ('http', 'https'):
214+
scheme = 'wss' if scheme == 'https' else 'ws'
215+
host = parsed.hostname or '127.0.0.1'
216+
port = parsed.port or _DEFAULT_PORT
217+
return f'{scheme}://{host}:{port}/rpc/ws'
198218

199219
if platform.get_platform() == 'docker':
200220
return f'ws://{_DOCKER_BOX_HOST}:{_DEFAULT_PORT}/rpc/ws'

src/langbot/pkg/box/service.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ async def execute_spec_payload(
134134
skip_host_mount_validation: bool = False,
135135
) -> dict:
136136
if not self._available:
137-
raise BoxError('Box runtime is not available. Install and start Podman or Docker to use sandbox features.')
137+
raise BoxError('Box runtime is not available. Install and start Docker to use sandbox features.')
138138
try:
139139
spec = self.build_spec(spec_payload, skip_host_mount_validation=skip_host_mount_validation)
140140
except BoxError as exc:
@@ -280,10 +280,10 @@ async def start_managed_process(self, session_id: str, process_payload: dict) ->
280280
process_spec = BoxManagedProcessSpec.model_validate(process_payload)
281281
return await self.client.start_managed_process(session_id, process_spec)
282282

283-
async def get_managed_process(self, session_id: str) -> BoxManagedProcessInfo:
284-
return await self.client.get_managed_process(session_id)
283+
async def get_managed_process(self, session_id: str, process_id: str = 'default') -> BoxManagedProcessInfo:
284+
return await self.client.get_managed_process(session_id, process_id)
285285

286-
def get_managed_process_websocket_url(self, session_id: str) -> str:
286+
def get_managed_process_websocket_url(self, session_id: str, process_id: str = 'default') -> str:
287287
getter = getattr(self.client, 'get_managed_process_websocket_url', None)
288288
if getter is None:
289289
raise BoxValidationError('box runtime client does not support managed process websocket attach')
@@ -292,7 +292,7 @@ def get_managed_process_websocket_url(self, session_id: str) -> str:
292292
if self._runtime_connector is not None
293293
else 'http://127.0.0.1:5410'
294294
)
295-
return getter(session_id, ws_relay_base_url)
295+
return getter(session_id, ws_relay_base_url, process_id)
296296

297297
def _serialize_result(self, result: BoxExecutionResult) -> dict:
298298
stdout, stdout_truncated = self._truncate(result.stdout)

src/langbot/pkg/box/workspace.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
from __future__ import annotations
21
"""Reusable workspace/session helpers built on top of Box.
32
43
This module is the middle layer between the raw Box runtime primitives and
@@ -14,6 +13,8 @@
1413
- MCP stdio chooses how to prepare dependencies and attaches to a managed process
1514
"""
1615

16+
from __future__ import annotations
17+
1718
import os
1819
import textwrap
1920
from typing import Any
@@ -99,9 +100,7 @@ def list_python_manifest_files(host_path: str | None) -> list[str]:
99100
normalized_root = normalize_host_path(host_path)
100101
if not normalized_root:
101102
return []
102-
return [
103-
filename for filename in PYTHON_MANIFEST_FILES if os.path.isfile(os.path.join(normalized_root, filename))
104-
]
103+
return [filename for filename in PYTHON_MANIFEST_FILES if os.path.isfile(os.path.join(normalized_root, filename))]
105104

106105

107106
def classify_python_workspace(host_path: str | None) -> str | None:
@@ -269,6 +268,7 @@ def __init__(
269268
cpus: float | None = None,
270269
memory_mb: int | None = None,
271270
pids_limit: int | None = None,
271+
persistent: bool = False,
272272
):
273273
self.box_service = box_service
274274
self.session_id = session_id
@@ -283,6 +283,7 @@ def __init__(
283283
self.cpus = cpus
284284
self.memory_mb = memory_mb
285285
self.pids_limit = pids_limit
286+
self.persistent = persistent
286287

287288
def rewrite_path(self, path: str) -> str:
288289
return rewrite_mounted_path(path, self.host_path, mount_path=self.mount_path)
@@ -298,6 +299,8 @@ def build_session_payload(self) -> dict[str, Any]:
298299
'workdir': self.workdir,
299300
'env': self.env,
300301
}
302+
if self.persistent:
303+
payload['persistent'] = True
301304
if self.network is not None:
302305
payload['network'] = self.network
303306
if self.read_only_rootfs is not None:
@@ -388,17 +391,19 @@ async def start_managed_process(
388391
command: str,
389392
args: list[str] | None = None,
390393
*,
394+
process_id: str = 'default',
391395
env: dict[str, str] | None = None,
392396
cwd: str = '/workspace',
393397
):
394398
payload = self.build_process_payload(command, args, env=env, cwd=cwd)
399+
payload['process_id'] = process_id
395400
return await self.box_service.start_managed_process(self.session_id, payload)
396401

397-
async def get_managed_process(self):
398-
return await self.box_service.get_managed_process(self.session_id)
402+
async def get_managed_process(self, process_id: str = 'default'):
403+
return await self.box_service.get_managed_process(self.session_id, process_id)
399404

400-
def get_managed_process_websocket_url(self) -> str:
401-
return self.box_service.get_managed_process_websocket_url(self.session_id)
405+
def get_managed_process_websocket_url(self, process_id: str = 'default') -> str:
406+
return self.box_service.get_managed_process_websocket_url(self.session_id, process_id)
402407

403408
async def cleanup(self) -> None:
404409
await self.box_service.client.delete_session(self.session_id)

src/langbot/pkg/provider/tools/loaders/mcp.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
import langbot_plugin.api.entities.builtin.resource.tool as resource_tool
2121
import langbot_plugin.api.entities.builtin.provider.message as provider_message
2222
from ....entity.persistence import mcp as persistence_mcp
23-
from .mcp_stdio import BoxStdioSessionRuntime, MCPServerBoxConfig, MCPSessionErrorPhase
23+
from .mcp_stdio import BoxStdioSessionRuntime, MCPSessionErrorPhase
2424

2525

2626
class MCPSessionStatus(enum.Enum):
@@ -320,6 +320,7 @@ def get_runtime_info_dict(self) -> dict:
320320
}
321321
if self._uses_box_stdio():
322322
info['box_session_id'] = self._build_box_session_id()
323+
info['box_process_id'] = self._box_stdio_runtime.process_id
323324
info['box_enabled'] = True
324325
return info
325326

@@ -349,7 +350,7 @@ def _uses_box_stdio(self) -> bool:
349350
return self._box_stdio_runtime.uses_box_stdio()
350351

351352
def _build_box_session_id(self) -> str:
352-
return f'mcp-{self.server_uuid}'
353+
return 'mcp-shared'
353354

354355
def _rewrite_path(self, path: str, host_path: str | None) -> str:
355356
return self._box_stdio_runtime.rewrite_path(path, host_path)

src/langbot/pkg/provider/tools/loaders/mcp_stdio.py

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,14 @@ def _build_workspace(self) -> BoxWorkspaceSession:
8181
cpus=self.config.cpus,
8282
memory_mb=self.config.memory_mb,
8383
pids_limit=self.config.pids_limit,
84+
persistent=True,
8485
)
8586

87+
@property
88+
def process_id(self) -> str:
89+
"""Each MCP server gets a unique process_id within the shared session."""
90+
return self.owner.server_uuid
91+
8692
def uses_box_stdio(self) -> bool:
8793
if self.server_config.get('mode') != 'stdio':
8894
return False
@@ -104,7 +110,9 @@ async def initialize(self) -> None:
104110
if host_path:
105111
install_cmd = self.owner._detect_install_command(host_path)
106112
if install_cmd:
107-
self.ap.logger.info(f'MCP server {self.server_name}: installing dependencies in Box with: {install_cmd}')
113+
self.ap.logger.info(
114+
f'MCP server {self.server_name}: installing dependencies in Box with: {install_cmd}'
115+
)
108116
try:
109117
result = await workspace.execute_raw(
110118
install_cmd,
@@ -122,17 +130,20 @@ async def initialize(self) -> None:
122130
await workspace.start_managed_process(
123131
self.server_config['command'],
124132
self.server_config.get('args', []),
133+
process_id=self.process_id,
125134
env=self.server_config.get('env', {}),
126135
)
127136
except Exception:
128137
self.owner.error_phase = MCPSessionErrorPhase.PROCESS_START
129138
raise
130139

131140
try:
132-
websocket_url = workspace.get_managed_process_websocket_url()
141+
websocket_url = workspace.get_managed_process_websocket_url(self.process_id)
133142
transport = await self.owner.exit_stack.enter_async_context(websocket_client(websocket_url))
134143
read_stream, write_stream = transport
135-
self.owner.session = await self.owner.exit_stack.enter_async_context(ClientSession(read_stream, write_stream))
144+
self.owner.session = await self.owner.exit_stack.enter_async_context(
145+
ClientSession(read_stream, write_stream)
146+
)
136147
except Exception:
137148
self.owner.error_phase = MCPSessionErrorPhase.RELAY_CONNECT
138149
raise
@@ -150,7 +161,7 @@ async def monitor_process_health(self) -> None:
150161
consecutive_errors = 0
151162
while not self.owner._shutdown_event.is_set():
152163
try:
153-
info = await workspace.get_managed_process()
164+
info = await workspace.get_managed_process(self.process_id)
154165
if isinstance(info, dict):
155166
status = info.get('status', '')
156167
else:
@@ -173,10 +184,14 @@ async def cleanup_session(self) -> None:
173184
if not self.uses_box_stdio():
174185
return
175186

176-
try:
177-
await self._build_workspace().cleanup()
178-
except Exception as exc:
179-
self.ap.logger.warning(f'Failed to cleanup Box session for MCP server {self.server_name}: {exc}')
187+
# In the shared-session model, we do NOT delete the session itself.
188+
# The managed process is cleaned up when it exits or when the Box
189+
# runtime shuts down. Deleting the session would kill all other
190+
# MCP servers sharing the same container.
191+
self.ap.logger.info(
192+
f'MCP server {self.server_name}: process_id={self.process_id} cleanup complete '
193+
f'(shared session {self.owner._build_box_session_id()} kept alive)'
194+
)
180195

181196
def rewrite_path(self, path: str, host_path: str | None) -> str:
182197
return rewrite_mounted_path(path, host_path)

src/langbot/templates/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ monitoring:
9090
box:
9191
profile: 'default'
9292
image: '' # Custom sandbox container image. Leave empty to use the profile default (python:3.11-slim).
93-
runtime_url: '' # Action-RPC WebSocket URL of an external Box Runtime. Leave empty for auto-detection (stdio locally, Docker service in containers).
93+
runtime_url: '' # Base URL of an external Box Runtime (e.g. ws://localhost:5410). Leave empty for auto-detection (stdio locally, Docker service in containers).
9494
shared_host_root: './data/box' # For Docker deployment, use '/workspaces'
9595
default_host_workspace: '' # Defaults to '<shared_host_root>/default'
9696
allowed_host_mount_roots: # Defaults to ['<shared_host_root>'] when left empty

0 commit comments

Comments
 (0)