Skip to content

Commit 5053fb6

Browse files
Pangjipingclaude
andauthored
docs(osep): OSEP-0012 for multi tenant (#838)
* docs(osep): OSEP-0012 for multi tenant * docs(osep): simplify OSEP-0012, add TenantProvider abstraction - Replace code blocks with pseudocode/flows, ~40% shorter - Add TenantProvider interface decoupling auth from config source - FileTenantProvider as initial backend; room for HTTP/IAM providers - Make Docker unsupported explicit across summary, goals, requirements Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(osep): clarify ingress is tenant-unaware, fix proxy ownership guard Ingress gateway intentionally does not enforce tenant isolation. Proxy routes bypass auth (design) — tenancy enforced at lifecycle API boundary, not data-plane. Isolation via unguessable IDs + signed tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent ccb5a6c commit 5053fb6

1 file changed

Lines changed: 391 additions & 0 deletions

File tree

oseps/0012-multi-tenancy.md

Lines changed: 391 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,391 @@
1+
---
2+
title: Multi-Tenancy Support for Kubernetes Runtime
3+
authors:
4+
- "@Pangjiping"
5+
creation-date: 2026-04-29
6+
last-updated: 2026-05-07
7+
status: draft
8+
---
9+
10+
# OSEP-0012: Multi-Tenancy Support for Kubernetes Runtime
11+
12+
<!-- toc -->
13+
- [Summary](#summary)
14+
- [Motivation](#motivation)
15+
- [Goals](#goals)
16+
- [Non-Goals](#non-goals)
17+
- [Requirements](#requirements)
18+
- [Proposal](#proposal)
19+
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
20+
- [Risks and Mitigations](#risks-and-mitigations)
21+
- [Design Details](#design-details)
22+
- [TenantProvider Abstraction](#tenantprovider-abstraction)
23+
- [Config Model & Loading Flow (FileTenantProvider)](#config-model--loading-flow-filetenantprovider)
24+
- [Auth Middleware Flow](#auth-middleware-flow)
25+
- [Sandbox Service — Namespace Resolution](#sandbox-service--namespace-resolution)
26+
- [Startup Guards](#startup-guards)
27+
- [Deployment Changes](#deployment-changes)
28+
- [Tenant Isolation Model (Reference)](#tenant-isolation-model-reference)
29+
- [Test Plan](#test-plan)
30+
- [Drawbacks](#drawbacks)
31+
- [Alternatives](#alternatives)
32+
- [Infrastructure Needed](#infrastructure-needed)
33+
- [Upgrade & Migration Strategy](#upgrade--migration-strategy)
34+
<!-- /toc -->
35+
36+
## Summary
37+
38+
Add multi-tenancy support to OpenSandbox Server when running on Kubernetes. A new config file `tenants.toml` maps API keys to Kubernetes namespaces, enabling K8s-level isolation between tenants. Opt-in: when `tenants.toml` exists, server enters multi-tenant mode; when absent, single-tenant behavior unchanged.
39+
40+
**Docker runtime is explicitly unsupported.** If `runtime.type = "docker"` and `tenants.toml` exists, the server refuses to start with a clear error. Multi-tenancy requires Kubernetes namespaces — Docker has no equivalent isolation primitive.
41+
42+
## Motivation
43+
44+
Current deployment shares a single API key and a single K8s namespace across all sandbox consumers. Problems:
45+
46+
1. **No workload isolation.** All sandboxes in one namespace — one misbehaving consumer affects all. ResourceQuota, NetworkPolicy, LimitRange cannot be per-consumer.
47+
2. **No credential isolation.** One shared key = no per-consumer audit trail, no per-consumer revocation, no per-consumer rate limiting.
48+
49+
Multi-tenancy gives each tenant its own namespace and API key(s), single server deployment.
50+
51+
### Goals
52+
53+
- Define tenants in independent config file (`tenants.toml`), zero changes to `server.toml`
54+
- Each tenant → dedicated K8s namespace
55+
- Multiple API keys per tenant (key rotation without downtime)
56+
- Hot-reload via fsnotify — no restart
57+
- Single-tenant mode fully intact when `tenants.toml` absent
58+
- Docker runtime explicitly unsupported — server refuses to start if `tenants.toml` present with `runtime.type = "docker"`
59+
60+
### Non-Goals
61+
62+
- Docker runtime multi-tenancy — Docker has no namespace concept; `tenants.toml` with Docker is a startup error, not silently ignored
63+
- Ingress gateway tenant isolation — ingress is a data-plane routing layer, intentionally tenant-unaware; isolation at proxy layer relies on unguessable sandbox IDs + signed tokens + K8s NetworkPolicy
64+
- Dynamic tenant CRUD via REST API (future OSEP)
65+
- Per-tenant rate limiting at server layer (delegate to K8s/ingress)
66+
- Server-side resource quotas (delegate to K8s ResourceQuota)
67+
- Migration tooling (manual, documented)
68+
69+
## Requirements
70+
71+
- `tenants.toml` existence = sole trigger for multi-tenant mode
72+
- When `tenants.toml` exists, `server.api_key` in `server.toml` MUST be rejected
73+
- Each tenant entry MUST have: `name`, `namespace`, `api_keys` (non-empty)
74+
- Auth MUST use constant-time comparison on API keys
75+
- Startup MUST validate all tenant namespaces exist and are accessible
76+
- Sandbox `create`/`get`/`list`/`delete` operate within authenticated tenant's namespace
77+
- Proxy routes MUST validate tenant ownership of target sandbox
78+
- Tenant config changes propagate to all server replicas without restart
79+
- `runtime.type = "docker"` with `tenants.toml` present MUST cause a fatal startup error — multi-tenancy is a K8s-only feature and Docker has no namespace primitive
80+
81+
## Proposal
82+
83+
Introduce a `TenantProvider` abstraction for tenant resolution. The initial implementation is `FileTenantProvider`, backed by `tenants.toml` at `~/.opensandbox/tenants.toml` (overridable via `SANDBOX_TENANTS_CONFIG_PATH`). Auth middleware depends only on the interface, not the file — this leaves room for future providers (HTTP API, K8s Secret, external IAM) without touching auth code.
84+
85+
```
86+
┌───────────────────────────────┐
87+
│ server.toml (unchanged) │
88+
│ [server] api_key = "..." │
89+
│ [kubernetes] namespace = "..." │
90+
└───────────────────────────────┘
91+
+
92+
┌───────────────────────────────┐
93+
│ tenants.toml (new, optional) │
94+
│ [[tenants]] │
95+
│ name = "team-a" │
96+
│ namespace = "ns-a" │
97+
│ api_keys = ["key1", "key2"] │
98+
└───────────────────────────────┘
99+
100+
FileTenantProvider (initial backend)
101+
TenantProvider interface (extension point)
102+
```
103+
104+
**Request routing flow:**
105+
106+
```
107+
Server startup
108+
109+
├── runtime.type = "docker" AND tenants.toml exists?
110+
│ └── YES → FATAL: exit with error. Docker has no namespace isolation.
111+
112+
└── runtime.type = "kubernetes" (or Docker without tenants.toml)
113+
114+
Request with OPEN-SANDBOX-API-KEY header
115+
116+
├── tenants.toml exists?
117+
│ ├── YES → lookup key in tenant api_keys
118+
│ │ ├── found → inject tenant context, route to tenant.namespace
119+
│ │ └── not found → 401
120+
│ └── NO → validate against server.api_key (legacy single-tenant)
121+
│ ├── valid → route to kubernetes.namespace
122+
│ └── invalid → 401
123+
```
124+
125+
### Notes/Constraints/Caveats
126+
127+
- **Docker runtime NOT supported.** If `runtime.type = "docker"` and `tenants.toml` exists, server exits with a fatal error at startup. Docker daemon has no namespace concept — multi-tenancy isolation is impossible. This is a hard rejection, not a silent skip.
128+
- **`server.api_key` disabled in multi-tenant.** Must migrate it into `tenants.toml` as a tenant entry.
129+
- **No server-side quotas.** Delegated to K8s ResourceQuota/LimitRange per namespace.
130+
- **In-memory lookup, no file I/O on hot path.** Config loaded into `dict[str, TenantEntry]` at startup and on fsnotify events.
131+
132+
### Risks and Mitigations
133+
134+
| Risk | Mitigation |
135+
|------|------------|
136+
| Plaintext API keys in `tenants.toml` | File permissions 0600; ConfigMap with restricted RBAC; future: K8s Secret reference |
137+
| ConfigMap update delay on multi-replica | kubelet syncs ~1 min; fsnotify triggers reload on each replica independently |
138+
| Namespace doesn't exist at tenant creation | Startup validation; `create_sandbox` returns clear 400 |
139+
| Timing attack on API key comparison | `secrets.compare_digest` (constant-time) |
140+
| Informer memory growth with many namespaces | Lazily created per namespace, only for active sandboxes |
141+
142+
## Design Details
143+
144+
Implementation in 6 steps. No step blocks another except where noted.
145+
146+
---
147+
148+
### TenantProvider Abstraction
149+
150+
Tenant resolution is behind a `TenantProvider` interface, decoupling auth middleware from any specific config source. This lets the initial implementation ship with a simple file-based provider while leaving a clean extension point for enterprise deployments that already manage tenants in an external IAM or tenant management system.
151+
152+
**Interface (pseudocode):**
153+
```
154+
TenantProvider (Protocol):
155+
lookup(api_key: str) → TenantEntry | None
156+
list_tenants() → list[TenantEntry] # for startup validation
157+
ready() → bool # provider has loaded initial state
158+
on_reload(callback) → None # notify consumers on config change (optional)
159+
```
160+
161+
**Initial provider — FileTenantProvider:**
162+
- Backed by `tenants.toml`, loaded at startup, hot-reloaded via fsnotify
163+
- Implements full `TenantProvider` interface
164+
- `ready()` returns `True` after initial file parse succeeds
165+
- `on_reload` triggers on fsnotify events; auth middleware picks up new key→tenant mappings without restart
166+
167+
**Future providers (not in this OSEP, but the interface accommodates):**
168+
- `HTTPTenantProvider` — polls or streams from an internal IAM API; tenant metadata, key rotation, enable/disable all managed in the external system
169+
- `K8sConfigMapProvider` — watches a ConfigMap or Secret across namespaces
170+
- Composite/chained providers for fallback (e.g., file + external API merge)
171+
172+
**Startup wiring (pseudocode):**
173+
```
174+
if tenants.toml exists:
175+
provider = FileTenantProvider(path)
176+
if not provider.ready():
177+
→ SystemExit (parse error, duplicates, etc.)
178+
else:
179+
provider = None # single-tenant mode
180+
```
181+
182+
Auth middleware depends only on `TenantProvider`, not on `FileTenantProvider` directly. Switching backends in the future does not touch auth code.
183+
184+
---
185+
186+
### Config Model & Loading Flow (FileTenantProvider)
187+
188+
**New package:** `opensandbox_server/tenants/`
189+
190+
This is the initial `TenantProvider` implementation. It reads `tenants.toml` and hot-reloads on file changes.
191+
192+
**Data model (pseudocode):**
193+
```
194+
TenantEntry:
195+
- name: str
196+
- namespace: str
197+
- api_keys: list[str]
198+
199+
TenantsConfig:
200+
- entries: list[TenantEntry]
201+
- validation: reject duplicate api_keys across tenants (on parse)
202+
```
203+
204+
**Loading flow:**
205+
```
206+
FileTenantProvider(path):
207+
1. resolve path: env SANDBOX_TENANTS_CONFIG_PATH || ~/.opensandbox/tenants.toml
208+
2. if file absent → ready() returns False → server stays in single-tenant mode
209+
3. parse TOML → TenantsConfig → build dict[api_key → TenantEntry]
210+
4. on parse error or duplicate keys → raise, server exits
211+
5. start fsnotify watcher thread for hot-reload
212+
```
213+
214+
**Hot-reload behavior:**
215+
```
216+
- maintains dict[api_key → TenantEntry] under threading.Lock
217+
- on file change: reload atomically (swap dict under lock)
218+
- on parse error during reload: log warning, keep old entries (no downtime)
219+
- file delete → clear all entries (all tenant keys → 401)
220+
- new key added → live immediately on next lookup
221+
```
222+
Watcher monitors parent directory for ConfigMap atomic symlink swap.
223+
224+
---
225+
226+
### Auth Middleware Flow
227+
228+
**Modify:** `middleware/auth.py`
229+
230+
**Mode detection:** `TenantProvider` instance passed in → multi-tenant; `None` → single-tenant. Middleware depends only on the `TenantProvider` interface, not on `FileTenantProvider`.
231+
232+
**Startup validation:**
233+
```
234+
if provider is not None AND server.api_key is set:
235+
→ SystemExit("Remove server.api_key from server.toml")
236+
```
237+
238+
**Auth flow (pseudocode):**
239+
```
240+
authenticate(request) → TenantEntry | None:
241+
api_key = request.headers["OPEN-SANDBOX-API-KEY"]
242+
243+
if multi-tenant mode:
244+
return provider.lookup(api_key) # TenantEntry or None
245+
else:
246+
return None if constant_time_compare(server.api_key, api_key) else None
247+
# None with non-empty valid_keys = single-tenant, allow
248+
# None with empty valid_keys = no keys configured, reject
249+
```
250+
251+
**Tenant context propagation:**
252+
```
253+
dispatch(request):
254+
tenant = authenticate(request)
255+
if multi-tenant and tenant is None → 401
256+
if single-tenant and auth failed → 401
257+
request.state.tenant = tenant # TenantEntry | None
258+
ContextVar("current_tenant").set(tenant) # for downstream access
259+
```
260+
261+
Downstream code reads tenant via `get_current_tenant() → TenantEntry | None`.
262+
263+
---
264+
265+
### Sandbox Service — Namespace Resolution
266+
267+
**Modify:** `services/kubernetes_service.py`
268+
269+
All K8s API calls replace `self.namespace` with runtime-resolved namespace:
270+
271+
```
272+
_resolve_namespace():
273+
tenant = get_current_tenant()
274+
return tenant.namespace if tenant else self.namespace # config default
275+
276+
_resolve_tenant_name():
277+
tenant = get_current_tenant()
278+
return tenant.name if tenant else "default"
279+
```
280+
281+
Methods affected: `create_sandbox`, `list_sandboxes`, `get_sandbox`, `delete_sandbox`.
282+
283+
**Sandbox labels on create:** add `opensandbox.io/tenant = <tenant_name>`.
284+
285+
**Proxy route ownership:** proxy routes (`/sandboxes/{id}/proxy/{port}/...`) bypass API key auth by design — end users hitting sandboxes don't carry `OPEN-SANDBOX-API-KEY`. Ingress gateway is intentionally tenant-unaware.
286+
287+
Isolation at proxy layer relies on:
288+
- **Unguessable sandbox IDs** (random UUIDs) — knowing one tenant's sandbox ID doesn't reveal another's
289+
- **Signed route tokens** (OSEP-0011) — time-limited, cryptographically bound to a single sandbox
290+
- **K8s namespace isolation** — even if traffic reaches a pod, NetworkPolicy restricts cross-namespace pod-to-pod communication
291+
292+
No tenant context is injected on proxy paths. The server resolves the sandbox endpoint purely by sandbox ID and forwards. Tenancy is enforced at lifecycle API boundaries (create/get/list/delete), not at data-plane proxy boundaries.
293+
294+
---
295+
296+
### Startup Guards
297+
298+
**Modify:** `main.py` or `app.py` — before server start.
299+
300+
```
301+
validate_tenant_startup():
302+
1. Docker + tenants.toml → SystemExit
303+
2. Missing tenant namespaces → SystemExit (list missing)
304+
3. server.api_key + tenants.toml coexisting → SystemExit
305+
```
306+
307+
Namespace validation: iterate all tenant entries, call `k8s.read_namespace()` for each. Collect missing. All must exist at startup.
308+
309+
---
310+
311+
### Deployment Changes
312+
313+
**New files:** `deploy/kubernetes/configmap-tenants.yaml`, modify `rbac.yaml`, `deployment.yaml`.
314+
315+
- **Split ConfigMaps:** `opensandbox-server` (server.toml) + `opensandbox-tenants` (tenants.toml)
316+
- **Deployment:** mount both ConfigMaps, set `SANDBOX_TENANTS_CONFIG_PATH` env var
317+
- **RBAC:** upgrade `Role``ClusterRole` + `ClusterRoleBinding` (multi-namespace access required)
318+
319+
---
320+
321+
### Tenant Isolation Model (Reference)
322+
323+
Server does not enforce quotas. Isolation delegated to K8s:
324+
325+
| Isolation dimension | K8s mechanism | Scope |
326+
|--------------------|---------------|-------|
327+
| Resource quota | `ResourceQuota` | Per-ns CPU, memory, storage |
328+
| Default limits | `LimitRange` | Per-ns default container resources |
329+
| Network policy | `NetworkPolicy` | Per-ns ingress/egress |
330+
| Sandbox count | `count/batchsandboxes` via `ResourceQuota` | Per-ns CR count |
331+
| RBAC | `RoleBinding` | Per-ns API access |
332+
333+
Cluster admin creates per-tenant namespace with ResourceQuota + LimitRange before tenant onboarding.
334+
335+
## Test Plan
336+
337+
**Unit tests:**
338+
- Duplicate API keys across tenants → `ValueError` at config parse
339+
- Auth: multi-tenant rejects `server.api_key`; accepts valid tenant key; rejects invalid → 401
340+
- TenantLoader: file delete → entries cleared; new key → live in lookup; parse error → old entries kept
341+
- Docker + tenants → `SystemExit`
342+
343+
**Integration tests:**
344+
- Create with tenant A key → sandbox in ns-a with label `opensandbox.io/tenant=team-a`
345+
- List with tenant A → only ns-a sandboxes
346+
- Get/delete tenant A sandbox with tenant B key → 404
347+
- Hot reload: new key works without restart; removed key → 401
348+
- Legacy: delete tenants.toml → server.api_key works again
349+
350+
**End-to-end:**
351+
- Key rotation: add new key, verify both work, remove old key
352+
- Multi-replica: update ConfigMap, all replicas pick up within 60s
353+
354+
## Drawbacks
355+
356+
- **Two config files.** Mitigated by clear startup logging of which mode is active.
357+
- **ClusterRole required.** Broader RBAC than single-namespace RoleBinding. Inherent to multi-tenancy; scoped by resource types.
358+
- **No dynamic tenant CRUD.** Static config only. REST API / CRD deferred to future OSEP.
359+
360+
## Alternatives
361+
362+
| Approach | Rejected because |
363+
|----------|-----------------|
364+
| Embed tenants in `server.toml` | Tenant changes require server restart |
365+
| Couple auth directly to `tenants.toml` file format | Locks out enterprise deployments where tenants already live in IAM/external systems; `TenantProvider` interface avoids this |
366+
| SQLite for tenant storage | Single-node; breaks multi-replica |
367+
| One server instance per tenant | High operational cost (N processes) |
368+
| Soft multi-tenancy (labels, one namespace) | No K8s-native isolation; ResourceQuota/NetworkPolicy not per-tenant |
369+
| Single API key per tenant | No key rotation; replacing key causes downtime |
370+
371+
## Infrastructure Needed
372+
373+
- One K8s namespace per tenant (cluster admin creates)
374+
- Per-namespace ResourceQuota + LimitRange (recommended)
375+
- `opensandbox-tenants` ConfigMap in server namespace
376+
- ClusterRole + ClusterRoleBinding for server ServiceAccount
377+
378+
## Upgrade & Migration Strategy
379+
380+
**Existing single-tenant → multi-tenant:**
381+
382+
1. Create target namespace(s)
383+
2. Write `tenants.toml` with existing key as a tenant entry (same namespace)
384+
3. Mount via ConfigMap alongside `server.toml`
385+
4. Deploy — old key continues working as tenant key
386+
5. Optionally remove `api_key` from `server.toml`
387+
6. Add more tenants as needed
388+
389+
**Rollback:** Delete `tenants.toml` ConfigMap, restart. Falls back to `server.api_key` + `kubernetes.namespace`.
390+
391+
**No data migration needed.** Existing sandboxes stay in their namespace.

0 commit comments

Comments
 (0)