|
| 1 | +--- |
| 2 | +title: Multi-Tenancy Support for Kubernetes Runtime |
| 3 | +authors: |
| 4 | + - "@Pangjiping" |
| 5 | +creation-date: 2026-04-29 |
| 6 | +last-updated: 2026-05-07 |
| 7 | +status: draft |
| 8 | +--- |
| 9 | + |
| 10 | +# OSEP-0012: Multi-Tenancy Support for Kubernetes Runtime |
| 11 | + |
| 12 | +<!-- toc --> |
| 13 | +- [Summary](#summary) |
| 14 | +- [Motivation](#motivation) |
| 15 | + - [Goals](#goals) |
| 16 | + - [Non-Goals](#non-goals) |
| 17 | +- [Requirements](#requirements) |
| 18 | +- [Proposal](#proposal) |
| 19 | + - [Notes/Constraints/Caveats](#notesconstraintscaveats) |
| 20 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 21 | +- [Design Details](#design-details) |
| 22 | + - [TenantProvider Abstraction](#tenantprovider-abstraction) |
| 23 | + - [Config Model & Loading Flow (FileTenantProvider)](#config-model--loading-flow-filetenantprovider) |
| 24 | + - [Auth Middleware Flow](#auth-middleware-flow) |
| 25 | + - [Sandbox Service — Namespace Resolution](#sandbox-service--namespace-resolution) |
| 26 | + - [Startup Guards](#startup-guards) |
| 27 | + - [Deployment Changes](#deployment-changes) |
| 28 | + - [Tenant Isolation Model (Reference)](#tenant-isolation-model-reference) |
| 29 | +- [Test Plan](#test-plan) |
| 30 | +- [Drawbacks](#drawbacks) |
| 31 | +- [Alternatives](#alternatives) |
| 32 | +- [Infrastructure Needed](#infrastructure-needed) |
| 33 | +- [Upgrade & Migration Strategy](#upgrade--migration-strategy) |
| 34 | +<!-- /toc --> |
| 35 | + |
| 36 | +## Summary |
| 37 | + |
| 38 | +Add multi-tenancy support to OpenSandbox Server when running on Kubernetes. A new config file `tenants.toml` maps API keys to Kubernetes namespaces, enabling K8s-level isolation between tenants. Opt-in: when `tenants.toml` exists, server enters multi-tenant mode; when absent, single-tenant behavior unchanged. |
| 39 | + |
| 40 | +**Docker runtime is explicitly unsupported.** If `runtime.type = "docker"` and `tenants.toml` exists, the server refuses to start with a clear error. Multi-tenancy requires Kubernetes namespaces — Docker has no equivalent isolation primitive. |
| 41 | + |
| 42 | +## Motivation |
| 43 | + |
| 44 | +Current deployment shares a single API key and a single K8s namespace across all sandbox consumers. Problems: |
| 45 | + |
| 46 | +1. **No workload isolation.** All sandboxes in one namespace — one misbehaving consumer affects all. ResourceQuota, NetworkPolicy, LimitRange cannot be per-consumer. |
| 47 | +2. **No credential isolation.** One shared key = no per-consumer audit trail, no per-consumer revocation, no per-consumer rate limiting. |
| 48 | + |
| 49 | +Multi-tenancy gives each tenant its own namespace and API key(s), single server deployment. |
| 50 | + |
| 51 | +### Goals |
| 52 | + |
| 53 | +- Define tenants in independent config file (`tenants.toml`), zero changes to `server.toml` |
| 54 | +- Each tenant → dedicated K8s namespace |
| 55 | +- Multiple API keys per tenant (key rotation without downtime) |
| 56 | +- Hot-reload via fsnotify — no restart |
| 57 | +- Single-tenant mode fully intact when `tenants.toml` absent |
| 58 | +- Docker runtime explicitly unsupported — server refuses to start if `tenants.toml` present with `runtime.type = "docker"` |
| 59 | + |
| 60 | +### Non-Goals |
| 61 | + |
| 62 | +- Docker runtime multi-tenancy — Docker has no namespace concept; `tenants.toml` with Docker is a startup error, not silently ignored |
| 63 | +- Ingress gateway tenant isolation — ingress is a data-plane routing layer, intentionally tenant-unaware; isolation at proxy layer relies on unguessable sandbox IDs + signed tokens + K8s NetworkPolicy |
| 64 | +- Dynamic tenant CRUD via REST API (future OSEP) |
| 65 | +- Per-tenant rate limiting at server layer (delegate to K8s/ingress) |
| 66 | +- Server-side resource quotas (delegate to K8s ResourceQuota) |
| 67 | +- Migration tooling (manual, documented) |
| 68 | + |
| 69 | +## Requirements |
| 70 | + |
| 71 | +- `tenants.toml` existence = sole trigger for multi-tenant mode |
| 72 | +- When `tenants.toml` exists, `server.api_key` in `server.toml` MUST be rejected |
| 73 | +- Each tenant entry MUST have: `name`, `namespace`, `api_keys` (non-empty) |
| 74 | +- Auth MUST use constant-time comparison on API keys |
| 75 | +- Startup MUST validate all tenant namespaces exist and are accessible |
| 76 | +- Sandbox `create`/`get`/`list`/`delete` operate within authenticated tenant's namespace |
| 77 | +- Proxy routes MUST validate tenant ownership of target sandbox |
| 78 | +- Tenant config changes propagate to all server replicas without restart |
| 79 | +- `runtime.type = "docker"` with `tenants.toml` present MUST cause a fatal startup error — multi-tenancy is a K8s-only feature and Docker has no namespace primitive |
| 80 | + |
| 81 | +## Proposal |
| 82 | + |
| 83 | +Introduce a `TenantProvider` abstraction for tenant resolution. The initial implementation is `FileTenantProvider`, backed by `tenants.toml` at `~/.opensandbox/tenants.toml` (overridable via `SANDBOX_TENANTS_CONFIG_PATH`). Auth middleware depends only on the interface, not the file — this leaves room for future providers (HTTP API, K8s Secret, external IAM) without touching auth code. |
| 84 | + |
| 85 | +``` |
| 86 | + ┌───────────────────────────────┐ |
| 87 | + │ server.toml (unchanged) │ |
| 88 | + │ [server] api_key = "..." │ |
| 89 | + │ [kubernetes] namespace = "..." │ |
| 90 | + └───────────────────────────────┘ |
| 91 | + + |
| 92 | + ┌───────────────────────────────┐ |
| 93 | + │ tenants.toml (new, optional) │ |
| 94 | + │ [[tenants]] │ |
| 95 | + │ name = "team-a" │ |
| 96 | + │ namespace = "ns-a" │ |
| 97 | + │ api_keys = ["key1", "key2"] │ |
| 98 | + └───────────────────────────────┘ |
| 99 | +
|
| 100 | + FileTenantProvider (initial backend) |
| 101 | + TenantProvider interface (extension point) |
| 102 | +``` |
| 103 | + |
| 104 | +**Request routing flow:** |
| 105 | + |
| 106 | +``` |
| 107 | +Server startup |
| 108 | + │ |
| 109 | + ├── runtime.type = "docker" AND tenants.toml exists? |
| 110 | + │ └── YES → FATAL: exit with error. Docker has no namespace isolation. |
| 111 | + │ |
| 112 | + └── runtime.type = "kubernetes" (or Docker without tenants.toml) |
| 113 | + │ |
| 114 | +Request with OPEN-SANDBOX-API-KEY header |
| 115 | + │ |
| 116 | + ├── tenants.toml exists? |
| 117 | + │ ├── YES → lookup key in tenant api_keys |
| 118 | + │ │ ├── found → inject tenant context, route to tenant.namespace |
| 119 | + │ │ └── not found → 401 |
| 120 | + │ └── NO → validate against server.api_key (legacy single-tenant) |
| 121 | + │ ├── valid → route to kubernetes.namespace |
| 122 | + │ └── invalid → 401 |
| 123 | +``` |
| 124 | + |
| 125 | +### Notes/Constraints/Caveats |
| 126 | + |
| 127 | +- **Docker runtime NOT supported.** If `runtime.type = "docker"` and `tenants.toml` exists, server exits with a fatal error at startup. Docker daemon has no namespace concept — multi-tenancy isolation is impossible. This is a hard rejection, not a silent skip. |
| 128 | +- **`server.api_key` disabled in multi-tenant.** Must migrate it into `tenants.toml` as a tenant entry. |
| 129 | +- **No server-side quotas.** Delegated to K8s ResourceQuota/LimitRange per namespace. |
| 130 | +- **In-memory lookup, no file I/O on hot path.** Config loaded into `dict[str, TenantEntry]` at startup and on fsnotify events. |
| 131 | + |
| 132 | +### Risks and Mitigations |
| 133 | + |
| 134 | +| Risk | Mitigation | |
| 135 | +|------|------------| |
| 136 | +| Plaintext API keys in `tenants.toml` | File permissions 0600; ConfigMap with restricted RBAC; future: K8s Secret reference | |
| 137 | +| ConfigMap update delay on multi-replica | kubelet syncs ~1 min; fsnotify triggers reload on each replica independently | |
| 138 | +| Namespace doesn't exist at tenant creation | Startup validation; `create_sandbox` returns clear 400 | |
| 139 | +| Timing attack on API key comparison | `secrets.compare_digest` (constant-time) | |
| 140 | +| Informer memory growth with many namespaces | Lazily created per namespace, only for active sandboxes | |
| 141 | + |
| 142 | +## Design Details |
| 143 | + |
| 144 | +Implementation in 6 steps. No step blocks another except where noted. |
| 145 | + |
| 146 | +--- |
| 147 | + |
| 148 | +### TenantProvider Abstraction |
| 149 | + |
| 150 | +Tenant resolution is behind a `TenantProvider` interface, decoupling auth middleware from any specific config source. This lets the initial implementation ship with a simple file-based provider while leaving a clean extension point for enterprise deployments that already manage tenants in an external IAM or tenant management system. |
| 151 | + |
| 152 | +**Interface (pseudocode):** |
| 153 | +``` |
| 154 | +TenantProvider (Protocol): |
| 155 | + lookup(api_key: str) → TenantEntry | None |
| 156 | + list_tenants() → list[TenantEntry] # for startup validation |
| 157 | + ready() → bool # provider has loaded initial state |
| 158 | + on_reload(callback) → None # notify consumers on config change (optional) |
| 159 | +``` |
| 160 | + |
| 161 | +**Initial provider — FileTenantProvider:** |
| 162 | +- Backed by `tenants.toml`, loaded at startup, hot-reloaded via fsnotify |
| 163 | +- Implements full `TenantProvider` interface |
| 164 | +- `ready()` returns `True` after initial file parse succeeds |
| 165 | +- `on_reload` triggers on fsnotify events; auth middleware picks up new key→tenant mappings without restart |
| 166 | + |
| 167 | +**Future providers (not in this OSEP, but the interface accommodates):** |
| 168 | +- `HTTPTenantProvider` — polls or streams from an internal IAM API; tenant metadata, key rotation, enable/disable all managed in the external system |
| 169 | +- `K8sConfigMapProvider` — watches a ConfigMap or Secret across namespaces |
| 170 | +- Composite/chained providers for fallback (e.g., file + external API merge) |
| 171 | + |
| 172 | +**Startup wiring (pseudocode):** |
| 173 | +``` |
| 174 | +if tenants.toml exists: |
| 175 | + provider = FileTenantProvider(path) |
| 176 | + if not provider.ready(): |
| 177 | + → SystemExit (parse error, duplicates, etc.) |
| 178 | +else: |
| 179 | + provider = None # single-tenant mode |
| 180 | +``` |
| 181 | + |
| 182 | +Auth middleware depends only on `TenantProvider`, not on `FileTenantProvider` directly. Switching backends in the future does not touch auth code. |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +### Config Model & Loading Flow (FileTenantProvider) |
| 187 | + |
| 188 | +**New package:** `opensandbox_server/tenants/` |
| 189 | + |
| 190 | +This is the initial `TenantProvider` implementation. It reads `tenants.toml` and hot-reloads on file changes. |
| 191 | + |
| 192 | +**Data model (pseudocode):** |
| 193 | +``` |
| 194 | +TenantEntry: |
| 195 | + - name: str |
| 196 | + - namespace: str |
| 197 | + - api_keys: list[str] |
| 198 | +
|
| 199 | +TenantsConfig: |
| 200 | + - entries: list[TenantEntry] |
| 201 | + - validation: reject duplicate api_keys across tenants (on parse) |
| 202 | +``` |
| 203 | + |
| 204 | +**Loading flow:** |
| 205 | +``` |
| 206 | +FileTenantProvider(path): |
| 207 | + 1. resolve path: env SANDBOX_TENANTS_CONFIG_PATH || ~/.opensandbox/tenants.toml |
| 208 | + 2. if file absent → ready() returns False → server stays in single-tenant mode |
| 209 | + 3. parse TOML → TenantsConfig → build dict[api_key → TenantEntry] |
| 210 | + 4. on parse error or duplicate keys → raise, server exits |
| 211 | + 5. start fsnotify watcher thread for hot-reload |
| 212 | +``` |
| 213 | + |
| 214 | +**Hot-reload behavior:** |
| 215 | +``` |
| 216 | + - maintains dict[api_key → TenantEntry] under threading.Lock |
| 217 | + - on file change: reload atomically (swap dict under lock) |
| 218 | + - on parse error during reload: log warning, keep old entries (no downtime) |
| 219 | + - file delete → clear all entries (all tenant keys → 401) |
| 220 | + - new key added → live immediately on next lookup |
| 221 | +``` |
| 222 | +Watcher monitors parent directory for ConfigMap atomic symlink swap. |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +### Auth Middleware Flow |
| 227 | + |
| 228 | +**Modify:** `middleware/auth.py` |
| 229 | + |
| 230 | +**Mode detection:** `TenantProvider` instance passed in → multi-tenant; `None` → single-tenant. Middleware depends only on the `TenantProvider` interface, not on `FileTenantProvider`. |
| 231 | + |
| 232 | +**Startup validation:** |
| 233 | +``` |
| 234 | +if provider is not None AND server.api_key is set: |
| 235 | + → SystemExit("Remove server.api_key from server.toml") |
| 236 | +``` |
| 237 | + |
| 238 | +**Auth flow (pseudocode):** |
| 239 | +``` |
| 240 | +authenticate(request) → TenantEntry | None: |
| 241 | + api_key = request.headers["OPEN-SANDBOX-API-KEY"] |
| 242 | +
|
| 243 | + if multi-tenant mode: |
| 244 | + return provider.lookup(api_key) # TenantEntry or None |
| 245 | + else: |
| 246 | + return None if constant_time_compare(server.api_key, api_key) else None |
| 247 | + # None with non-empty valid_keys = single-tenant, allow |
| 248 | + # None with empty valid_keys = no keys configured, reject |
| 249 | +``` |
| 250 | + |
| 251 | +**Tenant context propagation:** |
| 252 | +``` |
| 253 | +dispatch(request): |
| 254 | + tenant = authenticate(request) |
| 255 | + if multi-tenant and tenant is None → 401 |
| 256 | + if single-tenant and auth failed → 401 |
| 257 | + request.state.tenant = tenant # TenantEntry | None |
| 258 | + ContextVar("current_tenant").set(tenant) # for downstream access |
| 259 | +``` |
| 260 | + |
| 261 | +Downstream code reads tenant via `get_current_tenant() → TenantEntry | None`. |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +### Sandbox Service — Namespace Resolution |
| 266 | + |
| 267 | +**Modify:** `services/kubernetes_service.py` |
| 268 | + |
| 269 | +All K8s API calls replace `self.namespace` with runtime-resolved namespace: |
| 270 | + |
| 271 | +``` |
| 272 | +_resolve_namespace(): |
| 273 | + tenant = get_current_tenant() |
| 274 | + return tenant.namespace if tenant else self.namespace # config default |
| 275 | +
|
| 276 | +_resolve_tenant_name(): |
| 277 | + tenant = get_current_tenant() |
| 278 | + return tenant.name if tenant else "default" |
| 279 | +``` |
| 280 | + |
| 281 | +Methods affected: `create_sandbox`, `list_sandboxes`, `get_sandbox`, `delete_sandbox`. |
| 282 | + |
| 283 | +**Sandbox labels on create:** add `opensandbox.io/tenant = <tenant_name>`. |
| 284 | + |
| 285 | +**Proxy route ownership:** proxy routes (`/sandboxes/{id}/proxy/{port}/...`) bypass API key auth by design — end users hitting sandboxes don't carry `OPEN-SANDBOX-API-KEY`. Ingress gateway is intentionally tenant-unaware. |
| 286 | + |
| 287 | +Isolation at proxy layer relies on: |
| 288 | +- **Unguessable sandbox IDs** (random UUIDs) — knowing one tenant's sandbox ID doesn't reveal another's |
| 289 | +- **Signed route tokens** (OSEP-0011) — time-limited, cryptographically bound to a single sandbox |
| 290 | +- **K8s namespace isolation** — even if traffic reaches a pod, NetworkPolicy restricts cross-namespace pod-to-pod communication |
| 291 | + |
| 292 | +No tenant context is injected on proxy paths. The server resolves the sandbox endpoint purely by sandbox ID and forwards. Tenancy is enforced at lifecycle API boundaries (create/get/list/delete), not at data-plane proxy boundaries. |
| 293 | + |
| 294 | +--- |
| 295 | + |
| 296 | +### Startup Guards |
| 297 | + |
| 298 | +**Modify:** `main.py` or `app.py` — before server start. |
| 299 | + |
| 300 | +``` |
| 301 | +validate_tenant_startup(): |
| 302 | + 1. Docker + tenants.toml → SystemExit |
| 303 | + 2. Missing tenant namespaces → SystemExit (list missing) |
| 304 | + 3. server.api_key + tenants.toml coexisting → SystemExit |
| 305 | +``` |
| 306 | + |
| 307 | +Namespace validation: iterate all tenant entries, call `k8s.read_namespace()` for each. Collect missing. All must exist at startup. |
| 308 | + |
| 309 | +--- |
| 310 | + |
| 311 | +### Deployment Changes |
| 312 | + |
| 313 | +**New files:** `deploy/kubernetes/configmap-tenants.yaml`, modify `rbac.yaml`, `deployment.yaml`. |
| 314 | + |
| 315 | +- **Split ConfigMaps:** `opensandbox-server` (server.toml) + `opensandbox-tenants` (tenants.toml) |
| 316 | +- **Deployment:** mount both ConfigMaps, set `SANDBOX_TENANTS_CONFIG_PATH` env var |
| 317 | +- **RBAC:** upgrade `Role` → `ClusterRole` + `ClusterRoleBinding` (multi-namespace access required) |
| 318 | + |
| 319 | +--- |
| 320 | + |
| 321 | +### Tenant Isolation Model (Reference) |
| 322 | + |
| 323 | +Server does not enforce quotas. Isolation delegated to K8s: |
| 324 | + |
| 325 | +| Isolation dimension | K8s mechanism | Scope | |
| 326 | +|--------------------|---------------|-------| |
| 327 | +| Resource quota | `ResourceQuota` | Per-ns CPU, memory, storage | |
| 328 | +| Default limits | `LimitRange` | Per-ns default container resources | |
| 329 | +| Network policy | `NetworkPolicy` | Per-ns ingress/egress | |
| 330 | +| Sandbox count | `count/batchsandboxes` via `ResourceQuota` | Per-ns CR count | |
| 331 | +| RBAC | `RoleBinding` | Per-ns API access | |
| 332 | + |
| 333 | +Cluster admin creates per-tenant namespace with ResourceQuota + LimitRange before tenant onboarding. |
| 334 | + |
| 335 | +## Test Plan |
| 336 | + |
| 337 | +**Unit tests:** |
| 338 | +- Duplicate API keys across tenants → `ValueError` at config parse |
| 339 | +- Auth: multi-tenant rejects `server.api_key`; accepts valid tenant key; rejects invalid → 401 |
| 340 | +- TenantLoader: file delete → entries cleared; new key → live in lookup; parse error → old entries kept |
| 341 | +- Docker + tenants → `SystemExit` |
| 342 | + |
| 343 | +**Integration tests:** |
| 344 | +- Create with tenant A key → sandbox in ns-a with label `opensandbox.io/tenant=team-a` |
| 345 | +- List with tenant A → only ns-a sandboxes |
| 346 | +- Get/delete tenant A sandbox with tenant B key → 404 |
| 347 | +- Hot reload: new key works without restart; removed key → 401 |
| 348 | +- Legacy: delete tenants.toml → server.api_key works again |
| 349 | + |
| 350 | +**End-to-end:** |
| 351 | +- Key rotation: add new key, verify both work, remove old key |
| 352 | +- Multi-replica: update ConfigMap, all replicas pick up within 60s |
| 353 | + |
| 354 | +## Drawbacks |
| 355 | + |
| 356 | +- **Two config files.** Mitigated by clear startup logging of which mode is active. |
| 357 | +- **ClusterRole required.** Broader RBAC than single-namespace RoleBinding. Inherent to multi-tenancy; scoped by resource types. |
| 358 | +- **No dynamic tenant CRUD.** Static config only. REST API / CRD deferred to future OSEP. |
| 359 | + |
| 360 | +## Alternatives |
| 361 | + |
| 362 | +| Approach | Rejected because | |
| 363 | +|----------|-----------------| |
| 364 | +| Embed tenants in `server.toml` | Tenant changes require server restart | |
| 365 | +| Couple auth directly to `tenants.toml` file format | Locks out enterprise deployments where tenants already live in IAM/external systems; `TenantProvider` interface avoids this | |
| 366 | +| SQLite for tenant storage | Single-node; breaks multi-replica | |
| 367 | +| One server instance per tenant | High operational cost (N processes) | |
| 368 | +| Soft multi-tenancy (labels, one namespace) | No K8s-native isolation; ResourceQuota/NetworkPolicy not per-tenant | |
| 369 | +| Single API key per tenant | No key rotation; replacing key causes downtime | |
| 370 | + |
| 371 | +## Infrastructure Needed |
| 372 | + |
| 373 | +- One K8s namespace per tenant (cluster admin creates) |
| 374 | +- Per-namespace ResourceQuota + LimitRange (recommended) |
| 375 | +- `opensandbox-tenants` ConfigMap in server namespace |
| 376 | +- ClusterRole + ClusterRoleBinding for server ServiceAccount |
| 377 | + |
| 378 | +## Upgrade & Migration Strategy |
| 379 | + |
| 380 | +**Existing single-tenant → multi-tenant:** |
| 381 | + |
| 382 | +1. Create target namespace(s) |
| 383 | +2. Write `tenants.toml` with existing key as a tenant entry (same namespace) |
| 384 | +3. Mount via ConfigMap alongside `server.toml` |
| 385 | +4. Deploy — old key continues working as tenant key |
| 386 | +5. Optionally remove `api_key` from `server.toml` |
| 387 | +6. Add more tenants as needed |
| 388 | + |
| 389 | +**Rollback:** Delete `tenants.toml` ConfigMap, restart. Falls back to `server.api_key` + `kubernetes.namespace`. |
| 390 | + |
| 391 | +**No data migration needed.** Existing sandboxes stay in their namespace. |
0 commit comments