Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/workflows/kubernetes-nightly-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ on:
- '.github/workflows/kubernetes-nightly-build.yml'
- 'scripts/python-k8s-e2e.sh'
- 'scripts/python-k8s-e2e-ingress.sh'
- 'scripts/python-k8s-e2e-multi-tenant.sh'
- 'scripts/common/kubernetes-e2e.sh'
- 'kubernetes/charts/**'

Expand All @@ -37,6 +38,14 @@ jobs:
- variant: ingress-uri
script: scripts/python-k8s-e2e-ingress.sh
e2e_gateway_route_mode: uri
- variant: multi-tenant-file
script: scripts/python-k8s-e2e-multi-tenant.sh
e2e_gateway_route_mode: ""
e2e_tenant_provider: file
- variant: multi-tenant-http
script: scripts/python-k8s-e2e-multi-tenant.sh
e2e_gateway_route_mode: ""
e2e_tenant_provider: http
env:
KIND_CLUSTER: opensandbox-e2e
KIND_K8S_VERSION: v1.30.4
Expand Down Expand Up @@ -75,6 +84,7 @@ jobs:
- name: Run Kubernetes runtime E2E
env:
E2E_GATEWAY_ROUTE_MODE: ${{ matrix.e2e_gateway_route_mode }}
TENANT_PROVIDER: ${{ matrix.e2e_tenant_provider }}
run: bash "./${{ matrix.script }}"

- name: Dump kind diagnostics
Expand Down
28 changes: 28 additions & 0 deletions kubernetes/charts/opensandbox-server/templates/server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,19 @@ data:
config.toml: |
{{ .Values.configToml | indent 4 }}
{{ include "opensandbox-server.ingressConfigToml" . | indent 4 }}
{{- if .Values.tenantsToml }}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ include "opensandbox-server.fullname" . }}-tenants
namespace: {{ include "opensandbox-server.namespace" . }}
labels:
{{- include "opensandbox-server.labels" . | nindent 4 }}
data:
tenants.toml: |
{{ .Values.tenantsToml | indent 4 }}
{{- end }}
---
apiVersion: apps/v1
kind: Deployment
Expand Down Expand Up @@ -106,6 +119,10 @@ spec:
env:
- name: SANDBOX_CONFIG_PATH
value: "/etc/opensandbox/config.toml"
{{- if .Values.tenantsToml }}
- name: SANDBOX_TENANTS_CONFIG_PATH
value: "/etc/opensandbox/tenants.toml"
{{- end }}
{{- with .Values.server.env }}
{{- toYaml . | nindent 12 }}
{{- end }}
Expand All @@ -114,6 +131,12 @@ spec:
mountPath: /etc/opensandbox/config.toml
subPath: config.toml
readOnly: true
{{- if .Values.tenantsToml }}
- name: tenants
mountPath: /etc/opensandbox/tenants.toml
subPath: tenants.toml
readOnly: true
Comment on lines +135 to +138
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mount tenants ConfigMap without subPath for reloads

The file provider is documented and implemented as a hot-reloading tenants.toml, but the Helm chart mounts the ConfigMap key via subPath; Kubernetes documents that ConfigMap mounts using subPath do not receive updates when the ConfigMap changes. As a result, updating tenantsToml in a Helm release will not change /etc/opensandbox/tenants.toml in the running server pod, so key rotation/add/remove will not be observed until the pod restarts.

Useful? React with 👍 / 👎.

{{- end }}
{{- with .Values.server.volumeMounts }}
{{- toYaml . | nindent 12 }}
{{- end }}
Expand All @@ -137,6 +160,11 @@ spec:
- name: config
configMap:
name: {{ include "opensandbox-server.fullname" . }}-config
{{- if .Values.tenantsToml }}
- name: tenants
configMap:
name: {{ include "opensandbox-server.fullname" . }}-tenants
{{- end }}
{{- with .Values.server.volumes }}
{{- toYaml . | nindent 8 }}
{{- end }}
Expand Down
9 changes: 9 additions & 0 deletions kubernetes/charts/opensandbox-server/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,12 @@ configToml: |
[egress]
image = "sandbox-registry.cn-zhangjiakou.cr.aliyuncs.com/opensandbox/egress:v1.0.12"
mode = "dns+nft"

# Optional: multi-tenant tenants.toml content.
# When set, a separate ConfigMap is created and mounted at /etc/opensandbox/tenants.toml.
# tenantsToml: |
# [[tenants]]
# name = "team-a"
# namespace = "sandbox-team-a"
# api_keys = ["sk-a-1"]
tenantsToml: ""
187 changes: 136 additions & 51 deletions oseps/0012-multi-tenancy.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ title: Multi-Tenancy Support for Kubernetes Runtime
authors:
- "@Pangjiping"
creation-date: 2026-04-29
last-updated: 2026-05-07
status: draft
last-updated: 2026-06-05
status: implemented
---

# OSEP-0012: Multi-Tenancy Support for Kubernetes Runtime
Expand All @@ -20,7 +20,7 @@ status: draft
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [TenantProvider Abstraction](#tenantprovider-abstraction)
- [Config Model & Loading Flow (FileTenantProvider)](#config-model--loading-flow-filetenantprovider)
- [Tenants Config File Format (FileTenantProvider)](#tenants-config-file-format-filetenantprovider)
- [Auth Middleware Flow](#auth-middleware-flow)
- [Sandbox Service — Namespace Resolution](#sandbox-service--namespace-resolution)
- [Startup Guards](#startup-guards)
Expand Down Expand Up @@ -149,77 +149,162 @@ Implementation in 6 steps. No step blocks another except where noted.

Tenant resolution is behind a `TenantProvider` interface, decoupling auth middleware from any specific config source. This lets the initial implementation ship with a simple file-based provider while leaving a clean extension point for enterprise deployments that already manage tenants in an external IAM or tenant management system.

**Interface (pseudocode):**
**Interface (`opensandbox_server/tenants/provider.py`):**

```python
class TenantProvider(Protocol):
def lookup(self, api_key: str) -> Optional[TenantEntry]:
"""Resolve API key → tenant. Returns None if not recognized.
Raises TenantProviderUnavailable if provider cannot serve."""
...

def list_tenants(self) -> List[TenantEntry]:
"""All known tenant entries (startup validation)."""
...

def ready(self) -> bool:
"""True once provider can serve lookups."""
...

def start(self) -> None:
"""Start background resources (watchers, connections). Called at server startup."""
...

def close(self) -> None:
"""Release resources. Called on server shutdown."""
...

def on_reload(self, callback: Callable[[List[TenantEntry]], None]) -> None:
"""Register callback invoked on tenant data change.
Not all providers support this; those that don't may ignore."""
...
```
TenantProvider (Protocol):
lookup(api_key: str) → TenantEntry | None
list_tenants() → list[TenantEntry] # for startup validation
ready() → bool # provider has loaded initial state
on_reload(callback) → None # notify consumers on config change (optional)

**Exception:**
- `TenantProviderUnavailable` — raised when provider cannot serve lookups (e.g., HTTP endpoint unreachable + cache expired beyond `max_stale_seconds`)

**Data model (`opensandbox_server/tenants/models.py`):**

```python
@dataclass(frozen=True)
class TenantEntry:
name: str
namespace: str
api_keys: List[str]
```

**Initial provider — FileTenantProvider:**
- Backed by `tenants.toml`, loaded at startup, hot-reloaded via fsnotify
---

#### Provider 1 — FileTenantProvider

Backed by `tenants.toml`, loaded at startup, hot-reloaded via filesystem mtime polling.

- Implements full `TenantProvider` interface
- `start()` parses file and starts watcher thread (2s mtime poll)
- `ready()` returns `True` after initial file parse succeeds
- `on_reload` triggers on fsnotify events; auth middleware picks up new key→tenant mappings without restart
- `on_reload` triggers on file change; auth middleware picks up new key→tenant mappings without restart
- File delete → all entries cleared (all tenant keys → 401)
- Parse error during reload → log warning, keep previous state (no downtime)
- Watcher monitors parent directory for ConfigMap atomic symlink swap

**Future providers (not in this OSEP, but the interface accommodates):**
- `HTTPTenantProvider` — polls or streams from an internal IAM API; tenant metadata, key rotation, enable/disable all managed in the external system
- `K8sConfigMapProvider` — watches a ConfigMap or Secret across namespaces
- Composite/chained providers for fallback (e.g., file + external API merge)
---

**Startup wiring (pseudocode):**
#### Provider 2 — HTTPTenantProvider

Per-key lookup against a remote HTTP endpoint with in-memory TTL cache. No background thread, no file persistence, no bulk fetch. Keys not looked up are not cached.

**Endpoint contract:**

```
GET {endpoint}
Header: OPEN-SANDBOX-API-KEY: <api_key> // 客户端原始 key 原封不动转发

200 OK:
{
"namespace": "ns-a", // target K8s namespace for this key
"ttl": 60 // suggested cache duration in seconds
}

401 Unauthorized:
{
"code": "UNAUTHORIZED",
"message": "API key is invalid or revoked"
}
```

Server 将客户端的 `OPEN-SANDBOX-API-KEY` 原封不动转发给 HTTP provider 做校验。Provider 是权威方 — 决定 key 是否有效、映射到哪个 namespace。Server 只需要 `namespace` + `ttl`。

**Cache behavior:**

| Scenario | Action |
|----------|--------|
| Cache hit + within server-suggested TTL | Return cached entry immediately |
| Cache hit + TTL expired | Sync GET → success: update cache with new TTL; failure + within `max_stale_seconds`: return stale; failure + beyond `max_stale_seconds`: raise `TenantProviderUnavailable` |
| Cache miss | Sync GET → 200: cache + return; 401: return `None`; network error: raise `TenantProviderUnavailable` |
| Remote returns 401 for previously cached key | Evict from cache + return `None` (key revoked) |

**Configuration (`HTTPTenantProviderConfig`):**

| Field | Default | Description |
|-------|---------|-------------|
| `endpoint` | (required) | Remote tenant lookup URL |
| `max_stale_seconds` | 300 | Maximum time to serve stale cache when endpoint unreachable |
| `timeout_seconds` | 5 | HTTP request timeout |
| `auth_header` | None | Optional header name for provider-level authentication |
| `auth_token` | None | Optional token value for provider-level authentication |

**Security properties:**
- No persistent cache file → no disk attack surface, no stale file after long downtime
- Cold start (`start()`) only marks ready, does not bulk-fetch (per-key on demand)
- Revoked key (401) immediately evicted from cache
- Max stale bounds the window where unreachable endpoint + stale cache could allow a revoked key

---

#### Provider Selection

Provider type is determined at startup:

```python
# Config field: tenant_provider_type = "file" | "http"
# Or auto-detect:
if tenants.toml exists:
provider = FileTenantProvider(path)
if not provider.ready():
→ SystemExit (parse error, duplicates, etc.)
elif http_tenant_endpoint configured:
provider = HTTPTenantProvider(config)
else:
provider = None # single-tenant mode

provider.start()
if not provider.ready():
→ SystemExit
```

Auth middleware depends only on `TenantProvider`, not on `FileTenantProvider` directly. Switching backends in the future does not touch auth code.
Auth middleware depends only on `TenantProvider`, not on any specific implementation. Switching backends does not touch auth code.

---

### Config Model & Loading Flow (FileTenantProvider)
### Tenants Config File Format (FileTenantProvider)

**New package:** `opensandbox_server/tenants/`
**Package:** `opensandbox_server/tenants/`

This is the initial `TenantProvider` implementation. It reads `tenants.toml` and hot-reloads on file changes.
**File:** `tenants.toml` (path resolved via `SANDBOX_TENANTS_CONFIG_PATH` env or default `~/.opensandbox/tenants.toml`)

**Data model (pseudocode):**
```
TenantEntry:
- name: str
- namespace: str
- api_keys: list[str]

TenantsConfig:
- entries: list[TenantEntry]
- validation: reject duplicate api_keys across tenants (on parse)
```
```toml
[[tenants]]
name = "team-a"
namespace = "sandbox-team-a"
api_keys = ["sk-a-1", "sk-a-2"]

**Loading flow:**
```
FileTenantProvider(path):
1. resolve path: env SANDBOX_TENANTS_CONFIG_PATH || ~/.opensandbox/tenants.toml
2. if file absent → ready() returns False → server stays in single-tenant mode
3. parse TOML → TenantsConfig → build dict[api_key → TenantEntry]
4. on parse error or duplicate keys → raise, server exits
5. start fsnotify watcher thread for hot-reload
[[tenants]]
name = "team-b"
namespace = "sandbox-team-b"
api_keys = ["sk-b-1"]
```

**Hot-reload behavior:**
```
- maintains dict[api_key → TenantEntry] under threading.Lock
- on file change: reload atomically (swap dict under lock)
- on parse error during reload: log warning, keep old entries (no downtime)
- file delete → clear all entries (all tenant keys → 401)
- new key added → live immediately on next lookup
```
Watcher monitors parent directory for ConfigMap atomic symlink swap.
**Validation rules (on parse):**
- Each tenant must have non-empty `name`, `namespace`, `api_keys`
- Duplicate `api_keys` across tenants → `ValueError`, server exits

---

Expand Down
Loading
Loading