diff --git a/gateway/spec/architecture/architecture.md b/gateway/spec/architecture/architecture.md index 32195c092..a87318098 100644 --- a/gateway/spec/architecture/architecture.md +++ b/gateway/spec/architecture/architecture.md @@ -2,109 +2,562 @@ ## Overview -Envoy-based gateway system with Go xDS control plane for dynamic API configuration, policy enforcement, and traffic management. Supports both single-instance deployments with SQLite and scalable cloud deployments. +The API Platform Gateway is an Envoy-based, AI-ready API gateway with a Go xDS control plane. Everything beyond basic routing — authentication, rate limiting, transformation, AI guardrails, MCP handling — is implemented as composable, versioned **policies**. + +Policies are not built into the runtime. They are compiled (Go) or installed (Python) into a Gateway Runtime image at build time by the **Gateway Builder**, and pushed to the runtime at deploy time over xDS. This means a Gateway Runtime image is always a self-contained, reproducible artifact: a fixed Envoy version, a fixed policy set, a fixed SDK version. + +### Control Plane vs Data Plane + +There are two layers in the wider API Platform. The Gateway as a whole is a **Data Plane** product — it terminates client traffic and forwards it to upstream services. The **Control Plane** is the WSO2 **Platform API**, which is a separate, optional central management surface that one or more independent Gateways can register with. + +Inside a single Gateway, the **Gateway Controller** acts as an internal control plane for its own **Gateway Runtime** instances (it pushes xDS to them) — but the Controller itself is part of the Data Plane deployment. A Gateway can run with or without a Platform API in front of it. + +## Top-Level Architecture + +```mermaid +graph TB + subgraph CP["Control Plane (optional)"] + PlatformAPI["Platform API
central management for
1..N gateways"] + end + + subgraph DP["Data Plane"] + subgraph GW1["Gateway A"] + C1["Gateway Controller
REST :9090
Envoy xDS :18000
Policy xDS :18001"] + subgraph RT1["Gateway Runtime"] + R1["Router (Envoy)
:8080 / :8443"] + PE1["Policy Engine (Go)"] + Py1["Python Executor
(if Python policies)"] + end + end + + subgraph GW2["Gateway B (distinct gateway, not a replica)"] + C2["Gateway Controller"] + RT2["Gateway Runtime"] + end + + Builder["Gateway Builder
(image build time only)"] + end + + Backend[("Upstream
Backends / LLMs
MCP Servers")] + DB[("Shared PostgreSQL
rows scoped by gateway_id")] + Redis[(Redis
distributed rate-limit)] + Client["Client"] + + C1 -. "REST: manifest + version
WebSocket: events + heartbeat" .-> PlatformAPI + C2 -. "REST + WebSocket" .-> PlatformAPI + PlatformAPI -. "WebSocket: deploy / undeploy" .-> C1 + PlatformAPI -. "WebSocket" .-> C2 + + Builder -- "produces
controller + runtime images" --> GW1 + Builder -- "produces images" --> GW2 + + C1 --- DB + C2 --- DB + + C1 -- "Envoy xDS
(LDS/RDS/CDS/EDS/SDS)" --> R1 + C1 -- "Policy xDS +
RouteConfig +
APIKey +
Subscription +
LazyResource" --> PE1 + R1 <-- "ext_proc
over UDS" --> PE1 + PE1 <-- "gRPC
over UDS" --> Py1 + + Client --> R1 + R1 --> Backend + PE1 -. "rate-limit state
(optional)" .-> Redis + + classDef cp fill:#e1f5ff,stroke:#01579b + classDef dp fill:#fff3e0,stroke:#e65100 + classDef ext fill:#f3e5f5,stroke:#4a148c + class CP cp + class DP,GW1,GW2,RT1,RT2 dp + class Backend,DB,Redis,Client ext +``` + +A single Gateway is composed of two deployable units, released as a matched version pair — the controller's policy YAMLs and the runtime's compiled policy-engine binary come from the same builder run: + +| Unit | Contains | +| ----------------------- | --------------------------------------------------------------------------------------- | +| **Gateway Controller** | REST API, Envoy xDS server, Policy xDS server, policy definitions, persistence | +| **Gateway Runtime** | Envoy + Policy Engine binary (with policies linked in) + Python Executor + Python deps | + +The **Gateway Builder** is a build-time tool that produces both images. End users do not run the builder unless they want to ship a custom policy set; the default WSO2-published images are pre-built. When custom images are needed, the **CLI** is the primary user-facing entry point — it wraps the builder in a Docker container, supplies the policy manifest, and produces both images locally. + +### Multi-Gateway Database Sharing + +Each Gateway is identified by a unique `gateway_id`. Multiple **distinct Gateways** (not just replicas of one Gateway) can point at the **same shared database** — every persistent row is scoped by `gateway_id`, so two gateways sharing a PostgreSQL instance see only their own APIs, subscriptions, API keys, and events. This is independent of the multi-replica EventHub sync described later (which is for replicas of the *same* gateway_id). + +```mermaid +graph LR + subgraph DB["Shared PostgreSQL"] + T1["artifacts
{gateway_id, uuid, ...}"] + T2["api_keys
{gateway_id, ...}"] + T3["events
{gateway_id, event_id, ...}"] + end + + GA["Gateway A
gateway_id=gw-a"] + GB["Gateway B
gateway_id=gw-b"] + + GA <--> DB + GB <--> DB +``` + +--- ## Components -### Gateway-Controller (Port 9090 REST, 18000 xDS) -- REST API server accepting YAML/JSON API configurations using Gin router. -- Validation layer providing field-level error messages with structured reporting. -- xDS v3 server implementing State-of-the-World protocol for Envoy configuration. -- SQLite database for persistent storage (`./data/gateway.db`) with WAL mode. -- In-memory cache for fast configuration access with thread-safe operations. - -### Router (Envoy Proxy, Port 8080) -- Envoy Proxy 1.35.3 routing HTTP traffic to backend services. -- Bootstrap configuration connecting to Gateway-Controller xDS server. -- JSON-formatted access logs to stdout for observability. -- Zero-downtime configuration updates via xDS protocol. - -### Policy Engine (Standard Tier) -- Authentication policies: API Key, OAuth, JWT validation. -- Authorization policies: RBAC, scope validation. -- Traffic management policies: Header modification, request/response transformation. - -### Rate Limiter (Standard Tier) -- Distributed rate limiting with Redis backend. -- Quota management and throttling. -- Spike arrest and burst protection. - -### Database -- SQLite database file (`./data/gateway.db`). -- Schema with `deployments` table storing configurations as JSON TEXT. -- Composite unique constraint on `(name, version)`. -- Indexes on frequently queried fields: `name+version`, `status`, `context`, `kind`. -- Migration path to PostgreSQL/MySQL for cloud deployments. - -## Container Structure +### 1. Gateway Controller + +The Gateway's internal control plane — it manages and pushes configuration to its own Gateway Runtime(s). A single Go binary that: + +- Serves the **Management REST API** for create/read/update/delete of all gateway resources. +- Serves an **Admin/debug API** — config dump, health, xDS sync status. +- Runs an **Envoy xDS server** implementing the State-of-the-World v3 protocol (LDS, RDS, CDS, EDS, SDS). +- Runs a separate **Policy xDS server** that pushes policy chains, route configs, API keys, subscriptions, and lazy resources to the Policy Engine. +- Persists all state in **SQLite** (default, WAL mode) or **PostgreSQL** (HA deployments). +- Optionally connects to the **Platform API** (REST for the manifest/version push, WebSocket for the live event channel) for centralized multi-gateway management. + +#### Resource Kinds + +The controller manages a typed set of API and policy resources, each with its own validator: + +| Kind | Purpose | +| --------------------- | ---------------------------------------------------------------------- | +| `RestApi` | HTTP/REST API definition (operations, upstream, policies) | +| `WebSubApi` | Event-driven WebSub API (Kafka-backed, async; served by event-gateway) | +| `LlmProviderTemplate` | Reusable template for an LLM vendor (OpenAI, Anthropic, Bedrock, …) | +| `LlmProvider` | A configured LLM provider instance | +| `LlmProxy` | A multi-provider AI gateway endpoint with model routing & guardrails | +| `Mcp` | Model Context Protocol proxy | +| `Certificate` | Trusted-cert and listener-cert management for upstream/downstream TLS | +| `SubscriptionPlan` | Quota/rate plan definition | +| `Subscription` | Plan binding to an Application; carries billing IDs for analytics | +| `ApiKey` | Per-API key issuance; stored and pushed to runtime as SHA-256 hashes only | +| `Application` | Logical consumer that owns API keys and subscriptions (synced from Platform API) | +| `Secret` | Secret storage with AES-GCM encryption at rest | +| `Policy` | Installed policy definitions compiled into the runtime image (read-only) | + +Each resource has its **own database table** keyed by an immutable **UUID** primary identifier. All cross-references (subscription → plan, API key → API, events, analytics) carry the UUID, so renames and re-deployments stay valid. Resources additionally carry a URL-friendly `handle` and a human-readable `displayName`, both unique per gateway and kind. + +Resource versions for `RestApi`, `WebSubApi`, `Mcp`, `LlmProvider`, and `LlmProxy` use a `vMAJOR.MINOR` form. Patch versions are intentionally not exposed — a backend bug fix should never force consumers to migrate. Policies follow a different scheme: their patch versions are visible to operators so security and bug fixes can be pinned at deploy time. + +Operations on these kinds are converted at handler time to a kind-agnostic **`RuntimeDeployConfig`** before being snapshotted into xDS. This keeps the xDS translators and Policy Engine free of per-kind branching: a `RestApi`, an `LlmProvider`, and a `WebSubApi` all reach the runtime as the same intermediate shape. + +A REST/LLM API may declare a **main** upstream and an optional **sandbox** upstream, selected per request via header or path convention — both upstreams share the same policy chain. Resources may also carry arbitrary `metadata.labels` (string→string map) for analytics, routing, and operational metadata; labels are propagated into the runtime context and into emitted analytics events. + +#### Multi-Replica Sync (EventHub) + +A single Gateway can run multiple Controller replicas (same `gateway_id`, same DB) for HA. Replicas stay in sync through a DB-backed **event hub**: each mutation writes a row to an events table; every replica polls the table on a short interval and applies events to its in-memory caches and xDS snapshots. This avoids the need for a separate message broker. + +Events are also scoped by `gateway_id`, so a replica only consumes events for its own gateway — a different gateway sharing the same database is invisible to it. + +### 2. Gateway Runtime + +A single OCI image that bundles three processes managed by a shared entrypoint: + +```mermaid +graph LR + subgraph Container["Gateway Runtime Container"] + Entry["entrypoint
(process manager)"] + Router["Envoy
:8080 / :8443 / :9901"] + PE["Policy Engine (Go)
ext_proc :9001"] + PyExec["Python Executor
(only if Python policies exist)"] + UDS1[/"policy-engine.sock"/] + UDS2[/"python-executor.sock"/] + end + + Entry -- "starts in order: 1 (if Python policies)" --> PyExec + Entry -- "starts in order: 2" --> PE + Entry -- "starts in order: 3" --> Router + + Router <-- "ext_proc gRPC" --> UDS1 + UDS1 --- PE + PE <-- "Execute gRPC" --> UDS2 + UDS2 --- PyExec +``` + +The entrypoint starts the **Python Executor** (only if any Python policies are present), waits for the **Policy Engine** to come up, then starts **Envoy**. If any one process exits, the entrypoint terminates the rest and the container restarts. + +#### Router (Envoy) + +A standard upstream Envoy build. The bootstrap is minimal — an admin listener, ADS pointing at the controller's xDS port, and a placeholder cluster. All listeners, routes, clusters, endpoints, and TLS secrets are pushed dynamically by the controller. The Router speaks `ext_proc` to the Policy Engine over a UDS for every request/response on configured routes. + +Body-processing mode is decided **per request, per chain**: the Policy Engine sends back a `mode_override` that puts Envoy in `SKIP` mode when no policy in the chain needs the body, and in `BUFFERED` mode only when one does. This keeps headers-only policies (auth, header rewrite, routing) on the fast path while still allowing body-aware policies (transformation, guardrails) to opt into buffering. + +#### Policy Engine (Go) + +The Policy Engine is the heart of the data plane. It: + +- Receives Envoy `ext_proc` streams on a UDS. +- Maintains an in-memory map of **PolicyChains** keyed by route, kept in sync via xDS streams (`PolicyChainConfig`, `RouteConfig`, `APIKeyConfig`, `SubscriptionConfig`, `LazyResourceConfig`) from the controller. +- For each request, looks up the chain (route key resolution is pluggable via `PolicyChainResolver`), builds an execution context, runs the **request** policies, then on the response path runs the **response** policies. +- Translates per-policy `Action`s (header set/remove, immediate response, dynamic metadata, body replacement, host rewrite, …) into Envoy ext_proc responses. +- Exposes Prometheus metrics on `:9003` and an admin/debug API on `:9002` (config dump with secret redaction, health). + +Policies are **compiled in** at image build time — the engine has zero built-in policies; the gateway-builder generates a `plugin_registry.go` that wires them into the binary. From the engine's runtime perspective, all policies (request-phase, response-phase, body-requiring or not) are uniform plugins implementing the SDK policy interfaces. + +#### Python Executor + +Optional gRPC sidecar process for Python policies. It is a Python 3 process that: + +- Listens on a Unix Domain Socket — or on TCP for local debugging. +- Loads all installed Python policies from a builder-generated registry. +- Serves `Execute` RPCs from the Go Policy Engine; the Go side translates each policy invocation into a gRPC request/response. +- Uses a single event loop with bounded worker concurrency, configurable from the entrypoint. + +```mermaid +sequenceDiagram + participant Envoy as Router (Envoy) + participant PE as Policy Engine (Go) + participant Py as Python Executor (Python) + + Envoy->>PE: ext_proc HeadersRequest (UDS) + PE->>PE: lookup PolicyChain by route key + loop for each policy in chain + alt Go policy + PE->>PE: invoke in-process + else Python policy + PE->>Py: Execute(...) (UDS gRPC) + Py->>Py: load + run Python policy + Py-->>PE: ExecuteResponse(actions) + end + end + PE-->>Envoy: ext_proc Response (mode_override, headers/body actions) + Envoy->>Envoy: apply mutations + Envoy->>Backend: forward request +``` + +Python dependencies are installed into the runtime image at build time from a **locked requirements file** produced by the builder. The SDK ships from PyPI by default, or can be installed from the monorepo for local development. + +### 3. Gateway Builder + +A build-time Go tool that produces both the gateway-runtime and gateway-controller images. It is invoked from the gateway-runtime `Dockerfile` and runs a six-phase pipeline: + +```mermaid +flowchart LR + A["build.yaml
(policy manifest)"] --> P1 + SP["system policies"] --> P1 + P1["1. Discovery
resolve policy refs"] --> P2 + P2["2. Validation
schema + ID checks"] --> P3 + P3["3. Code generation"] --> P4 + P4["4. Compilation
go build policy-engine"] --> P5 + P5["5. Dockerfile generation"] --> P6 + P6["6. Manifest emission
build info + policy YAMLs"] +``` + +The two output images use an **extend base image** pattern: the gateway-runtime image is built on an Envoy base plus the freshly compiled `policy-engine` binary plus Python dependencies; the gateway-controller image is the controller base plus the policy-definition YAMLs extracted from the builder output. + +The canonical policy set for the current gateway version is declared in [`gateway/build.yaml`](../../build.yaml), covering auth, rate limiting, AI guardrails, AI traffic management, MCP, mediation, and subscription policies. Refer to that file for the authoritative list and pinned versions. + +--- + +## xDS Streams Between Controller and Runtime + +The controller drives the runtime through several independent xDS channels. Envoy and the Policy Engine connect to different gRPC ports on the controller: + +```mermaid +graph LR + subgraph Controller["Gateway Controller"] + EnvoyXDS["Envoy xDS server
(SotW, ADS)"] + PolicyXDS["Policy xDS server"] + end + + subgraph Runtime["Gateway Runtime"] + Envoy["Envoy"] + PE["Policy Engine"] + end + + EnvoyXDS -- "LDS / RDS / CDS / EDS / SDS" --> Envoy + PolicyXDS -- "policy chains" --> PE + PolicyXDS -- "per-route metadata" --> PE + PolicyXDS -- "API key state (atomic replace)" --> PE + PolicyXDS -- "subscription state" --> PE + PolicyXDS -- "lazy resources" --> PE +``` + +Notable properties of these streams: + +- **Envoy xDS** uses State-of-the-World: a full LDS/RDS/CDS/EDS/SDS snapshot is published per change. Each route carries only a stable `route_name` in metadata. +- **Per-route metadata** (api name, version, kind, …) is delivered to the Policy Engine at deploy time as a separate stream, not parsed per request — this avoids per-request protobuf metadata unmarshaling in the data path. +- **API keys** are indexed by **SHA-256 hash** of the raw key — the runtime never sees plaintext. Keys are swapped atomically per snapshot so auth never gaps during rotation. +- **Subscriptions** carry active plan limits and billing IDs needed by analytics. +- **Lazy resources** is a generic channel for resources that should be loaded on first use rather than at startup. + +The Policy Engine exposes its current xDS resource versions on the controller's admin API, so integration tests and operators can gate readiness on a known sync version. + +--- + +## Request Lifecycle + +```mermaid +sequenceDiagram + participant Client + participant Envoy as Router (Envoy) + participant PE as Policy Engine + participant Py as Python Executor + participant Upstream as Backend / LLM / MCP + + Client->>Envoy: HTTP request + Envoy->>PE: ext_proc HeadersRequest (UDS)
+ dynamic_metadata.route_name + PE->>PE: resolve route_name → RouteConfig + PolicyChain + PE->>PE: decide body mode (SKIP / BUFFERED)
from chain RequiresRequestBody + alt body required + PE-->>Envoy: continue with BUFFERED + Envoy->>PE: BodyRequest (full body) + else headers only + PE-->>Envoy: continue with SKIP + end + loop request policies + PE->>PE: execute Go policy + opt policy is Python + PE->>Py: Execute (UDS gRPC) + Py-->>PE: ExecuteResponse + end + end + PE-->>Envoy: HeadersResponse (header/body/immediate-response actions) + Envoy->>Upstream: forward (possibly with host-rewrite, mutated headers/body) + Upstream-->>Envoy: response + Envoy->>PE: ext_proc ResponseHeaders / ResponseBody + loop response policies + PE->>PE: execute + end + PE-->>Envoy: response mutations + Envoy-->>Client: HTTP response + PE->>PE: publish analytics event (async) +``` + +Short-circuits are honoured at every step: an auth policy may emit an `ImmediateResponse` action and the chain ends without ever touching the upstream. +--- + +## Configuration Management + +### Layered Configuration + +All three runtime processes (controller, policy engine, python executor) share the same configuration model: + +``` +CLI flags > env vars > TOML config file > built-in defaults +``` + +A single TOML file covers every section needed across the three processes. + +### Artifact Templating + +Resource YAMLs (RestApi, LlmProvider, etc.) are rendered through **Go templates** before validation. The available helpers cover the things artifacts actually need: resolving a value from the gateway secret store, reading an environment variable, supplying a default, requiring a value to be present, and marking a value as sensitive so admin endpoints redact it. + +Example: + +```yaml +spec: + upstream: + main: + url: '{{ env "BACKEND_URL" | default "https://api.example.com" }}' + auth: + type: bearer + token: '{{ secret "BACKEND_TOKEN" | redact }}' +``` + +Rendering errors are typed and surfaced as HTTP 400s by the management API. + +### Secrets + +Secrets are stored encrypted at rest in the controller database using **AES-GCM**. The `secret` template function resolves to the decrypted value at render time. Values marked sensitive are masked in downstream admin config-dump endpoints. + +--- + +## Deployment Modes + +### Mutable Mode (default) + +Configurations are managed at runtime through the Management REST API or through the Platform API (control-plane push). The gateway's database is the source of truth; changes are persisted, replicated to peer controllers via EventHub, and pushed to runtimes via xDS. + +### Immutable Mode + +For GitOps-style and Kubernetes-native deployments, the controller can run in **immutable mode**. When enabled: + +- On startup, the controller walks an artifacts directory and applies all YAML resources via the same service layer the REST handlers use, in dependency order. Any failure aborts startup. +- The SQLite database file is **deleted on startup** to guarantee a fresh, reproducible state. Postgres is rejected — immutable mode is SQLite-only. +- All write methods on the management API return `405 Method Not Allowed`; read endpoints remain available. + +This mode is the recommended path for Kubernetes ConfigMap-based deployments and for baking a fully-formed gateway into a custom container image. + +### Standalone Distribution + +A `make` target produces a standalone zip containing the controller binary, the runtime image references, and a self-contained Docker Compose file for installation outside the monorepo. + +### Platform-API Control Plane Mode + +The Gateway can run standalone (configurations submitted directly to the Controller REST API) or it can register with a central **Platform API** — the system's actual Control Plane — using a combination of REST and WebSocket. The Platform API can manage **multiple, independent gateways** at once. + +#### Authentication + +Both channels authenticate with the same **gateway registration token**, sent as an HTTP header on: + +1. The WebSocket upgrade request — this also serves as the registration handshake (the WebSocket dial *is* the register call). +2. Every REST request the gateway makes to the Platform API. + +A `401 Unauthorized` from either channel is treated as a **permanent failure** — the gateway exits rather than reconnecting. Other permanent statuses (forbidden, not-found, conflict, unprocessable) cause the same exit-on-failure behaviour so a misconfigured gateway doesn't loop forever against a control plane that will never accept it. + +#### Channels + +| Channel | Direction | Used for | +| -------------- | -------------------- | ---------------------------------------------------------------------------------------------- | +| **REST** (HTTPS) | Gateway → Platform API | Well-known discovery; **manifest + version push** on every connect, carrying gateway version, functionality type, and the list of installed policy definitions | +| **WebSocket** | bidirectional | Long-lived event channel — deploy / undeploy / API key / subscription events pushed down; heartbeat | + +The platform may reject the manifest if version or policy set is incompatible — also treated as a permanent failure. + +#### Custom Policy Sync + +The manifest push carries every policy installed in the gateway, each entry tagged with a `managedBy` field — `"wso2"` for built-in policies and `"customer"` for policies added via `ap gateway image build`. System policies (those whose name is prefixed `wso2_apip_sys_`) are filtered out at the gateway before the manifest is sent; they are an internal concern of the data plane and the Platform API has no need to know about them. Customer-managed entries include the policy's full `parameters` and `systemParameters` JSON-Schema blocks; for WSO2-managed entries those are dropped on the platform side because the schema is already known centrally. + +The Platform API persists the manifest into a `gateways.manifest` column on receipt, but does **not** automatically promote customer-managed policies into the catalogue the Console uses for attachment. That step is **Console-triggered** — the Console calls `POST /api/v1/gateway-custom-policies/sync` with the gateway, policy name, and version. The service looks up the stored manifest, verifies the entry's `managedBy == "customer"`, and writes the extracted definition into the org-scoped `gateway_custom_policies` table. Only after this Console sync is a custom policy attachable to APIs through the Console UI. + +#### Deployment Acknowledgement + +Deployments pushed from the Platform API are not fire-and-forget. After the gateway applies (or fails to apply) a deployment or undeployment, it sends an acknowledgement back over the same WebSocket carrying the originating deployment ID, the action, and a terminal status (`success`/`failed` with an optional error code). Acknowledgements are sent for every WebSocket-pushed resource type — REST APIs, LLM providers, LLM proxies, MCP proxies, and WebSub APIs. + +The Platform API drives its own internal in-flight state machine off these acks — that intermediate state is platform-side; the gateway only reports the terminal outcome. + +#### Startup Sync + +Because WebSocket events can be missed while a gateway is down, every gateway runs a **background reconciliation** with the Platform API on startup: + +1. The gateway fetches the platform's expected deployment set for its `gateway_id` over REST. +2. It diffs that set against its own local state. +3. Missing or stale deployments are pulled and applied; orphaned local deployments are removed. + +The diff is computed **gateway-side** — gateways scale out far more than the Platform API, so doing it server-side would create a fan-out bottleneck. The sync is **asynchronous**: the gateway begins serving traffic immediately and reconciles in the background. Any WebSocket event arriving mid-sync naturally wins via deployment-ID ordering — operations are idempotent. + +```mermaid +graph TB + subgraph CP["Control Plane"] + Mgr["Platform API
Deployment Manager"] + end + + subgraph DP["Data Plane"] + subgraph GW1["Gateway A (gateway_id=gw-a)"] + C1["Controller"] + R1["Runtime"] + end + + subgraph GW2["Gateway B (gateway_id=gw-b)"] + C2["Controller"] + R2["Runtime"] + end + end + + C1 -- "REST
POST /gateways/{id}/manifest
(version, policies)" --> Mgr + Mgr <-- "WebSocket
register · deploy · undeploy
health · heartbeat" --> C1 + C2 -- "REST" --> Mgr + Mgr <-- "WebSocket" --> C2 + C1 --> R1 + C2 --> R2 ``` -+-------------------------------------------------------------+ -| Gateway-Controller (container) | -| +-------------------+ +-------------------+ | -| | REST API Server | -> | Validation Layer | | -| | (Port 9090) | +-------------------+ | -| +-------------------+ | | -| | v | -| | +----------------------+ | -| | | SQLite + In-Memory | | -| | | Cache | | -| | +----------------------+ | -| | | | -| v v | -| +-------------------+ +-------------------+ | -| | xDS Translator | -> | xDS v3 Server | | -| +-------------------+ | (Port 18000) | | -| +-------------------+ | -+-------------------------------------------------------------+ - | - | xDS gRPC - v -+-------------------------------------------------------------+ -| Router (Envoy container) | -| +-------------------+ +-------------------+ | -| | Envoy Proxy | -> | Backend Services | | -| | (Port 8080) | +-------------------+ | -| +-------------------+ | -+-------------------------------------------------------------+ + +### Event Gateway + +For event-driven (`WebSubApi`) traffic the same Gateway Controller drives a separate **event-gateway runtime** instead of the Envoy-based runtime. The controller's REST API, persistence, xDS streams, and `RuntimeDeployConfig` translator are reused unchanged; only the data-plane runtime differs. + +> **TODO**: A dedicated architecture document for the event-gateway runtime does not yet exist. Add `event-gateway/spec/architecture/architecture.md` covering the WebSub subscription flow, Kafka delivery model, and runtime/controller interaction. + +### CLI + +The `ap` CLI is the local-user equivalent of the Platform API — it talks to the Gateway Controller's **management REST API** directly to deploy, list, update, and undeploy artifacts. There is no separate channel: every CLI operation maps 1:1 to a REST call against the management API, just like the Platform API and the Gateway Operator. It also offers a `kubectl apply`-style bulk apply of a directory of artifact YAMLs, and a wrapper that runs the gateway-builder in Docker to produce custom runtime + controller images locally. + +The CLI is therefore the same kind of REST-API client as the Platform API and the Gateway Operator — just driven interactively from a developer's machine. + +### Kubernetes Integration + +On Kubernetes the gateway is deployed and managed by the **Gateway Operator**. The operator is a *client* of the Gateway Controller's REST API — it does not bypass the controller — and supports two reconciliation flows side by side: + +| Flow | CRDs | Behavior | +| ----------------------- | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | +| **WSO2 CRD flow** | `ApiGateway`, `RestApi`, `LlmProvider`, `LlmProxy`, `Mcp`, `WebSubApi`, `ApiKey`, `Subscription`, `SubscriptionPlan`, `Certificate`, `Secret` | Operator deploys the gateway via Helm and POSTs each CR's spec to the controller's management REST API. CRs mirror controller resource kinds 1:1. | +| **Kubernetes Gateway API flow** | `Gateway`, `HTTPRoute`, `APIPolicy` | Operator deploys the gateway from a `Gateway` CR and translates `HTTPRoute` + `APIPolicy` into the controller's `RestApi` shape via the same REST API. | + +Both flows converge on the same REST API of the same Gateway Controller — the Kubernetes layer is just another producer alongside CLI users, Platform API, and immutable-mode file artifacts. + +--- + +## High Availability + +HA is configured **per gateway** (per `gateway_id`). Two HA gateways can still share the same physical PostgreSQL because every row is scoped by `gateway_id`. + +```mermaid +graph TB + LB["Load Balancer"] + + LB --> R1[Gateway Runtime 1] + LB --> R2[Gateway Runtime 2] + LB --> R3[Gateway Runtime 3] + + subgraph Controllers["Gateway Controller replicas
(same gateway_id)"] + C1[Controller 1] <-->|EventHub poll| DB + C2[Controller 2] <-->|EventHub poll| DB + end + + R1 -. "Envoy xDS + Policy xDS" .-> C1 + R2 -. "Envoy xDS + Policy xDS" .-> C2 + R3 -. "Envoy xDS + Policy xDS" .-> C1 + + R1 --- Redis[(Redis
shared rate-limit)] + R2 --- Redis + R3 --- Redis + + DB[("PostgreSQL
rows scoped by gateway_id")] ``` -## Integration Points - -- **API Developers** → Gateway-Controller: Submit API configurations via REST API. -- **Router** ← Gateway-Controller: Receives xDS configuration updates via gRPC. -- **Backend Services** ← Router: Forwards HTTP requests based on API configurations. -- **Platform API** → Gateway: Orchestrates API deployments to gateways. -- **Portals/CLI** → Platform API → Gateway: Indirect configuration management. - -## Deployment Tiers - -### Basic Gateway -- Components: Gateway-Controller (memory-only), Router, Policy Engine. -- No persistence (configurations lost on restart). -- Basic rate limiting built into Router. -- Use case: Development, testing, 14-day trial. - -### Standard Gateway -- Components: All Basic + Rate Limiter + Redis + SQLite. -- Persistent storage with SQLite (configurable to PostgreSQL/MySQL). -- Advanced distributed rate limiting. -- Use case: Production, enterprise deployments. - -## Data Flow - -### API Configuration Lifecycle -1. User submits API config (YAML/JSON) to REST API (port 9090). -2. Gateway-Controller validates configuration structure and fields. -3. Configuration persisted to SQLite and cached in memory. -4. xDS translator generates Envoy configuration from API config. -5. xDS server pushes new snapshot to Router via gRPC (port 18000). -6. Router applies configuration gracefully (zero downtime). - -### Runtime Request Flow -1. HTTP request arrives at Router (port 8080). -2. Router matches request to API configuration (method, path, context). -3. Policy Engine evaluates policies (auth, rate limit, etc.). -4. Request forwarded to backend service upstream URL. -5. Response returned to client. -6. Access log written to stdout in JSON format. +- **Controller replicas** of one gateway share a PostgreSQL database and a `gateway_id`. They use the DB-backed EventHub to keep their in-memory caches and xDS snapshots in sync — no separate broker required. +- **Runtime replicas** are stateless. Each connects to one controller's xDS streams. Configuration is reconstructed entirely from xDS — restart is safe. +- **Other gateways** with a different `gateway_id` can share the same PostgreSQL instance without interfering — their data, events, and xDS state are isolated by ID. +- **Distributed rate limiting** uses Redis as the shared counter store for the `advanced-ratelimit` policy. Without Redis, rate limiting is per-replica. +- **Certificate rotation** is hot-reloaded by the controller (no restart required) and republished via SDS. + +--- + +## Observability + +- **Metrics**: All three processes expose Prometheus metrics. The policy engine emits per-request, per-policy, and per-chain metrics — request count, latency histograms, action counts, chain length, xDS connection state, snapshot version, body mode distribution. + +- **Tracing**: OpenTelemetry tracing in both Envoy and the Policy Engine. The default exporter points at an `otel-collector` sidecar that fans out to Jaeger or any OTLP backend. The Policy Engine creates a child span per policy execution and links across the ext_proc boundary using a propagated request ID. + +- **Logging**: Structured logs from both Go and Python processes, with consistent per-process prefixes so a single `docker logs` stream stays readable. + +- **Analytics**: Per-request events are published to configurable sinks (Moesif, gRPC ALS) asynchronously. + +--- + +## Key Architectural Decisions + +| Decision | Why | +| ---------------------------------------------------------------- | ------------------------------------------------------------------------------ | +| Policy Engine has **zero built-in policies**; all linked at build time | Reproducibility, security review surface, custom-policy support without a plugin loader | +| **Go templates** for artifact field interpolation | Composable, gives typed render errors with clear messages | +| **`RuntimeDeployConfig`** as kind-agnostic intermediate | Frees xDS translator and Policy Engine from per-kind branching | +| **RouteConfig delivered via xDS** (not extracted from request metadata) | Avoids per-request protobuf metadata unmarshal in the data path | +| **Per-chain body mode** with `mode_override` | Headers-only chains skip body buffering entirely | +| **Atomic API-key replacement** on every xDS snapshot | No auth gap during xDS key rotation | +| **UDS** between Router ↔ Policy Engine ↔ Python Executor | Lowest-latency local IPC; no port management; security via filesystem perms | +| **Optional Python Executor**, started only if Python policies exist | Zero overhead for Go-only deployments | +| **Single Dockerfile** for all three runtime processes | One artifact to scan, sign, and ship; matching Python versions guarantee C-ext compatibility | +| **EventHub via DB polling** for controller multi-replica sync | Avoids adding Kafka/Redis/etc. as a hard dependency | +| **`gateway_id` scoping on every persistent row** | Lets multiple distinct gateways share one PostgreSQL without interference | +| **Immutable mode wipes SQLite on boot** | Guarantees the file artifacts are the single source of truth | +| **Extend-base-image** custom builds via `gateway-builder` | Custom policy sets compose cleanly on top of WSO2-published base images | + +--- + +## Versioning and Compatibility + +- The gateway and its Management/Admin REST APIs follow independent version tracks. +- The Envoy version is pinned in the runtime Dockerfile. +- Policies are pinned by minor-version in the build manifest and resolved against the Go-module / pip-package references in the policy manifest. The Policy Engine and Controller key on the policy **major** version to allow forward-compatible minor upgrades without re-deployment. +- The runtime reports its version on the Platform API connection; the platform can enforce a manifest version match or verification flag before accepting a deployment. + +--- + +## Document Status + +- **Document Version**: 2.0 +- **Last Updated**: 2026-05-20 +- **Applies To**: Gateway `1.2.0-SNAPSHOT` +- **Status**: Active