|
| 1 | +--- |
| 2 | +id: talos-architecture |
| 3 | +title: Ory Talos architecture |
| 4 | +sidebar_label: Ory Talso architecture |
| 5 | +--- |
| 6 | + |
| 7 | +# Architecture |
| 8 | + |
| 9 | +Talos separates API key management into two planes. |
| 10 | + |
| 11 | +## Admin plane |
| 12 | + |
| 13 | +The admin plane handles all key management and verification operations: key issuance, rotation, revocation, token derivation, |
| 14 | +JWKS, and verification (single and batch). It is exposed only to internal services and clients with admin credentials. |
| 15 | + |
| 16 | +Endpoints: `/v2alpha1/admin/`, including `/v2alpha1/admin/apiKeys:verify` and `/v2alpha1/admin/apiKeys:batchVerify`. |
| 17 | + |
| 18 | +For low-latency verification close to clients, deploy the commercial [edge proxy](../operate/deploy/edge-proxy.md) as a sidecar. |
| 19 | +The proxy caches admin verify responses locally, so applications get sub-millisecond cache hits without exposing the admin plane |
| 20 | +publicly. |
| 21 | + |
| 22 | +## Data plane |
| 23 | + |
| 24 | +The data plane handles self-service operations that credential holders perform with proof of possession of the credential itself, |
| 25 | +no admin authentication required. |
| 26 | + |
| 27 | +Endpoints: `POST /v2alpha1/apiKeys:selfRevoke` |
| 28 | + |
| 29 | +## Verification flow |
| 30 | + |
| 31 | +``` |
| 32 | +Client --> Verifier --> Cache (hit?) --> Database --> Response |
| 33 | + | ^ |
| 34 | + +-- cache hit ---------------+ |
| 35 | +``` |
| 36 | + |
| 37 | +1. Client sends credential to `POST /v2alpha1/admin/apiKeys:verify` |
| 38 | +2. Talos identifies the credential type (generated, imported, JWT, macaroon) |
| 39 | +3. For generated keys, the UUID is extracted from the token identifier |
| 40 | +4. For imported keys, a tenant-scoped SHA-512/256 hash is computed |
| 41 | +5. Database lookup (or cache hit) returns key metadata |
| 42 | +6. Response includes key status, owner, scopes, and metadata |
| 43 | + |
| 44 | +## Deployment topologies |
| 45 | + |
| 46 | +| Topology | Edition | Description | |
| 47 | +| ------------ | ---------- | -------------------------------------------------------------------- | |
| 48 | +| Single-node | OSS | One process serves both planes | |
| 49 | +| Split planes | Commercial | Admin and data planes as separate deployments | |
| 50 | +| Edge proxy | Commercial | Sidecar proxy at the edge that caches admin verify responses locally | |
| 51 | + |
| 52 | +Both planes share the same database. Verification uses caching (memory or Redis) to minimize database load. |
| 53 | + |
| 54 | +## Ports |
| 55 | + |
| 56 | +| Port | Purpose | |
| 57 | +| ---- | ------------------ | |
| 58 | +| 4420 | HTTP API (default) | |
| 59 | +| 4422 | Prometheus metrics | |
| 60 | + |
| 61 | +## Design philosophy |
| 62 | + |
| 63 | +### Separation of concerns |
| 64 | + |
| 65 | +The system is divided into distinct layers: |
| 66 | + |
| 67 | +- **Admin plane**: Management operations (CRUD for keys, rotation, import, token derivation) |
| 68 | +- **Data plane**: High-throughput verification operations |
| 69 | +- **Persistence layer**: Database abstraction with pluggable drivers |
| 70 | +- **Cache layer**: Performance optimization with multiple backends |
| 71 | + |
| 72 | +This separation allows independent scaling of components, different SLOs for different operations (admin targets \<100ms p99, data |
| 73 | +plane targets \<3ms p99), and clear boundaries between responsibilities. |
| 74 | + |
| 75 | +### Production-first design |
| 76 | + |
| 77 | +- Hard isolation between admin and data operations |
| 78 | +- Metrics, traces, and structured logs are emitted by default |
| 79 | +- Graceful degradation when the database or cache backend is unavailable |
| 80 | +- Zero-downtime deployments via rolling updates and stateless verification |
| 81 | + |
| 82 | +### Performance characteristics |
| 83 | + |
| 84 | +- Self-contained tokens (JWT/macaroon) enable stateless verification |
| 85 | +- HMAC-SHA256 keeps the revocation check on the order of microseconds; bcrypt would cap a single core at roughly 10 verifications |
| 86 | + per second |
| 87 | +- LRU caching for hot paths |
| 88 | +- Minimal allocations in the verification path |
| 89 | + |
| 90 | +## System architecture |
| 91 | + |
| 92 | +``` |
| 93 | +Clients (CLI, SDK, HTTP) |
| 94 | + | |
| 95 | + v |
| 96 | ++----------------------------------+ |
| 97 | +| HTTP Server (grpc-gateway) | |
| 98 | +| Port: 4420 | |
| 99 | ++----------------------------------+ |
| 100 | + | |
| 101 | + v |
| 102 | ++----------------------------------+ |
| 103 | +| Middleware | |
| 104 | +| Logging, Metrics, Tracing | |
| 105 | ++----------------------------------+ |
| 106 | + | |
| 107 | + +-----+----------+ |
| 108 | + | | |
| 109 | + v v |
| 110 | ++-----------+ +-----------+ |
| 111 | +| Admin | | Data | |
| 112 | +| Plane | | Plane | |
| 113 | +| <100ms | | <3ms p99 | |
| 114 | ++-----------+ +-----------+ |
| 115 | + | | |
| 116 | + v v |
| 117 | ++----------------------------------+ |
| 118 | +| Service Layer | |
| 119 | +| Business logic, Validation | |
| 120 | ++----------------------------------+ |
| 121 | + | |
| 122 | + +-----+----------+ |
| 123 | + | | |
| 124 | + v v |
| 125 | ++-----------+ +-----------+ |
| 126 | +| Persist. | | Cache | |
| 127 | +| SQLite | | Memory | |
| 128 | +| PG/MySQL | | LRU | |
| 129 | +| CRDB | | Redis | |
| 130 | ++-----------+ +-----------+ |
| 131 | +``` |
| 132 | + |
| 133 | +All requests enter through a single HTTP server built on grpc-gateway (port 4420) and pass through middleware for logging, |
| 134 | +metrics, and tracing before being routed to the appropriate plane. |
| 135 | + |
| 136 | +## Component overview |
| 137 | + |
| 138 | +### HTTP server |
| 139 | + |
| 140 | +The API layer uses grpc-gateway for HTTP/JSON routing with protobuf-based schemas. It serves both planes through a single port, |
| 141 | +handles CORS and compression, and exposes OpenAPI documentation. |
| 142 | + |
| 143 | +### Service layer |
| 144 | + |
| 145 | +Business logic is split between the admin plane service (key lifecycle, import, token derivation, input validation) and the data |
| 146 | +plane verifier (token parsing, signature verification, revocation checking, cache management). The verifier is optimized for the |
| 147 | +hot path with minimal allocations. |
| 148 | + |
| 149 | +### Persistence |
| 150 | + |
| 151 | +Database access uses sqlc-generated type-safe queries with pluggable drivers: |
| 152 | + |
| 153 | +- **SQLite** -- OSS edition, zero-config, suitable for millions of keys |
| 154 | +- **PostgreSQL** -- production workloads |
| 155 | +- **MySQL** -- production workloads |
| 156 | +- **CockroachDB** -- distributed deployments |
| 157 | + |
| 158 | +Schema changes are managed through versioned migrations using golang-migrate. |
| 159 | + |
| 160 | +### Cache |
| 161 | + |
| 162 | +The cache layer reduces database load on the verification path: |
| 163 | + |
| 164 | +- **Memory LRU** (OSS) -- local to each instance, configurable size limits |
| 165 | +- **Redis** (Commercial) -- distributed, supports cluster and sentinel modes |
| 166 | +- **Hierarchical L1+L2** (Commercial) -- memory for speed, Redis for shared state |
| 167 | + |
| 168 | +### Crypto |
| 169 | + |
| 170 | +Talos supports multiple JWT signing algorithms and a separate API key hashing mechanism: |
| 171 | + |
| 172 | +- **JWT signing algorithms** |
| 173 | +- `Ed25519 (EdDSA)` -- default, fastest signing and smallest keys |
| 174 | +- `RSA-2048/4096 (RS256)` -- legacy compatibility |
| 175 | +- **API key hashing** |
| 176 | +- `HMAC-SHA256` -- used for API key revocation checks (\<1ms with constant-time comparison) |
| 177 | + |
| 178 | +The JWT signing algorithm is determined per JWK by its `alg` field, so one JWKS can contain keys for multiple signing algorithms |
| 179 | +at the same time. |
| 180 | + |
| 181 | +### Observability |
| 182 | + |
| 183 | +Built-in instrumentation across three pillars: |
| 184 | + |
| 185 | +- **Metrics** -- Prometheus exposition on port 4422 with request latency histograms and error rate counters |
| 186 | +- **Tracing** -- OpenTelemetry with W3C Trace Context propagation, configurable sampling, OTLP and Jaeger exporters |
| 187 | +- **Logging** -- structured JSON logging via slog with correlation IDs and contextual fields |
| 188 | + |
| 189 | +## Scalability |
| 190 | + |
| 191 | +### Small (\<1k RPS) |
| 192 | + |
| 193 | +A single Talos instance handles both planes with SQLite and an in-memory LRU cache. No external dependencies required. |
| 194 | + |
| 195 | +- OSS edition sufficient |
| 196 | +- 1 CPU, 512MB RAM |
| 197 | +- Cost: $5-10/month |
| 198 | + |
| 199 | +### Medium (10-50k RPS) |
| 200 | + |
| 201 | +Separate admin and data plane deployments behind a load balancer. PostgreSQL replaces SQLite for durability. Redis provides shared |
| 202 | +caching across data plane instances. |
| 203 | + |
| 204 | +- Commercial edition |
| 205 | +- Auto-scaling for data plane |
| 206 | +- Cost: $100-500/month |
| 207 | + |
| 208 | +### Large (200k+ RPS) |
| 209 | + |
| 210 | +A cluster of 10-50+ stateless data plane instances with auto-scaling, backed by a distributed Redis cache and PostgreSQL with read |
| 211 | +replicas and connection pooling. Supports multi-region deployment. |
| 212 | + |
| 213 | +- Commercial edition |
| 214 | +- Regional data plane deployment |
| 215 | +- Cost: $1-5k/month |
0 commit comments