- Title: Quorum Distributed Cache, Phase 1 Production Cutover
- Owning Team: Platform Caching
- Author: Principal Engineer, Platform Caching
- Reviewers: Architecture Review Board (ARB), Storage SIG, SRE Foundations, Security Platform
- Status: Draft, pre-ARB
- Revision: 0.4
- Related Docs: ADR-0142 (Raft selection), ADR-0151 (RESP3 compatibility shim), CAP-2024-Q3 (capacity model), RFC-Quorum-Protocol-v1
- Distribution: internal, infra-eng
The current cache tier is Redis Cluster, deployed across 28 shards in two regions (us-east-1 and us-west-2). Versions are mixed: 17 shards on Redis 6.2.x, 11 shards on Redis 5.0.x carried forward from the 2021 build-out. Working set across the fleet is approximately 340GB hot, with peak aggregate throughput of 3.1M ops/sec measured at the 95th percentile of the daily curve. Replication is single-region primary/replica with async cross-region copy via a custom side-car (xreplica) that has accumulated several known consistency edge cases under partition.
Operational pain points motivating replacement:
- No multi-region active-active. Failover is operator-driven and has run 9 to 14 minutes in the last three GameDay exercises.
- Mixed-version fleet blocks Cluster API features (notably RESP3 push, client-side caching) and creates a long tail of CVE patching cost.
xreplicais unowned. The original authors have left. It has produced three SEV-3 incidents in the last 18 months.- Memory efficiency. jemalloc fragmentation on long-running 5.0 shards has measured between 1.31 and 1.58 RSS/used ratio, costing roughly 22 percent of provisioned memory.
- No native tenancy primitives. Namespace isolation is enforced in client libraries, inconsistently.
Quorum is the in-house replacement, developed by Platform Caching over the last two quarters. This PRD covers Phase 1: production cutover for the 11 legacy Redis 5.0 shards, representing approximately 38 percent of working set.
- Replace the 11 Redis 5.0 shards with Quorum clusters in both regions.
- Deliver multi-region active-active replication via Raft per partition.
- Maintain or improve current in-region p99 latency.
- Provide a wire-compatible path (RESP3) so existing clients require no immediate change.
- Land a gRPC + protobuf surface for new clients and for features RESP3 cannot express.
- Stand up an OTel-native observability surface from day one.
- Replacing the Redis 6.2 shards in this phase. Phase 2 will address them after Quorum has 90 days of production soak.
- Supporting Redis Streams (
XADD,XREAD, consumer groups). No internal customer is on Streams in the 5.0 fleet; confirmed by namespace audit. - Supporting geospatial commands (
GEOADD,GEOSEARCH). Same audit result. - Lua scripting (
EVAL,EVALSHA). Replaced by a server-side function registry in a later phase. - Disk-backed persistence. Quorum is memory-resident. Cold start rebuilds from peers; durable storage is explicitly out of scope.
- Client library rewrites. Existing
go-redis,jedis, andredis-pyclients must continue to work via the RESP3 shim.
Quorum is a memory-resident key-value store organized as a set of Raft groups, one per partition. The partition map is a consistent hash ring with 4096 virtual nodes, assigned to physical Quorum nodes by the control plane. Each partition runs a 5-member Raft group spanning both regions (3 in the primary region for the partition, 2 in the secondary), with leader placement biased to minimize cross-region writes per partition.
+-------------------------+
| Control Plane (etcd) |
| partition map, leases |
+-----------+-------------+
|
+-----------------+-----------------+
| |
Region: us-east-1 Region: us-west-2
+-----------------------+ +-----------------------+
| Quorum Node Pool | Raft RPC | Quorum Node Pool |
| - partition 0..N | <---------> | - partition 0..N |
| - RESP3 listener | (mTLS) | - RESP3 listener |
| - gRPC listener | | - gRPC listener |
+----------+------------+ +----------+------------+
| |
Client Proxy Client Proxy
(RESP3 + dual-write) (RESP3 + dual-write)
| |
Application Pods Application Pods
Components:
- Control Plane. etcd-backed metadata service. Holds partition assignments, node membership, lease state. Not in the data path.
- Quorum Node. Stateful pod. Hosts N partition replicas. Single-process, multi-threaded. io_uring on Linux 6.1+, with BPF-based socket steering to pin connections to worker shards.
- Client Proxy (
qproxy). Sidecar deployed per application pod during migration. Speaks RESP3 to the application, routes to Redis Cluster, Quorum, or both (dual-write). Removed after cutover.
- 5 members per partition.
- 3 in the partition's home region, 2 in the peer region.
- Leadership is sticky to the home region under normal operation. Leader can fail over to the peer region under home-region loss.
- Pre-vote enabled. Leader leases of 500ms. Heartbeat interval 50ms. Election timeout randomized 1000 to 1500ms.
- Snapshot every 100k log entries or 256MB, whichever first. Snapshots streamed via dedicated TCP connection, not the Raft RPC channel.
- Single-key operations: linearizable within a partition (Raft log order is the linearization point).
- Multi-key operations within a single tenant namespace: sequential consistency. Cross-partition transactions are not supported in Phase 1.
- Cross-region: by default, all writes go through Raft and are linearizable across regions for that partition. Per-key opt-in to "eventual" mode is available via a key tag (
{eventual}) for cache-style use cases where staleness is tolerable. Eventual keys replicate via async log shipping, not Raft. - Reads: by default, leader reads. Follower reads are opt-in per call with a bounded-staleness parameter (max 200ms).
Two listeners. Same data, two protocols.
- Wire-compatible with Redis 7 RESP3.
- Supported commands (Phase 1):
- Strings:
GET,SET,SETEX,SETNX,MSET,MGET,INCR,INCRBY,DECR,DECRBY,APPEND,GETRANGE,STRLEN. - Hashes:
HSET,HGET,HMGET,HGETALL,HDEL,HEXISTS,HINCRBY,HLEN,HKEYS,HVALS. - Sorted sets:
ZADD,ZRANGE,ZRANGEBYSCORE,ZRANGEBYLEX,ZREM,ZSCORE,ZCARD,ZINCRBY,ZRANK. - HyperLogLog:
PFADD,PFCOUNT,PFMERGE. - TTL:
EXPIRE,EXPIREAT,PEXPIRE,TTL,PTTL,PERSIST. - Server:
PING,CLIENT,HELLO,RESET,INFO(subset).
- Strings:
- Explicitly unsupported:
XADD/XREAD/XGROUP,GEOADD/GEOSEARCH,EVAL/EVALSHA,SUBSCRIBE/PUBLISH(deferred to Phase 3),MULTI/EXEC(deferred),CLUSTERadmin (proxied to no-op or rewritten byqproxy). - Behavior delta:
OBJECT ENCODINGreturns Quorum-native encoding names. Clients that key off encoding strings will need to update. The audit found 2 such call sites; both owned by Platform Search.
- New surface for clients that want richer types and observability.
- Service definitions in
qcache.v1.proto. - Methods cover the same data operations plus:
BatchGet/BatchSetwith per-key consistency hints.Watch(server-stream) for key invalidation push.LeaseandLeaseRenewfor distributed-lock-style usage.Tenant.*admin RPCs (rate limit config, quota inspection).
- All gRPC RPCs carry an
x-tenant-idheader validated against the partition map and the tenant ACL.
- Flat keyspace per tenant. Keys are bytes, max 4KB.
- Values are bytes, max 16MB per value. Sorted sets and hashes have a per-collection cap of 64MB.
- Memory accounting:
- 16-byte fixed overhead per key (pointer, type tag, expiry, ref count).
- Slab allocator with 14 size classes: 32, 64, 96, 128, 192, 256, 384, 512, 1024, 2048, 4096, 8192, 16384, 65536 bytes. Allocations above 64KB fall through to the system allocator.
- Slab class utilization is a first-class metric (
quorum_slab_util_ratio{class="..."}).
- TTL: per-key expiry, second and millisecond precision. Expiration runs on a background reaper at 100Hz; lazy expiry on access guarantees no stale read returns a logically expired key.
- Eviction: per-tenant LRU with a configurable working set cap. No global eviction; tenants that exceed their cap are rejected with
QUOTA_EXCEEDED, not silently evicted from other tenants.
In-region (single datacenter, warm cache, 90 percent reads):
- p50 < 250µs end-to-end (client wire to server wire).
- p99 < 2ms.
- p99.9 < 10ms.
- Sustained throughput per shard cluster: 1.2M ops/sec. Burst (60s window): 2.4M ops/sec.
Cross-region (linearizable write through Raft):
- p50 < 25ms (driven by inter-region RTT, currently 18ms 95th).
- p99 < 80ms.
- p99.9 < 200ms.
Eventual-mode keys cross-region:
- p99 replication lag < 250ms.
- p99.9 replication lag < 1.5s under healthy network.
Availability SLO:
- 99.99 percent monthly for in-region single-key reads.
- 99.95 percent monthly for cross-region linearizable writes.
Error budget consumption is tracked per partition and aggregated per tenant.
| Failure | Detection | Response | Time to Restore |
|---|---|---|---|
| Single node failure | Raft heartbeat loss (3 missed, 150ms) | Follower promoted on partitions where node was leader | < 2s for leader re-election; reads served from peers immediately |
| Disk full on a node (snapshot dir) | Disk metric < 5GB free | Node enters read-only; control plane reassigns partitions | < 5 min via partition drain |
| Network partition, minority side | Raft loses quorum | Minority partitions reject writes, serve stale reads if opted in | Restored when partition heals |
| Network partition, majority side | Raft retains quorum | Continues serving | n/a |
| Full region failure | Control plane health check + Raft membership view | Partitions whose home is the lost region elect leader in peer region | < 30s for leader election fleet-wide |
| Control plane (etcd) unavailable | etcd client errors | Data plane continues with cached partition map; no new placement decisions | Up to 10 min cached, then degraded mode |
| qproxy crash | Pod readiness | Sidecar restarts; application connections re-establish | < 5s |
| Slow client (head-of-line) | Per-connection write buffer > 16MB | Connection closed with OUTPUT_BUFFER_LIMIT |
immediate |
| Hot key | Per-key ops counter > 100k/s | Read replica fanout enabled for that key, surfaced to tenant dashboard | < 30s |
- Kubernetes 1.29+ required (uses
PodDisruptionBudgetv1 withunhealthyPodEvictionPolicy: AlwaysAllow). - StatefulSet per region per cluster, with stable network identity via headless service.
- Pod spec:
- 32 vCPU, 256GB RAM, local NVMe (snapshot scratch only, not durable store).
- Hugepages 2MB, 200 pages reserved.
CAP_BPF,CAP_NET_ADMINfor BPF socket steering programs.
- Linux kernel 6.1 minimum (io_uring features, BPF CO-RE).
- Node image: internal "platform-base-2026.04" with io_uring, BPF toolchain, and tuned scheduler profile.
- Rollouts: one partition replica at a time, with a 60s soak between replicas. Full cluster rollout takes approximately 45 minutes.
- OpenTelemetry SDK embedded. Spans for every RPC, with sampling at 1 percent by default, 100 percent on error.
- Prometheus exposition on
/metrics, scrape interval 15s.- RED metrics per RPC method.
- Raft metrics: leader changes, log lag, snapshot count, commit latency.
- Slab metrics: utilization, fragmentation ratio, eviction rate.
- Per-tenant counters: ops, bytes, quota usage.
- Jaeger trace export via OTLP. Cross-region trace propagation via W3C
traceparent. - Structured logs in JSON, shipped via Fluent Bit to the central log bus.
- Required dashboards (delivered as code in
quorum-dashboards/):- Fleet overview
- Per-cluster RED + Raft health
- Per-tenant usage and SLO burn
- Migration progress (Phase 1 specific)
- mTLS on all listeners. Internal CA. Cert rotation every 24 hours via SPIFFE workload identity.
- Tenant authentication via signed tenant tokens issued by the internal auth service. Tokens scoped to a tenant ID and a TTL of 1 hour.
- Authorization at the partition router: every RPC is checked against tenant ACL before being dispatched.
- At-rest: not applicable, memory-resident. RAM scrubbing on partition release (
MADV_FREEfollowed by explicit zeroing for tenants withsecure: true). - In-flight: TLS 1.3 only. RESP3 listener requires AUTH followed by HELLO; no anonymous access.
- Audit log: every admin RPC (
Tenant.*, partition rebalance, snapshot trigger) emitted to the central audit pipeline.
Phase 1 targets the 11 Redis 5.0 shards, with combined hot working set of approximately 129GB and peak 1.18M ops/sec aggregate.
Sizing:
- 6 Quorum nodes per region, 256GB RAM each, 70 percent target utilization. Total 12 nodes across two regions.
- Headroom for 40 percent organic growth over 12 months and 2x burst.
- Partition count: 1024 (subset of the 4096 ring), bound to the Phase 1 tenants.
Cost model (vs. current Redis 5.0 fleet):
- Compute: +18 percent (denser nodes but new active-active footprint).
- Memory: -22 percent (fragmentation recovery + slab efficiency).
- Operational ($/incident estimated): -60 percent (target, modeled on
xreplicaincident history).
Net: roughly flat infrastructure spend in Phase 1; the win is operational.
Phased per namespace. 14 namespaces in the 5.0 fleet, ordered by risk (lowest first).
-
Pre-flight (week 0)
- Deploy
qproxysidecar to a canary application pod per namespace. - Configure
qproxyin shadow mode: every read served from Redis, also issued to Quorum, responses compared, divergences logged. - Run for 72 hours minimum, divergence rate must be < 0.01 percent.
- Deploy
-
Dual-write (weeks 1 to 4, staggered)
- Switch
qproxyto dual-write. Reads still from Redis. - Per-key TTLs honored on both sides.
- Quorum considered hot when 99.9 percent of expected keys present, verified by working-set sampling.
- Switch
-
Cutover (weeks 4 to 7, staggered)
- Per namespace: flip reads to Quorum, writes remain dual.
- 24-hour soak.
- Disable Redis writes. Redis becomes read-only fallback for 7 days.
- Decommission Redis shards for that namespace.
-
Post-cutover (week 8+)
qproxyremoved from applications, native clients (RESP3 or gRPC) talk directly to Quorum.- Redis 5.0 shards torn down.
Cutover gates (all must pass per namespace before advancing):
- Divergence rate < 0.01 percent over the prior 24 hours.
- p99 latency on Quorum within 10 percent of the Redis baseline.
- No SEV-3+ incidents attributable to Quorum in prior 7 days.
- Tenant owner sign-off.
Reversible at every phase up through "Disable Redis writes."
- Shadow mode: no-op rollback (Redis is still authoritative).
- Dual-write: flip
qproxyreads back to Redis. Quorum data discarded for that namespace. - Reads on Quorum: flip
qproxyreads back to Redis, retain dual-write. Investigate divergence. - Writes Redis-only: re-enable Redis writes within the 7-day fallback window. After 7 days, rollback requires a forward-fix (replay from Quorum to Redis via a one-shot tool, not yet built and not in scope for Phase 1).
Rollback authority: on-call SRE for the affected namespace, no further approvals.
- Kubernetes 1.29+ in both regions. (us-east-1 at 1.30, us-west-2 at 1.29.2. OK.)
- Linux kernel 6.1+ on all data-plane nodes. (Currently 6.1.62 on the platform-base image.)
- BPF toolchain in the node image (CO-RE, libbpf 1.4+).
- etcd 3.5+ for the control plane. (Existing platform etcd cluster, capacity confirmed.)
- Internal auth service v2 (SPIFFE workload identity). Available, GA Q1.
- OTel collector fleet. Available.
- Inter-region network: 18ms p95 RTT, 9.7 Gbps available. Confirmed with Network Eng.
OBJECT ENCODINGsemantics. Two callers (Platform Search) inspect encoding strings. Either change them or expose a compatibility map. Decision needed before shadow mode.qproxyconnection pool sizing under dual-write doubles upstream connection count. Validate that Redis 5.0 shards tolerate the increase (modeling says yes, but we have not load-tested above 1.6x).- Snapshot streaming bandwidth. A worst-case snapshot stream is ~12GB; with the per-node 10Gbps NIC and other Raft traffic, we may need a dedicated QoS class for snapshot streams.
- Hot key replica fanout: detection threshold currently 100k ops/sec/key. Validate against production workload traces. Possible the threshold should be tenant-relative.
- RESP3
CLIENT TRACKING(server-assisted client caching) is supported in the spec but not in the Phase 1 implementation. Confirm no callers depend on it (audit indicates none, but worth a final pass). - Eventual-mode key tag
{eventual}collides with the existing Redis hash-tag convention used by 3 internal libraries to control slotting. Decide whether to use a distinct sigil (<eventual>,[[eventual]]) or change semantics. - Phase 2 (Redis 6.2 fleet) will require streams support or a migration story for the 4 namespaces using
XADD. Out of scope here but flagged for sequencing. - Cost modeling assumes 70 percent memory utilization target. SRE Foundations has asked us to revisit at 60 percent given the active-active overhead. Open.
- RESP3 listener negotiated via
HELLO 3. Clients that sendHELLO 2get a downgraded surface (no push frames, no map type). Push-frame use in Phase 1 is limited to invalidation messages for clients that opt in. - gRPC listener supports both unary and server-streaming. Bidirectional streaming is reserved for the future pub/sub surface.
- Protobuf schema versioning: all messages tagged with
schema_version. Server accepts N and N-1.
Working set: 129GB hot. Per-replica overhead: 1.15x (Raft log + snapshot scratch in RAM cache). Per-region replication factor: 3 in home, 2 in peer = 5 total. Effective RAM per region for hot data: 129 * 1.15 * 3 / 2 ~= 222GB. Node capacity (256GB * 0.70 target util): ~179GB usable per node. Nodes per region: ceil(222 / 179) = 2 minimum. Add headroom and partition spread: 6 per region.
- ADR-0142 Raft library selection (etcd/raft fork vs. hashicorp/raft vs. in-house). Decision: etcd/raft fork.
- ADR-0151 RESP3 compatibility shim approach.
- CAP-2024-Q3 Capacity model and assumptions.
- RFC-Quorum-Protocol-v1 Wire protocol specification.
- Incident reports IR-2024-0231, IR-2024-0712, IR-2025-0118 (
xreplicaincidents motivating replacement).