|
| 1 | +# User-Defined Cluster Replica Sizes |
| 2 | + |
| 3 | +- Associated: [PR #35971](https://github.com/MaterializeInc/materialize/pull/35971) |
| 4 | + |
| 5 | +## The Problem |
| 6 | + |
| 7 | +Cluster replica sizes in Materialize are configured exclusively through |
| 8 | +the `CLUSTER_REPLICA_SIZES` environment variable, passed as a JSON blob |
| 9 | +at startup. This creates several problems: |
| 10 | + |
| 11 | +* **Adding or changing sizes requires an environmentd restart.** Any |
| 12 | + modification to the set of available sizes -- adding a new size, |
| 13 | + adjusting resource limits, disabling an existing one -- requires |
| 14 | + updating a cli flag and restarting environmentd. In cloud |
| 15 | + environments this means a deployment rollout; in self-managed |
| 16 | + environments it means coordinating downtime. |
| 17 | +* **No ad-hoc experimentation.** Operators cannot quickly test a new |
| 18 | + size to explore whether a different CPU/memory/worker configuration |
| 19 | + better fits a workload. Every experiment requires the full |
| 20 | + restart-and-redeploy cycle, making iterative sizing impractical. |
| 21 | +* **No durability or safety guarantees.** Sizes are held only in |
| 22 | + memory. If the env var changes between restarts, sizes can silently |
| 23 | + appear or disappear with no audit trail and no protection against |
| 24 | + removing a size that existing replicas depend on. |
| 25 | + |
| 26 | +## Success Criteria |
| 27 | + |
| 28 | +* Operators can create and remove cluster replica sizes without |
| 29 | + restarting environmentd or coordinating a deployment rollout. |
| 30 | +* Operators can inspect the full definition of a cluster replica |
| 31 | + size — including node selectors, swap configuration, and resource |
| 32 | + limits — through SQL. |
| 33 | +* Removing a size that is actively in use by a replica is prevented, |
| 34 | + avoiding silent breakage of running workloads. |
| 35 | +* Existing CLI flag based sizes continue to work unchanged. Operators |
| 36 | + who don't use the new DDL see no behavior difference. |
| 37 | + |
| 38 | +## Out of Scope |
| 39 | + |
| 40 | +* **ALTER CLUSTER REPLICA SIZE.** Sizes are immutable once created. |
| 41 | + To change a size definition, drop and recreate it (after migrating |
| 42 | + replicas off it). This avoids the complexity of drift between |
| 43 | + running replicas and their declared size, and avoids needing to |
| 44 | + track size history per replica. |
| 45 | + |
| 46 | +## Solution Proposal |
| 47 | + |
| 48 | +### Durable storage |
| 49 | + |
| 50 | +`ClusterReplicaSize` is added as a new durable catalog object, |
| 51 | +following the established pattern used by `NetworkPolicy`, `Cluster`, |
| 52 | +and other catalog objects. |
| 53 | + |
| 54 | +#### Identity |
| 55 | + |
| 56 | +A new `ClusterReplicaSizeId` type in `mz-repr` with `User(u64)` and |
| 57 | +`System(u64)` variants, matching `NetworkPolicyId`, `ClusterId`, |
| 58 | +`RoleId`, etc. Two allocator keys track the next ID in each namespace. |
| 59 | +Builtin sizes synced from the env var get `System` IDs; user-defined |
| 60 | +sizes created via SQL get `User` IDs. |
| 61 | + |
| 62 | +Replicas still reference sizes by name (`ManagedLocation.size` remains |
| 63 | +a `String`). The ID is catalog-internal identity only; the name is the |
| 64 | +user-facing reference. This avoids migrating the replica storage format. |
| 65 | + |
| 66 | +#### Proto definitions |
| 67 | + |
| 68 | +```rust |
| 69 | +// Key — the durable identity |
| 70 | +pub struct ClusterReplicaSizeKey { |
| 71 | + pub id: ClusterReplicaSizeId, |
| 72 | +} |
| 73 | + |
| 74 | +pub enum ClusterReplicaSizeId { |
| 75 | + System(u64), |
| 76 | + User(u64), |
| 77 | +} |
| 78 | + |
| 79 | +// Value — the allocation details and metadata |
| 80 | +pub struct ClusterReplicaSizeValue { |
| 81 | + pub name: String, |
| 82 | + pub memory_limit: Option<u64>, // bytes |
| 83 | + pub memory_request: Option<u64>, |
| 84 | + pub cpu_limit: Option<u64>, // nanocpus |
| 85 | + pub cpu_request: Option<u64>, |
| 86 | + pub disk_limit: Option<u64>, // bytes |
| 87 | + pub scale: u16, |
| 88 | + pub workers: u64, |
| 89 | + pub credits_per_hour: String, // Numeric as string |
| 90 | + pub cpu_exclusive: bool, |
| 91 | + pub is_cc: bool, |
| 92 | + pub swap_enabled: bool, |
| 93 | + pub disabled: bool, |
| 94 | + pub selectors: BTreeMap<String, String>, |
| 95 | + pub builtin: bool, |
| 96 | +} |
| 97 | +``` |
| 98 | + |
| 99 | +#### Combined durable type |
| 100 | + |
| 101 | +```rust |
| 102 | +pub struct ClusterReplicaSize { |
| 103 | + pub id: ClusterReplicaSizeId, |
| 104 | + pub name: String, |
| 105 | + pub allocation: ReplicaAllocation, |
| 106 | + pub builtin: bool, |
| 107 | +} |
| 108 | +``` |
| 109 | + |
| 110 | +Name uniqueness across the table is enforced by the `UniqueName` |
| 111 | +trait on `ClusterReplicaSizeValue`. |
| 112 | + |
| 113 | +### Builtin sync |
| 114 | + |
| 115 | +On every catalog open, `sync_builtin_cluster_replica_sizes()` reconciles |
| 116 | +the env-var `ClusterReplicaSizeMap` with durable state: |
| 117 | + |
| 118 | +* **New sizes** in the env var are inserted with `builtin: true`. |
| 119 | +* **Changed allocations** are retracted and reinserted. |
| 120 | +* **Removed sizes** are handled based on usage: |
| 121 | + * If no replica references the size, it is deleted. |
| 122 | + * If a replica still uses it, it is marked `disabled: true` with a |
| 123 | + warning log, preventing new replicas from using it while avoiding |
| 124 | + panics in `concretize_replica_location`. |
| 125 | + |
| 126 | +`CatalogState.cluster_replica_sizes` is initialized empty and populated |
| 127 | +from durable state updates (via `apply_cluster_replica_size_update`) |
| 128 | +rather than directly from config. |
| 129 | + |
| 130 | +### SQL DDL |
| 131 | + |
| 132 | +```sql |
| 133 | +CREATE CLUSTER REPLICA SIZE <name> ( |
| 134 | + CREDITS PER HOUR = '<numeric>', -- required |
| 135 | + WORKERS = <n>, -- default 1 |
| 136 | + SCALE = <n>, -- default 1 |
| 137 | + MEMORY LIMIT = '<size>', -- e.g. '4GiB', '512MiB' |
| 138 | + CPU LIMIT = '<cpu>', -- e.g. '0.5', '500m' |
| 139 | + DISK LIMIT = '<size>', |
| 140 | + CPU EXCLUSIVE = <bool>, |
| 141 | + DISABLED = <bool>, |
| 142 | + IS CC = <bool>, -- default true |
| 143 | + SWAP ENABLED = <bool>, |
| 144 | + NODE SELECTORS = '<json>' -- e.g. '{"k8s.io/arch": "arm64"}' |
| 145 | +); |
| 146 | + |
| 147 | +DROP CLUSTER REPLICA SIZE <name>; |
| 148 | +``` |
| 149 | + |
| 150 | +**Human-readable units:** |
| 151 | +* Memory/disk: `GiB`, `MiB`, `GB`, `MB`, `kB`, or raw bytes. |
| 152 | +* CPU: cores (`0.5`, `2`), millicpus (`500m`), or raw nanocpus. |
| 153 | + |
| 154 | +**Access control:** |
| 155 | +* Gated behind `enable_custom_cluster_replica_sizes` feature flag |
| 156 | + (default off). |
| 157 | +* `mz_system` bypasses the feature flag. |
| 158 | +* RBAC requires superuser for both CREATE and DROP. |
| 159 | +* Cannot drop builtin sizes or sizes in use by existing replicas. |
| 160 | +* Cannot create a size with a name that already exists. |
| 161 | + |
| 162 | +**Credit enforcement:** In self-managed deployments, `credits_per_hour` |
| 163 | +is automatically calculated as `(memory_limit * scale) / 1 GiB`, |
| 164 | +matching the existing behavior of env-var sizes. The user-provided |
| 165 | +`CREDITS PER HOUR` value is ignored. This ensures the DDL cannot be |
| 166 | +used to bypass license-based billing limits. Cloud deployments (where |
| 167 | +the license allows credit consumption overrides) honor the user-provided |
| 168 | +value. |
| 169 | + |
| 170 | +**Immutability:** Sizes cannot be altered after creation. To change a |
| 171 | +size definition, drop it (after migrating replicas off it) and recreate |
| 172 | +with the new parameters. |
| 173 | + |
| 174 | +### System tables |
| 175 | + |
| 176 | +**`mz_catalog.mz_cluster_replica_sizes`** (existing, public): Updated |
| 177 | +via durable state updates instead of manual startup packing. Shows |
| 178 | +`size`, `processes`, `workers`, `cpu_nano_cores`, `memory_bytes`, |
| 179 | +`disk_bytes`, `credits_per_hour`. |
| 180 | + |
| 181 | +**`mz_internal.mz_cluster_replica_size_details`** (new, public): |
| 182 | +Exposes the full allocation including `cpu_exclusive`, `is_cc`, |
| 183 | +`swap_enabled`, `disabled`, `builtin`, and `node_selectors` (as JSONB). |
| 184 | + |
| 185 | +### Audit logging |
| 186 | + |
| 187 | +`ObjectType::ClusterReplicaSize` added to the audit log enum. CREATE |
| 188 | +and DROP operations emit audit events with `EventDetails::IdNameV1`. |
| 189 | + |
| 190 | +## Alternatives |
| 191 | + |
| 192 | +### String key (name) as the catalog key |
| 193 | + |
| 194 | +An earlier iteration used the size name as the durable key directly, |
| 195 | +avoiding the need for a separate ID type. This worked but was |
| 196 | +inconsistent with every other catalog object and would have made |
| 197 | +size rename impossible without replica migration (since the name |
| 198 | +would be the primary identity). The ID approach adds minor |
| 199 | +complexity in exchange for pattern consistency and future rename |
| 200 | +support, without requiring any changes to how replicas reference |
| 201 | +sizes today. |
| 202 | + |
| 203 | +### Mutable sizes with drift tracking |
| 204 | + |
| 205 | +An alternative design would allow `ALTER CLUSTER REPLICA SIZE` and |
| 206 | +track which allocation each replica was created with. This enables |
| 207 | +live reconfiguration but introduces significant complexity: snapshot |
| 208 | +allocation at creation time, drift detection columns, history tables. |
| 209 | +The immutable approach is simpler and sufficient for the initial use |
| 210 | +case. ALTER can be added later if needed. |
| 211 | + |
| 212 | +### Storing `ReplicaAllocation` directly in the durable value |
| 213 | + |
| 214 | +The durable `ClusterReplicaSizeValue` could embed `ReplicaAllocation` |
| 215 | +directly. However, `ReplicaAllocation` contains `Numeric` (for |
| 216 | +`credits_per_hour`) which doesn't implement `Eq`/`Ord`, both required |
| 217 | +by `TableTransaction`. Storing raw `u64`/`String` fields and |
| 218 | +converting in `DurableType::from_key_value` avoids this constraint. |
| 219 | + |
| 220 | +## Open Questions |
| 221 | + |
| 222 | +* Should we actually provide this to self-managed customers? |
| 223 | +* Adding new replica sizes in cloud is tricky and requires input from finance. |
| 224 | + How do we prevent support from going wild and creating new replica sizes. |
0 commit comments