Skip to content

Commit 8355c8e

Browse files
jubradclaude
andcommitted
doc: add design document for user-defined cluster replica sizes
Design document for making cluster replica sizes durable catalog objects with SQL DDL support (CREATE/DROP CLUSTER REPLICA SIZE). Covers: problem statement, success criteria, durable storage design with ClusterReplicaSizeId, builtin sync, SQL DDL syntax, access control, credit enforcement, system tables, and alternatives considered. Prototype: #35971 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 6737635 commit 8355c8e

1 file changed

Lines changed: 224 additions & 0 deletions

File tree

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
# User-Defined Cluster Replica Sizes
2+
3+
- Associated: [PR #35971](https://github.com/MaterializeInc/materialize/pull/35971)
4+
5+
## The Problem
6+
7+
Cluster replica sizes in Materialize are configured exclusively through
8+
the `CLUSTER_REPLICA_SIZES` environment variable, passed as a JSON blob
9+
at startup. This creates several problems:
10+
11+
* **Adding or changing sizes requires an environmentd restart.** Any
12+
modification to the set of available sizes -- adding a new size,
13+
adjusting resource limits, disabling an existing one -- requires
14+
updating a cli flag and restarting environmentd. In cloud
15+
environments this means a deployment rollout; in self-managed
16+
environments it means coordinating downtime.
17+
* **No ad-hoc experimentation.** Operators cannot quickly test a new
18+
size to explore whether a different CPU/memory/worker configuration
19+
better fits a workload. Every experiment requires the full
20+
restart-and-redeploy cycle, making iterative sizing impractical.
21+
* **No durability or safety guarantees.** Sizes are held only in
22+
memory. If the env var changes between restarts, sizes can silently
23+
appear or disappear with no audit trail and no protection against
24+
removing a size that existing replicas depend on.
25+
26+
## Success Criteria
27+
28+
* Operators can create and remove cluster replica sizes without
29+
restarting environmentd or coordinating a deployment rollout.
30+
* Operators can inspect the full definition of a cluster replica
31+
size — including node selectors, swap configuration, and resource
32+
limits — through SQL.
33+
* Removing a size that is actively in use by a replica is prevented,
34+
avoiding silent breakage of running workloads.
35+
* Existing CLI flag based sizes continue to work unchanged. Operators
36+
who don't use the new DDL see no behavior difference.
37+
38+
## Out of Scope
39+
40+
* **ALTER CLUSTER REPLICA SIZE.** Sizes are immutable once created.
41+
To change a size definition, drop and recreate it (after migrating
42+
replicas off it). This avoids the complexity of drift between
43+
running replicas and their declared size, and avoids needing to
44+
track size history per replica.
45+
46+
## Solution Proposal
47+
48+
### Durable storage
49+
50+
`ClusterReplicaSize` is added as a new durable catalog object,
51+
following the established pattern used by `NetworkPolicy`, `Cluster`,
52+
and other catalog objects.
53+
54+
#### Identity
55+
56+
A new `ClusterReplicaSizeId` type in `mz-repr` with `User(u64)` and
57+
`System(u64)` variants, matching `NetworkPolicyId`, `ClusterId`,
58+
`RoleId`, etc. Two allocator keys track the next ID in each namespace.
59+
Builtin sizes synced from the env var get `System` IDs; user-defined
60+
sizes created via SQL get `User` IDs.
61+
62+
Replicas still reference sizes by name (`ManagedLocation.size` remains
63+
a `String`). The ID is catalog-internal identity only; the name is the
64+
user-facing reference. This avoids migrating the replica storage format.
65+
66+
#### Proto definitions
67+
68+
```rust
69+
// Key — the durable identity
70+
pub struct ClusterReplicaSizeKey {
71+
pub id: ClusterReplicaSizeId,
72+
}
73+
74+
pub enum ClusterReplicaSizeId {
75+
System(u64),
76+
User(u64),
77+
}
78+
79+
// Value — the allocation details and metadata
80+
pub struct ClusterReplicaSizeValue {
81+
pub name: String,
82+
pub memory_limit: Option<u64>, // bytes
83+
pub memory_request: Option<u64>,
84+
pub cpu_limit: Option<u64>, // nanocpus
85+
pub cpu_request: Option<u64>,
86+
pub disk_limit: Option<u64>, // bytes
87+
pub scale: u16,
88+
pub workers: u64,
89+
pub credits_per_hour: String, // Numeric as string
90+
pub cpu_exclusive: bool,
91+
pub is_cc: bool,
92+
pub swap_enabled: bool,
93+
pub disabled: bool,
94+
pub selectors: BTreeMap<String, String>,
95+
pub builtin: bool,
96+
}
97+
```
98+
99+
#### Combined durable type
100+
101+
```rust
102+
pub struct ClusterReplicaSize {
103+
pub id: ClusterReplicaSizeId,
104+
pub name: String,
105+
pub allocation: ReplicaAllocation,
106+
pub builtin: bool,
107+
}
108+
```
109+
110+
Name uniqueness across the table is enforced by the `UniqueName`
111+
trait on `ClusterReplicaSizeValue`.
112+
113+
### Builtin sync
114+
115+
On every catalog open, `sync_builtin_cluster_replica_sizes()` reconciles
116+
the env-var `ClusterReplicaSizeMap` with durable state:
117+
118+
* **New sizes** in the env var are inserted with `builtin: true`.
119+
* **Changed allocations** are retracted and reinserted.
120+
* **Removed sizes** are handled based on usage:
121+
* If no replica references the size, it is deleted.
122+
* If a replica still uses it, it is marked `disabled: true` with a
123+
warning log, preventing new replicas from using it while avoiding
124+
panics in `concretize_replica_location`.
125+
126+
`CatalogState.cluster_replica_sizes` is initialized empty and populated
127+
from durable state updates (via `apply_cluster_replica_size_update`)
128+
rather than directly from config.
129+
130+
### SQL DDL
131+
132+
```sql
133+
CREATE CLUSTER REPLICA SIZE <name> (
134+
CREDITS PER HOUR = '<numeric>', -- required
135+
WORKERS = <n>, -- default 1
136+
SCALE = <n>, -- default 1
137+
MEMORY LIMIT = '<size>', -- e.g. '4GiB', '512MiB'
138+
CPU LIMIT = '<cpu>', -- e.g. '0.5', '500m'
139+
DISK LIMIT = '<size>',
140+
CPU EXCLUSIVE = <bool>,
141+
DISABLED = <bool>,
142+
IS CC = <bool>, -- default true
143+
SWAP ENABLED = <bool>,
144+
NODE SELECTORS = '<json>' -- e.g. '{"k8s.io/arch": "arm64"}'
145+
);
146+
147+
DROP CLUSTER REPLICA SIZE <name>;
148+
```
149+
150+
**Human-readable units:**
151+
* Memory/disk: `GiB`, `MiB`, `GB`, `MB`, `kB`, or raw bytes.
152+
* CPU: cores (`0.5`, `2`), millicpus (`500m`), or raw nanocpus.
153+
154+
**Access control:**
155+
* Gated behind `enable_custom_cluster_replica_sizes` feature flag
156+
(default off).
157+
* `mz_system` bypasses the feature flag.
158+
* RBAC requires superuser for both CREATE and DROP.
159+
* Cannot drop builtin sizes or sizes in use by existing replicas.
160+
* Cannot create a size with a name that already exists.
161+
162+
**Credit enforcement:** In self-managed deployments, `credits_per_hour`
163+
is automatically calculated as `(memory_limit * scale) / 1 GiB`,
164+
matching the existing behavior of env-var sizes. The user-provided
165+
`CREDITS PER HOUR` value is ignored. This ensures the DDL cannot be
166+
used to bypass license-based billing limits. Cloud deployments (where
167+
the license allows credit consumption overrides) honor the user-provided
168+
value.
169+
170+
**Immutability:** Sizes cannot be altered after creation. To change a
171+
size definition, drop it (after migrating replicas off it) and recreate
172+
with the new parameters.
173+
174+
### System tables
175+
176+
**`mz_catalog.mz_cluster_replica_sizes`** (existing, public): Updated
177+
via durable state updates instead of manual startup packing. Shows
178+
`size`, `processes`, `workers`, `cpu_nano_cores`, `memory_bytes`,
179+
`disk_bytes`, `credits_per_hour`.
180+
181+
**`mz_internal.mz_cluster_replica_size_details`** (new, public):
182+
Exposes the full allocation including `cpu_exclusive`, `is_cc`,
183+
`swap_enabled`, `disabled`, `builtin`, and `node_selectors` (as JSONB).
184+
185+
### Audit logging
186+
187+
`ObjectType::ClusterReplicaSize` added to the audit log enum. CREATE
188+
and DROP operations emit audit events with `EventDetails::IdNameV1`.
189+
190+
## Alternatives
191+
192+
### String key (name) as the catalog key
193+
194+
An earlier iteration used the size name as the durable key directly,
195+
avoiding the need for a separate ID type. This worked but was
196+
inconsistent with every other catalog object and would have made
197+
size rename impossible without replica migration (since the name
198+
would be the primary identity). The ID approach adds minor
199+
complexity in exchange for pattern consistency and future rename
200+
support, without requiring any changes to how replicas reference
201+
sizes today.
202+
203+
### Mutable sizes with drift tracking
204+
205+
An alternative design would allow `ALTER CLUSTER REPLICA SIZE` and
206+
track which allocation each replica was created with. This enables
207+
live reconfiguration but introduces significant complexity: snapshot
208+
allocation at creation time, drift detection columns, history tables.
209+
The immutable approach is simpler and sufficient for the initial use
210+
case. ALTER can be added later if needed.
211+
212+
### Storing `ReplicaAllocation` directly in the durable value
213+
214+
The durable `ClusterReplicaSizeValue` could embed `ReplicaAllocation`
215+
directly. However, `ReplicaAllocation` contains `Numeric` (for
216+
`credits_per_hour`) which doesn't implement `Eq`/`Ord`, both required
217+
by `TableTransaction`. Storing raw `u64`/`String` fields and
218+
converting in `DurableType::from_key_value` avoids this constraint.
219+
220+
## Open Questions
221+
222+
* Should we actually provide this to self-managed customers?
223+
* Adding new replica sizes in cloud is tricky and requires input from finance.
224+
How do we prevent support from going wild and creating new replica sizes.

0 commit comments

Comments
 (0)