Skip to content

Commit 6ca9e16

Browse files
authored
Add some doc about interface endpoint layout (#13281)
Generated in the process of reviewing PR #13275. Updated to describe techniques to be used across protocol-compatible releases (probably rare to never) and techniques (interface compaction, the agent called it) that can be used across protocol-incompatible releases. Use the latter to address a latent bug from prior code removal on main.
1 parent bab63fd commit 6ca9e16

2 files changed

Lines changed: 78 additions & 14 deletions

File tree

design/AI-generated/subsystem_02_rpc_transport.md

Lines changed: 77 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ class Endpoint {
2727
};
2828
```
2929

30-
- **Token** = `UID` (pair of `uint64_t`). Lower 32 bits encode an index into the EndpointMap; upper 32 bits encode task priority.
30+
- **Token** = `UID` (pair of `uint64_t`). The low 32 bits of the second word are the endpoint's index into the `EndpointMap` — that is what `get()` looks up — and the map entry reuses that same field to hold the receiver's `TaskPriority`. The first word is a random base shared across an interface's contiguous block of endpoints.
3131
- **Well-known tokens**: `wellKnownToken(int id)` returns `UID(-1, id)`. Reserved IDs: `WLTOKEN_ENDPOINT_NOT_FOUND(0)`, `WLTOKEN_PING_PACKET`, `WLTOKEN_UNAUTHORIZED_ENDPOINT`, plus system services (leader election, config transactions, etc.)
3232
- **Address selection**: `choosePrimaryAddress()` swaps primary/secondary based on local TLS preference.
3333

@@ -58,7 +58,7 @@ struct Peer : ReferenceCounted<Peer> {
5858
3. Sends `ConnectPacket` (protocol version, local address, connection ID)
5959
4. Spawns `connectionWriter()` (async write loop) and `connectionReader()` (async read loop)
6060
5. `connectionMonitor()` sends periodic pings, detects timeouts
61-
6. On failure: exponential backoff (INITIAL_RECONNECTION_TIME to MAX_RECONNECTION_TIME, growth factor 1.5x)
61+
6. On failure: exponential backoff (INITIAL_RECONNECTION_TIME to MAX_RECONNECTION_TIME, growth factor 1.2x)
6262
7. `discardUnreliablePackets()` on disconnect; reliable packets resent after reconnect
6363

6464
### EndpointMap -- [`FlowTransport.cpp`](https://github.com/apple/foundationdb/blob/main/fdbrpc/FlowTransport.cpp)`:90-230`
@@ -75,7 +75,7 @@ struct Entry {
7575
- Pre-allocates slots for well-known endpoints (indices 0 to wellKnownEndpointCount-1)
7676
- Dynamic endpoints allocated from free list; doubles table size when full
7777
- `get(token)` -- O(1) lookup by token's lower 32 bits
78-
- `insert()` -- allocates from free list, encodes priority in upper 32 bits
78+
- `insert()` -- allocates from free list (single endpoint) or a contiguous block (the `streams` overload, keyed off a fresh random base UID); stores the receiver's priority in the entry token's low 32 bits
7979
- `remove()` -- returns slot to free list
8080

8181
### FlowTransport -- [`FlowTransport.h`](https://github.com/apple/foundationdb/blob/main/fdbrpc/include/fdbrpc/FlowTransport.h)`:199-315`
@@ -155,6 +155,70 @@ Stream of replies with flow control:
155155

156156
---
157157

158+
## Interface Endpoint Layout
159+
160+
Service interfaces (e.g., `CommitProxyInterface`, `GrvProxyInterface`, `StorageServerInterface`) bundle many `RequestStream<T>` channels but ship a single endpoint over the wire. The rest are reconstructed locally by adding a fixed offset to that anchor.
161+
162+
### Convention
163+
164+
Each interface picks an "anchor" stream (typically the most-used one: `commit` for the commit proxy, `getConsistentReadVersion` for the GRV proxy, `getValue` for storage servers). It is the only `RequestStream` actually serialized. All other streams are reconstructed in the `if (Archive::isDeserializing)` branch via `anchor.getEndpoint().getAdjustedEndpoint(N)`, where N is the stream's position in `initEndpoints`'s `push_back` order.
165+
166+
```cpp
167+
// CommitProxyInterface.h (excerpt)
168+
template <class Archive>
169+
void serialize(Archive& ar) {
170+
serializer(ar, processId, provisional, commit); // commit is the anchor — the only RequestStream on the wire
171+
if (Archive::isDeserializing) {
172+
legacyGetConsistentReadVersion = ...(commit.getEndpoint().getAdjustedEndpoint(1));
173+
getKeyServersLocations = ...(commit.getEndpoint().getAdjustedEndpoint(2));
174+
// ...
175+
}
176+
}
177+
178+
void initEndpoints() {
179+
std::vector<...> streams;
180+
streams.push_back(commit.getReceiver(...)); // index 0 — anchor
181+
streams.push_back(legacyGetConsistentReadVersion.getReceiver(...)); // 1
182+
streams.push_back(getKeyServersLocations.getReceiver(...)); // 2
183+
// ...
184+
FlowTransport::transport().addEndpoints(streams);
185+
}
186+
```
187+
188+
`EndpointMap::insert` allocates the registered receivers as a contiguous block keyed off a fresh random `base` UID. Stream `i` ends up at token offset `i` from the anchor, and `getAdjustedEndpoint(N)` produces the matching token. Client and server agree iff the client's deserialization index matches the server's registration order.
189+
190+
### The endpoint layout is a wire-compatibility contract
191+
192+
The offsets look like a private, per-build implementation detail. They are not. The offset of every stream within an interface is part of the wire protocol, and it must stay identical across **every binary that runs a compatible protocol version** — not merely across binaries built from the same source tree.
193+
194+
The binaries that actually differ are `fdbserver` and `fdbclient`. A cluster's server processes are upgraded together: an upgrade is a coordinated restart that triggers a recovery, so the cluster's `fdbserver` processes always run a single build, and server-to-server interface reconstruction — for example, Ratekeeper rebuilding a `CommitProxyInterface` from the broadcast `ServerDBInfo` — is effectively same-build. Client libraries are not: they are versioned and deployed independently of the cluster and are *not* upgraded in lockstep with it. An application links `fdb_c` libraries that may have been built long before, or after, the `fdbserver` it happens to connect to.
195+
196+
What lets a client talk to that server is *protocol compatibility*, not an identical build:
197+
198+
1. **FlowTransport connects compatible peers, not identical ones.** Two protocol versions are compatible when their high 48 bits match — `isCompatible` compares `version() & compatibleProtocolVersionMask`, where `compatibleProtocolVersionMask = 0xFFFFFFFFFFFF0000` (see [`ProtocolVersion.h`](https://github.com/apple/foundationdb/blob/main/flow/ProtocolVersion.h.cmake)). The low 16 bits never affect compatibility, and patch releases of an `x.y` line are *required* to keep the same protocol version (see `cmake/ProtocolVersions.cmake`: "This version impacts both communications and the deserialization of certain database and IKeyValueStore keys"). A single compatible protocol version therefore spans many distinct builds.
199+
2. **The multi-version client selects a library by *normalized* (compatible) version.** `MultiVersionDatabase` indexes loaded client libraries by `protocolVersion.normalizedVersion()` and keeps the same connection when the cluster's protocol version changes but stays compatible (see [`subsystem_03_client_library.md`](subsystem_03_client_library.md)). The library it picks need only be *compatible* with the cluster, so it is routinely a different build than the `fdbserver` it connects to — and its compiled-in endpoint offsets must still match what that server registered.
200+
201+
A client reconstructs the *entire* interface (every stream in `commitProxies`/`grvProxies`) but only sends to the streams it actually uses. So a client-facing stream carries the cross-build `fdbserver`↔`fdbclient` contract, whereas a server-only stream (one no client sends to, such as `setThrottledShard`) is exercised only on the same-build server-to-server path — its realistic failure mode is the *local* `serialize`/`initEndpoints` misalignment described below, not a cross-build mismatch.
202+
203+
Two facts about tokens are true but do *not* license repacking: tokens are not persistent (every process gets a fresh random `base` UID, so no stale token survives a restart), and recovery reissues every interface via fresh `ServerDBInfo`/`ClientDBInfo` (see [`subsystem_09_cluster_recovery.md`](subsystem_09_cluster_recovery.md)). Both keep the *anchor* token fresh — but the offsets relative to that anchor are still reconstructed on the far side from the reader's compiled-in indices, so they remain a cross-binary contract whenever a client and server are different builds.
204+
205+
### Evolving an interface safely
206+
207+
There are two invariants, one local and one global:
208+
209+
- **Local (necessary).** Within a single build, the `getAdjustedEndpoint(N)` argument in `serialize()` must equal the `push_back` position of the same stream in `initEndpoints()`. If they differ, even two *identical* binaries mis-route: the reconstructed endpoint points at a slot the server never registered, and `EndpointMap::get` returns `nullptr`.
210+
- **Global (the real contract).** The offset of each stream must be stable across all compatible binaries, per the section above.
211+
212+
From these, the only safe ways to change an interface are:
213+
214+
1. **Append new streams at the end.** Existing offsets are untouched, so an older client built against the previous layout keeps resolving every stream it knows about.
215+
2. **When removing a stream within a compatible protocol version, leave a placeholder in its slot** so every successor keeps its offset. The retained `legacyGetConsistentReadVersion` field in [`CommitProxyInterface.h`](https://github.com/apple/foundationdb/blob/main/fdbclient/include/fdbclient/CommitProxyInterface.h) is an example — a typed-but-unused `RequestStream` is enough to hold the slot. Removing a stream and letting successors shift down is a *silent* wire break for any compatible client still reconstructing the old layout.
216+
3. **Repack or compact offsets only across an incompatible protocol-version change.** A new *major* version bumps the protocol version incompatibly, so builds on either side of the boundary refuse to connect rather than mis-route. A major version change is therefore exactly when it is safe to drop accumulated placeholders and renumber an interface's endpoints densely (as long as the `serialize()` offsets and `initEndpoints()` order are renumbered together — the local invariant still applies). Doing the same *within* a compatible protocol version — for example, a patch release — is a silent wire break.
217+
218+
A violation of this contract fails quietly. A send to a token the receiver never registered elicits a `WLTOKEN_ENDPOINT_NOT_FOUND` reply (surfacing as an `EndpointNotFound` trace event), but fire-and-forget sends (`RequestStream::send`) observe no application-level error, so the request is simply dropped. Guarding the layout is therefore best done proactively: a unit test that, for each interface, round-trips through `serialize`/`initEndpoints` and asserts every reconstructed stream resolves to a registered receiver catches the local invariant immediately, and pinning each stream's offset across *compatible* protocol versions catches a removal that forgets to reserve a placeholder (such a test is expected to be updated when a major version bump intentionally renumbers).
219+
220+
---
221+
158222
## Wire Protocol
159223
160224
### Packet Format
@@ -206,10 +270,10 @@ class SimpleFailureMonitor : public IFailureMonitor {
206270

207271
**Detection mechanism** (`connectionMonitor()` actor):
208272
1. Sends periodic ping to `WLTOKEN_PING_PACKET`
209-
2. Waits for reply with `CONNECTION_MONITOR_TIMEOUT` (1s default)
273+
2. Waits for reply with `CONNECTION_MONITOR_TIMEOUT` (2s default; 1.5s in simulation)
210274
3. On timeout: increment count, eventually throw `connection_failed()`
211275
4. Records ping latency in `DDSketch` histogram
212-
5. After `FAILURE_DETECTION_DELAY` (5s): mark address as failed
276+
5. After `FAILURE_DETECTION_DELAY` (4s): mark address as failed
213277

214278
---
215279

@@ -313,14 +377,14 @@ Client-side request distribution:
313377

314378
| Knob | Default | Purpose |
315379
|------|---------|---------|
316-
| `INITIAL_RECONNECTION_TIME` | 0.2s | First reconnect delay |
317-
| `MAX_RECONNECTION_TIME` | 30s | Max backoff |
318-
| `RECONNECTION_TIME_GROWTH_RATE` | 1.5x | Backoff multiplier |
319-
| `CONNECTION_MONITOR_TIMEOUT` | 1s | Ping timeout |
320-
| `FAILURE_DETECTION_DELAY` | 5s | Before marking failed |
321-
| `PACKET_LIMIT` | 512MB | Max packet size |
322-
| `MIN_COALESCE_DELAY` | 0.5ms | Min buffer wait |
323-
| `MAX_COALESCE_DELAY` | 2ms | Max buffer wait |
380+
| `INITIAL_RECONNECTION_TIME` | 0.05s | First reconnect delay |
381+
| `MAX_RECONNECTION_TIME` | 0.5s | Max backoff |
382+
| `RECONNECTION_TIME_GROWTH_RATE` | 1.2x | Backoff multiplier |
383+
| `CONNECTION_MONITOR_TIMEOUT` | 2s (1.5s in sim) | Ping timeout |
384+
| `FAILURE_DETECTION_DELAY` | 4s | Before marking failed |
385+
| `PACKET_LIMIT` | 100MB | Max packet size |
386+
| `MIN_COALESCE_DELAY` | 10µs | Min buffer wait |
387+
| `MAX_COALESCE_DELAY` | 20µs | Max buffer wait |
324388

325389
---
326390

fdbclient/include/fdbclient/CommitProxyInterface.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ struct CommitProxyInterface {
8282
expireIdempotencyId =
8383
PublicRequestStream<struct ExpireIdempotencyIdRequest>(commit.getEndpoint().getAdjustedEndpoint(9));
8484
setThrottledShard =
85-
RequestStream<struct SetThrottledShardRequest>(commit.getEndpoint().getAdjustedEndpoint(12));
85+
RequestStream<struct SetThrottledShardRequest>(commit.getEndpoint().getAdjustedEndpoint(10));
8686
}
8787
}
8888

0 commit comments

Comments
 (0)