Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 77 additions & 13 deletions design/AI-generated/subsystem_02_rpc_transport.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ class Endpoint {
};
```

- **Token** = `UID` (pair of `uint64_t`). Lower 32 bits encode an index into the EndpointMap; upper 32 bits encode task priority.
- **Token** = `UID` (pair of `uint64_t`). The low 32 bits of the second word are the endpoint's index into the `EndpointMap` — that is what `get()` looks up — and the map entry reuses that same field to hold the receiver's `TaskPriority`. The first word is a random base shared across an interface's contiguous block of endpoints.
- **Well-known tokens**: `wellKnownToken(int id)` returns `UID(-1, id)`. Reserved IDs: `WLTOKEN_ENDPOINT_NOT_FOUND(0)`, `WLTOKEN_PING_PACKET`, `WLTOKEN_UNAUTHORIZED_ENDPOINT`, plus system services (leader election, config transactions, etc.)
- **Address selection**: `choosePrimaryAddress()` swaps primary/secondary based on local TLS preference.

Expand Down Expand Up @@ -58,7 +58,7 @@ struct Peer : ReferenceCounted<Peer> {
3. Sends `ConnectPacket` (protocol version, local address, connection ID)
4. Spawns `connectionWriter()` (async write loop) and `connectionReader()` (async read loop)
5. `connectionMonitor()` sends periodic pings, detects timeouts
6. On failure: exponential backoff (INITIAL_RECONNECTION_TIME to MAX_RECONNECTION_TIME, growth factor 1.5x)
6. On failure: exponential backoff (INITIAL_RECONNECTION_TIME to MAX_RECONNECTION_TIME, growth factor 1.2x)
7. `discardUnreliablePackets()` on disconnect; reliable packets resent after reconnect

### EndpointMap -- [`FlowTransport.cpp`](https://github.com/apple/foundationdb/blob/main/fdbrpc/FlowTransport.cpp)`:90-230`
Expand All @@ -75,7 +75,7 @@ struct Entry {
- Pre-allocates slots for well-known endpoints (indices 0 to wellKnownEndpointCount-1)
- Dynamic endpoints allocated from free list; doubles table size when full
- `get(token)` -- O(1) lookup by token's lower 32 bits
- `insert()` -- allocates from free list, encodes priority in upper 32 bits
- `insert()` -- allocates from free list (single endpoint) or a contiguous block (the `streams` overload, keyed off a fresh random base UID); stores the receiver's priority in the entry token's low 32 bits
- `remove()` -- returns slot to free list

### FlowTransport -- [`FlowTransport.h`](https://github.com/apple/foundationdb/blob/main/fdbrpc/include/fdbrpc/FlowTransport.h)`:199-315`
Expand Down Expand Up @@ -155,6 +155,70 @@ Stream of replies with flow control:

---

## Interface Endpoint Layout

Service interfaces (e.g., `CommitProxyInterface`, `GrvProxyInterface`, `StorageServerInterface`) bundle many `RequestStream<T>` channels but ship a single endpoint over the wire. The rest are reconstructed locally by adding a fixed offset to that anchor.

### Convention

Each interface picks an "anchor" stream (typically the most-used one: `commit` for the commit proxy, `getConsistentReadVersion` for the GRV proxy, `getValue` for storage servers). It is the only `RequestStream` actually serialized. All other streams are reconstructed in the `if (Archive::isDeserializing)` branch via `anchor.getEndpoint().getAdjustedEndpoint(N)`, where N is the stream's position in `initEndpoints`'s `push_back` order.

```cpp
// CommitProxyInterface.h (excerpt)
template <class Archive>
void serialize(Archive& ar) {
serializer(ar, processId, provisional, commit); // commit is the anchor — the only RequestStream on the wire
if (Archive::isDeserializing) {
legacyGetConsistentReadVersion = ...(commit.getEndpoint().getAdjustedEndpoint(1));
getKeyServersLocations = ...(commit.getEndpoint().getAdjustedEndpoint(2));
// ...
}
}

void initEndpoints() {
std::vector<...> streams;
streams.push_back(commit.getReceiver(...)); // index 0 — anchor
streams.push_back(legacyGetConsistentReadVersion.getReceiver(...)); // 1
streams.push_back(getKeyServersLocations.getReceiver(...)); // 2
// ...
FlowTransport::transport().addEndpoints(streams);
}
```

`EndpointMap::insert` allocates the registered receivers as a contiguous block keyed off a fresh random `base` UID. Stream `i` ends up at token offset `i` from the anchor, and `getAdjustedEndpoint(N)` produces the matching token. Client and server agree iff the client's deserialization index matches the server's registration order.

### The endpoint layout is a wire-compatibility contract

The offsets look like a private, per-build implementation detail. They are not. The offset of every stream within an interface is part of the wire protocol, and it must stay identical across **every binary that runs a compatible protocol version** — not merely across binaries built from the same source tree.

The binaries that actually differ are `fdbserver` and `fdbclient`. A cluster's server processes are upgraded together: an upgrade is a coordinated restart that triggers a recovery, so the cluster's `fdbserver` processes always run a single build, and server-to-server interface reconstruction — for example, Ratekeeper rebuilding a `CommitProxyInterface` from the broadcast `ServerDBInfo` — is effectively same-build. Client libraries are not: they are versioned and deployed independently of the cluster and are *not* upgraded in lockstep with it. An application links `fdb_c` libraries that may have been built long before, or after, the `fdbserver` it happens to connect to.

What lets a client talk to that server is *protocol compatibility*, not an identical build:

1. **FlowTransport connects compatible peers, not identical ones.** Two protocol versions are compatible when their high 48 bits match — `isCompatible` compares `version() & compatibleProtocolVersionMask`, where `compatibleProtocolVersionMask = 0xFFFFFFFFFFFF0000` (see [`ProtocolVersion.h`](https://github.com/apple/foundationdb/blob/main/flow/ProtocolVersion.h.cmake)). The low 16 bits never affect compatibility, and patch releases of an `x.y` line are *required* to keep the same protocol version (see `cmake/ProtocolVersions.cmake`: "This version impacts both communications and the deserialization of certain database and IKeyValueStore keys"). A single compatible protocol version therefore spans many distinct builds.
2. **The multi-version client selects a library by *normalized* (compatible) version.** `MultiVersionDatabase` indexes loaded client libraries by `protocolVersion.normalizedVersion()` and keeps the same connection when the cluster's protocol version changes but stays compatible (see [`subsystem_03_client_library.md`](subsystem_03_client_library.md)). The library it picks need only be *compatible* with the cluster, so it is routinely a different build than the `fdbserver` it connects to — and its compiled-in endpoint offsets must still match what that server registered.

A client reconstructs the *entire* interface (every stream in `commitProxies`/`grvProxies`) but only sends to the streams it actually uses. So a client-facing stream carries the cross-build `fdbserver`↔`fdbclient` contract, whereas a server-only stream (one no client sends to, such as `setThrottledShard`) is exercised only on the same-build server-to-server path — its realistic failure mode is the *local* `serialize`/`initEndpoints` misalignment described below, not a cross-build mismatch.

Two facts about tokens are true but do *not* license repacking: tokens are not persistent (every process gets a fresh random `base` UID, so no stale token survives a restart), and recovery reissues every interface via fresh `ServerDBInfo`/`ClientDBInfo` (see [`subsystem_09_cluster_recovery.md`](subsystem_09_cluster_recovery.md)). Both keep the *anchor* token fresh — but the offsets relative to that anchor are still reconstructed on the far side from the reader's compiled-in indices, so they remain a cross-binary contract whenever a client and server are different builds.

### Evolving an interface safely

There are two invariants, one local and one global:

- **Local (necessary).** Within a single build, the `getAdjustedEndpoint(N)` argument in `serialize()` must equal the `push_back` position of the same stream in `initEndpoints()`. If they differ, even two *identical* binaries mis-route: the reconstructed endpoint points at a slot the server never registered, and `EndpointMap::get` returns `nullptr`.
- **Global (the real contract).** The offset of each stream must be stable across all compatible binaries, per the section above.

From these, the only safe ways to change an interface are:

1. **Append new streams at the end.** Existing offsets are untouched, so an older client built against the previous layout keeps resolving every stream it knows about.
2. **When removing a stream within a compatible protocol version, leave a placeholder in its slot** so every successor keeps its offset. The retained `legacyGetConsistentReadVersion` field in [`CommitProxyInterface.h`](https://github.com/apple/foundationdb/blob/main/fdbclient/include/fdbclient/CommitProxyInterface.h) is an example — a typed-but-unused `RequestStream` is enough to hold the slot. Removing a stream and letting successors shift down is a *silent* wire break for any compatible client still reconstructing the old layout.
3. **Repack or compact offsets only across an incompatible protocol-version change.** A new *major* version bumps the protocol version incompatibly, so builds on either side of the boundary refuse to connect rather than mis-route. A major version change is therefore exactly when it is safe to drop accumulated placeholders and renumber an interface's endpoints densely (as long as the `serialize()` offsets and `initEndpoints()` order are renumbered together — the local invariant still applies). Doing the same *within* a compatible protocol version — for example, a patch release — is a silent wire break.

A violation of this contract fails quietly. A send to a token the receiver never registered elicits a `WLTOKEN_ENDPOINT_NOT_FOUND` reply (surfacing as an `EndpointNotFound` trace event), but fire-and-forget sends (`RequestStream::send`) observe no application-level error, so the request is simply dropped. Guarding the layout is therefore best done proactively: a unit test that, for each interface, round-trips through `serialize`/`initEndpoints` and asserts every reconstructed stream resolves to a registered receiver catches the local invariant immediately, and pinning each stream's offset across *compatible* protocol versions catches a removal that forgets to reserve a placeholder (such a test is expected to be updated when a major version bump intentionally renumbers).

---

## Wire Protocol

### Packet Format
Expand Down Expand Up @@ -206,10 +270,10 @@ class SimpleFailureMonitor : public IFailureMonitor {

**Detection mechanism** (`connectionMonitor()` actor):
1. Sends periodic ping to `WLTOKEN_PING_PACKET`
2. Waits for reply with `CONNECTION_MONITOR_TIMEOUT` (1s default)
2. Waits for reply with `CONNECTION_MONITOR_TIMEOUT` (2s default; 1.5s in simulation)
3. On timeout: increment count, eventually throw `connection_failed()`
4. Records ping latency in `DDSketch` histogram
5. After `FAILURE_DETECTION_DELAY` (5s): mark address as failed
5. After `FAILURE_DETECTION_DELAY` (4s): mark address as failed

---

Expand Down Expand Up @@ -313,14 +377,14 @@ Client-side request distribution:

| Knob | Default | Purpose |
|------|---------|---------|
| `INITIAL_RECONNECTION_TIME` | 0.2s | First reconnect delay |
| `MAX_RECONNECTION_TIME` | 30s | Max backoff |
| `RECONNECTION_TIME_GROWTH_RATE` | 1.5x | Backoff multiplier |
| `CONNECTION_MONITOR_TIMEOUT` | 1s | Ping timeout |
| `FAILURE_DETECTION_DELAY` | 5s | Before marking failed |
| `PACKET_LIMIT` | 512MB | Max packet size |
| `MIN_COALESCE_DELAY` | 0.5ms | Min buffer wait |
| `MAX_COALESCE_DELAY` | 2ms | Max buffer wait |
| `INITIAL_RECONNECTION_TIME` | 0.05s | First reconnect delay |
| `MAX_RECONNECTION_TIME` | 0.5s | Max backoff |
| `RECONNECTION_TIME_GROWTH_RATE` | 1.2x | Backoff multiplier |
| `CONNECTION_MONITOR_TIMEOUT` | 2s (1.5s in sim) | Ping timeout |
| `FAILURE_DETECTION_DELAY` | 4s | Before marking failed |
| `PACKET_LIMIT` | 100MB | Max packet size |
| `MIN_COALESCE_DELAY` | 10µs | Min buffer wait |
| `MAX_COALESCE_DELAY` | 20µs | Max buffer wait |

---

Expand Down
2 changes: 1 addition & 1 deletion fdbclient/include/fdbclient/CommitProxyInterface.h
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ struct CommitProxyInterface {
expireIdempotencyId =
PublicRequestStream<struct ExpireIdempotencyIdRequest>(commit.getEndpoint().getAdjustedEndpoint(9));
setThrottledShard =
RequestStream<struct SetThrottledShardRequest>(commit.getEndpoint().getAdjustedEndpoint(12));
RequestStream<struct SetThrottledShardRequest>(commit.getEndpoint().getAdjustedEndpoint(10));
}
}

Expand Down
Loading