diff --git a/design/AI-generated/subsystem_02_rpc_transport.md b/design/AI-generated/subsystem_02_rpc_transport.md index f00fcf71315..dd1239f90fb 100644 --- a/design/AI-generated/subsystem_02_rpc_transport.md +++ b/design/AI-generated/subsystem_02_rpc_transport.md @@ -27,7 +27,7 @@ class Endpoint { }; ``` -- **Token** = `UID` (pair of `uint64_t`). Lower 32 bits encode an index into the EndpointMap; upper 32 bits encode task priority. +- **Token** = `UID` (pair of `uint64_t`). The low 32 bits of the second word are the endpoint's index into the `EndpointMap` — that is what `get()` looks up — and the map entry reuses that same field to hold the receiver's `TaskPriority`. The first word is a random base shared across an interface's contiguous block of endpoints. - **Well-known tokens**: `wellKnownToken(int id)` returns `UID(-1, id)`. Reserved IDs: `WLTOKEN_ENDPOINT_NOT_FOUND(0)`, `WLTOKEN_PING_PACKET`, `WLTOKEN_UNAUTHORIZED_ENDPOINT`, plus system services (leader election, config transactions, etc.) - **Address selection**: `choosePrimaryAddress()` swaps primary/secondary based on local TLS preference. @@ -58,7 +58,7 @@ struct Peer : ReferenceCounted { 3. Sends `ConnectPacket` (protocol version, local address, connection ID) 4. Spawns `connectionWriter()` (async write loop) and `connectionReader()` (async read loop) 5. `connectionMonitor()` sends periodic pings, detects timeouts -6. On failure: exponential backoff (INITIAL_RECONNECTION_TIME to MAX_RECONNECTION_TIME, growth factor 1.5x) +6. On failure: exponential backoff (INITIAL_RECONNECTION_TIME to MAX_RECONNECTION_TIME, growth factor 1.2x) 7. `discardUnreliablePackets()` on disconnect; reliable packets resent after reconnect ### EndpointMap -- [`FlowTransport.cpp`](https://github.com/apple/foundationdb/blob/main/fdbrpc/FlowTransport.cpp)`:90-230` @@ -75,7 +75,7 @@ struct Entry { - Pre-allocates slots for well-known endpoints (indices 0 to wellKnownEndpointCount-1) - Dynamic endpoints allocated from free list; doubles table size when full - `get(token)` -- O(1) lookup by token's lower 32 bits -- `insert()` -- allocates from free list, encodes priority in upper 32 bits +- `insert()` -- allocates from free list (single endpoint) or a contiguous block (the `streams` overload, keyed off a fresh random base UID); stores the receiver's priority in the entry token's low 32 bits - `remove()` -- returns slot to free list ### FlowTransport -- [`FlowTransport.h`](https://github.com/apple/foundationdb/blob/main/fdbrpc/include/fdbrpc/FlowTransport.h)`:199-315` @@ -155,6 +155,70 @@ Stream of replies with flow control: --- +## Interface Endpoint Layout + +Service interfaces (e.g., `CommitProxyInterface`, `GrvProxyInterface`, `StorageServerInterface`) bundle many `RequestStream` channels but ship a single endpoint over the wire. The rest are reconstructed locally by adding a fixed offset to that anchor. + +### Convention + +Each interface picks an "anchor" stream (typically the most-used one: `commit` for the commit proxy, `getConsistentReadVersion` for the GRV proxy, `getValue` for storage servers). It is the only `RequestStream` actually serialized. All other streams are reconstructed in the `if (Archive::isDeserializing)` branch via `anchor.getEndpoint().getAdjustedEndpoint(N)`, where N is the stream's position in `initEndpoints`'s `push_back` order. + +```cpp +// CommitProxyInterface.h (excerpt) +template +void serialize(Archive& ar) { + serializer(ar, processId, provisional, commit); // commit is the anchor — the only RequestStream on the wire + if (Archive::isDeserializing) { + legacyGetConsistentReadVersion = ...(commit.getEndpoint().getAdjustedEndpoint(1)); + getKeyServersLocations = ...(commit.getEndpoint().getAdjustedEndpoint(2)); + // ... + } +} + +void initEndpoints() { + std::vector<...> streams; + streams.push_back(commit.getReceiver(...)); // index 0 — anchor + streams.push_back(legacyGetConsistentReadVersion.getReceiver(...)); // 1 + streams.push_back(getKeyServersLocations.getReceiver(...)); // 2 + // ... + FlowTransport::transport().addEndpoints(streams); +} +``` + +`EndpointMap::insert` allocates the registered receivers as a contiguous block keyed off a fresh random `base` UID. Stream `i` ends up at token offset `i` from the anchor, and `getAdjustedEndpoint(N)` produces the matching token. Client and server agree iff the client's deserialization index matches the server's registration order. + +### The endpoint layout is a wire-compatibility contract + +The offsets look like a private, per-build implementation detail. They are not. The offset of every stream within an interface is part of the wire protocol, and it must stay identical across **every binary that runs a compatible protocol version** — not merely across binaries built from the same source tree. + +The binaries that actually differ are `fdbserver` and `fdbclient`. A cluster's server processes are upgraded together: an upgrade is a coordinated restart that triggers a recovery, so the cluster's `fdbserver` processes always run a single build, and server-to-server interface reconstruction — for example, Ratekeeper rebuilding a `CommitProxyInterface` from the broadcast `ServerDBInfo` — is effectively same-build. Client libraries are not: they are versioned and deployed independently of the cluster and are *not* upgraded in lockstep with it. An application links `fdb_c` libraries that may have been built long before, or after, the `fdbserver` it happens to connect to. + +What lets a client talk to that server is *protocol compatibility*, not an identical build: + +1. **FlowTransport connects compatible peers, not identical ones.** Two protocol versions are compatible when their high 48 bits match — `isCompatible` compares `version() & compatibleProtocolVersionMask`, where `compatibleProtocolVersionMask = 0xFFFFFFFFFFFF0000` (see [`ProtocolVersion.h`](https://github.com/apple/foundationdb/blob/main/flow/ProtocolVersion.h.cmake)). The low 16 bits never affect compatibility, and patch releases of an `x.y` line are *required* to keep the same protocol version (see `cmake/ProtocolVersions.cmake`: "This version impacts both communications and the deserialization of certain database and IKeyValueStore keys"). A single compatible protocol version therefore spans many distinct builds. +2. **The multi-version client selects a library by *normalized* (compatible) version.** `MultiVersionDatabase` indexes loaded client libraries by `protocolVersion.normalizedVersion()` and keeps the same connection when the cluster's protocol version changes but stays compatible (see [`subsystem_03_client_library.md`](subsystem_03_client_library.md)). The library it picks need only be *compatible* with the cluster, so it is routinely a different build than the `fdbserver` it connects to — and its compiled-in endpoint offsets must still match what that server registered. + +A client reconstructs the *entire* interface (every stream in `commitProxies`/`grvProxies`) but only sends to the streams it actually uses. So a client-facing stream carries the cross-build `fdbserver`↔`fdbclient` contract, whereas a server-only stream (one no client sends to, such as `setThrottledShard`) is exercised only on the same-build server-to-server path — its realistic failure mode is the *local* `serialize`/`initEndpoints` misalignment described below, not a cross-build mismatch. + +Two facts about tokens are true but do *not* license repacking: tokens are not persistent (every process gets a fresh random `base` UID, so no stale token survives a restart), and recovery reissues every interface via fresh `ServerDBInfo`/`ClientDBInfo` (see [`subsystem_09_cluster_recovery.md`](subsystem_09_cluster_recovery.md)). Both keep the *anchor* token fresh — but the offsets relative to that anchor are still reconstructed on the far side from the reader's compiled-in indices, so they remain a cross-binary contract whenever a client and server are different builds. + +### Evolving an interface safely + +There are two invariants, one local and one global: + +- **Local (necessary).** Within a single build, the `getAdjustedEndpoint(N)` argument in `serialize()` must equal the `push_back` position of the same stream in `initEndpoints()`. If they differ, even two *identical* binaries mis-route: the reconstructed endpoint points at a slot the server never registered, and `EndpointMap::get` returns `nullptr`. +- **Global (the real contract).** The offset of each stream must be stable across all compatible binaries, per the section above. + +From these, the only safe ways to change an interface are: + +1. **Append new streams at the end.** Existing offsets are untouched, so an older client built against the previous layout keeps resolving every stream it knows about. +2. **When removing a stream within a compatible protocol version, leave a placeholder in its slot** so every successor keeps its offset. The retained `legacyGetConsistentReadVersion` field in [`CommitProxyInterface.h`](https://github.com/apple/foundationdb/blob/main/fdbclient/include/fdbclient/CommitProxyInterface.h) is an example — a typed-but-unused `RequestStream` is enough to hold the slot. Removing a stream and letting successors shift down is a *silent* wire break for any compatible client still reconstructing the old layout. +3. **Repack or compact offsets only across an incompatible protocol-version change.** A new *major* version bumps the protocol version incompatibly, so builds on either side of the boundary refuse to connect rather than mis-route. A major version change is therefore exactly when it is safe to drop accumulated placeholders and renumber an interface's endpoints densely (as long as the `serialize()` offsets and `initEndpoints()` order are renumbered together — the local invariant still applies). Doing the same *within* a compatible protocol version — for example, a patch release — is a silent wire break. + +A violation of this contract fails quietly. A send to a token the receiver never registered elicits a `WLTOKEN_ENDPOINT_NOT_FOUND` reply (surfacing as an `EndpointNotFound` trace event), but fire-and-forget sends (`RequestStream::send`) observe no application-level error, so the request is simply dropped. Guarding the layout is therefore best done proactively: a unit test that, for each interface, round-trips through `serialize`/`initEndpoints` and asserts every reconstructed stream resolves to a registered receiver catches the local invariant immediately, and pinning each stream's offset across *compatible* protocol versions catches a removal that forgets to reserve a placeholder (such a test is expected to be updated when a major version bump intentionally renumbers). + +--- + ## Wire Protocol ### Packet Format @@ -206,10 +270,10 @@ class SimpleFailureMonitor : public IFailureMonitor { **Detection mechanism** (`connectionMonitor()` actor): 1. Sends periodic ping to `WLTOKEN_PING_PACKET` -2. Waits for reply with `CONNECTION_MONITOR_TIMEOUT` (1s default) +2. Waits for reply with `CONNECTION_MONITOR_TIMEOUT` (2s default; 1.5s in simulation) 3. On timeout: increment count, eventually throw `connection_failed()` 4. Records ping latency in `DDSketch` histogram -5. After `FAILURE_DETECTION_DELAY` (5s): mark address as failed +5. After `FAILURE_DETECTION_DELAY` (4s): mark address as failed --- @@ -313,14 +377,14 @@ Client-side request distribution: | Knob | Default | Purpose | |------|---------|---------| -| `INITIAL_RECONNECTION_TIME` | 0.2s | First reconnect delay | -| `MAX_RECONNECTION_TIME` | 30s | Max backoff | -| `RECONNECTION_TIME_GROWTH_RATE` | 1.5x | Backoff multiplier | -| `CONNECTION_MONITOR_TIMEOUT` | 1s | Ping timeout | -| `FAILURE_DETECTION_DELAY` | 5s | Before marking failed | -| `PACKET_LIMIT` | 512MB | Max packet size | -| `MIN_COALESCE_DELAY` | 0.5ms | Min buffer wait | -| `MAX_COALESCE_DELAY` | 2ms | Max buffer wait | +| `INITIAL_RECONNECTION_TIME` | 0.05s | First reconnect delay | +| `MAX_RECONNECTION_TIME` | 0.5s | Max backoff | +| `RECONNECTION_TIME_GROWTH_RATE` | 1.2x | Backoff multiplier | +| `CONNECTION_MONITOR_TIMEOUT` | 2s (1.5s in sim) | Ping timeout | +| `FAILURE_DETECTION_DELAY` | 4s | Before marking failed | +| `PACKET_LIMIT` | 100MB | Max packet size | +| `MIN_COALESCE_DELAY` | 10µs | Min buffer wait | +| `MAX_COALESCE_DELAY` | 20µs | Max buffer wait | --- diff --git a/fdbclient/include/fdbclient/CommitProxyInterface.h b/fdbclient/include/fdbclient/CommitProxyInterface.h index 5fa33fae819..84b5e1d7e32 100644 --- a/fdbclient/include/fdbclient/CommitProxyInterface.h +++ b/fdbclient/include/fdbclient/CommitProxyInterface.h @@ -82,7 +82,7 @@ struct CommitProxyInterface { expireIdempotencyId = PublicRequestStream(commit.getEndpoint().getAdjustedEndpoint(9)); setThrottledShard = - RequestStream(commit.getEndpoint().getAdjustedEndpoint(12)); + RequestStream(commit.getEndpoint().getAdjustedEndpoint(10)); } }