simplex-chat
diff --git a/‎spec/TOPICS.md‎
Lines changed: 6 additions & 0 deletions b/‎spec/TOPICS.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎spec/modules/Simplex/Messaging/Client.md‎
Lines changed: 93 additions & 0 deletions b/‎spec/modules/Simplex/Messaging/Client.md‎
Lines changed: 93 additions & 0 deletions
diff --git a/‎spec/modules/Simplex/Messaging/Client/Agent.md‎
Lines changed: 92 additions & 0 deletions b/‎spec/modules/Simplex/Messaging/Client/Agent.md‎
Lines changed: 92 additions & 0 deletions
@@ -12,4 +12,10 @@
 
 - **Certificate chain trust model**: ChainCertificates (Shared.hs) defines 0–4 cert chain semantics, used by both Client.hs (validateCertificateChain) and Server.hs (validateClientCertificate, SNI credential switching). The 4-length case skipping index 2 (operator cert) and the FQHN-disabled x509validate are decisions that span the entire transport security model.
 
+- **SMP proxy protocol flow**: The PRXY/PFWD/RFWD proxy protocol involves Client.hs (proxySMPCommand with 10 error scenarios, forwardSMPTransmission with sessionSecret encryption), Protocol.hs (command types, version-dependent encoding), Transport.hs (proxiedSMPRelayVersion cap, proxyServer flag disabling block encryption). The double encryption (client-relay via PFWD + proxy-relay via RFWD), combined timeout (tcpConnect + tcpTimeout), nonce/reverseNonce pairing, and version downgrade logic are not visible from any single module.
+
+- **Service certificate subscription model**: Service subscriptions (SUBS/NSUBS) and per-queue subscriptions (SUB/NSUB) coexist with complex state transitions. Client/Agent.hs manages dual active/pending subscription maps with session-aware cleanup. Protocol.hs defines useServiceAuth (only NEW/SUB/NSUB). Client.hs implements authTransmission with dual signing (entity key over cert hash + transmission, service key over transmission only). Transport.hs handles the service certificate handshake extension (v16+). The full subscription lifecycle — from DBService credentials through handshake to service subscription to disconnect/reconnect — spans all four modules.
+
+- **Two agent layers**: Client/Agent.hs ("small agent") is used only in servers — SMP proxy and notification server — to manage client connections to other SMP servers. Agent.hs + Agent/Client.hs ("big agent") is used in client applications. Both manage SMP client connections with subscription tracking and reconnection, but the big agent adds the full messaging agent layer (connections, double ratchet, file transfer). When documenting Agent/Client.hs, Client/Agent.hs should be reviewed for shared patterns and differences.
+
 - **Handshake protocol family**: SMP (Transport.hs), NTF (Notifications/Transport.hs), and XFTP (FileTransfer/Transport.hs) all have handshake protocols with the same structure (version negotiation + session binding + key exchange) but different feature sets. NTF is a strict subset. XFTP doesn't use the TLS handshake at all (HTTP2 layer). The shared types (THandle, THandleParams, THandleAuth) mean changes to the handshake infrastructure affect all three protocols.
@@ -0,0 +1,93 @@
+# Simplex.Messaging.Client
+
+> Generic protocol client: connection management, command sending/receiving, batching, proxy protocol, reconnection.
+
+**Source**: [`Client.hs`](../../../../src/Simplex/Messaging/Client.hs)
+
+**Protocol spec**: [`protocol/simplex-messaging.md`](../../../../protocol/simplex-messaging.md) — SimpleX Messaging Protocol.
+
+## Overview
+
+This module implements the client side of the `Protocol` typeclass — connecting to servers, sending commands, receiving responses, and managing connection lifecycle. It is generic over `Protocol v err msg`, instantiated for SMP as `SMPClient` (= `ProtocolClient SMPVersion ErrorType BrokerMsg`). The SMP proxy protocol (PRXY/PFWD/RFWD) is also implemented here.
+
+## Four concurrent threads — teardown semantics
+
+`getProtocolClient` launches four threads via `raceAny_`:
+- `send`: reads from `sndQ` (TBQueue) and writes to TLS
+- `receive`: reads from TLS and writes to `rcvQ` (TBQueue), updates `lastReceived`
+- `process`: reads from `rcvQ` and dispatches to response vars or `msgQ`
+- `monitor`: periodic ping loop (only when `smpPingInterval > 0`)
+
+When ANY thread exits (normally or exceptionally), `raceAny_` cancels all others. `E.finally` ensures the `disconnected` callback always fires. Implication: a single stuck thread (e.g., TLS read blocked on a half-open connection) keeps the entire client alive until `monitor` drops it. There is no per-thread health check — liveness depends entirely on the monitor's timeout logic.
+
+## Request lifecycle and leak risk
+
+`mkRequest` inserts a `Request` into `sentCommands` TMap BEFORE the transmission is written to TLS. If the TLS write fails silently or the connection drops before the response, the entry remains in `sentCommands` until the monitor's timeout counter exceeds `maxCnt` and drops the entire client. There is no per-request cleanup on send failure — individual request entries are only removed by `processMsg` (on response) or by `getResponse` timeout (which sets `pending = False` but doesn't remove the entry).
+
+## getResponse — pending flag race contract
+
+This is the core concurrency contract between timeout and response processing:
+
+1. `getResponse` waits with `timeout` for `takeTMVar responseVar`
+2. Regardless of result, atomically sets `pending = False` and tries `tryTakeTMVar` again (see comment on `getResponse`)
+3. In `processMsg`, when a response arrives for a request where `pending` is already `False` (timeout won), `wasPending` is `False` and the response is forwarded to `msgQ` as `STResponse` rather than discarded
+
+The double-check pattern (`swapTVar pending False` + `tryTakeTMVar`) handles the race window where a response arrives between timeout firing and `pending` being set to `False`. Without this, responses arriving in that gap would be silently lost.
+
+`timeoutErrorCount` is reset to 0 in three places: in `getResponse` when a response arrives, in `receive` on every TLS read, and the monitor uses this count to decide when to drop the connection.
+
+## processMsg — server events vs expired responses
+
+When `corrId` is empty, the message is an `STEvent` (server-initiated). When non-empty and the request was already expired (`wasPending` is `False`), the response becomes `STResponse` — not discarded, but forwarded to `msgQ` with the original command context. Entity ID mismatch is `STUnexpectedError`.
+
+## nonBlockingWriteTBQueue — fork on full
+
+If `tryWriteTBQueue` returns `False`, a new thread is forked for the blocking write. No backpressure mechanism — under sustained overload, thread count grows without bound. This is a deliberate tradeoff: the caller never blocks (preventing deadlock between send and process threads), at the cost of potential unbounded thread creation.
+
+## Batch commands do not expire
+
+See comment on `sendBatch`. Batched commands are written with `Nothing` as the request parameter — the send thread skips the `pending` flag check. Individual commands use `Just r` and the send thread checks `pending` after dequeue. The coupling: if the server stops responding, batched commands can block the send queue indefinitely since they have no timeout-based expiry.
+
+## monitor — quasi-periodic adaptive ping
+
+The ping loop sleeps for `smpPingInterval`, then checks elapsed time since `lastReceived`. If significant time remains in the interval (> 1 second), it re-sleeps for just the remaining time rather than sending a ping. This means ping frequency adapts to actual receive activity — frequent receives suppress pings.
+
+Pings are only sent when `sendPings` is `True`, set by `enablePings` (called from `subscribeSMPQueue`, `subscribeSMPQueues`, `subscribeSMPQueueNotifications`, `subscribeSMPQueuesNtfs`, `subscribeService`). The client drops the connection when `maxCnt` commands have timed out in sequence AND at least `recoverWindow` (15 minutes) has passed since the last received response.
+
+## clientCorrId — dual-purpose random values
+
+`clientCorrId` is a `TVar ChaChaDRG` generating random `CbNonce` values that serve as both correlation IDs and nonces for proxy encryption. When a nonce is explicitly passed (e.g., by `createSMPQueue`), it is used instead of generating a random one.
+
+## Proxy command re-parameterization
+
+`proxySMPCommand` constructs modified `thParams` per-request — setting `sessionId`, `peerServerPubKey`, and `thVersion` to the proxy-relay connection's parameters rather than the client-proxy connection's. A single `SMPClient` connection to the proxy carries commands with different auth parameters per destination relay. The encoding, signing, and encryption all use these per-request params, not the connection's original params.
+
+## proxySMPCommand — error classification
+
+See comment above `proxySMPCommand` for the 9 error scenarios (0-9) mapping each combination of success/error at client-proxy and proxy-relay boundaries. Errors from the destination relay wrapped in `PRES` are thrown as `ExceptT` errors (transparent proxy). Errors from the proxy itself are returned as `Left ProxyClientError`.
+
+## forwardSMPTransmission — proxy-side forwarding
+
+Used by the proxy server to forward `RFWD` to the destination relay. Uses `cbEncryptNoPad`/`cbDecryptNoPad` (no padding) with the session secret from the proxy-relay connection. Response nonce is `reverseNonce` of the request nonce.
+
+## authTransmission — dual auth with service signature
+
+When `useServiceAuth` is `True` and a service certificate is present, the entity key signs over `serviceCertHash <> transmission` (not just the transmission) — see comment on `authTransmission`. The service key only signs the transmission itself. For X25519 keys, `cbAuthenticate` produces a `TAAuthenticator`; for Ed25519/Ed448, `C.sign'` produces a `TASignature`.
+
+The service signature is only added when the entity authenticator is non-empty. If authenticator generation fails silently (returns empty bytes), service signing is silently skipped. This mirrors the [state-dependent parser contract](./Protocol.md#service-signature--state-dependent-parser-contract) in Protocol.hs.
+
+## connectSMPProxiedRelay — combined timeout
+
+The timeout for the `PRXY` command is `netTimeoutInt tcpConnectTimeout nm + netTimeoutInt tcpTimeout nm` — both timeouts are transformed by `netTimeoutInt` before summing.
+
+## ProxiedRelay — stored auth
+
+See comment on `prBasicAuth` — auth is stored to allow reconnecting via the same proxy after `NO_SESSION` error.
+
+## action — weak thread reference
+
+`action` stores a `Weak ThreadId` (via `mkWeakThreadId`) to the main client thread. `closeProtocolClient` dereferences and kills it. The weak reference allows the thread to be garbage collected if all other references are dropped.
+
+## writeSMPMessage — server-side event injection
+
+`writeSMPMessage` writes directly to `msgQ` as `STEvent`, bypassing the entire command/response pipeline. This is used by the server to inject MSG events into the subscription response path.
@@ -0,0 +1,92 @@
+# Simplex.Messaging.Client.Agent
+
+> SMP client connections with subscription management, reconnection, and service certificate support.
+
+**Source**: [`Client/Agent.hs`](../../../../../src/Simplex/Messaging/Client/Agent.hs)
+
+## Overview
+
+This is the "small agent" — used only in servers (SMP proxy, notification server) to manage client connections to other SMP servers. The "big agent" in `Simplex.Messaging.Agent` + `Simplex.Messaging.Agent.Client` serves client applications and adds the full messaging agent layer. See [Two agent layers](../../../../TOPICS.md) topic.
+
+`SMPClientAgent` manages `SMPClient` connections via `smpClients :: TMap SMPServer SMPClientVar` (one per SMP server), tracks active and pending subscriptions, and handles automatic reconnection. It is parameterized by `Party` (`p`) and uses the `ServiceParty` constraint to support both `RecipientService` and `NotifierService` modes.
+
+## Dual subscription model
+
+Four TMap fields track subscriptions in two dimensions:
+
+| | Active | Pending |
+|---|---|---|
+| **Service** | `activeServiceSubs` (TMap SMPServer (TVar (Maybe (ServiceSub, SessionId)))) | `pendingServiceSubs` (TMap SMPServer (TVar (Maybe ServiceSub))) |
+| **Queue** | `activeQueueSubs` (TMap SMPServer (TMap QueueId (SessionId, C.APrivateAuthKey))) | `pendingQueueSubs` (TMap SMPServer (TMap QueueId C.APrivateAuthKey)) |
+
+See comments on `activeServiceSubs` and `pendingServiceSubs` for the coexistence rules. Key constraint: only one service subscription per server. Active subs store the `SessionId` that established them.
+
+## SessionVar compare-and-swap — core concurrency safety
+
+`removeSessVar` (in Session.hs) uses `sessionVarId` (monotonically increasing counter from `sessSeq`) to prevent stale removal. When a disconnected client's cleanup runs after a new client already replaced the map entry, the ID mismatch causes removal to silently no-op. See comment on `removeSessVar`. This is used throughout: `removeClientAndSubs` for client map, `cleanup` for worker map.
+
+## removeClientAndSubs — outside-STM lookup optimization
+
+See comment on `removeClientAndSubs`. Subscription TVar references are obtained outside STM (via `TM.lookupIO`), then modified inside `atomically`. This is safe because the invariant is that subscription TVar entries for a server are never deleted from the outer TMap, only their contents change. Moving lookups inside the STM transaction would cause excessive re-evaluation under contention.
+
+## Disconnect preserves others' subscriptions
+
+`updateServiceSub` only moves active→pending when `sessId` matches the disconnected client (see its comment). If a new client already established different subscriptions on the same server, those are preserved. Queue subs use `M.partition` to split by SessionId — only matching subs move to pending, non-matching remain active.
+
+## Pending never reset to Nothing on disconnect
+
+See comment on `updateServiceSub`. After clearing an active service sub, the code sets pending to the cleared value but does NOT reset pending to `Nothing`. This avoids the race where a concurrent new client session has already set a different pending subscription. Implication: pending subs can only grow (be set) during disconnect, never shrink (be cleared).
+
+## persistErrorInterval — delayed error cleanup
+
+When `connectClient` calls `newSMPClient` and it fails, the error is stored with an expiry timestamp. `waitForSMPClient` checks expiry before retrying. When `persistErrorInterval` is 0, the error is stored without timestamp and the SessionVar is immediately removed from the map.
+
+## Session validation after subscription RPC
+
+Both `smpSubscribeQueues` and `smpSubscribeService` validate `activeClientSession` AFTER the subscription RPC completes, before committing results to state. If the session changed during the RPC (client reconnected), results are discarded and reconnection is triggered. This is optimistic execution with post-hoc validation — the RPC may succeed but its results are thrown away if the session is stale.
+
+## groupSub — subscription response classification
+
+Each queue response is classified by a `foldr` over the (subs, responses) zip:
+
+- **Success with matching serviceId**: counted as service-subscribed (`sQs` list)
+- **Success without matching serviceId**: counted as queue-only (`qOks` list with SessionId and key)
+- **Not in pending map**: silently skipped (handles concurrent activation by another path)
+- **Temporary error** (network, timeout): sets the `tempErrs` flag but does NOT remove from pending — queue stays pending for retry on reconnect
+- **Permanent error**: removes from pending and added to `finalErrs` — terminal, no automatic retry
+
+Even if multiple temporary errors occur in a batch, only one `reconnectClient` call is made (via the boolean accumulator flag).
+
+## updateActiveServiceSub — accumulative merge
+
+When serviceId and sessionId match the existing active subscription, queue count is added (`n + n'`) and IdsHash is XOR-merged (`idsHash <> idsHash'`). This accumulates across multiple subscription batches for the same service. When they don't match, the subscription is replaced entirely (silently drops old data).
+
+## serviceAvailable — dual check
+
+`smpSubscribeService` checks both that `serviceId` matches the connection's service ID AND that `partyServiceRole` matches the connection's `serviceRole`. If either fails → `CAServiceUnavailable`. See comment on `CAServiceUnavailable` for the cascade: the app must resubscribe all queues individually, creating new associations.
+
+## getPending — polymorphic over STM/IO
+
+`getPending` uses rank-2 polymorphism to work in both STM (for the "should we spawn a worker?" check, providing a consistent snapshot) and IO (for the actual reconnection data read, providing fresh data). Between these two calls, new pending subs could be added — the worker loop handles this by re-checking on each iteration.
+
+## Reconnect worker lifecycle
+
+### Spawn decision
+`reconnectClient` checks `active` outside STM, then atomically checks for pending subs and gets/creates a worker SessionVar. If no pending subs exist, no worker is spawned — this prevents race with cleanup and adding pending queues in another call.
+
+### Worker cleanup blocks on TMVar fill
+See comment on `cleanup`. The STM `retry` loop waits until the async handle is inserted into the TMVar before removing the worker from the map. Without this, cleanup could race ahead of the `putTMVar` in `newSubWorker`, leaving a terminated worker in the map.
+
+### Double timeout on reconnection
+`runSubWorker` wraps the entire reconnection in `System.Timeout.timeout` using `tcpConnectTimeout` in addition to the network-layer timeout. Two layers — network for the connection attempt, outer for the entire operation including subscription.
+
+### Reconnect filters already-active queues
+During reconnection, `reconnectSMPClient` reads current active queue subs (outside STM, same "vars never removed" invariant) and filters them out before resubscribing. Subscription is chunked by `agentSubsBatchSize` — partial success is possible across chunks.
+
+## Agent shutdown ordering
+
+`closeSMPClientAgent` executes in order: set `active = False`, close all client connections, then swap workers map to empty and fork cancellation threads. The cancel threads use `uninterruptibleCancel` but are fire-and-forget — `closeSMPClientAgent` may return before all workers are actually cancelled.
+
+## addSubs_ — left-biased union
+
+`addSubs_` uses `TM.union` which delegates to `M.union` (left-biased). If a queue subscription already exists, the new auth key from the incoming map wins. Service subs use `writeTVar` (overwrite) since only one service sub exists per server.