Skip to content

Commit f291b61

Browse files
committed
further analysis
1 parent dd14ca5 commit f291b61

File tree

1 file changed

+177
-70
lines changed

1 file changed

+177
-70
lines changed

docs/plans/memory-analysis.md

Lines changed: 177 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
## Root Cause Analysis: SMP Server Memory Growth (23.5GB)
22

3+
### Environment
4+
5+
- **Server**: smp19.simplex.im, ~21,927 connected clients
6+
- **Storage**: PostgreSQL backend with `useCache = False`
7+
- **RTS flags**: `+RTS -N -A16m -I0.01 -Iw15 -s -RTS` (16 cores)
8+
- **Memory**: 23.5GB RES / 1031GB VIRT (75% of available RAM)
9+
310
### Log Summary
411

512
- **Duration**: ~22 hours (Mar 16 12:12 → Mar 17 10:20)
@@ -9,110 +16,210 @@
916

1017
---
1118

12-
### Root Cause #1 (PRIMARY): PostgreSQL Queue Cache Never Evicts
19+
### Known Factor: GHC Heap Sizing
1320

14-
**Files**: `src/Simplex/Messaging/Server/QueueStore/Postgres.hs`
21+
With 16 cores and `-A16m`:
22+
- **Nursery**: 16 × 16MB = **256MB baseline**
23+
- GHC default major GC threshold = **2× live data** — if live data is 10GB, heap grows to ~20GB before major GC
24+
- The server is rarely idle with 22K clients, so major GC is deferred despite `-I0.01`
25+
- This is an amplifier — whatever the actual live data size is, GHC roughly doubles it
1526

16-
The Postgres queue store has `useCache = True` hard-coded (via `Journal.hs:429`). Every queue accessed or created gets inserted into `queues :: TMap RecipientId q` and **is never removed**, even after deletion.
27+
---
1728

18-
**Evidence**`deleteStoreQueue` at Postgres.hs:448-465:
19-
```haskell
20-
deleteStoreQueue st sq = ... do
21-
atomically $ writeTVar qr Nothing -- QueueRec set to Nothing...
22-
when (useCache st) $ do
23-
atomically $ TM.delete (senderId q) $ senders st -- ✅ cleaned
24-
forM_ (notifier q) $ \NtfCreds {notifierId} -> do
25-
atomically $ TM.delete notifierId $ notifiers st -- ✅ cleaned
26-
atomically $ TM.delete notifierId $ notifierLocks st -- ✅ cleaned
27-
-- ❌ NO TM.delete rId $ queues st — zombie entry stays forever!
28-
```
29+
### Candidate Structures That Could Grow Unboundedly
2930

30-
Similarly, `links :: TMap LinkId RecipientId` is never cleaned on queue deletion.
31+
Analysis of the full codebase identified these structures that either grow without bound or have uncertain cleanup:
3132

32-
**Impact**: Every queue created and then deleted leaves a zombie in the `queues` map (~200 bytes minimum per entry). On a busy server like smp19 running for days/weeks:
33-
- Millions of queues created/deleted → millions of zombie cache entries
34-
- Active queues also stay cached forever once loaded from Postgres
35-
- **This is the primary unbounded growth mechanism**
33+
#### 1. `SubscribedClients` maps — `Env/STM.hs:378`
3634

37-
The `loadedQueueCount` Prometheus metric would confirm this — it shows the size of this cache (Prometheus.hs:500).
35+
Both `subscribers.queueSubscribers` and `ntfSubscribers.queueSubscribers` (and their `serviceSubscribers`) use `SubscribedClients (TMap EntityId (TVar (Maybe (Client s))))`.
3836

39-
---
37+
Comment at line 376: *"The subscriptions that were made at any point are not removed"*
4038

41-
### Root Cause #2 (AMPLIFIER): GHC Heap Sizing with `-A16m -N`
39+
`deleteSubcribedClient` IS called on disconnect (Server.hs:1112) and DOES call `TM.delete`. But it only deletes if the current stored client matches — if another client already re-subscribed, the old client's disconnect won't remove the entry. This is by design for mobile client continuity, but the net effect on map size over time is unclear without measurement.
4240

43-
**RTS flags**: `+RTS -N -A16m -I0.01 -Iw15 -s -RTS`
41+
#### 2. ProxyAgent's subscription TMaps — `Client/Agent.hs:145-151`
4442

45-
With 16 cores:
46-
- **Nursery**: 16 × 16MB = **256MB baseline**
47-
- GHC's default major GC threshold = **2× live data** — if live data is 10GB, GHC allows the heap to grow to **~20GB before triggering major GC**
48-
- VIRT = 1031GB is normal for GHC (address space reservation, not actual memory)
43+
The `SMPClientAgent` has 4 TMaps that accumulate one top-level entry per unique destination server and **never remove** them:
4944

50-
The `-I0.01` setting is aggressive for idle GC (10ms), but with 22K clients the server is rarely idle, so major GC is deferred.
45+
- `activeServiceSubs :: TMap SMPServer (TVar ...)` (line 145)
46+
- `activeQueueSubs :: TMap SMPServer (TMap QueueId ...)` (line 146)
47+
- `pendingServiceSubs :: TMap SMPServer (TVar ...)` (line 149)
48+
- `pendingQueueSubs :: TMap SMPServer (TMap QueueId ...)` (line 150)
5149

52-
---
50+
Comment at line 262: *"these vars are never removed, they are only added"*
5351

54-
### Root Cause #3 (CONTRIBUTOR): No Cache Eviction by Design
52+
These are only used for the proxy agent (SParty 'Sender), so they grow with each unique destination SMP server proxied to. With 292 unique servers in this log period, these are likely small — but long-running servers may accumulate thousands.
5553

56-
Looking at `getQueue_` in Postgres.hs:193-244:
57-
- When `useCache = True`, queues are loaded from Postgres on first access
58-
- They are cached in the `queues` TMap **forever**
59-
- There is no LRU eviction, no TTL, no max cache size
60-
- The comment at line 233 acknowledges interaction with `withAllMsgQueues` but no eviction mechanism exists
54+
`closeSMPClientAgent` (line 369) does NOT clear these 4 maps.
6155

62-
---
56+
#### 3. `NtfStore``NtfStore.hs:26`
6357

64-
### What is NOT the Main Cause
58+
`NtfStore (TMap NotifierId (TVar [MsgNtf]))` — one entry per NotifierId.
6559

66-
1. **Proxy connection state (`smpClients`)**: Properly managed — one entry per destination server, cleaned after 30-second `persistErrorInterval`. Max 292 entries. Code at `Client/Agent.hs:196-242` and `Session.hs:24-39` is correct.
60+
`deleteExpiredNtfs` (line 47) filters expired notifications from lists but does **not remove entries with empty lists** from the TMap. Over time, NotifierIds that no longer receive notifications leave zombie `TVar []` entries.
6761

68-
2. **SubscribedClients**: Despite the misleading comment at `Env/STM.hs:376` saying "subscriptions are never removed," they ARE cleaned up on both client disconnect (`Server.hs:1112`) and queue deletion (`Server.hs:308` via `lookupDeleteSubscribedClient`).
62+
`deleteNtfs` (line 44) does remove the full entry via `TM.lookupDelete` — but only called when a notifier is explicitly deleted.
6963

70-
3. **Client structures**: 22K × ~3KB = ~66MB — negligible.
64+
#### 4. `serviceLocks` in PostgresQueueStore — `Postgres.hs:112,469`
65+
66+
`serviceLocks :: TMap CertFingerprint Lock` — one Lock per unique certificate fingerprint.
67+
68+
`getCreateService` (line 469) calls `withLockMap (serviceLocks st) fp` which calls `getMapLock` (Agent/Client.hs:1029-1032) — this **unconditionally inserts** a Lock into the TMap. There is **no cleanup code** for serviceLocks anywhere. This is NOT guarded by `useCache`.
69+
70+
#### 5. `sentCommands` per proxy client connection — `Client.hs:580`
71+
72+
Each `PClient` has `sentCommands :: TMap CorrId (Request err msg)`. Entries are added per command sent (line 1369) and only removed when a response arrives (line 698). If a connection drops before all responses arrive, entries remain until the `PClient` is GC'd. Since `PClient` is captured by the connection thread which terminates on error, the `PClient` should become GC-eligible — but GC timing depends on heap pressure.
73+
74+
#### 6. `subQ :: TQueue (ClientSub, ClientId)``Env/STM.hs:363`
75+
76+
Unbounded `TQueue` for subscription changes. If the subscriber thread (`serverThread`) can't process changes fast enough, this queue grows without backpressure. With 22K clients subscribing/unsubscribing, sustained bursts could cause this queue to bloat.
7177

72-
4. **TBQueues**: Bounded (`TBQueue` with `tbqSize = 128`), properly sized.
78+
---
79+
80+
### Ruled Out
7381

74-
5. **Thread management**: `forkClient` uses weak references, `finally` blocks ensure cleanup. Well-managed.
82+
1. **PostgreSQL queue cache**: `useCache = False``queues`, `senders`, `links`, `notifiers` TMaps are empty.
83+
2. **`notifierLocks`**: Guarded by `useCache` (Postgres.hs:377,405) — not used with `useCache = False`.
84+
3. **Client structures**: 22K × ~3KB = ~66MB — negligible.
85+
4. **TBQueues**: Bounded (`tbqSize = 128`).
86+
5. **Thread management**: `forkClient` uses weak refs + `finally` blocks. `endThreads` cleared on disconnect.
87+
6. **Proxy `smpClients`/`smpSessions`**: Properly cleaned on disconnect/expiry.
88+
7. **`smpSubWorkers`**: Properly cleaned on worker completion; also cleared in `closeSMPClientAgent`.
89+
8. **`pendingEvents`**: Atomically swapped empty every `pendingENDInterval`.
90+
9. **Stats IORef counters**: Fixed number, bounded.
91+
10. **DB connection pool**: Bounded `TBQueue` with bracket-based return.
7592

7693
---
7794

78-
### Memory Budget Estimate
95+
### Insufficient Data to Determine Root Cause
7996

80-
| Component | Estimated Size |
81-
|-----------|---------------|
82-
| Postgres queue cache (growing) | **5-15+ GB** |
83-
| TLS connection state (22K × ~30KB) | ~660 MB |
84-
| GHC nursery (16 × 16MB) | 256 MB |
85-
| Client structures | ~66 MB |
86-
| Proxy agent state | ~50 KB |
87-
| **GHC old gen headroom (2× live)** | **doubles effective usage** |
97+
Without measuring the actual sizes of these structures at runtime, we cannot determine which (if any) is the primary contributor. The following exact logging changes will identify the root cause.
8898

8999
---
90100

91-
### Recommended Fixes (Priority Order)
101+
### EXACT LOGS TO ADD
92102

93-
**1. Remove zombie queue entries from cache on deletion** (immediate fix)
103+
Add a new periodic logging thread in `src/Simplex/Messaging/Server.hs`.
94104

95-
In `deleteStoreQueue` in `Postgres.hs`, add `TM.delete rId $ queues st` and `forM_ (queueData q) $ \(lnkId, _) -> TM.delete lnkId $ links st`:
105+
Insert at `Server.hs:197` (after `prometheusMetricsThread_`):
96106

97107
```haskell
98-
-- After the existing cleanup:
99-
when (useCache st) $ do
100-
atomically $ TM.delete rId $ queues st -- ADD THIS
101-
atomically $ TM.delete (senderId q) $ senders st
102-
forM_ (queueData q) $ \(lnkId, _) ->
103-
atomically $ TM.delete lnkId $ links st -- ADD THIS
104-
forM_ (notifier q) $ \NtfCreds {notifierId} -> do
105-
atomically $ TM.delete notifierId $ notifiers st
106-
atomically $ TM.delete notifierId $ notifierLocks st
108+
<> memoryDiagThread_ cfg
107109
```
108110

109-
**2. Add configuration option to disable cache** — allow `useCache = False` via INI config for Postgres deployments where the DB is fast enough.
111+
Then define:
112+
113+
```haskell
114+
memoryDiagThread_ :: ServerConfig s -> [M s ()]
115+
memoryDiagThread_ ServerConfig {prometheusInterval = Just _} =
116+
[memoryDiagThread]
117+
memoryDiagThread_ _ = []
118+
119+
memoryDiagThread :: M s ()
120+
memoryDiagThread = do
121+
labelMyThread "memoryDiag"
122+
Env { ntfStore = NtfStore ntfMap
123+
, server = srv@Server {subscribers, ntfSubscribers}
124+
, proxyAgent = ProxyAgent {smpAgent = pa}
125+
, msgStore_ = ms
126+
} <- ask
127+
let interval = 300_000_000 -- 5 minutes
128+
liftIO $ forever $ do
129+
threadDelay interval
130+
-- GHC RTS stats
131+
rts <- getRTSStats
132+
let liveBytes = gcdetails_live_bytes $ gc rts
133+
heapSize = gcdetails_mem_in_use_bytes $ gc rts
134+
gcCount = gcs rts
135+
-- Server structures
136+
clientCount <- IM.size <$> getServerClients srv
137+
-- SubscribedClients (queue and service subscribers for both SMP and NTF)
138+
smpQSubs <- M.size <$> getSubscribedClients (queueSubscribers subscribers)
139+
smpSSubs <- M.size <$> getSubscribedClients (serviceSubscribers subscribers)
140+
ntfQSubs <- M.size <$> getSubscribedClients (queueSubscribers ntfSubscribers)
141+
ntfSSubs <- M.size <$> getSubscribedClients (serviceSubscribers ntfSubscribers)
142+
-- Pending events
143+
smpPending <- IM.size <$> readTVarIO (pendingEvents subscribers)
144+
ntfPending <- IM.size <$> readTVarIO (pendingEvents ntfSubscribers)
145+
-- NtfStore
146+
ntfStoreSize <- M.size <$> readTVarIO ntfMap
147+
-- ProxyAgent maps
148+
let SMPClientAgent {smpClients, smpSessions, activeServiceSubs, activeQueueSubs, pendingServiceSubs, pendingQueueSubs, smpSubWorkers} = pa
149+
paClients <- M.size <$> readTVarIO smpClients
150+
paSessions <- M.size <$> readTVarIO smpSessions
151+
paActSvc <- M.size <$> readTVarIO activeServiceSubs
152+
paActQ <- M.size <$> readTVarIO activeQueueSubs
153+
paPndSvc <- M.size <$> readTVarIO pendingServiceSubs
154+
paPndQ <- M.size <$> readTVarIO pendingQueueSubs
155+
paWorkers <- M.size <$> readTVarIO smpSubWorkers
156+
-- Loaded queue counts
157+
lc <- loadedQueueCounts $ fromMsgStore ms
158+
-- Log everything
159+
logInfo $
160+
"MEMORY "
161+
<> "rts_live=" <> tshow liveBytes
162+
<> " rts_heap=" <> tshow heapSize
163+
<> " rts_gc=" <> tshow gcCount
164+
<> " clients=" <> tshow clientCount
165+
<> " smpQSubs=" <> tshow smpQSubs
166+
<> " smpSSubs=" <> tshow smpSSubs
167+
<> " ntfQSubs=" <> tshow ntfQSubs
168+
<> " ntfSSubs=" <> tshow ntfSSubs
169+
<> " smpPending=" <> tshow smpPending
170+
<> " ntfPending=" <> tshow ntfPending
171+
<> " ntfStore=" <> tshow ntfStoreSize
172+
<> " paClients=" <> tshow paClients
173+
<> " paSessions=" <> tshow paSessions
174+
<> " paActSvc=" <> tshow paActSvc
175+
<> " paActQ=" <> tshow paActQ
176+
<> " paPndSvc=" <> tshow paPndSvc
177+
<> " paPndQ=" <> tshow paPndQ
178+
<> " paWorkers=" <> tshow paWorkers
179+
<> " loadedQ=" <> tshow (loadedQueueCount lc)
180+
<> " loadedNtf=" <> tshow (loadedNotifierCount lc)
181+
<> " ntfLocks=" <> tshow (notifierLockCount lc)
182+
```
183+
184+
Note: `smpSubs.subsCount` (queueSubscribers size) and `smpSubs.subServicesCount` (serviceSubscribers size) are **already logged** in Prometheus (lines 475-496). The log above adds all other candidate structures plus GHC RTS memory stats.
185+
186+
This produces a single log line every 5 minutes:
187+
188+
```
189+
[INFO] MEMORY rts_live=10737418240 rts_heap=23488102400 rts_gc=4521 clients=21927 smpQSubs=1847233 smpSSubs=42 ntfQSubs=982112 ntfSSubs=31 smpPending=0 ntfPending=0 ntfStore=512844 paClients=12 paSessions=12 paActSvc=0 paActQ=0 paPndSvc=0 paPndQ=0 paWorkers=3 loadedQ=0 loadedNtf=0 ntfLocks=0
190+
```
191+
192+
### What Each Metric Tells Us
193+
194+
| Metric | What it reveals | If growing = suspect |
195+
|--------|----------------|---------------------|
196+
| `rts_live` | Actual live data after last major GC | Baseline — everything else should add up to this |
197+
| `rts_heap` | Total heap (should be ~2× rts_live) | If >> 2× live, fragmentation issue |
198+
| `clients` | Connected client count | Known: ~22K |
199+
| `smpQSubs` | SubscribedClients map size (queue subs) | If >> clients × avg_subs, entries not cleaned |
200+
| `smpSSubs` | SubscribedClients map size (service subs) | Should be small |
201+
| `ntfQSubs` | NTF SubscribedClients map (queue subs) | Same concern as smpQSubs |
202+
| `ntfSSubs` | NTF SubscribedClients map (service subs) | Should be small |
203+
| `smpPending` / `ntfPending` | Pending END/DELD events per client | If large, subscriber thread lagging |
204+
| `ntfStore` | NotifierId count in NtfStore | If growing monotonically, zombie entries |
205+
| `paClients` | Proxy connections to other servers | Should be <= unique dest servers |
206+
| `paSessions` | Active proxy sessions | Should match paClients |
207+
| `paActSvc` / `paActQ` | Proxy active subscriptions | If growing, entries never removed |
208+
| `paPndSvc` / `paPndQ` | Proxy pending subscriptions | If growing, resubscription stuck |
209+
| `paWorkers` | Active reconnect workers | If growing, workers stuck in retry |
210+
| `loadedQ` | Cached queues in store (0 with useCache=False) | Should be 0 |
211+
| `ntfLocks` | Notifier locks in store | Should be 0 with useCache=False |
212+
213+
### Interpretation Guide
214+
215+
**If `smpQSubs` is in the millions**: SubscribedClients is the primary leak. Entries accumulate for every queue ever subscribed to.
216+
217+
**If `ntfStore` grows monotonically**: Zombie notification entries (empty lists after expiration). Fix: `deleteExpiredNtfs` should remove entries with empty lists.
218+
219+
**If `paActSvc` + `paActQ` grow**: Proxy agent subscription maps are the leak. Fix: add cleanup when no active/pending subs exist for a server.
110220

111-
**3. Add cache eviction** — periodically evict queues not accessed recently (LRU or TTL-based).
221+
**If `rts_live` is much smaller than `rts_heap`**: GHC heap fragmentation. Fix: tune `-F` flag (GC trigger factor) or use `-c` (compacting GC).
112222

113-
**4. Tune GHC RTS**:
114-
- Add `-F1.5` (or lower) to reduce the major GC threshold multiplier from default 2.0
115-
- Consider `-M<limit>` to set a hard heap cap
116-
- Consider `-N8` instead of `-N` to halve the nursery size and reduce per-capability overhead
223+
**If `rts_live` ~ 10-12GB**: The live data is genuinely large. Look at which metric is the largest contributor.
117224

118-
**5. Add observability** — log `loadedQueueCount` periodically to track cache growth. This metric exists in Prometheus but should also appear in regular stats logs.
225+
**If nothing above is large but `rts_live` is large**: The leak is in a structure not measured here — likely TLS connection buffers, ByteString retention from Postgres queries, or GHC runtime overhead. Next step would be heap profiling with `-hT`.

0 commit comments

Comments
 (0)