Skip to content

Commit 05757ca

Browse files
committed
smp-server: memory usage analysis
1 parent 48eba59 commit 05757ca

File tree

1 file changed

+118
-0
lines changed

1 file changed

+118
-0
lines changed

docs/plans/memory-analysis.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
## Root Cause Analysis: SMP Server Memory Growth (23.5GB)
2+
3+
### Log Summary
4+
5+
- **Duration**: ~22 hours (Mar 16 12:12 → Mar 17 10:20)
6+
- **92,277 proxy connection errors** out of 92,656 total log lines (99.6%)
7+
- **292 unique failing destination servers**, top offender: `nowhere.moe` (12,875 errors)
8+
- Only **145 successful proxy connections**
9+
10+
---
11+
12+
### Root Cause #1 (PRIMARY): PostgreSQL Queue Cache Never Evicts
13+
14+
**Files**: `src/Simplex/Messaging/Server/QueueStore/Postgres.hs`
15+
16+
The Postgres queue store has `useCache = True` hard-coded (via `Journal.hs:429`). Every queue accessed or created gets inserted into `queues :: TMap RecipientId q` and **is never removed**, even after deletion.
17+
18+
**Evidence**`deleteStoreQueue` at Postgres.hs:448-465:
19+
```haskell
20+
deleteStoreQueue st sq = ... do
21+
atomically $ writeTVar qr Nothing -- QueueRec set to Nothing...
22+
when (useCache st) $ do
23+
atomically $ TM.delete (senderId q) $ senders st -- ✅ cleaned
24+
forM_ (notifier q) $ \NtfCreds {notifierId} -> do
25+
atomically $ TM.delete notifierId $ notifiers st -- ✅ cleaned
26+
atomically $ TM.delete notifierId $ notifierLocks st -- ✅ cleaned
27+
-- ❌ NO TM.delete rId $ queues st — zombie entry stays forever!
28+
```
29+
30+
Similarly, `links :: TMap LinkId RecipientId` is never cleaned on queue deletion.
31+
32+
**Impact**: Every queue created and then deleted leaves a zombie in the `queues` map (~200 bytes minimum per entry). On a busy server like smp19 running for days/weeks:
33+
- Millions of queues created/deleted → millions of zombie cache entries
34+
- Active queues also stay cached forever once loaded from Postgres
35+
- **This is the primary unbounded growth mechanism**
36+
37+
The `loadedQueueCount` Prometheus metric would confirm this — it shows the size of this cache (Prometheus.hs:500).
38+
39+
---
40+
41+
### Root Cause #2 (AMPLIFIER): GHC Heap Sizing with `-A16m -N`
42+
43+
**RTS flags**: `+RTS -N -A16m -I0.01 -Iw15 -s -RTS`
44+
45+
With 16 cores:
46+
- **Nursery**: 16 × 16MB = **256MB baseline**
47+
- GHC's default major GC threshold = **2× live data** — if live data is 10GB, GHC allows the heap to grow to **~20GB before triggering major GC**
48+
- VIRT = 1031GB is normal for GHC (address space reservation, not actual memory)
49+
50+
The `-I0.01` setting is aggressive for idle GC (10ms), but with 22K clients the server is rarely idle, so major GC is deferred.
51+
52+
---
53+
54+
### Root Cause #3 (CONTRIBUTOR): No Cache Eviction by Design
55+
56+
Looking at `getQueue_` in Postgres.hs:193-244:
57+
- When `useCache = True`, queues are loaded from Postgres on first access
58+
- They are cached in the `queues` TMap **forever**
59+
- There is no LRU eviction, no TTL, no max cache size
60+
- The comment at line 233 acknowledges interaction with `withAllMsgQueues` but no eviction mechanism exists
61+
62+
---
63+
64+
### What is NOT the Main Cause
65+
66+
1. **Proxy connection state (`smpClients`)**: Properly managed — one entry per destination server, cleaned after 30-second `persistErrorInterval`. Max 292 entries. Code at `Client/Agent.hs:196-242` and `Session.hs:24-39` is correct.
67+
68+
2. **SubscribedClients**: Despite the misleading comment at `Env/STM.hs:376` saying "subscriptions are never removed," they ARE cleaned up on both client disconnect (`Server.hs:1112`) and queue deletion (`Server.hs:308` via `lookupDeleteSubscribedClient`).
69+
70+
3. **Client structures**: 22K × ~3KB = ~66MB — negligible.
71+
72+
4. **TBQueues**: Bounded (`TBQueue` with `tbqSize = 128`), properly sized.
73+
74+
5. **Thread management**: `forkClient` uses weak references, `finally` blocks ensure cleanup. Well-managed.
75+
76+
---
77+
78+
### Memory Budget Estimate
79+
80+
| Component | Estimated Size |
81+
|-----------|---------------|
82+
| Postgres queue cache (growing) | **5-15+ GB** |
83+
| TLS connection state (22K × ~30KB) | ~660 MB |
84+
| GHC nursery (16 × 16MB) | 256 MB |
85+
| Client structures | ~66 MB |
86+
| Proxy agent state | ~50 KB |
87+
| **GHC old gen headroom (2× live)** | **doubles effective usage** |
88+
89+
---
90+
91+
### Recommended Fixes (Priority Order)
92+
93+
**1. Remove zombie queue entries from cache on deletion** (immediate fix)
94+
95+
In `deleteStoreQueue` in `Postgres.hs`, add `TM.delete rId $ queues st` and `forM_ (queueData q) $ \(lnkId, _) -> TM.delete lnkId $ links st`:
96+
97+
```haskell
98+
-- After the existing cleanup:
99+
when (useCache st) $ do
100+
atomically $ TM.delete rId $ queues st -- ADD THIS
101+
atomically $ TM.delete (senderId q) $ senders st
102+
forM_ (queueData q) $ \(lnkId, _) ->
103+
atomically $ TM.delete lnkId $ links st -- ADD THIS
104+
forM_ (notifier q) $ \NtfCreds {notifierId} -> do
105+
atomically $ TM.delete notifierId $ notifiers st
106+
atomically $ TM.delete notifierId $ notifierLocks st
107+
```
108+
109+
**2. Add configuration option to disable cache** — allow `useCache = False` via INI config for Postgres deployments where the DB is fast enough.
110+
111+
**3. Add cache eviction** — periodically evict queues not accessed recently (LRU or TTL-based).
112+
113+
**4. Tune GHC RTS**:
114+
- Add `-F1.5` (or lower) to reduce the major GC threshold multiplier from default 2.0
115+
- Consider `-M<limit>` to set a hard heap cap
116+
- Consider `-N8` instead of `-N` to halve the nursery size and reduce per-capability overhead
117+
118+
**5. Add observability** — log `loadedQueueCount` periodically to track cache growth. This metric exists in Prometheus but should also appear in regular stats logs.

0 commit comments

Comments
 (0)