|
| 1 | +## Root Cause Analysis: SMP Server Memory Growth (23.5GB) |
| 2 | + |
| 3 | +### Log Summary |
| 4 | + |
| 5 | +- **Duration**: ~22 hours (Mar 16 12:12 → Mar 17 10:20) |
| 6 | +- **92,277 proxy connection errors** out of 92,656 total log lines (99.6%) |
| 7 | +- **292 unique failing destination servers**, top offender: `nowhere.moe` (12,875 errors) |
| 8 | +- Only **145 successful proxy connections** |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +### Root Cause #1 (PRIMARY): PostgreSQL Queue Cache Never Evicts |
| 13 | + |
| 14 | +**Files**: `src/Simplex/Messaging/Server/QueueStore/Postgres.hs` |
| 15 | + |
| 16 | +The Postgres queue store has `useCache = True` hard-coded (via `Journal.hs:429`). Every queue accessed or created gets inserted into `queues :: TMap RecipientId q` and **is never removed**, even after deletion. |
| 17 | + |
| 18 | +**Evidence** — `deleteStoreQueue` at Postgres.hs:448-465: |
| 19 | +```haskell |
| 20 | +deleteStoreQueue st sq = ... do |
| 21 | + atomically $ writeTVar qr Nothing -- QueueRec set to Nothing... |
| 22 | + when (useCache st) $ do |
| 23 | + atomically $ TM.delete (senderId q) $ senders st -- ✅ cleaned |
| 24 | + forM_ (notifier q) $ \NtfCreds {notifierId} -> do |
| 25 | + atomically $ TM.delete notifierId $ notifiers st -- ✅ cleaned |
| 26 | + atomically $ TM.delete notifierId $ notifierLocks st -- ✅ cleaned |
| 27 | + -- ❌ NO TM.delete rId $ queues st — zombie entry stays forever! |
| 28 | +``` |
| 29 | + |
| 30 | +Similarly, `links :: TMap LinkId RecipientId` is never cleaned on queue deletion. |
| 31 | + |
| 32 | +**Impact**: Every queue created and then deleted leaves a zombie in the `queues` map (~200 bytes minimum per entry). On a busy server like smp19 running for days/weeks: |
| 33 | +- Millions of queues created/deleted → millions of zombie cache entries |
| 34 | +- Active queues also stay cached forever once loaded from Postgres |
| 35 | +- **This is the primary unbounded growth mechanism** |
| 36 | + |
| 37 | +The `loadedQueueCount` Prometheus metric would confirm this — it shows the size of this cache (Prometheus.hs:500). |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +### Root Cause #2 (AMPLIFIER): GHC Heap Sizing with `-A16m -N` |
| 42 | + |
| 43 | +**RTS flags**: `+RTS -N -A16m -I0.01 -Iw15 -s -RTS` |
| 44 | + |
| 45 | +With 16 cores: |
| 46 | +- **Nursery**: 16 × 16MB = **256MB baseline** |
| 47 | +- GHC's default major GC threshold = **2× live data** — if live data is 10GB, GHC allows the heap to grow to **~20GB before triggering major GC** |
| 48 | +- VIRT = 1031GB is normal for GHC (address space reservation, not actual memory) |
| 49 | + |
| 50 | +The `-I0.01` setting is aggressive for idle GC (10ms), but with 22K clients the server is rarely idle, so major GC is deferred. |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +### Root Cause #3 (CONTRIBUTOR): No Cache Eviction by Design |
| 55 | + |
| 56 | +Looking at `getQueue_` in Postgres.hs:193-244: |
| 57 | +- When `useCache = True`, queues are loaded from Postgres on first access |
| 58 | +- They are cached in the `queues` TMap **forever** |
| 59 | +- There is no LRU eviction, no TTL, no max cache size |
| 60 | +- The comment at line 233 acknowledges interaction with `withAllMsgQueues` but no eviction mechanism exists |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +### What is NOT the Main Cause |
| 65 | + |
| 66 | +1. **Proxy connection state (`smpClients`)**: Properly managed — one entry per destination server, cleaned after 30-second `persistErrorInterval`. Max 292 entries. Code at `Client/Agent.hs:196-242` and `Session.hs:24-39` is correct. |
| 67 | + |
| 68 | +2. **SubscribedClients**: Despite the misleading comment at `Env/STM.hs:376` saying "subscriptions are never removed," they ARE cleaned up on both client disconnect (`Server.hs:1112`) and queue deletion (`Server.hs:308` via `lookupDeleteSubscribedClient`). |
| 69 | + |
| 70 | +3. **Client structures**: 22K × ~3KB = ~66MB — negligible. |
| 71 | + |
| 72 | +4. **TBQueues**: Bounded (`TBQueue` with `tbqSize = 128`), properly sized. |
| 73 | + |
| 74 | +5. **Thread management**: `forkClient` uses weak references, `finally` blocks ensure cleanup. Well-managed. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +### Memory Budget Estimate |
| 79 | + |
| 80 | +| Component | Estimated Size | |
| 81 | +|-----------|---------------| |
| 82 | +| Postgres queue cache (growing) | **5-15+ GB** | |
| 83 | +| TLS connection state (22K × ~30KB) | ~660 MB | |
| 84 | +| GHC nursery (16 × 16MB) | 256 MB | |
| 85 | +| Client structures | ~66 MB | |
| 86 | +| Proxy agent state | ~50 KB | |
| 87 | +| **GHC old gen headroom (2× live)** | **doubles effective usage** | |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +### Recommended Fixes (Priority Order) |
| 92 | + |
| 93 | +**1. Remove zombie queue entries from cache on deletion** (immediate fix) |
| 94 | + |
| 95 | +In `deleteStoreQueue` in `Postgres.hs`, add `TM.delete rId $ queues st` and `forM_ (queueData q) $ \(lnkId, _) -> TM.delete lnkId $ links st`: |
| 96 | + |
| 97 | +```haskell |
| 98 | +-- After the existing cleanup: |
| 99 | +when (useCache st) $ do |
| 100 | + atomically $ TM.delete rId $ queues st -- ADD THIS |
| 101 | + atomically $ TM.delete (senderId q) $ senders st |
| 102 | + forM_ (queueData q) $ \(lnkId, _) -> |
| 103 | + atomically $ TM.delete lnkId $ links st -- ADD THIS |
| 104 | + forM_ (notifier q) $ \NtfCreds {notifierId} -> do |
| 105 | + atomically $ TM.delete notifierId $ notifiers st |
| 106 | + atomically $ TM.delete notifierId $ notifierLocks st |
| 107 | +``` |
| 108 | + |
| 109 | +**2. Add configuration option to disable cache** — allow `useCache = False` via INI config for Postgres deployments where the DB is fast enough. |
| 110 | + |
| 111 | +**3. Add cache eviction** — periodically evict queues not accessed recently (LRU or TTL-based). |
| 112 | + |
| 113 | +**4. Tune GHC RTS**: |
| 114 | +- Add `-F1.5` (or lower) to reduce the major GC threshold multiplier from default 2.0 |
| 115 | +- Consider `-M<limit>` to set a hard heap cap |
| 116 | +- Consider `-N8` instead of `-N` to halve the nursery size and reduce per-capability overhead |
| 117 | + |
| 118 | +**5. Add observability** — log `loadedQueueCount` periodically to track cache growth. This metric exists in Prometheus but should also appear in regular stats logs. |
0 commit comments