|
| 1 | +# Trinity LLM Scale Report: Multi-Node Distributed Inference |
| 2 | + |
| 3 | +## Key Metrics |
| 4 | + |
| 5 | +| Metric | Single Node | v1 (per-token) | v2 (batched) | v1->v2 | |
| 6 | +|--------|------------|-----------------|-------------|--------| |
| 7 | +| Prefill (20 tokens) | 52s | 77s | **39s** | **2x faster** | |
| 8 | +| Decode (per token) | ~2.6s | ~1.7s | **~1.1s** | **1.5x faster** | |
| 9 | +| Total (20+20 tokens) | ~105s | 143s | **83s** | **1.7x faster** | |
| 10 | +| Memory per node | ~1.2GB | ~600MB | ~600MB | Same | |
| 11 | +| Network transfer/prefill | 0 | 20x 8KB = 160KB | 1x 160KB | **1 round-trip** | |
| 12 | +| Network fraction | 0% | ~100% | **56.9%** | Measurable | |
| 13 | + |
| 14 | +## Architecture: Pipeline Parallelism |
| 15 | + |
| 16 | +``` |
| 17 | +[Coordinator: layers 0-10] [Worker: layers 11-21] |
| 18 | + embed(all tokens) |
| 19 | + forwardShard(all tokens) |
| 20 | + TCP send ALL hidden_states --------> recv batch (160KB) |
| 21 | + (1 round-trip for prefill) forwardShard(each sequentially) |
| 22 | + computeLogits + sample (each) |
| 23 | + recv ALL tokens <------------------ TCP send batch response |
| 24 | +
|
| 25 | + [decode: single-token per RT] [decode: single-token per RT] |
| 26 | +``` |
| 27 | + |
| 28 | +### v2 Optimizations |
| 29 | + |
| 30 | +1. **Batch prefill**: All prompt hidden states sent in 1 TCP round-trip (was 20 separate round-trips) |
| 31 | +2. **TCP_NODELAY**: Disabled Nagle's algorithm on both coordinator and worker sockets |
| 32 | +3. **Coalesced writes**: Header + payload combined into single `write()` syscall |
| 33 | +4. **Pre-allocated buffers**: Worker reuses `output_buf`, `logits_buf`, `probs_buf` (zero heap allocs per token) |
| 34 | +5. **Zero-alloc methods**: `computeLogitsInto()` and `sampleFromLogitsInto()` write into caller buffers |
| 35 | +6. **Timing instrumentation**: Compute vs network breakdown per phase |
| 36 | + |
| 37 | +### Design Decisions |
| 38 | + |
| 39 | +1. **Worker-side sampling**: Worker samples token and returns 4 bytes instead of 128KB logits (32000x less traffic) |
| 40 | +2. **Persistent TCP connection**: Single keepalive connection per generation session |
| 41 | +3. **Tied embeddings**: Worker loads embedding table for output projection (TinyLlama ties weights) |
| 42 | +4. **KV caches per shard**: Each node maintains KV caches only for its local layers |
| 43 | +5. **Partial model loading**: `loadPartialWeights(start, end, embed, output)` loads only required layers |
| 44 | + |
| 45 | +## Detailed Profile (v2 Batched) |
| 46 | + |
| 47 | +``` |
| 48 | +╔══════════════════════════════════════════════════════════╗ |
| 49 | +║ DISTRIBUTED INFERENCE PROFILE ║ |
| 50 | +╠══════════════════════════════════════════════════════════╣ |
| 51 | +║ Prefill: 20 tokens |
| 52 | +║ Local compute: 13,874ms (coordinator layers 0-10) |
| 53 | +║ Network (batch): 24,877ms (worker layers 11-21 + sampling) |
| 54 | +║ Total prefill: 38,751ms |
| 55 | +║ Decode: 20 tokens |
| 56 | +║ Total compute: 21,968ms (coordinator local layers) |
| 57 | +║ Total network: 22,367ms (worker forward + response) |
| 58 | +║ Total decode: 44,335ms |
| 59 | +║ Network fraction: 56.9% |
| 60 | +║ Total: 83,093ms |
| 61 | +╚══════════════════════════════════════════════════════════╝ |
| 62 | +``` |
| 63 | + |
| 64 | +## What This Means |
| 65 | + |
| 66 | +### For localhost (same machine) |
| 67 | +Both nodes share the same CPU and memory bandwidth. Prefill improved from 77s to 39s by eliminating 19 TCP round-trips. Decode improved from 1.7s to 1.1s/token via TCP_NODELAY + zero-alloc. Total: 143s -> 83s (1.7x improvement). Memory per node remains halved (~600MB). |
| 68 | + |
| 69 | +### For multi-machine deployment |
| 70 | +On separate machines with dedicated RAM and CPU: |
| 71 | +- Coordinator and worker compute **in parallel** (currently sequential on localhost) |
| 72 | +- Expected prefill: **~25s** (coordinator 14s local + worker 25s remote, overlapped) |
| 73 | +- Expected decode: **~1.1s/token** (similar, pipeline overlap) |
| 74 | +- Memory per machine: **50% reduction** -- enables models that exceed single-machine RAM |
| 75 | + |
| 76 | +### For scaling beyond 2 nodes |
| 77 | +The `ShardConfig.autoSplit()` handles 2-node splits. N-node splits require: |
| 78 | +- Chain of TCP connections (node 0 -> node 1 -> ... -> node N-1) |
| 79 | +- Last node samples and returns token to coordinator |
| 80 | +- Linear pipeline depth scales with N |
| 81 | + |
| 82 | +## Technical Details |
| 83 | + |
| 84 | +### Files Modified/Created |
| 85 | + |
| 86 | +| File | Change | |
| 87 | +|------|--------| |
| 88 | +| `src/vibeec/gguf_model.zig` | `loadPartialWeights()`, `forwardShard()`, `computeLogits()`, `sampleFromLogits()`, `computeLogitsInto()`, `sampleFromLogitsInto()` | |
| 89 | +| `src/trinity_node/protocol.zig` | `ForwardRequest`/`ForwardResponse` + `BatchForwardRequest`/`BatchForwardResponse` (0x11-0x14) | |
| 90 | +| `src/trinity_node/distributed.zig` | `ShardConfig`, `PipelineWorker` (pre-alloc buffers, batch handler), `PipelineCoordinator` (batch prefill, timing), `setTcpNodelay()` | |
| 91 | +| `src/trinity_node/main.zig` | `--distributed` CLI flag | |
| 92 | +| `build.zig` | `gguf_model_mod` module for trinity-node | |
| 93 | +| `src/tri/tri_utils.zig` | `.distributed` command | |
| 94 | +| `src/tri/main.zig` + `tri_commands.zig` | dispatch + `runDistributedCommand()` | |
| 95 | + |
| 96 | +### Network Protocol |
| 97 | + |
| 98 | +``` |
| 99 | +ForwardRequest (8220 bytes for TinyLlama, single-token decode): |
| 100 | + TRIN header: [4B magic] [1B type=0x11] [4B length] |
| 101 | + Payload: [4B seq_id] [4B pos] [4B hidden_size] [4B temp] [hidden_size*4B data] |
| 102 | +
|
| 103 | +BatchForwardRequest (~164KB for 20-token prefill): |
| 104 | + TRIN header: [4B magic] [1B type=0x13] [4B length] |
| 105 | + Payload: [4B seq_id] [4B batch_size] [4B hidden_size] [4B temp] |
| 106 | + per token: [4B pos] [hidden_size*4B data] |
| 107 | +
|
| 108 | +ForwardResponse (12 bytes): |
| 109 | + [4B seq_id] [4B pos] [4B token] |
| 110 | +
|
| 111 | +BatchForwardResponse (8 + batch_size*4 bytes): |
| 112 | + [4B seq_id] [4B batch_size] [batch_size * 4B tokens] |
| 113 | +``` |
| 114 | + |
| 115 | +### CLI Usage |
| 116 | + |
| 117 | +```bash |
| 118 | +# Terminal 1 (Worker) |
| 119 | +./zig-out/bin/trinity-node --distributed --role worker \ |
| 120 | + --model models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ |
| 121 | + --layers 11-21 --port 9335 |
| 122 | + |
| 123 | +# Terminal 2 (Coordinator) |
| 124 | +./zig-out/bin/trinity-node --distributed --role coordinator \ |
| 125 | + --model models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ |
| 126 | + --layers 0-10 --peer 127.0.0.1:9335 \ |
| 127 | + --prompt "Hello, how are you?" --max-tokens 20 --temperature 0.7 |
| 128 | +``` |
| 129 | + |
| 130 | +## Test Results |
| 131 | + |
| 132 | +### v1 Baseline (2026-02-08, per-token TCP) |
| 133 | + |
| 134 | +``` |
| 135 | +Model: TinyLlama 1.1B Chat Q4_K_M (638MB GGUF) |
| 136 | +Platform: macOS arm64 (Apple Silicon), Zig 0.15.2 ReleaseFast |
| 137 | +Nodes: 2 (localhost) |
| 138 | +
|
| 139 | +Prefill: 20 tokens in 77,344ms (3.9s/token, 20 TCP round-trips) |
| 140 | +Decode: 21 tokens, avg 1.7s/token |
| 141 | +Total: 142,913ms |
| 142 | +``` |
| 143 | + |
| 144 | +### v2 Optimized (2026-02-08, batched prefill) |
| 145 | + |
| 146 | +``` |
| 147 | +Model: TinyLlama 1.1B Chat Q4_K_M (638MB GGUF) |
| 148 | +Platform: macOS arm64 (Apple Silicon), Zig 0.15.2 ReleaseFast |
| 149 | +Nodes: 2 (localhost) |
| 150 | +
|
| 151 | +Prefill: 20 tokens in 38,751ms (local=13,874ms, net=24,877ms, 1 batch RT) |
| 152 | +Decode: 20 tokens, avg 1.1s/token (compute=22s, net=22s) |
| 153 | +Total: 83,093ms |
| 154 | +Network fraction: 56.9% |
| 155 | +Improvement: 1.7x faster total, 2x faster prefill, 1.5x faster decode |
| 156 | +``` |
| 157 | + |
| 158 | +## Conclusion |
| 159 | + |
| 160 | +Distributed inference v2 with batch prefill reduces total time by **1.7x** on localhost: |
| 161 | +- Prefill: 77s -> 39s (2x, via batch TCP) |
| 162 | +- Decode: 1.7s -> 1.1s/token (1.5x, via TCP_NODELAY + zero-alloc) |
| 163 | +- Network fraction now measurable: 56.9% |
| 164 | + |
| 165 | +### Next Steps |
| 166 | + |
| 167 | +1. **Multi-machine test**: Deploy on 2 separate VPS to measure real parallel speedup |
| 168 | +2. **Tokenizer integration**: GGUF tokenizer for coherent text output |
| 169 | +3. **Larger models**: Qwen2.5 7B Q4_K_M (requires download, ~4GB per shard) |
| 170 | +4. **N-way pipeline**: Extend for >2 nodes |
| 171 | +5. **Tensor parallelism**: Split matmul across nodes (complementary to pipeline) |
0 commit comments