Skip to content

Commit af2d6c1

Browse files
gHashTagclaude
andcommitted
feat: spider-web logo + Fibonacci spiral formulas + zero-gravity physics
- Inverted logo: black petals with white outlines (spider-web effect) - White hover highlight on logo petals with white tooltip (black text) - 42 sacred formulas orbit in Fibonacci golden-angle spiral - Alternating rotation directions per ring layer - Formulas stop on mouse hover (not scatter away) - Click formula to expand description - Sacred world panels fully opaque pure black background - Panel slot reuse (fixes count overflow after many opens) - Fixed applyMouse() vertex rotation matching draw() - ESC hides panels (not exit), Cmd+Q to quit 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 763b1a1 commit af2d6c1

2 files changed

Lines changed: 323 additions & 64 deletions

File tree

docs/trinity_llm_scale_report.md

Lines changed: 69 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -234,28 +234,83 @@ Improvement over v2 localhost: 1.8x faster total, 1.9x faster prefill, 1.6x fast
234234
Improvement over v1: 3x faster total, 3.7x faster prefill, 2.4x faster decode
235235
```
236236

237+
### v4 3-Node Pipeline (2026-02-08, relay chain, multi-machine)
238+
239+
```
240+
Model: TinyLlama 1.1B Chat Q4_K_M (638MB GGUF)
241+
Coordinator: macOS arm64 (Apple Silicon), layers 0-7 (8 layers)
242+
Relay: macOS arm64 (same machine as coordinator), layers 8-14 (7 layers)
243+
Worker: Ubuntu 24.04 x86_64 (Intel Xeon, VPS), layers 15-21 (7 layers)
244+
Network: Coordinator → Relay (localhost) → Worker (internet)
245+
246+
Prefill: 20 tokens in 31,369ms (local=9,401ms, net=21,968ms)
247+
Decode: 20 tokens, avg 1,193ms/token (compute=15,025ms, net=23,675ms)
248+
Total: 70,073ms
249+
Network fraction: 65.1%
250+
Note: Relay shares CPU with coordinator (2 processes on 1 Mac)
251+
```
252+
253+
## Detailed Profile (v4 — 3-Node Multi-Machine)
254+
255+
```
256+
╔══════════════════════════════════════════════════════════╗
257+
║ DISTRIBUTED INFERENCE PROFILE (3-Node Multi-Machine) ║
258+
╠══════════════════════════════════════════════════════════╣
259+
║ Topology: Coordinator(Mac) → Relay(Mac) → Worker(VPS)
260+
║ Layers: [0-7] → [8-14] → [15-21]
261+
262+
║ Prefill: 20 tokens
263+
║ Local compute: 9,401ms (coordinator 8 layers)
264+
║ Network (batch): 21,968ms (relay 7 layers + worker 7 layers + sampling)
265+
║ Total prefill: 31,369ms
266+
║ Decode: 20 tokens
267+
║ Total compute: 15,025ms (coordinator local layers)
268+
║ Total network: 23,675ms (relay + worker + 2x TCP round-trips)
269+
║ Total decode: 38,700ms
270+
║ Avg per token: ~1,193ms (vs 706ms with 2-node)
271+
║ Network fraction: 65.1%
272+
║ Total: 70,073ms
273+
╚══════════════════════════════════════════════════════════╝
274+
```
275+
276+
### 3-Node Analysis
277+
278+
The 3-node test on 2 machines shows 70s total (vs 47s with 2-node). This is expected because:
279+
280+
1. **CPU contention**: Relay shares the Mac's CPU with coordinator (2 processes on 1 machine)
281+
2. **Extra TCP hop**: Each decode token requires 2 round-trips instead of 1 (coordinator→relay→worker→relay→coordinator)
282+
3. **Memory per node**: ~400MB (33% of model each, vs 50% with 2-node split)
283+
284+
**On 3 separate machines**, expected performance:
285+
- Prefill: ~12s (coordinator 9s, relay and worker compute in parallel)
286+
- Decode: ~0.7s/token (pipeline overlap, but 2 network hops add ~200ms)
287+
- Total: ~26s (theoretical optimum with full parallelism)
288+
237289
## Conclusion
238290

239-
Distributed inference v3 on separate machines achieves **3x speedup** over v1:
291+
Distributed inference v4 adds **N-node pipeline support** with relay chain:
240292

241-
| Version | Total Time | vs v1 |
242-
|---------|-----------|-------|
243-
| v1 (per-token, localhost) | 143s | baseline |
244-
| v2 (batched, localhost) | 83s | 1.7x |
245-
| **v3 (batched, multi-machine)** | **47s** | **3x** |
293+
| Version | Nodes | Total Time | vs v1 | Topology |
294+
|---------|-------|-----------|-------|----------|
295+
| v1 (per-token, localhost) | 2 | 143s | baseline | Coordinator + Worker |
296+
| v2 (batched, localhost) | 2 | 83s | 1.7x | Coordinator + Worker |
297+
| v3 (batched, 2-machine) | 2 | **47s** | **3x** | Coordinator(Mac) + Worker(VPS) |
298+
| v4 (batched, 3-node, 2-machine) | 3 | 70s | 2x | Coordinator + Relay(Mac) + Worker(VPS) |
299+
| v4 (3 machines, projected) | 3 | ~26s | ~5.5x | Each node on separate CPU |
246300

247-
- Prefill: 77s → 39s → **21s** (3.7x, batch TCP + parallel compute)
248-
- Decode: 1.7s → 1.1s → **0.7s/token** (2.4x, dedicated CPUs + zero-alloc)
249-
- Network fraction: 51.7% (compute-bound, not network-bound)
250-
- Cross-platform: Single Zig codebase compiles to macOS arm64 + Linux x86_64 with zero dependencies
301+
- **N-node pipeline proven**: PipelineRelay chains coordinator → relay → worker correctly
302+
- **No protocol changes**: Relay reuses existing ForwardRequest/ForwardResponse messages
303+
- **autoSplitN()**: Divides any model's layers evenly across N nodes
304+
- **Cross-platform**: macOS arm64 coordinator + Linux x86_64 worker, zero dependencies
251305

252306
### Key Finding
253307
The dominant bottleneck on localhost was **CPU contention**, not network. When each node has its own CPU, pipeline parallelism delivers the expected parallel speedup. Network adds ~100ms RTT overhead per decode step but this is dwarfed by the compute savings from eliminating contention.
254308

255309
### Next Steps
256310

257311
1. ~~**Multi-machine test**: Deploy on 2 separate machines to measure real parallel speedup~~ **DONE**
258-
2. **Tokenizer integration**: GGUF tokenizer for coherent text output
259-
3. **Larger models**: Qwen2.5 7B Q4_K_M (requires download, ~4GB per shard)
260-
4. **N-way pipeline**: Extend for >2 nodes
261-
5. **Tensor parallelism**: Split matmul across nodes (complementary to pipeline)
312+
2. ~~**N-way pipeline**: Extend for >2 nodes~~ **DONE** (PipelineRelay)
313+
3. **3 separate machines**: Deploy on 3 VPS to measure real 3-way parallel speedup
314+
4. **Tokenizer integration**: GGUF tokenizer for coherent text output
315+
5. **Larger models**: Qwen2.5 7B Q4_K_M (requires download, ~4GB per shard)
316+
6. **Tensor parallelism**: Split matmul across nodes (complementary to pipeline)

0 commit comments

Comments
 (0)