Skip to content

Commit 99639ff

Browse files
gHashTagona-agent
andcommitted
feat: configure Fly.io Volumes for NVMe SSD storage
CRITICAL FIX: Use NVMe volume instead of ephemeral disk Volume performance (performance-16x): - NVMe: 32,000 IOPs, 128 MiB/s - Ephemeral: 2,000 IOPs, 8 MiB/s - Improvement: 16x faster I/O Changes: - fly.toml: Add [[mounts]] for trinity_models volume - Dockerfile: Use entrypoint.sh for model initialization - entrypoint.sh: Download model to volume on first run Expected load time: 208s → ~13s Co-authored-by: Ona <no-reply@ona.com>
1 parent afd50f8 commit 99639ff

4 files changed

Lines changed: 104 additions & 24 deletions

File tree

Dockerfile

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# TRINITY LLM - Zig-based LLM Inference Engine
22
# phi^2 + 1/phi^2 = 3 = TRINITY
3+
#
4+
# Uses Fly.io Volumes for NVMe SSD storage (16x faster than ephemeral)
35

46
FROM debian:bookworm-slim AS builder
57

@@ -36,23 +38,17 @@ WORKDIR /app
3638
# Copy binary from builder
3739
COPY --from=builder /build/vibee /app/vibee
3840

39-
# Create models directory
40-
RUN mkdir -p /app/models
41-
42-
# Download SmolLM2-1.7B Q8_0 (better quality, larger model)
43-
# Size: ~1.8GB, loads in ~10-15 seconds
44-
# For smaller/faster option, use SmolLM2-360M or SmolLM-135M
45-
RUN echo "Downloading SmolLM2-1.7B-Instruct Q8_0..." && \
46-
curl -L -o /app/models/smollm2-1.7b-instruct-q8_0.gguf \
47-
"https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q8_0.gguf" && \
48-
ls -la /app/models/
41+
# Copy entrypoint script
42+
COPY entrypoint.sh /app/entrypoint.sh
43+
RUN chmod +x /app/entrypoint.sh
4944

5045
# Set environment
51-
ENV MODEL_PATH=/app/models/smollm2-1.7b-instruct-q8_0.gguf
46+
# MODEL_PATH points to volume mount (NVMe SSD)
47+
ENV MODEL_PATH=/data/models/smollm2-1.7b-instruct-q8_0.gguf
5248
ENV TEMPERATURE=0.7
5349
ENV TOP_P=0.9
5450
ENV NUM_THREADS=16
5551

5652
# Run HTTP API server
5753
EXPOSE 8080
58-
CMD ["/app/vibee", "serve", "--model", "/app/models/smollm2-1.7b-instruct-q8_0.gguf", "--port", "8080"]
54+
ENTRYPOINT ["/app/entrypoint.sh"]

docs/DISCOVERIES.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -234,17 +234,55 @@ Dequantization and SIMD are fast - the bottleneck is FILE READ.
234234

235235
### Recommended Solutions
236236

237-
1. **Fly.io Volumes** - Use local SSD storage (HIGH IMPACT)
237+
1. **Fly.io Volumes** - Use local SSD storage (HIGH IMPACT) ✅ IMPLEMENTED
238238
2. **Memory-map model** - mmap() for lazy loading (MEDIUM)
239239
3. **Smaller model** - Use 360M instead of 1.7B (WORKAROUND)
240240
4. **Pre-warm on deploy** - Keep model in memory (WORKAROUND)
241241

242242
---
243243

244+
## Fly.io Volumes Configuration
245+
246+
**Status**: ✅ Implemented
247+
248+
### Volume Performance (performance-16x)
249+
250+
| Storage Type | IOPs | Bandwidth |
251+
|--------------|------|-----------|
252+
| Ephemeral disk | 2,000 | 8 MiB/s |
253+
| **NVMe Volume** | **32,000** | **128 MiB/s** |
254+
| **Improvement** | **16x** | **16x** |
255+
256+
### Configuration Changes
257+
258+
**fly.toml:**
259+
```toml
260+
[[mounts]]
261+
source = "trinity_models"
262+
destination = "/data/models"
263+
initial_size = "3gb"
264+
```
265+
266+
**entrypoint.sh:**
267+
- Downloads model to volume on first run
268+
- Subsequent starts use cached model (instant)
269+
- Model persists across deploys
270+
271+
### Expected Impact
272+
273+
| Metric | Before (Ephemeral) | After (Volume) |
274+
|--------|-------------------|----------------|
275+
| Load time | 208s | ~13s (estimated) |
276+
| First deploy | 208s | ~60s (download) |
277+
| Subsequent | 208s | ~13s |
278+
279+
---
280+
244281
## Version History
245282

246283
| Version | Date | Changes |
247284
|---------|------|---------|
285+
| v1.4.0 | 2026-02-02 | Fly.io Volumes for NVMe SSD storage |
248286
| v1.3.0 | 2026-02-02 | Load profiling - found I/O bottleneck |
249287
| v1.2.0 | 2026-02-02 | Parallel dequantization (OPT-003) |
250288
| v1.1.0 | 2026-02-02 | SIMD optimization (OPT-001) |

entrypoint.sh

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
#!/bin/bash
2+
# ═══════════════════════════════════════════════════════════════════════════════
3+
# TRINITY LLM - Entrypoint Script
4+
# Downloads model to NVMe volume on first run
5+
# φ² + 1/φ² = 3 = TRINITY
6+
# ═══════════════════════════════════════════════════════════════════════════════
7+
8+
set -e
9+
10+
MODEL_DIR="/data/models"
11+
MODEL_FILE="smollm2-1.7b-instruct-q8_0.gguf"
12+
MODEL_PATH="${MODEL_DIR}/${MODEL_FILE}"
13+
MODEL_URL="https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q8_0.gguf"
14+
15+
echo "╔══════════════════════════════════════════════════════════════╗"
16+
echo "║ TRINITY LLM - Volume Initialization ║"
17+
echo "╚══════════════════════════════════════════════════════════════╝"
18+
19+
# Create models directory on volume
20+
mkdir -p "${MODEL_DIR}"
21+
22+
# Check if model exists on volume
23+
if [ -f "${MODEL_PATH}" ]; then
24+
echo "✓ Model found on NVMe volume: ${MODEL_PATH}"
25+
ls -lh "${MODEL_PATH}"
26+
else
27+
echo "⚡ Model not found on volume. Downloading to NVMe SSD..."
28+
echo " This is a one-time operation. Future starts will be instant."
29+
echo ""
30+
echo " Downloading: ${MODEL_FILE}"
31+
echo " From: ${MODEL_URL}"
32+
echo ""
33+
34+
# Download with progress
35+
curl -L --progress-bar -o "${MODEL_PATH}" "${MODEL_URL}"
36+
37+
echo ""
38+
echo "✓ Download complete!"
39+
ls -lh "${MODEL_PATH}"
40+
fi
41+
42+
echo ""
43+
echo "Starting TRINITY LLM server..."
44+
echo "Model: ${MODEL_PATH}"
45+
echo ""
46+
47+
# Start the server
48+
exec /app/vibee serve --model "${MODEL_PATH}" --port 8080

fly.toml

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,27 +11,25 @@ primary_region = "iad"
1111
dockerfile = "Dockerfile"
1212

1313
[env]
14-
MODEL_PATH = "/app/models/smollm2-1.7b-instruct-q8_0.gguf"
14+
# Model path on persistent volume (NVMe SSD - 16x faster than ephemeral)
15+
MODEL_PATH = "/data/models/smollm2-1.7b-instruct-q8_0.gguf"
1516
TEMPERATURE = "0.7"
1617
TOP_P = "0.9"
1718
NUM_THREADS = "16"
1819

1920
# SmolLM2-1.7B requires more RAM (~4GB for model + buffers)
20-
# performance-4x: 4 dedicated CPU cores, 8GB RAM
2121
[[vm]]
2222
size = "performance-16x"
2323
memory = "32gb"
2424
cpus = 16
2525

26-
# Alternative sizes:
27-
# performance-8x: 8 CPU, 16GB RAM (faster, more expensive)
28-
# performance-16x: 16 CPU, 32GB RAM (maximum speed)
29-
# shared-cpu-4x: 4 shared CPU, 8GB RAM (cheaper)
30-
31-
# Persistent volume for models
32-
# [[mounts]]
33-
# source = "trinity_models"
34-
# destination = "/app/models"
26+
# Persistent volume for models (NVMe SSD)
27+
# Volume limits for performance-16x: 32,000 IOPs, 128 MiB/s
28+
# vs ephemeral disk: 2,000 IOPs, 8 MiB/s (16x slower!)
29+
[[mounts]]
30+
source = "trinity_models"
31+
destination = "/data/models"
32+
initial_size = "3gb"
3533

3634
# HTTP API service
3735
[http_service]

0 commit comments

Comments
 (0)