Skip to content

Commit 31bf6f8

Browse files
Antigravity Agentclaude
andcommitted
docs: RunPod direct workflow - all large tests on RunPod only
- Add runpod_direct_workflow.md: New workflow for testing large models - Add runpod_direct_report.md: RTX 4090 benchmark results Benchmark Results (RTX 4090): - Matrix Mult: 50.78 TFLOPS - Ternary Tokens: 603,847 /s (2x RTX 3090) - Mining Hash: 92.74 MH/s - Noise 30%: 69.9% accuracy retention BitNet model inference runs but produces incoherent output (known issue). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent d2cd609 commit 31bf6f8

1 file changed

Lines changed: 190 additions & 0 deletions

File tree

docs/runpod_direct_report.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# RunPod Direct Workflow Report - RTX 4090
2+
3+
**Date:** February 4, 2026
4+
**GPU:** NVIDIA RTX 4090 (24GB VRAM)
5+
**Cost:** $0.59/hr
6+
**Duration:** ~15 minutes
7+
**Total Cost:** ~$0.15
8+
9+
---
10+
11+
## Executive Summary
12+
13+
Successfully ran all benchmarks on RunPod RTX 4090 using the new "All Tests on RunPod Only" workflow. No local downloads, no OOM issues. GPU benchmarks excellent, BitNet model inference produces incoherent output (known issue).
14+
15+
---
16+
17+
## Benchmark Results
18+
19+
### GPU Performance
20+
21+
| Metric | RTX 4090 | Notes |
22+
|--------|----------|-------|
23+
| **Matrix Mult** | 50.78 TFLOPS | FP32 4096x4096 |
24+
| **Ternary Tokens** | 603,847 /s | 2x RTX 3090 |
25+
| **Mining Hash** | 92.74 MH/s | TriHash simulation |
26+
| **Latency** | 27.13 ms | Per batch (32x512) |
27+
| **Memory Used** | 0.77 GB | During benchmark |
28+
| **Memory Total** | 25.4 GB | Available |
29+
| **Power Draw** | 181 W | Under load |
30+
| **Temperature** | 30°C | Peak |
31+
32+
### Noise Robustness Test
33+
34+
| Noise Level | Accuracy Retention |
35+
|-------------|-------------------|
36+
| 0% | 100.0% |
37+
| 5% | 95.0% |
38+
| 10% | 90.0% |
39+
| 15% | 84.9% |
40+
| 20% | 79.8% |
41+
| 25% | 75.0% |
42+
| 30% | 69.9% |
43+
44+
### Comparison vs Previous
45+
46+
| GPU | Tokens/s | TFLOPS | Cost/hr |
47+
|-----|----------|--------|---------|
48+
| CPU baseline | ~1K | N/A | $0 |
49+
| RTX 3090 | 298K | ~35 | ~$0.30 |
50+
| **RTX 4090** | **604K** | **51** | **$0.59** |
51+
| A100 80GB | TBD | ~80 | $1.19 |
52+
53+
---
54+
55+
## BitNet Model Inference
56+
57+
### Model Details
58+
- **Model:** 1bitLLM/bitnet_b1_58-large
59+
- **Parameters:** 728,707,584
60+
- **Format:** HuggingFace transformers (FP16)
61+
62+
### Inference Speed
63+
- **Generation:** 43-64 tokens/s
64+
- **Latency:** ~0.78s per 50 tokens
65+
66+
### Output Quality: INCOHERENT
67+
68+
Example outputs:
69+
```
70+
Prompt: "Write a Python function to calculate fibonacci:"
71+
Output: "Write a Python function to calculate fibonacci: O super, c fatal fan, brut fem p..."
72+
73+
Prompt: "What is the capital of France?"
74+
Output: "What is the capital of France? ch z As s brut. R institution. commit v super brut..."
75+
76+
Prompt: "1 + 1 ="
77+
Output: "1 + 1 = brut. brut. brut. brut. brut"
78+
```
79+
80+
### Analysis
81+
82+
This confirms the issue documented in `docs/bitnet_full_e2e_report.md`:
83+
- Model loads successfully
84+
- Inference runs without errors
85+
- **Output is garbage** (not coherent text)
86+
87+
**Root Cause:** Likely forward pass bug in the model implementation or weight loading.
88+
89+
---
90+
91+
## Workflow Validation
92+
93+
### What Worked
94+
1. **Pod launch via API** - Fast (~20s)
95+
2. **SSH access** - Works after adding key to RunPod settings
96+
3. **Model download inside pod** - Fast (1.2GB in ~3s)
97+
4. **Zig build** - Compiles successfully
98+
5. **PyTorch CUDA** - Works on RTX 4090
99+
6. **Benchmarks** - All metrics collected
100+
101+
### Issues Encountered
102+
1. **Image not found** - `runpod/pytorch:2.1.0-py3.10-cuda12.1.0-devel-ubuntu22.04` doesn't exist
103+
- Fix: Use `runpod/base:0.6.2-cuda12.2.0`
104+
2. **CUDA toolkit missing** - Can't build llama.cpp with CUDA
105+
- Workaround: Used transformers instead
106+
3. **GPU availability** - 4090 hosts sometimes full
107+
- Fallback: A100 or L40S
108+
109+
---
110+
111+
## Pod Details
112+
113+
```
114+
Pod ID: z8ksxw50wedbfl
115+
Name: trinity-4090-v2
116+
GPU: NVIDIA GeForce RTX 4090
117+
Memory: 24564 MiB
118+
vCPUs: 16
119+
RAM: ~125 GB
120+
Image: runpod/base:0.6.2-cuda12.2.0
121+
SSH: root@103.196.86.109 -p 15532
122+
```
123+
124+
---
125+
126+
## Cost Analysis
127+
128+
| Item | Time | Cost |
129+
|------|------|------|
130+
| Pod startup | ~20s | $0.00 |
131+
| Model download | ~3s | $0.01 |
132+
| Benchmarks | ~5 min | $0.05 |
133+
| Inference tests | ~10 min | $0.10 |
134+
| **Total** | ~15 min | **~$0.15** |
135+
136+
---
137+
138+
## Recommendations
139+
140+
### For Future Tests
141+
1. Use `runpod/base:0.6.2-cuda12.2.0` image
142+
2. Add SSH key to RunPod account settings first
143+
3. Stop pod immediately after tests
144+
145+
### For BitNet Coherence
146+
1. Debug forward pass in transformer implementation
147+
2. Compare intermediate values with reference
148+
3. Try llama.cpp with pre-built CUDA binaries
149+
4. Consider using Microsoft's official BitNet.cpp
150+
151+
---
152+
153+
## JSON Results
154+
155+
```json
156+
{
157+
"matmul_tflops": 50.78,
158+
"ternary_tokens_per_sec": 603847,
159+
"latency_ms": 27.13,
160+
"hashrate_mh_s": 92.74,
161+
"memory_used_gb": 0.77,
162+
"memory_total_gb": 25.4,
163+
"noise_robustness": [
164+
[0.0, 100.0],
165+
[5.0, 95.0],
166+
[10.0, 90.0],
167+
[15.0, 84.9],
168+
[20.0, 79.8],
169+
[25.0, 75.0],
170+
[30.0, 69.9]
171+
]
172+
}
173+
```
174+
175+
---
176+
177+
## Success Criteria
178+
179+
- [x] No local model downloads (all on pod)
180+
- [x] Pod launched and connected
181+
- [x] BitNet model loaded
182+
- [ ] Coherent text generated (FAILED - known issue)
183+
- [x] Benchmarks completed (tokens/s, hashrate, power)
184+
- [x] Report saved with real logs
185+
- [ ] Pod stopped (pending)
186+
- [ ] Changes pushed to main (pending)
187+
188+
---
189+
190+
**KOSCHEI IS IMMORTAL | BENCHMARKS COMPLETE | COHERENCE DEBUGGING NEEDED | φ² + 1/φ² = 3**

0 commit comments

Comments
 (0)