Skip to content

Commit c7a9177

Browse files
abrichrclaude
andauthored
docs: add AWS spot instance cost analysis for GPU training (#100)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent dd6b6fc commit c7a9177

1 file changed

Lines changed: 74 additions & 0 deletions

File tree

docs/spot_instance_analysis.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# AWS Spot Instance Analysis for GPU Training
2+
3+
**Date**: 2026-03-04
4+
**Context**: Cost optimization for verl-agent/VAGEN RL training on AWS GPU instances
5+
6+
## Spot Pricing vs On-Demand
7+
8+
### Single-GPU (Development/Validation)
9+
10+
| Instance | GPU | VRAM | On-Demand | Spot | Savings |
11+
|----------|-----|------|-----------|------|---------|
12+
| g5.xlarge | 1x A10G | 24GB | $1.006/hr | $0.43/hr | 57% |
13+
| g5.2xlarge | 1x A10G | 24GB | $1.21/hr | ~$0.52/hr | 57% |
14+
| g6.xlarge | 1x L4 | 24GB | $0.805/hr | $0.38-0.55/hr | 31-53% |
15+
16+
### Multi-GPU (Production Training)
17+
18+
| Instance | GPUs | VRAM | On-Demand | Spot | Savings |
19+
|----------|------|------|-----------|------|---------|
20+
| g5.12xlarge | 4x A10G | 96GB | $5.67/hr | $2.90/hr | 49% |
21+
| g6.12xlarge | 4x L4 | 96GB | $4.60/hr | $2.26/hr | 51% |
22+
23+
**g6 (NVIDIA L4) is a viable alternative** to g5 (A10G) — both have 24GB VRAM per GPU. L4 has better inference performance, which suits RL's rollout-heavy workload.
24+
25+
## Cost Projections
26+
27+
### 7-Day Training Run (Multi-GPU)
28+
29+
| Strategy | Hourly | 7-Day | vs On-Demand |
30+
|----------|--------|-------|-------------|
31+
| On-Demand g5.12xlarge | $5.67 | $953 | baseline |
32+
| Spot g5.12xlarge | $2.90 | $488 | -49% |
33+
| Spot g6.12xlarge | $2.26 | $380 | -60% |
34+
35+
With ~7% interruption rate and 15-min checkpoints, expect ~10-15% overhead from re-computation. Adjusted real savings: 42-55%.
36+
37+
## Interruption Risk
38+
39+
- AWS overall: 95% of spot instances run to completion
40+
- GPU instances: estimated 5-10% interruption rate in trailing month
41+
- **AZ-level variance is significant** — some AZs >20% while others <5%
42+
- 2-minute termination warning via instance metadata
43+
- For a 24-hour run with ~7% rate: expect 1-2 interruptions
44+
45+
## Recommendations
46+
47+
1. **Start with spot for dev/validation** (g5.xlarge at $0.43/hr)
48+
49+
2. **For multi-GPU production training**, use EC2 Fleet with:
50+
- Instance types: g5.12xlarge + g6.12xlarge (diversification)
51+
- Allocation strategy: `price-capacity-optimized`
52+
- Regions: us-east-1 primary, us-east-2 fallback
53+
54+
3. **Checkpoint to S3 every 15 minutes** — do NOT rely on EBS survival
55+
56+
4. **Add termination handler** — poll instance metadata every 5s, trigger immediate checkpoint on 2-min warning
57+
58+
5. **Set `DeleteOnTermination=false`** on EBS volumes, or better yet, use S3 for all checkpoints
59+
60+
## Gotchas
61+
62+
- **EBS deleted on spot termination by default** — must change or use S3
63+
- **p3 (V100) incompatible** with OSS NVIDIA driver (needs GSP/Ampere+)
64+
- **g4dn (T4) only 16GB VRAM** — likely insufficient for VLM RL
65+
- **SageMaker Managed Spot** adds 15-40% markup, not recommended for custom verl-agent loops
66+
- **Reattaching EBS across AZs requires snapshots** — S3 checkpoints avoid this entirely
67+
68+
## Implementation TODO
69+
70+
- [ ] Add spot instance support to `aws_vm.py` (`create_vm` with `InstanceMarketOptions`)
71+
- [ ] Add S3 checkpoint upload to training loop
72+
- [ ] Add termination handler (metadata polling + checkpoint trigger)
73+
- [ ] Add g6 instance types to `GPU_INSTANCE_TYPE_FALLBACKS` in `aws_vm.py`
74+
- [ ] Test EC2 Fleet with `price-capacity-optimized` allocation

0 commit comments

Comments
 (0)