|
| 1 | +# AWS Spot Instance Analysis for GPU Training |
| 2 | + |
| 3 | +**Date**: 2026-03-04 |
| 4 | +**Context**: Cost optimization for verl-agent/VAGEN RL training on AWS GPU instances |
| 5 | + |
| 6 | +## Spot Pricing vs On-Demand |
| 7 | + |
| 8 | +### Single-GPU (Development/Validation) |
| 9 | + |
| 10 | +| Instance | GPU | VRAM | On-Demand | Spot | Savings | |
| 11 | +|----------|-----|------|-----------|------|---------| |
| 12 | +| g5.xlarge | 1x A10G | 24GB | $1.006/hr | $0.43/hr | 57% | |
| 13 | +| g5.2xlarge | 1x A10G | 24GB | $1.21/hr | ~$0.52/hr | 57% | |
| 14 | +| g6.xlarge | 1x L4 | 24GB | $0.805/hr | $0.38-0.55/hr | 31-53% | |
| 15 | + |
| 16 | +### Multi-GPU (Production Training) |
| 17 | + |
| 18 | +| Instance | GPUs | VRAM | On-Demand | Spot | Savings | |
| 19 | +|----------|------|------|-----------|------|---------| |
| 20 | +| g5.12xlarge | 4x A10G | 96GB | $5.67/hr | $2.90/hr | 49% | |
| 21 | +| g6.12xlarge | 4x L4 | 96GB | $4.60/hr | $2.26/hr | 51% | |
| 22 | + |
| 23 | +**g6 (NVIDIA L4) is a viable alternative** to g5 (A10G) — both have 24GB VRAM per GPU. L4 has better inference performance, which suits RL's rollout-heavy workload. |
| 24 | + |
| 25 | +## Cost Projections |
| 26 | + |
| 27 | +### 7-Day Training Run (Multi-GPU) |
| 28 | + |
| 29 | +| Strategy | Hourly | 7-Day | vs On-Demand | |
| 30 | +|----------|--------|-------|-------------| |
| 31 | +| On-Demand g5.12xlarge | $5.67 | $953 | baseline | |
| 32 | +| Spot g5.12xlarge | $2.90 | $488 | -49% | |
| 33 | +| Spot g6.12xlarge | $2.26 | $380 | -60% | |
| 34 | + |
| 35 | +With ~7% interruption rate and 15-min checkpoints, expect ~10-15% overhead from re-computation. Adjusted real savings: 42-55%. |
| 36 | + |
| 37 | +## Interruption Risk |
| 38 | + |
| 39 | +- AWS overall: 95% of spot instances run to completion |
| 40 | +- GPU instances: estimated 5-10% interruption rate in trailing month |
| 41 | +- **AZ-level variance is significant** — some AZs >20% while others <5% |
| 42 | +- 2-minute termination warning via instance metadata |
| 43 | +- For a 24-hour run with ~7% rate: expect 1-2 interruptions |
| 44 | + |
| 45 | +## Recommendations |
| 46 | + |
| 47 | +1. **Start with spot for dev/validation** (g5.xlarge at $0.43/hr) |
| 48 | + |
| 49 | +2. **For multi-GPU production training**, use EC2 Fleet with: |
| 50 | + - Instance types: g5.12xlarge + g6.12xlarge (diversification) |
| 51 | + - Allocation strategy: `price-capacity-optimized` |
| 52 | + - Regions: us-east-1 primary, us-east-2 fallback |
| 53 | + |
| 54 | +3. **Checkpoint to S3 every 15 minutes** — do NOT rely on EBS survival |
| 55 | + |
| 56 | +4. **Add termination handler** — poll instance metadata every 5s, trigger immediate checkpoint on 2-min warning |
| 57 | + |
| 58 | +5. **Set `DeleteOnTermination=false`** on EBS volumes, or better yet, use S3 for all checkpoints |
| 59 | + |
| 60 | +## Gotchas |
| 61 | + |
| 62 | +- **EBS deleted on spot termination by default** — must change or use S3 |
| 63 | +- **p3 (V100) incompatible** with OSS NVIDIA driver (needs GSP/Ampere+) |
| 64 | +- **g4dn (T4) only 16GB VRAM** — likely insufficient for VLM RL |
| 65 | +- **SageMaker Managed Spot** adds 15-40% markup, not recommended for custom verl-agent loops |
| 66 | +- **Reattaching EBS across AZs requires snapshots** — S3 checkpoints avoid this entirely |
| 67 | + |
| 68 | +## Implementation TODO |
| 69 | + |
| 70 | +- [ ] Add spot instance support to `aws_vm.py` (`create_vm` with `InstanceMarketOptions`) |
| 71 | +- [ ] Add S3 checkpoint upload to training loop |
| 72 | +- [ ] Add termination handler (metadata polling + checkpoint trigger) |
| 73 | +- [ ] Add g6 instance types to `GPU_INSTANCE_TYPE_FALLBACKS` in `aws_vm.py` |
| 74 | +- [ ] Test EC2 Fleet with `price-capacity-optimized` allocation |
0 commit comments