Skip to content

Commit 828733f

Browse files
abrichrclaude
andauthored
fix: use C8i/M8i instances for AWS nested virt (10x cheaper than metal) (#124)
AWS Intel Xeon 6 families (C8i, M8i, R8i) support nested virtualization on standard (non-metal) instances since late 2025. Update default from m5.metal ($4.61/hr) to m8i.2xlarge ($0.46/hr) with fallbacks through c8i.2xlarge, r8i.2xlarge, m8i.4xlarge, and m5.metal as legacy option. Updated files: - aws_vm.py: new INSTANCE_TYPE and INSTANCE_TYPE_FALLBACKS - CLAUDE.md: updated cost table - docs/rl_quick_start.md: updated cost estimates - docs/ec2_setup_guide.md: updated instance types, costs, and instructions Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7213cee commit 828733f

4 files changed

Lines changed: 46 additions & 35 deletions

File tree

CLAUDE.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -154,14 +154,14 @@ oa-vm pool-wait --cloud aws --timeout 45
154154
oa-vm pool-cleanup --cloud aws -y
155155
```
156156

157-
AWS requires `m5.metal` ($4.61/hr) for KVM/QEMU nested virtualization. First boot takes ~35 min (Windows download + install). Costs per full WAA stack test:
157+
AWS uses `m8i.2xlarge` (~$0.46/hr) for KVM/QEMU nested virtualization (Intel Xeon 6 families C8i/M8i/R8i support nested virt on standard instances since late 2025). First boot takes ~35 min (Windows download + install). Costs per full WAA stack test:
158158

159159
| Phase | Time | Cost |
160160
|-------|------|------|
161-
| VM + Docker setup | ~14 min | $1.08 |
162-
| Docker image build | ~7 min | $0.54 |
163-
| Windows install + boot | ~20 min | $1.54 |
164-
| Benchmark runtime | varies | $4.61/hr |
161+
| VM + Docker setup | ~14 min | $0.11 |
162+
| Docker image build | ~7 min | $0.05 |
163+
| Windows install + boot | ~20 min | $0.15 |
164+
| Benchmark runtime | varies | $0.46/hr |
165165

166166
![Windows 11 on AWS EC2](docs/aws-waa-windows-desktop.png)
167167

docs/ec2_setup_guide.md

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ This guide walks through deploying WAA on AWS EC2 for GUI agent evaluation. WAA
2020
## Architecture overview
2121

2222
```
23-
LOCAL MACHINE (macOS/Linux) AWS EC2 (Ubuntu 22.04, m5.metal)
23+
LOCAL MACHINE (macOS/Linux) AWS EC2 (Ubuntu 22.04, m8i.2xlarge)
2424
+---------------------------+ +------------------------------------+
2525
| oa-vm CLI | SSH Tunnel | Docker (waa-auto:latest) |
2626
| (pool management) | --------→ | +- evaluate_server (:5050) |
@@ -33,7 +33,7 @@ LOCAL MACHINE (macOS/Linux) AWS EC2 (Ubuntu 22.04, m5.metal)
3333
```
3434

3535
Key points:
36-
- **Instance type**: `m5.metal` is required (bare-metal for KVM/QEMU nested virtualization). Standard instances like `g4dn.xlarge` or `t3.xlarge` do NOT expose `/dev/kvm` and cannot run QEMU.
36+
- **Instance type**: `m8i.2xlarge` is recommended (~$0.46/hr). Intel Xeon 6 families (C8i, M8i, R8i) support nested virtualization on standard (non-metal) instances since late 2025. Legacy metal instances (`m5.metal` at ~$4.61/hr) also work but at ~10x the cost. Older standard instances like `t3.xlarge` do NOT expose `/dev/kvm` and cannot run QEMU.
3737
- **OS**: Ubuntu 22.04 LTS (Canonical official AMI, auto-discovered by the CLI)
3838
- **Ports**: Only SSH (22) is opened in the security group. All other access goes through SSH tunnels.
3939
- **First boot**: ~35 minutes (Windows 11 download + install). Subsequent resumes: ~1-5 minutes.
@@ -110,7 +110,7 @@ This performs 5 read-only checks:
110110
1. AWS credentials (via `sts.get_caller_identity()`)
111111
2. SSH public key exists at `~/.ssh/id_rsa.pub`
112112
3. Latest Ubuntu 22.04 AMI lookup (Canonical official)
113-
4. `m5.metal` instance type availability across regions
113+
4. `m8i.2xlarge` (or fallback) instance type availability across regions
114114
5. VPC infrastructure (creates VPC, subnet, security group, internet gateway if needed)
115115

116116
For a full lifecycle test (creates and deletes a real EC2 instance, costs ~$0.01):
@@ -121,11 +121,11 @@ oa-vm smoke-test-aws --full
121121

122122
### 5. EC2 service quota
123123

124-
`m5.metal` requires sufficient vCPU quota. By default, new AWS accounts have a limit of 0 for metal instances. To request an increase:
124+
`m8i.2xlarge` requires sufficient vCPU quota. By default, new AWS accounts may have limited quotas. To check or request an increase:
125125

126126
1. Go to **AWS Console > Service Quotas > Amazon EC2**
127127
2. Search for "Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances"
128-
3. Request a quota increase to at least 96 vCPUs (the `m5.metal` instance has 96 vCPUs)
128+
3. Request a quota increase to at least 8 vCPUs (the `m8i.2xlarge` instance has 8 vCPUs). If using `m5.metal` as a fallback, 96 vCPUs are required.
129129

130130
## Path 1: Automated setup with oa-vm CLI (recommended)
131131

@@ -142,7 +142,7 @@ oa-vm pool-create --cloud aws --workers 3
142142
```
143143

144144
What happens behind the scenes:
145-
1. Finds an available `m5.metal` instance and region (tries us-east-1, us-west-2, us-east-2, eu-west-1)
145+
1. Finds an available instance type with nested virt support (tries m8i.2xlarge, c8i.2xlarge, r8i.2xlarge, m8i.4xlarge, m5.metal in order) and region (us-east-1, us-west-2, us-east-2, eu-west-1)
146146
2. Creates VPC infrastructure if needed (VPC, subnet, internet gateway, security group, key pair)
147147
3. Launches Ubuntu 22.04 EC2 instance with 128GB gp3 EBS root volume
148148
4. Waits for SSH to become available
@@ -255,10 +255,10 @@ aws ec2 authorize-security-group-ingress \
255255
--protocol tcp --port 22 \
256256
--cidr 0.0.0.0/0
257257

258-
# Launch m5.metal instance with 128GB disk
258+
# Launch m8i.2xlarge instance with 128GB disk (nested virt supported)
259259
INSTANCE_ID=$(aws ec2 run-instances \
260260
--image-id $AMI_ID \
261-
--instance-type m5.metal \
261+
--instance-type m8i.2xlarge \
262262
--key-name waa-pool-key \
263263
--security-group-ids $SG_ID \
264264
--block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":128,"VolumeType":"gp3"}}]' \
@@ -617,20 +617,22 @@ Tasks can include a `config` array with setup steps that run before the task beg
617617

618618
| Instance type | vCPU | RAM | Cost/hr | KVM support | Notes |
619619
|---------------|------|-----|---------|-------------|-------|
620-
| `m5.metal` | 96 | 384 GB | $4.61 | Yes | Primary choice |
621-
| `m5n.metal` | 96 | 384 GB | $5.71 | Yes | Network-optimized fallback |
622-
| `c5.metal` | 96 | 192 GB | $4.08 | Yes | Compute-optimized fallback |
620+
| `m8i.2xlarge` | 8 | 32 GB | $0.46 | Yes | Primary choice (Intel Xeon 6, nested virt) |
621+
| `c8i.2xlarge` | 8 | 16 GB | $0.41 | Yes | Compute-optimized, cheapest with nested virt |
622+
| `r8i.2xlarge` | 8 | 64 GB | $0.60 | Yes | Memory-optimized |
623+
| `m8i.4xlarge` | 16 | 64 GB | $0.92 | Yes | Bigger option |
624+
| `m5.metal` | 96 | 384 GB | $4.61 | Yes | Legacy fallback (expensive) |
623625

624626
### Time and cost per phase
625627

626-
| Phase | Time | Cost (m5.metal) |
627-
|-------|------|-----------------|
628-
| EC2 launch + SSH ready | ~2 min | $0.15 |
629-
| Docker + image build | ~12 min | $0.92 |
630-
| Windows 11 download + install (first boot) | ~20 min | $1.54 |
631-
| Windows boot (subsequent) | ~1-5 min | $0.08-0.38 |
632-
| **Total first boot** | **~35 min** | **~$2.61** |
633-
| Benchmark runtime | varies | $4.61/hr |
628+
| Phase | Time | Cost (m8i.2xlarge) |
629+
|-------|------|--------------------|
630+
| EC2 launch + SSH ready | ~2 min | $0.02 |
631+
| Docker + image build | ~12 min | $0.09 |
632+
| Windows 11 download + install (first boot) | ~20 min | $0.15 |
633+
| Windows boot (subsequent) | ~1-5 min | $0.01-0.04 |
634+
| **Total first boot** | **~35 min** | **~$0.27** |
635+
| Benchmark runtime | varies | $0.46/hr |
634636

635637
### Storage costs (when paused)
636638

@@ -642,9 +644,9 @@ Paused VMs (stopped instances) do not incur compute charges, but EBS storage con
642644

643645
### "No available EC2 instance type/region found"
644646

645-
`m5.metal` may not be available in all regions. The CLI tries us-east-1, us-west-2, us-east-2, eu-west-1 in order. If all fail:
647+
The preferred instance type (`m8i.2xlarge`) may not be available in all regions. The CLI tries multiple instance types (m8i, c8i, r8i, then m5.metal) across regions (us-east-1, us-west-2, us-east-2, eu-west-1) in order. If all fail:
646648
1. Check your vCPU quota: AWS Console > Service Quotas > EC2 > "Running On-Demand Standard instances"
647-
2. Request a quota increase to at least 96 vCPUs
649+
2. Request a quota increase to at least 8 vCPUs (or 96 vCPUs if falling back to m5.metal)
648650
3. Try a different region: `oa-vm smoke-test-aws --region eu-west-1`
649651

650652
### SSH connection timeout

docs/rl_quick_start.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,8 +184,14 @@ Running WAA requires a VM with nested virtualisation support:
184184

185185
| Resource | Spec | Approximate cost |
186186
|----------|------|-----------------|
187-
| Cloud VM (general purpose, 8 vCPU, 32 GB) | D8ds_v5 equivalent | ~$0.38/hr |
188-
| Cloud VM (bare metal, 96 vCPU) | m5.metal equivalent | ~$4.61/hr |
187+
| Azure VM (general purpose, 8 vCPU, 32 GB) | D8ds_v5 | ~$0.38/hr |
188+
| AWS VM (general purpose, 8 vCPU, 32 GB) | m8i.2xlarge | ~$0.46/hr |
189+
| AWS VM (compute-optimized, 8 vCPU, 16 GB) | c8i.2xlarge | ~$0.41/hr |
190+
| AWS VM (legacy bare metal, 96 vCPU) | m5.metal | ~$4.61/hr |
191+
192+
Intel Xeon 6 families (C8i, M8i, R8i) support nested virtualisation on
193+
standard (non-metal) AWS instances since late 2025, reducing AWS costs
194+
by ~10x compared to legacy metal instances.
189195

190196
A single rollout (15 steps) typically completes in 1--3 minutes depending
191197
on action delay and evaluator latency. At the lower rate that is roughly

openadapt_evals/infrastructure/aws_vm.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,17 @@
3636
logger = logging.getLogger(__name__)
3737

3838
# Instance types with nested virtualization (KVM) support.
39-
# WAA requires QEMU/KVM, which only bare-metal instances expose on AWS.
40-
# m5.metal: 96 vCPU, 384GB — supports /dev/kvm for nested virtualization.
41-
INSTANCE_TYPE = "m5.metal"
39+
# WAA requires QEMU/KVM. As of late 2025, Intel Xeon 6 families (C8i, M8i, R8i)
40+
# support nested virtualization on standard (non-metal) instances, dramatically
41+
# reducing cost vs legacy metal instances.
42+
INSTANCE_TYPE = "m8i.2xlarge" # 8 vCPU, 32GB, ~$0.46/hr — nested virt supported
4243
INSTANCE_TYPE_FALLBACKS = [
43-
("m5.metal", 4.608),
44-
("m5n.metal", 5.712),
45-
("c5.metal", 4.080),
46-
("m5a.xlarge", 0.172), # Non-KVM fallback (won't run QEMU, for testing only)
44+
("m8i.2xlarge", 0.461), # 8 vCPU, 32GB — best value with nested virt
45+
("c8i.2xlarge", 0.408), # 8 vCPU, 16GB — compute-optimized, cheaper
46+
("r8i.2xlarge", 0.604), # 8 vCPU, 64GB — memory-optimized
47+
("m8i.4xlarge", 0.922), # 16 vCPU, 64GB — bigger option
48+
("m5.metal", 4.608), # Legacy: 96 vCPU, 384GB — expensive but proven
49+
("m5a.xlarge", 0.172), # Non-KVM fallback (testing only, won't run QEMU)
4750
]
4851

4952
# GPU instance types for verl-agent RL training.

0 commit comments

Comments
 (0)