Skip to content

feat: add multi-cloud VM support with AWS backend and VMProvider protocol#66

Merged
abrichr merged 4 commits into
mainfrom
feat/aws-vm-backend
Mar 2, 2026
Merged

feat: add multi-cloud VM support with AWS backend and VMProvider protocol#66
abrichr merged 4 commits into
mainfrom
feat/aws-vm-backend

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 2, 2026

Summary

  • VMProvider Protocol (typing.Protocol): Cloud-agnostic interface for VM lifecycle management. Both AzureVMManager and AWSVMManager satisfy it via structural subtyping — no inheritance changes needed
  • AWSVMManager (boto3): Full EC2 lifecycle management (create/delete/start/stop instances, EIP allocation, VPC/subnet/SG idempotent setup, tag-based pool resource discovery and cleanup)
  • PoolManager + vm_cli.py: Updated to accept any VMProvider instead of hardcoded AzureVMManager. All pool commands (pool-create, pool-wait, pool-run, pool-cleanup, pool-pause, pool-resume) accept --cloud azure|aws
  • Parameterized SSH: All hardcoded azureuser references replaced with vm_manager.ssh_username / {home_dir} template variables

Changes by file

File Change
infrastructure/vm_provider.py NEW — VMProvider Protocol (10 methods + 2 properties)
infrastructure/aws_vm.py NEW — AWSVMManager dataclass using boto3
infrastructure/azure_vm.py Added resource_scope, ssh_username properties; list_pool_resources, cleanup_pool_resources methods; parameterized ssh_run/wait_for_ssh with username
infrastructure/pool.py Type: VMProvider instead of AzureVMManager; parameterized scripts with {home_dir}/{ssh_username}; delegated cleanup to provider
benchmarks/vm_cli.py _create_vm_manager() factory; --cloud arg on all pool subparsers
config.py Added cloud_provider, aws_region settings
infrastructure/__init__.py Export VMProvider, AWSVMManager
pyproject.toml aws = ["boto3>=1.34.0"] optional dep
tests/test_evaluate_server_deploy.py Updated for WAA_START_SCRIPT_TEMPLATE rename

Test plan

  • 531 tests pass (29 pre-existing failures unrelated to this PR)
  • WAA_START_SCRIPT_TEMPLATE tests updated and passing
  • All imports work (VMProvider, AWSVMManager with/without boto3)
  • Azure pool commands unchanged: oa-vm pool-create, pool-pause, pool-resume, pool-cleanup
  • oa-vm pool-create --cloud aws --workers 1 — creates EC2 instance
  • oa-vm pool-cleanup --cloud aws -y — terminates instances, releases resources

🤖 Generated with Claude Code

abrichr and others added 4 commits March 2, 2026 10:01
…ocol

- Create VMProvider Protocol (typing.Protocol) for cloud-agnostic VM management
- Create AWSVMManager with boto3 for EC2 lifecycle (create, delete, start, stop)
- Add resource_scope/ssh_username properties to AzureVMManager
- Add list_pool_resources/cleanup_pool_resources to AzureVMManager
- Parameterize pool.py SSH calls and scripts with username/home_dir
- Add --cloud flag (azure|aws) to all pool CLI commands
- Add cloud_provider/aws_region to config.py settings
- Add boto3 optional dependency (openadapt-evals[aws])
- Update tests for WAA_START_SCRIPT_TEMPLATE rename

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix DOCKER_SETUP_SCRIPT_WITH_ACR daemon.json double-brace corruption
  that produced invalid JSON ({{"data-root"...}}) breaking Docker start
- Use .metal instance types for AWS (KVM/nested virt required for QEMU)
- Fix region mismatch: update self.region and invalidate cached clients
  when create_vm uses a different region than the manager default
- Fix hardcoded "azureuser" in pool-wait diagnostic message
- Set AWSVMManager = None on ImportError so `import *` doesn't raise
- Only delete pool registry on successful cleanup (prevents orphaned
  cloud resources when deletion fails)
- Remove unused `time` import from aws_vm.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix pool-vnc/pool-logs/pool-exec hardcoded azureuser: read ssh_username
  from pool registry with backward-compatible default
- Store ssh_username in VMPool dataclass and persist to registry on create
- Move set_auto_shutdown after SSH is available (was racing with boot)
- Fix cleanup_pool_resources: handle raw instance IDs and allocation IDs
  for resources without Name tags (prevents orphaned resources)
- Narrow key pair exception handling: re-raise unless InvalidKeyPair.NotFound
- Add TODO for restricting SSH security group to user's IP

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ssh_username to VMPoolRegistry.load() so it persists across
  process restarts (was silently reverting to "azureuser" default)
- Fix disassociate_address for raw allocation IDs: look up AssociationId
  via describe_addresses first (disassociate_address does not accept
  AllocationId parameter)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr merged commit 7a27d60 into main Mar 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant