Skip to content

Add OpenVLA AgentCore Orchestrator: autonomous Slurm training orchestration#1576

Open
mijanur132 wants to merge 1 commit into
awslabs:mainfrom
mijanur132:add-ml-training-agent
Open

Add OpenVLA AgentCore Orchestrator: autonomous Slurm training orchestration#1576
mijanur132 wants to merge 1 commit into
awslabs:mainfrom
mijanur132:add-ml-training-agent

Conversation

@mijanur132

Copy link
Copy Markdown

Summary

Adds a new use case under 02-use-cases/ml-training-agent/ demonstrating how Amazon Bedrock AgentCore can autonomously manage GPU training jobs on a SageMaker HyperPod Slurm cluster.

What's included

File Purpose
slurm_mcp_server.py MCP server exposing 6 Slurm tools (submit, status, logs, cancel, info, metrics) over SSH
vla_training_agent.py Autonomous agent with anomaly detection and recovery
openvla.Dockerfile Training container with pinned deps (torch, transformers, peft) on EFA/NCCL base
slurm/finetune_openvla.sbatch Batch script using srun --container-image + torchrun
.env.example Configuration template
requirements.txt MCP server dependency

Architecture

Same pattern as the SRE-agent sibling:

  • MCP tools → Slurm operations (instead of K8s/logs/metrics APIs)
  • Agent → Monitors training metrics (instead of infrastructure health)
  • Recovery → Cancels diverging jobs, adjusts LR, resubmits (instead of pod restarts)

Concrete workload

OpenVLA-7B LoRA fine-tuning on LIBERO robotics benchmark. Tested on HyperPod P5en (8x H200): ~10 min for 500 steps, 15 GB checkpoint output.

Relation to awsome-distributed-ai PR

The standalone training recipe (Dockerfile + sbatch, no agent) is submitted separately as awslabs/awsome-distributed-ai#1112. This PR bundles the full agent + training workload as a self-contained AgentCore demo.

@github-actions github-actions Bot added the 02-use-cases 02-use-cases label May 29, 2026
@mijanur132 mijanur132 force-pushed the add-ml-training-agent branch from 7b5a3a7 to 4801535 Compare May 29, 2026 02:23
@mijanur132 mijanur132 changed the title Add ML Training Agent use case: autonomous Slurm training orchestration Add OpenVLA AgentCore Orchestrator: autonomous Slurm training orchestration May 29, 2026
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown

Latest scan for commit: 977f929 | Updated: 2026-06-02 15:13:27 UTC

Security Scan Results

Scan Metadata

  • Project: ASH
  • Scan executed: 2026-06-02T15:13:06+00:00
  • ASH version: 3.0.0

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Column Explanations:

Severity Levels (S/C/H/M/L/I):

  • Suppressed (S): Security findings that have been explicitly suppressed/ignored and don't affect the scanner's pass/fail status
  • Critical (C): The most severe security vulnerabilities requiring immediate remediation (e.g., SQL injection, remote code execution)
  • High (H): Serious security vulnerabilities that should be addressed promptly (e.g., authentication bypasses, privilege escalation)
  • Medium (M): Moderate security risks that should be addressed in normal development cycles (e.g., weak encryption, input validation issues)
  • Low (L): Minor security concerns with limited impact (e.g., information disclosure, weak recommendations)
  • Info (I): Informational findings for awareness with minimal security risk (e.g., code quality suggestions, best practice recommendations)

Other Columns:

  • Time: Duration taken by each scanner to complete its analysis
  • Action: Total number of actionable findings at or above the configured severity threshold that require attention

Scanner Results:

  • PASSED: Scanner found no security issues at or above the configured severity threshold - code is clean for this scanner
  • FAILED: Scanner found security vulnerabilities at or above the threshold that require attention and remediation
  • MISSING: Scanner could not run because required dependencies/tools are not installed or available
  • SKIPPED: Scanner was intentionally disabled or excluded from this scan
  • ERROR: Scanner encountered an execution error and could not complete successfully

Severity Thresholds (Thresh Column):

  • CRITICAL: Only Critical severity findings cause scanner to fail
  • HIGH: High and Critical severity findings cause scanner to fail
  • MEDIUM (MED): Medium, High, and Critical severity findings cause scanner to fail
  • LOW: Low, Medium, High, and Critical severity findings cause scanner to fail
  • ALL: Any finding of any severity level causes scanner to fail

Threshold Source: Values in parentheses indicate where the threshold is configured:

  • (g) = global: Set in the global_settings section of ASH configuration
  • (c) = config: Set in the individual scanner configuration section
  • (s) = scanner: Default threshold built into the scanner itself

Statistics calculation:

  • All statistics are calculated from the final aggregated SARIF report
  • Suppressed findings are counted separately and do not contribute to actionable findings
  • Scanner status is determined by comparing actionable findings to the threshold
Scanner S C H M L I Time Action Result Thresh
bandit 0 0 0 0 2 0 797ms 0 PASSED MED (g)
cdk-nag 0 0 0 0 0 0 30.1s 0 PASSED MED (g)
cfn-nag 0 0 0 0 0 0 34ms 0 PASSED MED (g)
checkov 0 0 0 0 0 0 5.4s 0 PASSED MED (g)
detect-secrets 0 0 0 0 0 0 760ms 0 PASSED MED (g)
grype 0 0 0 0 0 0 46.0s 0 PASSED MED (g)
npm-audit 0 0 0 0 0 0 205ms 0 PASSED MED (g)
opengrep 0 0 0 0 0 0 <1ms 0 SKIPPED MED (g)
semgrep 0 0 0 0 0 0 <1ms 0 MISSING MED (g)
syft 0 0 0 0 0 0 2.5s 0 PASSED MED (g)

@mijanur132 mijanur132 force-pushed the add-ml-training-agent branch 2 times, most recently from 21413fc to 73158b5 Compare June 2, 2026 01:49
…ration

Demonstrates AgentCore managing GPU training jobs on HyperPod Slurm:
- MCP server exposing 6 Slurm tools (submit, status, logs, cancel, info, metrics)
- Autonomous agent with anomaly detection (divergence, stall, NaN) and recovery
- OpenVLA-7B LoRA fine-tuning on LIBERO as concrete workload
- Container-based (Pyxis/Enroot) with pinned deps for reproducibility
- All config via environment variables, no hardcoded paths/credentials

Tested: 500 steps, ~10 min on 1x P5en node (8x H200), 15 GB checkpoint output.
@mijanur132 mijanur132 force-pushed the add-ml-training-agent branch from 73158b5 to 977f929 Compare June 2, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

02-use-cases 02-use-cases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant