Add OpenVLA AgentCore Orchestrator: autonomous Slurm training orchestration by mijanur132 · Pull Request #1576 · awslabs/agentcore-samples

mijanur132 · 2026-05-29T02:21:03Z

Summary

Adds a new use case under 02-use-cases/ml-training-agent/ demonstrating how Amazon Bedrock AgentCore can autonomously manage GPU training jobs on a SageMaker HyperPod Slurm cluster.

What's included

File	Purpose
`slurm_mcp_server.py`	MCP server exposing 6 Slurm tools (submit, status, logs, cancel, info, metrics) over SSH
`vla_training_agent.py`	Autonomous agent with anomaly detection and recovery
`openvla.Dockerfile`	Training container with pinned deps (torch, transformers, peft) on EFA/NCCL base
`slurm/finetune_openvla.sbatch`	Batch script using `srun --container-image` + `torchrun`
`.env.example`	Configuration template
`requirements.txt`	MCP server dependency

Architecture

Same pattern as the SRE-agent sibling:

MCP tools → Slurm operations (instead of K8s/logs/metrics APIs)
Agent → Monitors training metrics (instead of infrastructure health)
Recovery → Cancels diverging jobs, adjusts LR, resubmits (instead of pod restarts)

Concrete workload

OpenVLA-7B LoRA fine-tuning on LIBERO robotics benchmark. Tested on HyperPod P5en (8x H200): ~10 min for 500 steps, 15 GB checkpoint output.

Relation to awsome-distributed-ai PR

The standalone training recipe (Dockerfile + sbatch, no agent) is submitted separately as awslabs/awsome-distributed-ai#1112. This PR bundles the full agent + training workload as a self-contained AgentCore demo.

github-actions · 2026-05-29T02:26:17Z

Latest scan for commit: 977f929 | Updated: 2026-06-02 15:13:27 UTC

Security Scan Results

Scan Metadata

Project: ASH
Scan executed: 2026-06-02T15:13:06+00:00
ASH version: 3.0.0

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Column Explanations:

Severity Levels (S/C/H/M/L/I):

Suppressed (S): Security findings that have been explicitly suppressed/ignored and don't affect the scanner's pass/fail status
Critical (C): The most severe security vulnerabilities requiring immediate remediation (e.g., SQL injection, remote code execution)
High (H): Serious security vulnerabilities that should be addressed promptly (e.g., authentication bypasses, privilege escalation)
Medium (M): Moderate security risks that should be addressed in normal development cycles (e.g., weak encryption, input validation issues)
Low (L): Minor security concerns with limited impact (e.g., information disclosure, weak recommendations)
Info (I): Informational findings for awareness with minimal security risk (e.g., code quality suggestions, best practice recommendations)

Other Columns:

Time: Duration taken by each scanner to complete its analysis
Action: Total number of actionable findings at or above the configured severity threshold that require attention

Scanner Results:

PASSED: Scanner found no security issues at or above the configured severity threshold - code is clean for this scanner
FAILED: Scanner found security vulnerabilities at or above the threshold that require attention and remediation
MISSING: Scanner could not run because required dependencies/tools are not installed or available
SKIPPED: Scanner was intentionally disabled or excluded from this scan
ERROR: Scanner encountered an execution error and could not complete successfully

Severity Thresholds (Thresh Column):

CRITICAL: Only Critical severity findings cause scanner to fail
HIGH: High and Critical severity findings cause scanner to fail
MEDIUM (MED): Medium, High, and Critical severity findings cause scanner to fail
LOW: Low, Medium, High, and Critical severity findings cause scanner to fail
ALL: Any finding of any severity level causes scanner to fail

Threshold Source: Values in parentheses indicate where the threshold is configured:

(g) = global: Set in the global_settings section of ASH configuration
(c) = config: Set in the individual scanner configuration section
(s) = scanner: Default threshold built into the scanner itself

Statistics calculation:

All statistics are calculated from the final aggregated SARIF report
Suppressed findings are counted separately and do not contribute to actionable findings
Scanner status is determined by comparing actionable findings to the threshold

Scanner	L	Time	Result	Thresh
bandit	2	797ms	PASSED	MED (g)
cdk-nag	0	30.1s	PASSED	MED (g)
cfn-nag	0	34ms	PASSED	MED (g)
checkov	0	5.4s	PASSED	MED (g)
detect-secrets	0	760ms	PASSED	MED (g)
grype	0	46.0s	PASSED	MED (g)
npm-audit	0	205ms	PASSED	MED (g)
opengrep	0	<1ms	SKIPPED	MED (g)
semgrep	0	<1ms	MISSING	MED (g)
syft	0	2.5s	PASSED	MED (g)

…ration Demonstrates AgentCore managing GPU training jobs on HyperPod Slurm: - MCP server exposing 6 Slurm tools (submit, status, logs, cancel, info, metrics) - Autonomous agent with anomaly detection (divergence, stall, NaN) and recovery - OpenVLA-7B LoRA fine-tuning on LIBERO as concrete workload - Container-based (Pyxis/Enroot) with pinned deps for reproducibility - All config via environment variables, no hardcoded paths/credentials Tested: 500 steps, ~10 min on 1x P5en node (8x H200), 15 GB checkpoint output.

github-actions Bot added the 02-use-cases 02-use-cases label May 29, 2026

mijanur132 force-pushed the add-ml-training-agent branch from 7b5a3a7 to 4801535 Compare May 29, 2026 02:23

mijanur132 changed the title ~~Add ML Training Agent use case: autonomous Slurm training orchestration~~ Add OpenVLA AgentCore Orchestrator: autonomous Slurm training orchestration May 29, 2026

mijanur132 force-pushed the add-ml-training-agent branch 2 times, most recently from 21413fc to 73158b5 Compare June 2, 2026 01:49

mijanur132 force-pushed the add-ml-training-agent branch from 73158b5 to 977f929 Compare June 2, 2026 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenVLA AgentCore Orchestrator: autonomous Slurm training orchestration#1576

Add OpenVLA AgentCore Orchestrator: autonomous Slurm training orchestration#1576
mijanur132 wants to merge 1 commit into
awslabs:mainfrom
mijanur132:add-ml-training-agent

mijanur132 commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mijanur132 commented May 29, 2026

Summary

What's included

Architecture

Concrete workload

Relation to awsome-distributed-ai PR

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Scan Results

Scan Metadata

Summary

Scanner Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 29, 2026 •

edited

Loading