|
| 1 | +# RoboGate Benchmark for Isaac Lab-Arena |
| 2 | + |
| 3 | +Adversarial 68-scenario pick-and-place validation benchmark with 5 safety metrics and deployment confidence scoring. Contributes the [RoboGate](https://robogate.io) evaluation suite to [Isaac Lab-Arena](https://github.com/isaac-sim/IsaacLab-Arena). |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +RoboGate validates robot manipulation policies before deployment by testing them against 68 progressively harder scenarios across 4 difficulty categories: |
| 8 | + |
| 9 | +| Category | Count | Target SR | Description | |
| 10 | +|----------|-------|-----------|-------------| |
| 11 | +| Nominal | 20 | 95-100% | Standard objects, lighting, centered placement | |
| 12 | +| Edge Cases | 15 | 70-85% | Small/heavy/edge/occluded/transparent objects | |
| 13 | +| Adversarial | 10 | 40-60% | Low light, clutter, slippery, disturbances | |
| 14 | +| Domain Rand | 23 | 85-95% | Lighting/color/position/camera variations | |
| 15 | + |
| 16 | +## Quick Start |
| 17 | + |
| 18 | +### Mock Mode (No GPU Required) |
| 19 | + |
| 20 | +```bash |
| 21 | +cd contrib/isaaclab-arena |
| 22 | + |
| 23 | +# Run scripted policy benchmark |
| 24 | +python scripts/run_benchmark.py --mock --output results/mock_results.json |
| 25 | + |
| 26 | +# Run VLA evaluation |
| 27 | +python scripts/run_vla_eval.py --model octo-small --mock |
| 28 | +``` |
| 29 | + |
| 30 | +### Isaac Lab-Arena Integration |
| 31 | + |
| 32 | +```bash |
| 33 | +# Install |
| 34 | +pip install -e . |
| 35 | + |
| 36 | +# Run with Franka Panda |
| 37 | +python scripts/run_benchmark.py --embodiment franka --config configs/robogate_68.yaml |
| 38 | + |
| 39 | +# Run VLA evaluation with real physics |
| 40 | +python scripts/run_vla_eval.py --model octo-small --embodiment franka --enable-cameras |
| 41 | +``` |
| 42 | + |
| 43 | +### As Isaac Lab-Arena Environment |
| 44 | + |
| 45 | +```python |
| 46 | +from isaaclab_arena.assets.asset_registry import AssetRegistry |
| 47 | +from isaaclab_arena.environments.arena_env_builder import ArenaEnvBuilder |
| 48 | +from robogate_benchmark.environments import RoboGateBenchmarkEnvironment |
| 49 | + |
| 50 | +env_def = RoboGateBenchmarkEnvironment() |
| 51 | +arena_env = env_def.get_env(args_cli) |
| 52 | +builder = ArenaEnvBuilder(arena_env, args_cli) |
| 53 | +env = builder.make_registered() |
| 54 | + |
| 55 | +obs, info = env.reset() |
| 56 | +# ... run your policy ... |
| 57 | +``` |
| 58 | + |
| 59 | +## 5 Safety Metrics |
| 60 | + |
| 61 | +| Metric | Threshold | Weight | |
| 62 | +|--------|-----------|--------| |
| 63 | +| Grasp Success Rate | >= 92% | 0.30 | |
| 64 | +| Cycle Time | <= baseline x 1.1 | 0.20 | |
| 65 | +| Collision Count | == 0 | 0.25 | |
| 66 | +| Drop Rate | <= 3% | 0.15* | |
| 67 | +| Grasp Miss Rate | <= baseline x 1.2 | 0.10* | |
| 68 | + |
| 69 | +*Edge case performance (0.15) and baseline delta (0.10) are computed from scenario summaries. |
| 70 | + |
| 71 | +## Confidence Score (0-100) |
| 72 | + |
| 73 | +Weighted sum of 5 component scores: |
| 74 | + |
| 75 | +- **76-100**: PASS — safe to deploy |
| 76 | +- **51-75**: WARN — deploy with monitoring |
| 77 | +- **0-50**: FAIL — do not deploy |
| 78 | + |
| 79 | +## Baseline & VLA Results |
| 80 | + |
| 81 | +| Model | Params | SR | Confidence | Collisions | Grasp Miss | |
| 82 | +|-------|--------|-----|-----------|------------|-----------| |
| 83 | +| Scripted (IK) | — | **100%** (68/68) | 76/100 | 0 | 0 | |
| 84 | +| OpenVLA (Stanford+TRI) | 7B | 0% (0/68) | 27/100 | 0 | 68 | |
| 85 | +| Octo-Base (UC Berkeley) | 93M | 0% (0/68) | 1/100 | 14 | 54 | |
| 86 | +| Octo-Small (UC Berkeley) | 27M | 0% (0/68) | 1/100 | 14 | 54 | |
| 87 | + |
| 88 | +The 100-point gap across three VLA models (27M→7B, 260× scale) validates RoboGate's ability to discriminate safe vs unsafe policies. Model size is not the bottleneck — training-deployment distribution mismatch is. |
| 89 | + |
| 90 | +## HuggingFace Failure Dictionary |
| 91 | + |
| 92 | +30,720 boundary-focused episodes available at: |
| 93 | +[liveplex/robogate-failure-dictionary](https://huggingface.co/datasets/liveplex/robogate-failure-dictionary) |
| 94 | + |
| 95 | +```python |
| 96 | +from robogate_benchmark.failure_dictionary import download_dataset, analyze_failures |
| 97 | + |
| 98 | +ds = download_dataset(split="test") |
| 99 | +stats = analyze_failures(ds) |
| 100 | +print(stats.success_rate) # ~0.82 |
| 101 | +``` |
| 102 | + |
| 103 | +## VLA Model Support |
| 104 | + |
| 105 | +| Model | Params | Framework | Image Size | Quantization | |
| 106 | +|-------|--------|-----------|------------|--------------| |
| 107 | +| octo-small | 27M | JAX | 256x256 | - | |
| 108 | +| octo-base | 93M | JAX | 256x256 | - | |
| 109 | +| openvla-7b | 7B | PyTorch | 224x224 | 4-bit NF4 | |
| 110 | + |
| 111 | +## File Structure |
| 112 | + |
| 113 | +``` |
| 114 | +contrib/isaaclab-arena/ |
| 115 | +├── README.md |
| 116 | +├── setup.py |
| 117 | +├── robogate_benchmark/ |
| 118 | +│ ├── __init__.py |
| 119 | +│ ├── scenarios.py # 68 scenarios (4 categories x 16 variants) |
| 120 | +│ ├── environments.py # ArenaEnvBuilder integration |
| 121 | +│ ├── metrics.py # 5 safety metrics |
| 122 | +│ ├── confidence_scorer.py # Deployment confidence (0-100) |
| 123 | +│ ├── failure_dictionary.py # HuggingFace 30K dataset |
| 124 | +│ ├── vla_evaluator.py # VLA evaluation pipeline |
| 125 | +│ └── report_generator.py # JSON + text reports |
| 126 | +├── configs/ |
| 127 | +│ ├── robogate_68.yaml # 68-scenario config |
| 128 | +│ ├── franka_panda.yaml # Franka embodiment config |
| 129 | +│ └── ur5e.yaml # UR5e embodiment config |
| 130 | +├── scripts/ |
| 131 | +│ ├── run_benchmark.py # Scripted policy benchmark |
| 132 | +│ └── run_vla_eval.py # VLA model evaluation |
| 133 | +└── results/ |
| 134 | + └── baseline_results.json # Scripted controller baseline |
| 135 | +``` |
| 136 | + |
| 137 | +## Citation |
| 138 | + |
| 139 | +```bibtex |
| 140 | +@misc{agentai2026robogate, |
| 141 | + title = {ROBOGATE: Adaptive Failure Discovery for Safe Robot |
| 142 | + Policy Deployment via Two-Stage Boundary-Focused Sampling}, |
| 143 | + author = {{AgentAI Co., Ltd.}}, |
| 144 | + year = {2026}, |
| 145 | + eprint = {2603.22126}, |
| 146 | + archivePrefix = {arXiv}, |
| 147 | + primaryClass = {cs.RO}, |
| 148 | + doi = {10.5281/zenodo.19166967}, |
| 149 | + url = {https://robogate.io/paper} |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +## License |
| 154 | + |
| 155 | +Apache 2.0 |
0 commit comments