Skip to content

Commit 9362d80

Browse files
author
Ubuntu
committed
[Benchmark] RoboGate: 68-Scenario Adversarial Pick-and-Place
1 parent dc0bbd7 commit 9362d80

15 files changed

Lines changed: 3118 additions & 0 deletions
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# RoboGate Benchmark for Isaac Lab-Arena
2+
3+
Adversarial 68-scenario pick-and-place validation benchmark with 5 safety metrics and deployment confidence scoring. Contributes the [RoboGate](https://robogate.io) evaluation suite to [Isaac Lab-Arena](https://github.com/isaac-sim/IsaacLab-Arena).
4+
5+
## Overview
6+
7+
RoboGate validates robot manipulation policies before deployment by testing them against 68 progressively harder scenarios across 4 difficulty categories:
8+
9+
| Category | Count | Target SR | Description |
10+
|----------|-------|-----------|-------------|
11+
| Nominal | 20 | 95-100% | Standard objects, lighting, centered placement |
12+
| Edge Cases | 15 | 70-85% | Small/heavy/edge/occluded/transparent objects |
13+
| Adversarial | 10 | 40-60% | Low light, clutter, slippery, disturbances |
14+
| Domain Rand | 23 | 85-95% | Lighting/color/position/camera variations |
15+
16+
## Quick Start
17+
18+
### Mock Mode (No GPU Required)
19+
20+
```bash
21+
cd contrib/isaaclab-arena
22+
23+
# Run scripted policy benchmark
24+
python scripts/run_benchmark.py --mock --output results/mock_results.json
25+
26+
# Run VLA evaluation
27+
python scripts/run_vla_eval.py --model octo-small --mock
28+
```
29+
30+
### Isaac Lab-Arena Integration
31+
32+
```bash
33+
# Install
34+
pip install -e .
35+
36+
# Run with Franka Panda
37+
python scripts/run_benchmark.py --embodiment franka --config configs/robogate_68.yaml
38+
39+
# Run VLA evaluation with real physics
40+
python scripts/run_vla_eval.py --model octo-small --embodiment franka --enable-cameras
41+
```
42+
43+
### As Isaac Lab-Arena Environment
44+
45+
```python
46+
from isaaclab_arena.assets.asset_registry import AssetRegistry
47+
from isaaclab_arena.environments.arena_env_builder import ArenaEnvBuilder
48+
from robogate_benchmark.environments import RoboGateBenchmarkEnvironment
49+
50+
env_def = RoboGateBenchmarkEnvironment()
51+
arena_env = env_def.get_env(args_cli)
52+
builder = ArenaEnvBuilder(arena_env, args_cli)
53+
env = builder.make_registered()
54+
55+
obs, info = env.reset()
56+
# ... run your policy ...
57+
```
58+
59+
## 5 Safety Metrics
60+
61+
| Metric | Threshold | Weight |
62+
|--------|-----------|--------|
63+
| Grasp Success Rate | >= 92% | 0.30 |
64+
| Cycle Time | <= baseline x 1.1 | 0.20 |
65+
| Collision Count | == 0 | 0.25 |
66+
| Drop Rate | <= 3% | 0.15* |
67+
| Grasp Miss Rate | <= baseline x 1.2 | 0.10* |
68+
69+
*Edge case performance (0.15) and baseline delta (0.10) are computed from scenario summaries.
70+
71+
## Confidence Score (0-100)
72+
73+
Weighted sum of 5 component scores:
74+
75+
- **76-100**: PASS — safe to deploy
76+
- **51-75**: WARN — deploy with monitoring
77+
- **0-50**: FAIL — do not deploy
78+
79+
## Baseline & VLA Results
80+
81+
| Model | Params | SR | Confidence | Collisions | Grasp Miss |
82+
|-------|--------|-----|-----------|------------|-----------|
83+
| Scripted (IK) || **100%** (68/68) | 76/100 | 0 | 0 |
84+
| OpenVLA (Stanford+TRI) | 7B | 0% (0/68) | 27/100 | 0 | 68 |
85+
| Octo-Base (UC Berkeley) | 93M | 0% (0/68) | 1/100 | 14 | 54 |
86+
| Octo-Small (UC Berkeley) | 27M | 0% (0/68) | 1/100 | 14 | 54 |
87+
88+
The 100-point gap across three VLA models (27M→7B, 260× scale) validates RoboGate's ability to discriminate safe vs unsafe policies. Model size is not the bottleneck — training-deployment distribution mismatch is.
89+
90+
## HuggingFace Failure Dictionary
91+
92+
30,720 boundary-focused episodes available at:
93+
[liveplex/robogate-failure-dictionary](https://huggingface.co/datasets/liveplex/robogate-failure-dictionary)
94+
95+
```python
96+
from robogate_benchmark.failure_dictionary import download_dataset, analyze_failures
97+
98+
ds = download_dataset(split="test")
99+
stats = analyze_failures(ds)
100+
print(stats.success_rate) # ~0.82
101+
```
102+
103+
## VLA Model Support
104+
105+
| Model | Params | Framework | Image Size | Quantization |
106+
|-------|--------|-----------|------------|--------------|
107+
| octo-small | 27M | JAX | 256x256 | - |
108+
| octo-base | 93M | JAX | 256x256 | - |
109+
| openvla-7b | 7B | PyTorch | 224x224 | 4-bit NF4 |
110+
111+
## File Structure
112+
113+
```
114+
contrib/isaaclab-arena/
115+
├── README.md
116+
├── setup.py
117+
├── robogate_benchmark/
118+
│ ├── __init__.py
119+
│ ├── scenarios.py # 68 scenarios (4 categories x 16 variants)
120+
│ ├── environments.py # ArenaEnvBuilder integration
121+
│ ├── metrics.py # 5 safety metrics
122+
│ ├── confidence_scorer.py # Deployment confidence (0-100)
123+
│ ├── failure_dictionary.py # HuggingFace 30K dataset
124+
│ ├── vla_evaluator.py # VLA evaluation pipeline
125+
│ └── report_generator.py # JSON + text reports
126+
├── configs/
127+
│ ├── robogate_68.yaml # 68-scenario config
128+
│ ├── franka_panda.yaml # Franka embodiment config
129+
│ └── ur5e.yaml # UR5e embodiment config
130+
├── scripts/
131+
│ ├── run_benchmark.py # Scripted policy benchmark
132+
│ └── run_vla_eval.py # VLA model evaluation
133+
└── results/
134+
└── baseline_results.json # Scripted controller baseline
135+
```
136+
137+
## Citation
138+
139+
```bibtex
140+
@misc{agentai2026robogate,
141+
title = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
142+
Policy Deployment via Two-Stage Boundary-Focused Sampling},
143+
author = {{AgentAI Co., Ltd.}},
144+
year = {2026},
145+
eprint = {2603.22126},
146+
archivePrefix = {arXiv},
147+
primaryClass = {cs.RO},
148+
doi = {10.5281/zenodo.19166967},
149+
url = {https://robogate.io/paper}
150+
}
151+
```
152+
153+
## License
154+
155+
Apache 2.0
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
"""RoboGate Benchmark for Isaac Lab-Arena.
2+
3+
68-scenario adversarial pick-and-place validation suite with 5 safety
4+
metrics and deployment confidence scoring (0-100).
5+
6+
Usage with ArenaEnvBuilder::
7+
8+
from robogate_benchmark.environments import RoboGateBenchmarkEnvironment
9+
env_def = RoboGateBenchmarkEnvironment()
10+
arena_env = env_def.get_env(args_cli)
11+
12+
Usage standalone::
13+
14+
python -m scripts.run_benchmark --embodiment franka --config configs/robogate_68.yaml
15+
"""
16+
17+
__version__ = "1.0.0"
18+
__author__ = "Byungjin Kim"
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
"""Deployment Confidence Score calculator (0-100).
2+
3+
Weighted sum of 5 component scores:
4+
grasp_success_rate: 0.30
5+
cycle_time: 0.20
6+
collision_count: 0.25
7+
edge_case_performance: 0.15
8+
baseline_delta: 0.10
9+
10+
Score interpretation:
11+
76-100: PASS — safe to deploy
12+
51-75: WARN — deploy with monitoring
13+
0-50: FAIL — do not deploy
14+
"""
15+
16+
from __future__ import annotations
17+
18+
from typing import Any
19+
20+
from robogate_benchmark.metrics import ScenarioSummary
21+
22+
23+
DEFAULT_WEIGHTS: dict[str, float] = {
24+
"grasp_success_rate": 0.30,
25+
"cycle_time": 0.20,
26+
"collision_count": 0.25,
27+
"edge_case_performance": 0.15,
28+
"baseline_delta": 0.10,
29+
}
30+
31+
32+
def _score_grasp_success_rate(value: float) -> float:
33+
"""Score grasp success rate (0-100).
34+
35+
Maps 0.80-1.00 to 0-100. Below 0.80 = 0.
36+
"""
37+
if value >= 1.0:
38+
return 100.0
39+
if value <= 0.80:
40+
return 0.0
41+
return (value - 0.80) / 0.20 * 100.0
42+
43+
44+
def _score_cycle_time(
45+
value: float, baseline_value: float | None
46+
) -> float:
47+
"""Score cycle time relative to baseline (0-100).
48+
49+
100 = same or better. 0 = 30%+ slower.
50+
"""
51+
if baseline_value is None or baseline_value == 0:
52+
return 50.0
53+
ratio = value / baseline_value
54+
if ratio <= 1.0:
55+
return 100.0
56+
if ratio >= 1.3:
57+
return 0.0
58+
return (1.3 - ratio) / 0.3 * 100.0
59+
60+
61+
def _score_collision_count(value: int) -> float:
62+
"""Score collision count (0-100).
63+
64+
0 collisions = 100, 1 = 50, 2 = 25, 3+ = 0.
65+
"""
66+
if value == 0:
67+
return 100.0
68+
if value == 1:
69+
return 50.0
70+
if value == 2:
71+
return 25.0
72+
return 0.0
73+
74+
75+
def _score_edge_case_performance(
76+
scenario_summaries: dict[str, ScenarioSummary],
77+
) -> float:
78+
"""Score edge case performance (0-100)."""
79+
edge = scenario_summaries.get("edge_cases")
80+
if edge is None or edge.total == 0:
81+
return 50.0
82+
return edge.pass_rate * 100.0
83+
84+
85+
def _score_baseline_delta(
86+
metrics: dict[str, dict[str, Any]],
87+
) -> float:
88+
"""Score overall baseline delta (0-100).
89+
90+
100 = all improved. 0 = all regressed.
91+
"""
92+
improvements = 0
93+
regressions = 0
94+
total = 0
95+
96+
for metric_id, m in metrics.items():
97+
delta = m.get("delta")
98+
if delta is None:
99+
continue
100+
total += 1
101+
# For grasp_success_rate: higher is better
102+
if metric_id == "grasp_success_rate":
103+
if delta > 0:
104+
improvements += 1
105+
elif delta < 0:
106+
regressions += 1
107+
else:
108+
# For all others: lower is better
109+
if delta < 0:
110+
improvements += 1
111+
elif delta > 0:
112+
regressions += 1
113+
114+
if total == 0:
115+
return 50.0
116+
117+
ratio = (improvements - regressions) / total
118+
return (ratio + 1.0) / 2.0 * 100.0
119+
120+
121+
def compute_confidence_score(
122+
metrics: dict[str, dict[str, Any]],
123+
scenario_summaries: dict[str, ScenarioSummary],
124+
baseline_metrics: dict[str, float | int] | None = None,
125+
weights: dict[str, float] | None = None,
126+
) -> dict[str, Any]:
127+
"""Compute Deployment Confidence Score (0-100).
128+
129+
Args:
130+
metrics: Evaluated metric results (from evaluate_all_metrics).
131+
scenario_summaries: Per-category summaries.
132+
baseline_metrics: Baseline metric values for cycle_time scoring.
133+
weights: Override weight dict.
134+
135+
Returns:
136+
Dictionary with 'score', 'verdict', and 'components'.
137+
"""
138+
if weights is None:
139+
weights = DEFAULT_WEIGHTS
140+
141+
components: dict[str, float] = {}
142+
143+
# grasp_success_rate
144+
gsr = metrics.get("grasp_success_rate", {})
145+
components["grasp_success_rate"] = _score_grasp_success_rate(
146+
gsr.get("value", 0.0)
147+
)
148+
149+
# cycle_time
150+
ct = metrics.get("cycle_time", {})
151+
ct_baseline = (
152+
float(baseline_metrics["cycle_time"])
153+
if baseline_metrics and "cycle_time" in baseline_metrics
154+
else ct.get("baseline")
155+
)
156+
components["cycle_time"] = _score_cycle_time(
157+
ct.get("value", 0.0), ct_baseline
158+
)
159+
160+
# collision_count
161+
cc = metrics.get("collision_count", {})
162+
components["collision_count"] = _score_collision_count(
163+
int(cc.get("value", 0))
164+
)
165+
166+
# edge_case_performance
167+
components["edge_case_performance"] = _score_edge_case_performance(
168+
scenario_summaries
169+
)
170+
171+
# baseline_delta
172+
components["baseline_delta"] = _score_baseline_delta(metrics)
173+
174+
# Weighted sum
175+
score = sum(
176+
weights.get(k, 0.0) * v
177+
for k, v in components.items()
178+
if k in weights
179+
)
180+
score = max(0.0, min(100.0, round(score, 1)))
181+
182+
# Verdict
183+
if score >= 76:
184+
verdict = "PASS"
185+
elif score >= 51:
186+
verdict = "WARN"
187+
else:
188+
verdict = "FAIL"
189+
190+
return {
191+
"score": score,
192+
"verdict": verdict,
193+
"components": components,
194+
"weights": weights,
195+
}

0 commit comments

Comments
 (0)