isaac-sim
diff --git a/‎isaaclab_arena/tasks/robogate_benchmark/README.md‎
Lines changed: 155 additions & 0 deletions b/‎isaaclab_arena/tasks/robogate_benchmark/README.md‎
Lines changed: 155 additions & 0 deletions
diff --git a/‎isaaclab_arena/tasks/robogate_benchmark/__init__.py‎
Lines changed: 18 additions & 0 deletions b/‎isaaclab_arena/tasks/robogate_benchmark/__init__.py‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎isaaclab_arena/tasks/robogate_benchmark/confidence_scorer.py‎
Lines changed: 195 additions & 0 deletions b/‎isaaclab_arena/tasks/robogate_benchmark/confidence_scorer.py‎
Lines changed: 195 additions & 0 deletions
@@ -0,0 +1,155 @@
+# RoboGate Benchmark for Isaac Lab-Arena
+
+Adversarial 68-scenario pick-and-place validation benchmark with 5 safety metrics and deployment confidence scoring. Contributes the [RoboGate](https://robogate.io) evaluation suite to [Isaac Lab-Arena](https://github.com/isaac-sim/IsaacLab-Arena).
+
+## Overview
+
+RoboGate validates robot manipulation policies before deployment by testing them against 68 progressively harder scenarios across 4 difficulty categories:
+
+| Category | Count | Target SR | Description |
+|----------|-------|-----------|-------------|
+| Nominal | 20 | 95-100% | Standard objects, lighting, centered placement |
+| Edge Cases | 15 | 70-85% | Small/heavy/edge/occluded/transparent objects |
+| Adversarial | 10 | 40-60% | Low light, clutter, slippery, disturbances |
+| Domain Rand | 23 | 85-95% | Lighting/color/position/camera variations |
+
+## Quick Start
+
+### Mock Mode (No GPU Required)
+
+```bash
+cd contrib/isaaclab-arena
+
+# Run scripted policy benchmark
+python scripts/run_benchmark.py --mock --output results/mock_results.json
+
+# Run VLA evaluation
+python scripts/run_vla_eval.py --model octo-small --mock
+```
+
+### Isaac Lab-Arena Integration
+
+```bash
+# Install
+pip install -e .
+
+# Run with Franka Panda
+python scripts/run_benchmark.py --embodiment franka --config configs/robogate_68.yaml
+
+# Run VLA evaluation with real physics
+python scripts/run_vla_eval.py --model octo-small --embodiment franka --enable-cameras
+```
+
+### As Isaac Lab-Arena Environment
+
+```python
+from isaaclab_arena.assets.asset_registry import AssetRegistry
+from isaaclab_arena.environments.arena_env_builder import ArenaEnvBuilder
+from robogate_benchmark.environments import RoboGateBenchmarkEnvironment
+
+env_def = RoboGateBenchmarkEnvironment()
+arena_env = env_def.get_env(args_cli)
+builder = ArenaEnvBuilder(arena_env, args_cli)
+env = builder.make_registered()
+
+obs, info = env.reset()
+# ... run your policy ...
+```
+
+## 5 Safety Metrics
+
+| Metric | Threshold | Weight |
+|--------|-----------|--------|
+| Grasp Success Rate | >= 92% | 0.30 |
+| Cycle Time | <= baseline x 1.1 | 0.20 |
+| Collision Count | == 0 | 0.25 |
+| Drop Rate | <= 3% | 0.15* |
+| Grasp Miss Rate | <= baseline x 1.2 | 0.10* |
+
+*Edge case performance (0.15) and baseline delta (0.10) are computed from scenario summaries.
+
+## Confidence Score (0-100)
+
+Weighted sum of 5 component scores:
+
+- **76-100**: PASS — safe to deploy
+- **51-75**: WARN — deploy with monitoring
+- **0-50**: FAIL — do not deploy
+
+## Baseline & VLA Results
+
+| Model | Params | SR | Confidence | Collisions | Grasp Miss |
+|-------|--------|-----|-----------|------------|-----------|
+| Scripted (IK) | — | **100%** (68/68) | 76/100 | 0 | 0 |
+| OpenVLA (Stanford+TRI) | 7B | 0% (0/68) | 27/100 | 0 | 68 |
+| Octo-Base (UC Berkeley) | 93M | 0% (0/68) | 1/100 | 14 | 54 |
+| Octo-Small (UC Berkeley) | 27M | 0% (0/68) | 1/100 | 14 | 54 |
+
+The 100-point gap across three VLA models (27M→7B, 260× scale) validates RoboGate's ability to discriminate safe vs unsafe policies. Model size is not the bottleneck — training-deployment distribution mismatch is.
+
+## HuggingFace Failure Dictionary
+
+30,720 boundary-focused episodes available at:
+[liveplex/robogate-failure-dictionary](https://huggingface.co/datasets/liveplex/robogate-failure-dictionary)
+
+```python
+from robogate_benchmark.failure_dictionary import download_dataset, analyze_failures
+
+ds = download_dataset(split="test")
+stats = analyze_failures(ds)
+print(stats.success_rate)  # ~0.82
+```
+
+## VLA Model Support
+
+| Model | Params | Framework | Image Size | Quantization |
+|-------|--------|-----------|------------|--------------|
+| octo-small | 27M | JAX | 256x256 | - |
+| octo-base | 93M | JAX | 256x256 | - |
+| openvla-7b | 7B | PyTorch | 224x224 | 4-bit NF4 |
+
+## File Structure
+
+```
+contrib/isaaclab-arena/
+├── README.md
+├── setup.py
+├── robogate_benchmark/
+│   ├── __init__.py
+│   ├── scenarios.py          # 68 scenarios (4 categories x 16 variants)
+│   ├── environments.py       # ArenaEnvBuilder integration
+│   ├── metrics.py            # 5 safety metrics
+│   ├── confidence_scorer.py  # Deployment confidence (0-100)
+│   ├── failure_dictionary.py # HuggingFace 30K dataset
+│   ├── vla_evaluator.py      # VLA evaluation pipeline
+│   └── report_generator.py   # JSON + text reports
+├── configs/
+│   ├── robogate_68.yaml      # 68-scenario config
+│   ├── franka_panda.yaml     # Franka embodiment config
+│   └── ur5e.yaml             # UR5e embodiment config
+├── scripts/
+│   ├── run_benchmark.py      # Scripted policy benchmark
+│   └── run_vla_eval.py       # VLA model evaluation
+└── results/
+    └── baseline_results.json # Scripted controller baseline
+```
+
+## Citation
+
+```bibtex
+@misc{agentai2026robogate,
+  title         = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
+                   Policy Deployment via Two-Stage Boundary-Focused Sampling},
+  author        = {{AgentAI Co., Ltd.}},
+  year          = {2026},
+  eprint        = {2603.22126},
+  archivePrefix = {arXiv},
+  primaryClass  = {cs.RO},
+  doi           = {10.5281/zenodo.19166967},
+  url           = {https://robogate.io/paper}
+}
+```
+
+## License
+
+Apache 2.0
@@ -0,0 +1,18 @@
+"""RoboGate Benchmark for Isaac Lab-Arena.
+
+68-scenario adversarial pick-and-place validation suite with 5 safety
+metrics and deployment confidence scoring (0-100).
+
+Usage with ArenaEnvBuilder::
+
+    from robogate_benchmark.environments import RoboGateBenchmarkEnvironment
+    env_def = RoboGateBenchmarkEnvironment()
+    arena_env = env_def.get_env(args_cli)
+
+Usage standalone::
+
+    python -m scripts.run_benchmark --embodiment franka --config configs/robogate_68.yaml
+"""
+
+__version__ = "1.0.0"
+__author__ = "Byungjin Kim"
@@ -0,0 +1,195 @@
+"""Deployment Confidence Score calculator (0-100).
+
+Weighted sum of 5 component scores:
+    grasp_success_rate:   0.30
+    cycle_time:           0.20
+    collision_count:      0.25
+    edge_case_performance: 0.15
+    baseline_delta:       0.10
+
+Score interpretation:
+    76-100: PASS  — safe to deploy
+    51-75:  WARN  — deploy with monitoring
+    0-50:   FAIL  — do not deploy
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+from robogate_benchmark.metrics import ScenarioSummary
+
+
+DEFAULT_WEIGHTS: dict[str, float] = {
+    "grasp_success_rate": 0.30,
+    "cycle_time": 0.20,
+    "collision_count": 0.25,
+    "edge_case_performance": 0.15,
+    "baseline_delta": 0.10,
+}
+
+
+def _score_grasp_success_rate(value: float) -> float:
+    """Score grasp success rate (0-100).
+
+    Maps 0.80-1.00 to 0-100. Below 0.80 = 0.
+    """
+    if value >= 1.0:
+        return 100.0
+    if value <= 0.80:
+        return 0.0
+    return (value - 0.80) / 0.20 * 100.0
+
+
+def _score_cycle_time(
+    value: float, baseline_value: float | None
+) -> float:
+    """Score cycle time relative to baseline (0-100).
+
+    100 = same or better. 0 = 30%+ slower.
+    """
+    if baseline_value is None or baseline_value == 0:
+        return 50.0
+    ratio = value / baseline_value
+    if ratio <= 1.0:
+        return 100.0
+    if ratio >= 1.3:
+        return 0.0
+    return (1.3 - ratio) / 0.3 * 100.0
+
+
+def _score_collision_count(value: int) -> float:
+    """Score collision count (0-100).
+
+    0 collisions = 100, 1 = 50, 2 = 25, 3+ = 0.
+    """
+    if value == 0:
+        return 100.0
+    if value == 1:
+        return 50.0
+    if value == 2:
+        return 25.0
+    return 0.0
+
+
+def _score_edge_case_performance(
+    scenario_summaries: dict[str, ScenarioSummary],
+) -> float:
+    """Score edge case performance (0-100)."""
+    edge = scenario_summaries.get("edge_cases")
+    if edge is None or edge.total == 0:
+        return 50.0
+    return edge.pass_rate * 100.0
+
+
+def _score_baseline_delta(
+    metrics: dict[str, dict[str, Any]],
+) -> float:
+    """Score overall baseline delta (0-100).
+
+    100 = all improved. 0 = all regressed.
+    """
+    improvements = 0
+    regressions = 0
+    total = 0
+
+    for metric_id, m in metrics.items():
+        delta = m.get("delta")
+        if delta is None:
+            continue
+        total += 1
+        # For grasp_success_rate: higher is better
+        if metric_id == "grasp_success_rate":
+            if delta > 0:
+                improvements += 1
+            elif delta < 0:
+                regressions += 1
+        else:
+            # For all others: lower is better
+            if delta < 0:
+                improvements += 1
+            elif delta > 0:
+                regressions += 1
+
+    if total == 0:
+        return 50.0
+
+    ratio = (improvements - regressions) / total
+    return (ratio + 1.0) / 2.0 * 100.0
+
+
+def compute_confidence_score(
+    metrics: dict[str, dict[str, Any]],
+    scenario_summaries: dict[str, ScenarioSummary],
+    baseline_metrics: dict[str, float | int] | None = None,
+    weights: dict[str, float] | None = None,
+) -> dict[str, Any]:
+    """Compute Deployment Confidence Score (0-100).
+
+    Args:
+        metrics: Evaluated metric results (from evaluate_all_metrics).
+        scenario_summaries: Per-category summaries.
+        baseline_metrics: Baseline metric values for cycle_time scoring.
+        weights: Override weight dict.
+
+    Returns:
+        Dictionary with 'score', 'verdict', and 'components'.
+    """
+    if weights is None:
+        weights = DEFAULT_WEIGHTS
+
+    components: dict[str, float] = {}
+
+    # grasp_success_rate
+    gsr = metrics.get("grasp_success_rate", {})
+    components["grasp_success_rate"] = _score_grasp_success_rate(
+        gsr.get("value", 0.0)
+    )
+
+    # cycle_time
+    ct = metrics.get("cycle_time", {})
+    ct_baseline = (
+        float(baseline_metrics["cycle_time"])
+        if baseline_metrics and "cycle_time" in baseline_metrics
+        else ct.get("baseline")
+    )
+    components["cycle_time"] = _score_cycle_time(
+        ct.get("value", 0.0), ct_baseline
+    )
+
+    # collision_count
+    cc = metrics.get("collision_count", {})
+    components["collision_count"] = _score_collision_count(
+        int(cc.get("value", 0))
+    )
+
+    # edge_case_performance
+    components["edge_case_performance"] = _score_edge_case_performance(
+        scenario_summaries
+    )
+
+    # baseline_delta
+    components["baseline_delta"] = _score_baseline_delta(metrics)
+
+    # Weighted sum
+    score = sum(
+        weights.get(k, 0.0) * v
+        for k, v in components.items()
+        if k in weights
+    )
+    score = max(0.0, min(100.0, round(score, 1)))
+
+    # Verdict
+    if score >= 76:
+        verdict = "PASS"
+    elif score >= 51:
+        verdict = "WARN"
+    else:
+        verdict = "FAIL"
+
+    return {
+        "score": score,
+        "verdict": verdict,
+        "components": components,
+        "weights": weights,
+    }