NVIDIA-NeMo
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎resources_servers/grl_sokoban/README.md‎
Lines changed: 51 additions & 0 deletions b/‎resources_servers/grl_sokoban/README.md‎
Lines changed: 51 additions & 0 deletions
diff --git a/‎resources_servers/grl_sokoban/app.py‎
Lines changed: 163 additions & 0 deletions b/‎resources_servers/grl_sokoban/app.py‎
Lines changed: 163 additions & 0 deletions
diff --git a/‎resources_servers/grl_sokoban/configs/grl_sokoban.yaml‎
Lines changed: 29 additions & 0 deletions b/‎resources_servers/grl_sokoban/configs/grl_sokoban.yaml‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎resources_servers/grl_sokoban/data/example.jsonl‎
Lines changed: 5 additions & 0 deletions b/‎resources_servers/grl_sokoban/data/example.jsonl‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎resources_servers/grl_sokoban/data/example_metrics.json‎
Lines changed: 8 additions & 0 deletions b/‎resources_servers/grl_sokoban/data/example_metrics.json‎
Lines changed: 8 additions & 0 deletions
@@ -171,6 +171,7 @@ The Dataset column links to publicly available datasets (e.g., on HuggingFace).
 | Genrm Compare                                 | rlhf                  | GenRM pairwise comparison for RLHF training                                                                                                                                                                                  | Compare multiple candidate responses using GenRM model                                                                       | -     | -          | -                                                         | <a href='resources_servers/genrm_compare/configs/genrm_compare.yaml'>genrm_compare.yaml</a>                                                                                                                                 | -                                                                                                                                                              |
 | Google Search                                 | agent                 | Multi-choice question answering problems with search tools integrated                                                                                                                                                        | Improve knowledge-related benchmarks with search tools                                                                       | ✓     | -          | Apache 2.0                                                | <a href='resources_servers/google_search/configs/google_search.yaml'>google_search.yaml</a>                                                                                                                                 | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-knowledge-web_search-mcqa'>Nemotron-RL-knowledge-web_search-mcqa</a>                               |
 | Gpqa Diamond                                  | knowledge             | GPQA Diamond multiple-choice question answering problems                                                                                                                                                                     | Evaluate graduate-level scientific reasoning via MCQ verification                                                            | ✓     | -          | MIT                                                       | <a href='resources_servers/gpqa_diamond/configs/gpqa_diamond.yaml'>gpqa_diamond.yaml</a>                                                                                                                                    | -                                                                                                                                                              |
+| Grl Sokoban                                   | games                 | Single-box Sokoban in Gymnasium API style.                                                                                                                                                                                   | Model emits one move per turn until the puzzle is solved.                                                                    | -     | -          | -                                                         | <a href='resources_servers/grl_sokoban/configs/grl_sokoban.yaml'>grl_sokoban.yaml</a>                                                                                                                                       | -                                                                                                                                                              |
 | Instruction Following                         | instruction_following | Instruction following datasets targeting IFEval and IFBench style instruction following capabilities                                                                                                                         | Improve IFEval and IFBench                                                                                                   | ✓     | -          | Apache 2.0                                                | <a href='resources_servers/instruction_following/configs/instruction_following.yaml'>instruction_following.yaml</a>                                                                                                         | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following'>Nemotron-RL-instruction_following</a>                                       |
 | Jailbreak Detection                           | safety                | Jailbreak detection with Nemotron judge + combined reward                                                                                                                                                                    | -                                                                                                                            | -     | ✓          | -                                                         | <a href='resources_servers/jailbreak_detection/configs/jailbreak_detection_nemotron_combined_reward_tp8.yaml'>jailbreak_detection_nemotron_combined_reward_tp8.yaml</a>                                                     | -                                                                                                                                                              |
 | Math Advanced Calculations                    | agent                 | An instruction following math environment with counter-intuitive calculators                                                                                                                                                 | Improve instruction following capabilities in specific math environments                                                     | ✓     | -          | Apache 2.0                                                | <a href='resources_servers/math_advanced_calculations/configs/math_advanced_calculations.yaml'>math_advanced_calculations.yaml</a>                                                                                          | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-math-advanced_calculations'>Nemotron-RL-math-advanced_calculations</a>                             |
 
@@ -0,0 +1,51 @@
+# GRL Sokoban Resource Server
+
+Single-box Sokoban puzzle environment adapted to the `GymnasiumServer` interface introduced in PR `#1072`. The environment is implemented under `resources_servers/grl_sokoban/sokoban_env`, mirroring the Sokoban implementation in the GRL repo (https://github.com/lmgame-org/GRL) and based on code from https://github.com/lmgame-org/lmenv, developed in collaboration with NVIDIA. The implementation uses `gym-sokoban` (https://github.com/mpSchrader/gym-sokoban).
+
+## Why it exists
+- **Domain**: Deterministic Sokoban puzzles.
+- **Interaction style**: The environment returns a board observation and expects exactly one move per turn.
+- **Evaluation**: Reward is accumulated directly through `/step`, so this server should be paired with `responses_api_agents/gymnasium_agent`.
+
+## Setup
+
+Please follow the setup instructions in the repo root README or [detailed setup docs](../../docs/get-started/detailed-setup.md).
+
+## Running
+Spin up the server alongside a compatible agent:
+
+```bash
+config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
+responses_api_agents/gymnasium_agent/configs/gymnasium_agent.yaml,\
+resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
+ng_run "+config_paths=[$config_paths]"
+```
+
+Collect trajectories:
+
+```bash
+ng_collect_rollouts +agent_name=grl_sokoban_gymnasium_agent \
+    +input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
+    +output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
+    +limit=5
+```
+
+Launch the rollout viewer:
+
+```bash
+ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl
+```
+
+## Prompt format
+
+The example dataset assumes the model will:
+
+1. Read the initial board returned by `/reset`.
+2. Emit exactly one move per turn using an action tag such as `<action>Up</action>`.
+3. Continue until the environment terminates or the agent reaches `max_steps`.
+
+## Tests
+
+```bash
+pytest resources_servers/grl_sokoban/tests responses_api_agents/gymnasium_agent/tests
+```
@@ -0,0 +1,163 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass, field
+from typing import Any, Dict, Optional
+
+from fastapi import HTTPException
+from pydantic import Field
+
+from nemo_gym.base_resources_server import BaseResourcesServerConfig
+from nemo_gym.openai_utils import NeMoGymResponse
+from resources_servers.base_gymnasium import GymnasiumServer, extract_text
+from resources_servers.grl_sokoban.sokoban_env import SokobanEnv
+
+
+DEFAULT_GRID_LOOKUP = {0: "#", 1: "_", 2: "O", 3: "√", 4: "X", 5: "P", 6: "S"}
+DEFAULT_ACTION_LOOKUP = {1: "Up", 2: "Down", 3: "Left", 4: "Right"}
+ACTION_TAG_PATTERN = re.compile(r"<action>\s*(up|down|left|right|[1-4])\s*</action>", re.IGNORECASE)
+ACTION_WORD_PATTERN = re.compile(r"\b(up|down|left|right|[1-4])\b", re.IGNORECASE)
+
+
+class GrlSokobanResourcesServerConfig(BaseResourcesServerConfig):
+    env_config: Dict[str, Any] = Field(
+        default_factory=lambda: {
+            "grid_lookup": DEFAULT_GRID_LOOKUP,
+            "action_lookup": DEFAULT_ACTION_LOOKUP,
+            "search_depth": 100,
+            "dim_room": (6, 6),
+            "max_steps": 100,
+            "num_boxes": 1,
+            "render_mode": "text",
+        }
+    )
+
+
+@dataclass
+class SokobanSessionState:
+    env: Any
+    observation: str
+    total_reward: float = 0.0
+    done: bool = False
+    last_info: Dict[str, Any] = field(default_factory=dict)
+
+
+class GrlSokobanResourcesServer(GymnasiumServer):
+    config: GrlSokobanResourcesServerConfig
+    session_id_to_state: Dict[str, SokobanSessionState] = Field(default_factory=dict)
+
+    async def reset(self, metadata: dict, session_id: Optional[str] = None) -> tuple[Optional[str], dict]:
+        if session_id is None:
+            raise HTTPException(status_code=400, detail="Missing session id.")
+
+        self._close_env(session_id)
+
+        env = SokobanEnv(self._env_config_from_metadata(metadata))
+        observation = env.reset(seed=metadata.get("seed"))
+        self.session_id_to_state[session_id] = SokobanSessionState(env=env, observation=observation)
+        return self._format_observation(observation), {}
+
+    async def step(
+        self, action: NeMoGymResponse, metadata: dict, session_id: Optional[str] = None
+    ) -> tuple[Optional[str], float, bool, bool, dict]:
+        if session_id is None or session_id not in self.session_id_to_state:
+            raise HTTPException(status_code=400, detail="Session not initialized. Call /reset first.")
+
+        session_state = self.session_id_to_state[session_id]
+        if session_state.done:
+            return session_state.observation, 0.0, True, False, dict(session_state.last_info)
+
+        env = session_state.env
+        action_id = self._parse_action(action, env.ACTION_LOOKUP)
+        next_obs, reward, done, info = env.step(action_id)
+
+        session_state.total_reward += reward
+        session_state.observation = next_obs
+        session_state.last_info = info | {
+            "action_id": action_id,
+            "action_label": env.ACTION_LOOKUP[action_id],
+            "total_reward": session_state.total_reward,
+        }
+        session_state.done = bool(done)
+
+        return (
+            self._format_observation(next_obs) if not session_state.done else next_obs,
+            reward,
+            session_state.done,
+            False,
+            dict(session_state.last_info),
+        )
+
+    def _env_config_from_metadata(self, metadata: dict) -> dict[str, Any]:
+        env_config = dict(self.config.env_config)
+        for key in (
+            "grid_lookup",
+            "action_lookup",
+            "search_depth",
+            "dim_room",
+            "max_steps",
+            "num_boxes",
+            "render_mode",
+        ):
+            if key in metadata:
+                env_config[key] = metadata[key]
+        return env_config
+
+    def _close_env(self, session_id: str) -> None:
+        session_state = self.session_id_to_state.pop(session_id, None)
+        if session_state is None:
+            return
+        try:
+            session_state.env.close()
+        except Exception:
+            pass
+
+    @staticmethod
+    def _format_observation(observation: str) -> str:
+        return (
+            "Sokoban board:\n"
+            f"{observation}\n\n"
+            "Legend: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target.\n"
+            "Respond with exactly one move using <action>Up</action>, <action>Down</action>, "
+            "<action>Left</action>, or <action>Right</action>."
+        )
+
+    @staticmethod
+    def _parse_action(response: NeMoGymResponse, action_lookup: Dict[int, str]) -> int:
+        text = extract_text(response).strip()
+        match = ACTION_TAG_PATTERN.search(text) or ACTION_WORD_PATTERN.search(text)
+        if match is None:
+            raise HTTPException(status_code=400, detail=f"Unable to parse action from response: {text!r}")
+
+        token = match.group(1).strip()
+        if token.isdigit():
+            action_id = int(token)
+            if action_id in action_lookup:
+                return action_id
+            raise HTTPException(status_code=400, detail=f"Invalid action identifier: {action_id}")
+
+        reverse_lookup = {label.lower(): idx for idx, label in action_lookup.items()}
+        action_id = reverse_lookup.get(token.lower())
+        if action_id is None:
+            raise HTTPException(status_code=400, detail=f"Invalid action identifier: {token}")
+        return action_id
+
+
+GrlSokobanEnv = GrlSokobanResourcesServer
+
+
+if __name__ == "__main__":
+    GrlSokobanResourcesServer.run_webserver()
@@ -0,0 +1,29 @@
+grl_sokoban:
+  resources_servers:
+    grl_sokoban:
+      entrypoint: app.py
+      domain: games
+      verified: false
+      description: Single-box Sokoban in Gymnasium API style.
+      value: Model emits one move per turn until the puzzle is solved.
+grl_sokoban_gymnasium_agent:
+  responses_api_agents:
+    gymnasium_agent:
+      entrypoint: app.py
+      env_server:
+        type: resources_servers
+        name: grl_sokoban
+      model_server:
+        type: responses_api_models
+        name: policy_model
+      max_steps: 20
+      datasets:
+      - name: example
+        type: example
+        jsonl_fpath: resources_servers/grl_sokoban/data/example.jsonl
+        num_repeats: 1
+        gitlab_identifier:
+          dataset_name: grl_sokoban
+          version: 0.0.1
+          artifact_fpath: example.jsonl
+        license: Apache 2.0
@@ -0,0 +1,5 @@
+{"level_id": 1, "seed": 84810, "dim_room": [5, 5], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
+{"level_id": 2, "seed": 98293, "dim_room": [8, 8], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
+{"level_id": 3, "seed": 30450, "dim_room": [6, 6], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
+{"level_id": 4, "seed": 89987, "dim_room": [6, 7], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
+{"level_id": 5, "seed": 78785, "dim_room": [6, 4], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
@@ -0,0 +1,8 @@
+{
+    "name": "example",
+    "type": "example",
+    "jsonl_fpath": "resources_servers/grl_sokoban/data/example.jsonl",
+    "gitlab_identifier": null,
+    "license": "Apache 2.0",
+    "Number of examples": 5
+}