Skip to content

Commit b8f2a16

Browse files
Yixin HuangYixin Huang
authored andcommitted
Adapt grl_sokoban to GymnasiumServer and add rollout flow
Signed-off-by: Yixin Huang <yixinhuang@Yixins-MacBook-Pro-2.local>
1 parent 3266544 commit b8f2a16

13 files changed

Lines changed: 1068 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ The Dataset column links to publicly available datasets (e.g., on HuggingFace).
171171
| Genrm Compare | rlhf | GenRM pairwise comparison for RLHF training | Compare multiple candidate responses using GenRM model | - | - | - | <a href='resources_servers/genrm_compare/configs/genrm_compare.yaml'>genrm_compare.yaml</a> | - |
172172
| Google Search | agent | Multi-choice question answering problems with search tools integrated | Improve knowledge-related benchmarks with search tools || - | Apache 2.0 | <a href='resources_servers/google_search/configs/google_search.yaml'>google_search.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-knowledge-web_search-mcqa'>Nemotron-RL-knowledge-web_search-mcqa</a> |
173173
| Gpqa Diamond | knowledge | GPQA Diamond multiple-choice question answering problems | Evaluate graduate-level scientific reasoning via MCQ verification || - | MIT | <a href='resources_servers/gpqa_diamond/configs/gpqa_diamond.yaml'>gpqa_diamond.yaml</a> | - |
174+
| Grl Sokoban | games | Single-box Sokoban in Gymnasium API style. | Model emits one move per turn until the puzzle is solved. | - | - | - | <a href='resources_servers/grl_sokoban/configs/grl_sokoban.yaml'>grl_sokoban.yaml</a> | - |
174175
| Instruction Following | instruction_following | Instruction following datasets targeting IFEval and IFBench style instruction following capabilities | Improve IFEval and IFBench || - | Apache 2.0 | <a href='resources_servers/instruction_following/configs/instruction_following.yaml'>instruction_following.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following'>Nemotron-RL-instruction_following</a> |
175176
| Jailbreak Detection | safety | Jailbreak detection with Nemotron judge + combined reward | - | - || - | <a href='resources_servers/jailbreak_detection/configs/jailbreak_detection_nemotron_combined_reward_tp8.yaml'>jailbreak_detection_nemotron_combined_reward_tp8.yaml</a> | - |
176177
| Math Advanced Calculations | agent | An instruction following math environment with counter-intuitive calculators | Improve instruction following capabilities in specific math environments || - | Apache 2.0 | <a href='resources_servers/math_advanced_calculations/configs/math_advanced_calculations.yaml'>math_advanced_calculations.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-math-advanced_calculations'>Nemotron-RL-math-advanced_calculations</a> |
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# GRL Sokoban Resource Server
2+
3+
Single-box Sokoban puzzle environment adapted to the `GymnasiumServer` interface introduced in PR `#1072`. The environment is implemented under `resources_servers/grl_sokoban/sokoban_env`, mirroring the Sokoban implementation in the GRL repo (https://github.com/lmgame-org/GRL) and based on code from https://github.com/lmgame-org/lmenv, developed in collaboration with NVIDIA. The implementation uses `gym-sokoban` (https://github.com/mpSchrader/gym-sokoban).
4+
5+
## Why it exists
6+
- **Domain**: Deterministic Sokoban puzzles.
7+
- **Interaction style**: The environment returns a board observation and expects exactly one move per turn.
8+
- **Evaluation**: Reward is accumulated directly through `/step`, so this server should be paired with `responses_api_agents/gymnasium_agent`.
9+
10+
## Setup
11+
12+
Please follow the setup instructions in the repo root README or [detailed setup docs](../../docs/get-started/detailed-setup.md).
13+
14+
## Running
15+
Spin up the server alongside a compatible agent:
16+
17+
```bash
18+
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
19+
responses_api_agents/gymnasium_agent/configs/gymnasium_agent.yaml,\
20+
resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
21+
ng_run "+config_paths=[$config_paths]"
22+
```
23+
24+
Collect trajectories:
25+
26+
```bash
27+
ng_collect_rollouts +agent_name=grl_sokoban_gymnasium_agent \
28+
+input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
29+
+output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
30+
+limit=5
31+
```
32+
33+
Launch the rollout viewer:
34+
35+
```bash
36+
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl
37+
```
38+
39+
## Prompt format
40+
41+
The example dataset assumes the model will:
42+
43+
1. Read the initial board returned by `/reset`.
44+
2. Emit exactly one move per turn using an action tag such as `<action>Up</action>`.
45+
3. Continue until the environment terminates or the agent reaches `max_steps`.
46+
47+
## Tests
48+
49+
```bash
50+
pytest resources_servers/grl_sokoban/tests responses_api_agents/gymnasium_agent/tests
51+
```
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from __future__ import annotations
15+
16+
import re
17+
from dataclasses import dataclass, field
18+
from typing import Any, Dict, Optional
19+
20+
from fastapi import HTTPException
21+
from pydantic import Field
22+
23+
from nemo_gym.base_resources_server import BaseResourcesServerConfig
24+
from nemo_gym.openai_utils import NeMoGymResponse
25+
from resources_servers.base_gymnasium import GymnasiumServer, extract_text
26+
from resources_servers.grl_sokoban.sokoban_env import SokobanEnv
27+
28+
29+
DEFAULT_GRID_LOOKUP = {0: "#", 1: "_", 2: "O", 3: "√", 4: "X", 5: "P", 6: "S"}
30+
DEFAULT_ACTION_LOOKUP = {1: "Up", 2: "Down", 3: "Left", 4: "Right"}
31+
ACTION_TAG_PATTERN = re.compile(r"<action>\s*(up|down|left|right|[1-4])\s*</action>", re.IGNORECASE)
32+
ACTION_WORD_PATTERN = re.compile(r"\b(up|down|left|right|[1-4])\b", re.IGNORECASE)
33+
34+
35+
class GrlSokobanResourcesServerConfig(BaseResourcesServerConfig):
36+
env_config: Dict[str, Any] = Field(
37+
default_factory=lambda: {
38+
"grid_lookup": DEFAULT_GRID_LOOKUP,
39+
"action_lookup": DEFAULT_ACTION_LOOKUP,
40+
"search_depth": 100,
41+
"dim_room": (6, 6),
42+
"max_steps": 100,
43+
"num_boxes": 1,
44+
"render_mode": "text",
45+
}
46+
)
47+
48+
49+
@dataclass
50+
class SokobanSessionState:
51+
env: Any
52+
observation: str
53+
total_reward: float = 0.0
54+
done: bool = False
55+
last_info: Dict[str, Any] = field(default_factory=dict)
56+
57+
58+
class GrlSokobanResourcesServer(GymnasiumServer):
59+
config: GrlSokobanResourcesServerConfig
60+
session_id_to_state: Dict[str, SokobanSessionState] = Field(default_factory=dict)
61+
62+
async def reset(self, metadata: dict, session_id: Optional[str] = None) -> tuple[Optional[str], dict]:
63+
if session_id is None:
64+
raise HTTPException(status_code=400, detail="Missing session id.")
65+
66+
self._close_env(session_id)
67+
68+
env = SokobanEnv(self._env_config_from_metadata(metadata))
69+
observation = env.reset(seed=metadata.get("seed"))
70+
self.session_id_to_state[session_id] = SokobanSessionState(env=env, observation=observation)
71+
return self._format_observation(observation), {}
72+
73+
async def step(
74+
self, action: NeMoGymResponse, metadata: dict, session_id: Optional[str] = None
75+
) -> tuple[Optional[str], float, bool, bool, dict]:
76+
if session_id is None or session_id not in self.session_id_to_state:
77+
raise HTTPException(status_code=400, detail="Session not initialized. Call /reset first.")
78+
79+
session_state = self.session_id_to_state[session_id]
80+
if session_state.done:
81+
return session_state.observation, 0.0, True, False, dict(session_state.last_info)
82+
83+
env = session_state.env
84+
action_id = self._parse_action(action, env.ACTION_LOOKUP)
85+
next_obs, reward, done, info = env.step(action_id)
86+
87+
session_state.total_reward += reward
88+
session_state.observation = next_obs
89+
session_state.last_info = info | {
90+
"action_id": action_id,
91+
"action_label": env.ACTION_LOOKUP[action_id],
92+
"total_reward": session_state.total_reward,
93+
}
94+
session_state.done = bool(done)
95+
96+
return (
97+
self._format_observation(next_obs) if not session_state.done else next_obs,
98+
reward,
99+
session_state.done,
100+
False,
101+
dict(session_state.last_info),
102+
)
103+
104+
def _env_config_from_metadata(self, metadata: dict) -> dict[str, Any]:
105+
env_config = dict(self.config.env_config)
106+
for key in (
107+
"grid_lookup",
108+
"action_lookup",
109+
"search_depth",
110+
"dim_room",
111+
"max_steps",
112+
"num_boxes",
113+
"render_mode",
114+
):
115+
if key in metadata:
116+
env_config[key] = metadata[key]
117+
return env_config
118+
119+
def _close_env(self, session_id: str) -> None:
120+
session_state = self.session_id_to_state.pop(session_id, None)
121+
if session_state is None:
122+
return
123+
try:
124+
session_state.env.close()
125+
except Exception:
126+
pass
127+
128+
@staticmethod
129+
def _format_observation(observation: str) -> str:
130+
return (
131+
"Sokoban board:\n"
132+
f"{observation}\n\n"
133+
"Legend: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target.\n"
134+
"Respond with exactly one move using <action>Up</action>, <action>Down</action>, "
135+
"<action>Left</action>, or <action>Right</action>."
136+
)
137+
138+
@staticmethod
139+
def _parse_action(response: NeMoGymResponse, action_lookup: Dict[int, str]) -> int:
140+
text = extract_text(response).strip()
141+
match = ACTION_TAG_PATTERN.search(text) or ACTION_WORD_PATTERN.search(text)
142+
if match is None:
143+
raise HTTPException(status_code=400, detail=f"Unable to parse action from response: {text!r}")
144+
145+
token = match.group(1).strip()
146+
if token.isdigit():
147+
action_id = int(token)
148+
if action_id in action_lookup:
149+
return action_id
150+
raise HTTPException(status_code=400, detail=f"Invalid action identifier: {action_id}")
151+
152+
reverse_lookup = {label.lower(): idx for idx, label in action_lookup.items()}
153+
action_id = reverse_lookup.get(token.lower())
154+
if action_id is None:
155+
raise HTTPException(status_code=400, detail=f"Invalid action identifier: {token}")
156+
return action_id
157+
158+
159+
GrlSokobanEnv = GrlSokobanResourcesServer
160+
161+
162+
if __name__ == "__main__":
163+
GrlSokobanResourcesServer.run_webserver()
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
grl_sokoban:
2+
resources_servers:
3+
grl_sokoban:
4+
entrypoint: app.py
5+
domain: games
6+
verified: false
7+
description: Single-box Sokoban in Gymnasium API style.
8+
value: Model emits one move per turn until the puzzle is solved.
9+
grl_sokoban_gymnasium_agent:
10+
responses_api_agents:
11+
gymnasium_agent:
12+
entrypoint: app.py
13+
env_server:
14+
type: resources_servers
15+
name: grl_sokoban
16+
model_server:
17+
type: responses_api_models
18+
name: policy_model
19+
max_steps: 20
20+
datasets:
21+
- name: example
22+
type: example
23+
jsonl_fpath: resources_servers/grl_sokoban/data/example.jsonl
24+
num_repeats: 1
25+
gitlab_identifier:
26+
dataset_name: grl_sokoban
27+
version: 0.0.1
28+
artifact_fpath: example.jsonl
29+
license: Apache 2.0
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{"level_id": 1, "seed": 84810, "dim_room": [5, 5], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
2+
{"level_id": 2, "seed": 98293, "dim_room": [8, 8], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
3+
{"level_id": 3, "seed": 30450, "dim_room": [6, 6], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
4+
{"level_id": 4, "seed": 89987, "dim_room": [6, 7], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
5+
{"level_id": 5, "seed": 78785, "dim_room": [6, 4], "num_boxes": 1, "responses_create_params": {"reasoning": {"effort": "medium"}, "input": [{"role": "system", "content": "You are a Sokoban-solving assistant. You will receive a board observation after reset and after every move. Symbols: #=wall, _=floor, O=target, X=box, √=box on target, P=player, S=player on target. Reply with exactly one move each turn using one of <action>Up</action>, <action>Down</action>, <action>Left</action>, or <action>Right</action>."}, {"role": "user", "content": "Solve the Sokoban puzzle step by step."}]}, "agent_ref": {"type": "responses_api_agents", "name": "grl_sokoban_gymnasium_agent"}}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"name": "example",
3+
"type": "example",
4+
"jsonl_fpath": "resources_servers/grl_sokoban/data/example.jsonl",
5+
"gitlab_identifier": null,
6+
"license": "Apache 2.0",
7+
"Number of examples": 5
8+
}

0 commit comments

Comments
 (0)