Emerge-Lab
diff --git a/‎docs/simulator.md‎
Lines changed: 25 additions & 15 deletions b/‎docs/simulator.md‎
Lines changed: 25 additions & 15 deletions
diff --git a/‎docs/train.md‎
Lines changed: 59 additions & 0 deletions b/‎docs/train.md‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions b/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎pufferlib/config/ocean/drive.ini‎
Lines changed: 11 additions & 7 deletions b/‎pufferlib/config/ocean/drive.ini‎
Lines changed: 11 additions & 7 deletions
diff --git a/‎pufferlib/ocean/benchmark/metrics.py‎
Lines changed: 3 additions & 3 deletions b/‎pufferlib/ocean/benchmark/metrics.py‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎pufferlib/ocean/drive/binding.c‎
Lines changed: 14 additions & 8 deletions b/‎pufferlib/ocean/drive/binding.c‎
Lines changed: 14 additions & 8 deletions
@@ -3,8 +3,9 @@
 Deep dive into how the Drive environment is wired, what it expects as inputs, and how observations/actions/configs are shaped. The environment entrypoint is `pufferlib/ocean/drive/drive.py`, which wraps the C core in `pufferlib/ocean/drive/drive.h` via `binding.c`.
 
 ## Runtime inputs and lifecycle
+
 - **Map binaries**: The environment scans `resources/drive/binaries` for `map_*.bin` files and requires at least one to load. Keep `num_maps` no larger than what is present on disk. During vectorized setup, `binding.shared` samples maps until it accumulates at least `num_agents` controllable entities, skipping maps with no valid agents (`set_active_agents` in `drive.h`).
-- **Episode length**: Default `scenario_length = 91` to match the Waymo logs (trajectory data is 91 steps), but you can set `env.scenario_length` (CLI or `.ini`) to any positive value. Metrics are logged and `c_reset` is called when `timestep == scenario_length`.
+- **Episode length**: Default `episode_length = 91` to match the Waymo logs (trajectory data is 91 steps), but you can set `env.episode_length` (CLI or `.ini`) to any positive value. Metrics are logged and `c_reset` is called when `timestep = episode_length`.
 - **Resampling maps**: Python-side `Drive.step` reinitializes the vectorized environments every `resample_frequency` steps (default `910`, ~10 episodes) with fresh map IDs and seeds.
 - **Initialization controls**:
   - `init_steps` starts agents from a later timestep in the logged trajectory.
@@ -15,6 +16,7 @@ Deep dive into how the Drive environment is wired, what it expects as inputs, an
 See [Data](data.md) for how to produce the `.bin` inputs, including the binary layout.
 
 ## Actions and dynamics
+
 - **Action types** (`env.action_type`):
   - `discrete` (default): classic dynamics use a single `MultiDiscrete([7*13])` index decoded into acceleration (`ACCELERATION_VALUES`) and steering (`STEERING_VALUES`); jerk dynamics use `MultiDiscrete([4, 3])` over `JERK_LONG`/`JERK_LAT`.
   - `continuous`: a 2-D Box in `[-1, 1]`. Classic scales to the max accel/steer magnitudes used in the discrete table. Jerk scales asymmetrically: negative values reach up to `-15 m/s^3` braking, positives up to `4 m/s^3` acceleration, lateral jerk up to `±4 m/s^3`.
@@ -23,6 +25,7 @@ See [Data](data.md) for how to produce the `.bin` inputs, including the binary l
   - `jerk`: integrates longitudinal/lateral jerk into accel, then into velocity/pose with steering limited to `±0.55 rad`. Speeds are clipped to `[0, 20] m/s`.
 
 ## Observation space
+
 Shape is `ego_features + 63 * 7 + 200 * 7` = `1848` for classic dynamics (`ego_features = 7`) or `1851` for jerk dynamics (`ego_features = 10`). Computed in `compute_observations` (`drive.h`):
 
 - **Ego block** (classic):
@@ -37,16 +40,17 @@ Shape is `ego_features + 63 * 7 + 200 * 7` = `1848` for classic dynamics (`ego_f
   - Longitudinal acceleration normalized to `[-15, 4]`
   - Lateral acceleration normalized to `[-4, 4]`
   - Respawn flag (index 9)
-- **Partner blocks**: Up to 63 other agents (active first, then static experts) within 50 m. Each uses 7 values: relative (x, y) in ego frame scaled by `0.02`, width/length normalized as above, relative heading encoded as `(cos Δθ, sin Δθ)`, and speed / `MAX_SPEED`. Zero-padded when fewer neighbors are present or when agents are in respawn.
+- **Partner blocks**: Up to `MAX_AGENTS-1` other agents (active first, then static experts) within 50 m. Each uses 7 values: relative (x, y) in ego frame scaled by  `0.02 `, width/length normalized as above, relative heading encoded as `(cos Δθ, sin Δθ)`, and speed / `MAX_SPEED`. Zero-padded when fewer neighbors are present or when agents are in respawn.
 - **Road blocks**: Up to 200 nearby road segments pulled from a precomputed grid (`vision_range = 21`). Each entry stores relative midpoint (x, y) scaled by `0.02`, segment length / `MAX_ROAD_SEGMENT_LENGTH` (100 m), width / `MAX_ROAD_SCALE` (100), `(cos, sin)` of the segment direction in ego frame, and a type ID (`ROAD_LANE`..`DRIVEWAY` stored as `0..6`). Remaining slots are zero-padded.
 
 ## Rewards, termination, and metrics
+
 - **Per-step rewards** (`c_step`):
   - Collision with another actor: `reward_vehicle_collision` (default `-0.5`)
   - Off-road (road-edge intersection): `reward_offroad_collision` (default `-0.2`)
   - Goal reached: `reward_goal` (default `1.0`) or `reward_goal_post_respawn` after a respawn
-  - Optional ADE shaping: `reward_ade * avg_displacement_error`, where ADE is accumulated in `compute_agent_metrics`
-- **Termination**: No early truncation; episodes roll to `scenario_length` steps. If `goal_behavior` is respawn, `respawn_agent` resets the pose and marks `respawn_timestep` so the respawn flag shows up in observations.
+
+- **Termination**: No early truncation; episodes roll to episode_length steps. If `goal_behavior` is respawn, `respawn_agent` resets the pose and marks `respawn_timestep` so the respawn flag shows up in observations.
 - **Logged metrics** (`add_log` aggregates over all active agents across envs):
   - `score`: reached goal without collision/off-road
   - `collision_rate` / `offroad_rate`: fraction of agents with ≥1 event in the episode
@@ -57,23 +61,28 @@ Shape is `ego_features + 63 * 7 + 200 * 7` = `1848` for classic dynamics (`ego_f
 `collision_behavior`, `offroad_behavior`, `reward_vehicle_collision_post_respawn`, and `spawn_immunity_timer` are parsed from the INI but currently unused in the stepping logic.
 
 ## Configuration files (`.ini`)
+
 `pufferlib/config/default.ini` supplies global defaults. Environment-specific overrides live in `pufferlib/config/ocean/drive.ini` and are loaded first when you run `puffer train puffer_drive`; CLI flags (e.g., `--env.num-maps 128`) override both.
 
 Key sections in `pufferlib/config/ocean/drive.ini`:
-- **[env]**: Simulator knobs: `num_agents` (policy slots, C core cap 64), `num_maps`, `scenario_length`, `resample_frequency`, `action_type`, `dynamics_model`, rewards, `goal_radius`, `goal_behavior`, `init_steps`, `init_mode`, `control_mode`; rendering toggles `render`, `render_interval`, `obs_only`, `show_grid`, `show_lasers`, `show_human_logs`, `render_map`.
+
+- **[env]**: Simulator knobs: `num_agents` (policy slots, C core cap 64), `num_maps`, episode_length, `resample_frequency`, `action_type`, `dynamics_model`, rewards, `goal_radius`, `goal_behavior`, `init_steps`, `init_mode`, `control_mode`; rendering toggles `render`, `render_interval`, `obs_only`, `show_grid`, `show_lasers`, `show_human_logs`, `render_map`.
 - **[vec]**: Vectorization sizing (`num_envs`, `num_workers`, `batch_size`; backend defaults to multiprocessing).
 - **[policy]/[rnn]**: Model widths for the Torch policy (`input_size`, `hidden_size`) and optional LSTM wrapper.
 - **[train]**: PPO-style hyperparameters (timesteps, learning rate, clipping, batch/minibatch, BPTT horizon, optimizer choice) merged with any unspecified defaults from `pufferlib/config/default.ini`.
 - **[eval]**: WOSAC/human-replay switches and sizing (`eval.wosac_*`, `eval.human_replay_*`) mapped directly to the `Drive` kwargs in evaluation subprocesses.
 
 ## Model overview
+
 Defined in `pufferlib/ocean/torch.py:Drive`:
+
 - Three MLP encoders (ego, partners, roads) with LayerNorm. Partner and road encodings are max-pooled across instances.
 - Concatenated embedding → GELU → linear to `hidden_size`, then split into actor/value heads.
 - Discrete actions are emitted as logits per dimension (`MultiDiscrete`), continuous actions as Gaussian parameters (`softplus` std). Value head is a single linear output.
 - `Recurrent = pufferlib.models.LSTMWrapper` can wrap the policy using the `rnn` config entries; otherwise the policy is feed-forward.
 
 ## Drive source files (what lives where)
+
 - `pufferlib/ocean/drive/drive.py`: Python Gymnasium-style wrapper that sets up buffers, validates map availability, seeds the C core via `binding.env_init`, and handles map resampling.
 - `pufferlib/ocean/drive/drive.h`: Main C implementation of stepping, observations, rewards/metrics, grid map, lane graph, and collision checking.
 - `pufferlib/ocean/drive/binding.c`: Python C-extension glue that exposes `Drive` to Python, handles shared buffer setup, logging, and reading the `.ini` config.
@@ -89,24 +98,25 @@ Defined in `pufferlib/ocean/torch.py:Drive`:
 
 Determines which agents are **created** in the environment.
 
-| Option | Description |
-| --- | --- |
-| `create_all_valid` | Create all entities valid at initialization (`traj_valid[init_steps] == 1`). |
-| `create_only_controlled` | Create only those agents that are controlled by the policy. |
+| Option                     | Description                                                                    |
+| -------------------------- | ------------------------------------------------------------------------------ |
+| `create_all_valid`       | Create all entities valid at initialization (`traj_valid[init_steps] == 1`). |
+| `create_only_controlled` | Create only those agents that are controlled by the policy.                    |
 
 #### `control_mode`
 
 Determines which created agents are **controlled** by the policy.
 
-| Option | Description |
-| --- | --- |
-| `control_vehicles` (default) | Control only valid **vehicles** (not experts, beyond `MIN_DISTANCE_TO_GOAL`, under `MAX_AGENTS`). |
-| `control_agents` | Control all valid **agent types** (vehicles, cyclists, pedestrians). |
-| `control_tracks_to_predict` *(WOMD only)* | Control agents listed in the `tracks_to_predict` metadata. |
+| Option                                        | Description                                                                                                |
+| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
+| `control_vehicles` (default)                | Control only valid**vehicles** (not experts, beyond `MIN_DISTANCE_TO_GOAL`, under `MAX_AGENTS`). |
+| `control_agents`                            | Control all valid**agent types** (vehicles, cyclists, pedestrians).                                  |
+| `control_tracks_to_predict` *(WOMD only)* | Control agents listed in the `tracks_to_predict` metadata.                                               |
+| `control_sdc_only` *(WOMD only)* | Control just the self-driving car (SDC).                                             |
 
 ### Termination conditions (`done`)
 
-Episodes are never truncated before reaching `episode_len`. The `goal_behavior` argument controls agent behavior after reaching a goal early:
+The `goal_behavior` argument controls agent behavior after reaching a goal early:
 
 - **`goal_behavior=0` (default):** Agents respawn at their initial position after reaching their goal (last valid log position).
 - **`goal_behavior=1`:** Agents receive new goals indefinitely after reaching each goal.
 
@@ -0,0 +1,59 @@
+## Training
+
+### Basic training
+
+Launch a training run with Weights & Biases logging:
+```bash
+puffer train puffer_drive --wandb --wandb-project "pufferdrive"
+```
+
+### Environment configurations
+
+**Default configuration (Waymo maps)**
+
+The default settings in `drive.ini` are optimized for:
+
+- Training in thousands of Waymo maps
+- Short episodes (91 steps)
+
+**Carla maps configuration**
+For training agents to drive indefinitely in larger Carla maps, we recommend modifying `drive.ini` as follows:
+```ini
+[env]
+goal_speed = 30.0  # Target speed in m/s at the goal. Lower values discourage excessive speeding
+goal_behavior = 1  # 0: respawn, 1: generate_new_goals, 2: stop
+goal_target_distance = 25.0  # Distance to new goal when using generate_new_goals
+
+# Episode settings
+episode_length = 200 # Increase for longer episode horizon
+resample_frequency = 100000 # No resampling needed (there are only a few Carla maps)
+termination_mode = 0  # 0: terminate at episode_length, 1: terminate after all agents reset
+
+# Map settings
+map_dir = "resources/drive/binaries/carla"
+num_maps = 1
+```
+
+**Note:** The default training hyperparameters work well for both configurations and typically don't need adjustment.
+
+
+## Controlled experiments
+
+Run parameter sweeps for architecture search or multi-seed experiments:
+```bash
+puffer controlled_exp puffer_drive --wandb --wandb-project "pufferdrive2.0_carla" --tag speed
+```
+
+Define parameter sweeps in `drive.ini`:
+```ini
+[controlled_exp.env.goal_speed]
+values = [10, 20, 30]
+```
+
+This will launch separate training runs for each value in the list, useful for:
+- Hyperparameter tuning
+- Architecture search
+- Running multiple random seeds
+- Ablation studies
+
+You can specify multiple controlled experiment parameters, and the system will iterate through all combinations.
@@ -27,6 +27,7 @@ nav:
   - Get started: index.md
   - Docs:
       - Getting started: getting-started.md
+      - Training agents: train.md
       - Simulator: simulator.md
       - Interactive scenario editor: scene-editor.md
       - Visualizer: visualizer.md
 
@@ -25,23 +25,27 @@ action_type = discrete
 ; Options: classic, jerk
 dynamics_model = classic
 reward_vehicle_collision = -0.5
-reward_offroad_collision = -0.5     # Use -0.05 for carla maps
-reward_ade = 0.0
+reward_offroad_collision = -0.5
 dt = 0.1
 reward_goal = 1.0
 reward_goal_post_respawn = 0.25
 ; Meters around goal to be considered "reached"
 goal_radius = 2.0
+; Max target speed in m/s for the agent to maintain towards the goal
+goal_speed = 100.0
 ; What to do when the goal is reached. Options: 0:"respawn", 1:"generate_new_goals", 2:"stop"
 goal_behavior = 0
+; Determines the target distance to the new goal in the case of goal_behavior = generate_new_goals.
+; Large numbers will select a goal point further away from the agent's current position.
+goal_target_distance = 25.0
 ; Options: 0 - Ignore, 1 - Stop, 2 - Remove
 collision_behavior = 0
 ; Options: 0 - Ignore, 1 - Stop, 2 - Remove
 offroad_behavior = 0
-; Number of steps before reset
-scenario_length = 91
+; Number of steps before
+episode_length = 91
 resample_frequency = 910
-termination_mode = 1 # 0 - terminate at scenario_length, 1 - terminate after all agents have been reset
+termination_mode = 1 # 0 - terminate at episode_length, 1 - terminate after all agents have been reset
 map_dir = "resources/drive/binaries/training"
 num_maps = 10000
 ; Determines which step of the trajectory to initialize the agents at upon reset
@@ -85,7 +89,7 @@ render_interval = 1000
 ; If True, show exactly what the agent sees in agent observation
 obs_only = True
 ; Show grid lines
-show_grid = False
+show_grid = True
 ; Draws lines from ego agent observed ORUs and road elements to show detection range
 show_lasers = False
 ; Display human xy logs in the background
@@ -136,7 +140,7 @@ scale = auto
 distribution = log_normal
 min = 0.001
 mean = 0.005
-max = 0.01
+max = 0.03
 scale = auto
 
 [sweep.train.gamma]
 
@@ -249,7 +249,7 @@ def compute_interaction_features(
         scenario_mask = torch.as_tensor(scenario_mask_np, dtype=torch.bool, device=x_t.device)
         scenario_x = x_t[scenario_mask]
         scenario_y = y_t[scenario_mask]
-        scenario_length = length_broadcast[scenario_mask]
+        episode_length = length_broadcast[scenario_mask]
         scenario_width = width_broadcast[scenario_mask]
         scenario_heading = heading_t[scenario_mask]
         scenario_valid = valid_t[scenario_mask]
@@ -260,7 +260,7 @@ def compute_interaction_features(
         distances_to_objects = interaction_features.compute_distance_to_nearest_object(
             center_x=scenario_x,
             center_y=scenario_y,
-            length=scenario_length,
+            length=episode_length,
             width=scenario_width,
             heading=scenario_heading,
             valid=scenario_valid,
@@ -273,7 +273,7 @@ def compute_interaction_features(
         times_to_collision = interaction_features.compute_time_to_collision(
             center_x=scenario_x,
             center_y=scenario_y,
-            length=scenario_length,
+            length=episode_length,
             width=scenario_width,
             heading=scenario_heading,
             valid=scenario_valid,
 
@@ -75,6 +75,7 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
     int control_mode = unpack(kwargs, "control_mode");
     int init_steps = unpack(kwargs, "init_steps");
     int goal_behavior = unpack(kwargs, "goal_behavior");
+    float goal_target_distance = unpack(kwargs, "goal_target_distance");
     int use_all_maps = unpack(kwargs, "use_all_maps");
     clock_gettime(CLOCK_REALTIME, &ts);
     srand(ts.tv_nsec);
@@ -94,6 +95,7 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
         env->control_mode = control_mode;
         env->init_steps = init_steps;
         env->goal_behavior = goal_behavior;
+        env->goal_target_distance = goal_target_distance;
         snprintf(map_file, sizeof(map_file), "%s/map_%03d.bin", map_dir, map_id);
         env->entities = load_map_binary(map_file, env);
         set_active_agents(env);
@@ -175,11 +177,11 @@ static int my_init(Env *env, PyObject *args, PyObject *kwargs) {
     if (ini_parse(env->ini_file, handler, &conf) < 0) {
         printf("Error while loading %s", env->ini_file);
     }
-    if (kwargs && PyDict_GetItemString(kwargs, "scenario_length")) {
-        conf.scenario_length = (int)unpack(kwargs, "scenario_length");
+    if (kwargs && PyDict_GetItemString(kwargs, "episode_length")) {
+        conf.episode_length = (int)unpack(kwargs, "episode_length");
     }
-    if (conf.scenario_length <= 0) {
-        PyErr_SetString(PyExc_ValueError, "scenario_length must be > 0 (set in INI or kwargs)");
+    if (conf.episode_length <= 0) {
+        PyErr_SetString(PyExc_ValueError, "episode_length must be > 0 (set in INI or kwargs)");
         return -1;
     }
     env->action_type = conf.action_type;
@@ -188,8 +190,7 @@ static int my_init(Env *env, PyObject *args, PyObject *kwargs) {
     env->reward_offroad_collision = conf.reward_offroad_collision;
     env->reward_goal = conf.reward_goal;
     env->reward_goal_post_respawn = conf.reward_goal_post_respawn;
-    env->reward_ade = conf.reward_ade;
-    env->scenario_length = conf.scenario_length;
+    env->episode_length = conf.episode_length;
     env->termination_mode = conf.termination_mode;
     env->collision_behavior = conf.collision_behavior;
     env->offroad_behavior = conf.offroad_behavior;
@@ -198,7 +199,9 @@ static int my_init(Env *env, PyObject *args, PyObject *kwargs) {
     env->init_mode = (int)unpack(kwargs, "init_mode");
     env->control_mode = (int)unpack(kwargs, "control_mode");
     env->goal_behavior = (int)unpack(kwargs, "goal_behavior");
+    env->goal_target_distance = (float)unpack(kwargs, "goal_target_distance");
     env->goal_radius = (float)unpack(kwargs, "goal_radius");
+    env->goal_speed = (float)unpack(kwargs, "goal_speed");
     char *map_dir = unpack_str(kwargs, "map_dir");
     int map_id = unpack(kwargs, "map_id");
     int max_agents = unpack(kwargs, "max_agents");
@@ -223,8 +226,11 @@ static int my_log(PyObject *dict, Log *log) {
     assign_to_dict(dict, "dnf_rate", log->dnf_rate);
     assign_to_dict(dict, "completion_rate", log->completion_rate);
     assign_to_dict(dict, "lane_alignment_rate", log->lane_alignment_rate);
-    assign_to_dict(dict, "avg_offroad_per_agent", log->avg_offroad_per_agent);
-    assign_to_dict(dict, "avg_collisions_per_agent", log->avg_collisions_per_agent);
+    assign_to_dict(dict, "offroad_per_agent", log->offroad_per_agent);
+    assign_to_dict(dict, "collisions_per_agent", log->collisions_per_agent);
+    assign_to_dict(dict, "goals_sampled_this_episode", log->goals_sampled_this_episode);
+    assign_to_dict(dict, "goals_reached_this_episode", log->goals_reached_this_episode);
+    assign_to_dict(dict, "speed_at_goal", log->speed_at_goal);
     // assign_to_dict(dict, "avg_displacement_error", log->avg_displacement_error);
     return 0;
 }