Merge branch 'xyao/exp/ci_stall' into xyao/exp/revert_eval_runner_tests

xyao-nv · web-flow · commit 7bbe4065e3e0 · 2026-04-09T22:06:49.000-07:00
diff --git a/docs/pages/example_workflows/dexsuite_lift/step_2_policy_training.rst b/docs/pages/example_workflows/dexsuite_lift/step_2_policy_training.rst
@@ -64,6 +64,29 @@ Hyperparameters can be overridden with Hydra-style CLI arguments:
      agent.max_iterations=20000 agent.save_interval=500 agent.algorithm.learning_rate=0.0005
 
 
+Resuming from a Checkpoint
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To resume training from a previously saved checkpoint, use the ``--resume`` flag
+together with ``--load_run`` (run folder name) and ``--checkpoint`` (model filename).
+Both arguments are optional — when omitted, the most recent run and latest checkpoint
+are used automatically.
+
+.. code-block:: bash
+
+   python submodules/IsaacLab/scripts/reinforcement_learning/rsl_rl/train.py \
+     --task Isaac-Dexsuite-Kuka-Allegro-Lift-v0 \
+     --num_envs 512 \
+     --resume \
+     --load_run <timestamp> \
+     --checkpoint model_5000.pt \
+     presets=newton presets=cube
+
+Replace ``<timestamp>`` with the run folder name under ``logs/rsl_rl/dexsuite_kuka_allegro/``.
+If ``--load_run`` is omitted, the latest run is selected. If ``--checkpoint`` is omitted,
+the latest checkpoint in that run is loaded.
+
+
 Monitoring Training
 ^^^^^^^^^^^^^^^^^^^
 
diff --git a/docs/pages/example_workflows/dexsuite_lift/step_3_evaluation.rst b/docs/pages/example_workflows/dexsuite_lift/step_3_evaluation.rst
@@ -9,7 +9,7 @@ Once inside the container, set the models directory:
 
 .. code-block:: bash
 
-   export MODELS_DIR=models/isaaclab_arena/dexsuite_lift
+   export MODELS_DIR=/models/isaaclab_arena/dexsuite_lift
    mkdir -p $MODELS_DIR
 
 This step evaluates a checkpoint using Arena's ``dexsuite_lift`` environment.
diff --git a/docs/pages/example_workflows/reinforcement_learning/index.rst b/docs/pages/example_workflows/reinforcement_learning/index.rst
@@ -72,7 +72,7 @@ You'll need to create folders for logs, checkpoints, and models:
 
     export LOG_DIR=logs/rsl_rl
     mkdir -p $LOG_DIR
-    export MODELS_DIR=models/isaaclab_arena/reinforcement_learning
+    export MODELS_DIR=/models/isaaclab_arena/reinforcement_learning
     mkdir -p $MODELS_DIR
 
 Workflow Steps
diff --git a/docs/pages/example_workflows/reinforcement_learning/step_2_policy_training.rst b/docs/pages/example_workflows/reinforcement_learning/step_2_policy_training.rst
@@ -61,6 +61,31 @@ For example, to train with relu activation and a higher learning rate:
      agent.algorithm.learning_rate=0.001
 
 
+Resuming from a Checkpoint
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To resume training from a previously saved checkpoint, use the ``--resume`` flag
+together with ``--load_run`` (run folder name) and ``--checkpoint`` (model filename).
+Both arguments are optional — when omitted, the most recent run and latest checkpoint
+are used automatically.
+
+.. code-block:: bash
+
+   python submodules/IsaacLab/scripts/reinforcement_learning/rsl_rl/train.py \
+     --external_callback isaaclab_arena.environments.isaaclab_interop.environment_registration_callback \
+     --task lift_object \
+     --rl_training_mode \
+     --num_envs 4096 \
+     --max_iterations 4000 \
+     --resume \
+     --load_run <timestamp> \
+     --checkpoint model_1999.pt
+
+Replace ``<timestamp>`` with the run folder name under ``logs/rsl_rl/generic_experiment/``.
+If ``--load_run`` is omitted, the latest run is selected. If ``--checkpoint`` is omitted,
+the latest checkpoint in that run is loaded.
+
+
 Monitoring Training
 ^^^^^^^^^^^^^^^^^^^
 
@@ -101,6 +126,7 @@ During training, each iteration prints a summary to the console:
                                      ETA: 00:00:49
 
 
+
 Multi-GPU Training
 ^^^^^^^^^^^^^^^^^^
 
@@ -112,7 +138,7 @@ Add ``--distributed`` to spread environments across all available GPUs:
      --external_callback isaaclab_arena.environments.isaaclab_interop.environment_registration_callback \
      --task lift_object \
      --rl_training_mode \
-     --num_envs 4096\
+     --num_envs 4096 \
      --max_iterations 2000 \
      --distributed
 
diff --git a/docs/pages/example_workflows/reinforcement_learning/step_3_evaluation.rst b/docs/pages/example_workflows/reinforcement_learning/step_3_evaluation.rst
@@ -9,7 +9,7 @@ Once inside the container, set the models directory if you plan to download pre-
 
 .. code:: bash
 
-    export MODELS_DIR=models/isaaclab_arena/reinforcement_learning
+    export MODELS_DIR=/models/isaaclab_arena/reinforcement_learning
     mkdir -p $MODELS_DIR
 
 This tutorial assumes you've completed :doc:`step_2_policy_training` and have a trained checkpoint,
diff --git a/isaaclab_arena/tests/utils/subprocess.py b/isaaclab_arena/tests/utils/subprocess.py
@@ -42,11 +42,11 @@ def run_subprocess(
 ) -> subprocess.CompletedProcess | None:
     """Run a command in a subprocess with timeout.
 
-    ``start_new_session=True`` isolates the child into its own process group.
-    The child-side ``SimulationAppContext`` uses this to SIGTERM its entire
-    group before ``os._exit()``, preventing orphaned Kit children (shader
-    compiler, GPU workers, …) from holding GPU resources and blocking the
-    next subprocess.
+    The child is launched with ``start_new_session=True`` so it lives in its
+    own process group.  The child-side ``SimulationAppContext`` uses this to
+    SIGTERM its entire group before ``os._exit()``, preventing orphaned Kit
+    children (shader compiler, GPU workers, …) from holding GPU resources and
+    blocking the next subprocess.
 
     Args:
         cmd: Command to run (list of strings).