huggingface
diff --git a/‎docs/source/en/api/pipelines/cosmos3.md‎
Lines changed: 234 additions & 225 deletions b/‎docs/source/en/api/pipelines/cosmos3.md‎
Lines changed: 234 additions & 225 deletions
diff --git a/‎examples/cosmos3/README.md‎
Lines changed: 112 additions & 5 deletions b/‎examples/cosmos3/README.md‎
Lines changed: 112 additions & 5 deletions
diff --git a/‎examples/cosmos3/inference_cosmos3.py‎
Lines changed: 125 additions & 20 deletions b/‎examples/cosmos3/inference_cosmos3.py‎
Lines changed: 125 additions & 20 deletions
diff --git a/‎src/diffusers/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎src/diffusers/__init__.py‎
Lines changed: 2 additions & 0 deletions
@@ -48,16 +48,123 @@ python examples/cosmos3/inference_cosmos3.py \
     --enable-sound
 ```
 
+Action forward dynamics, robot domain (predict video from an observation video and a provided action chunk):
+
+```bash
+python examples/cosmos3/inference_cosmos3.py \
+    --model nano \
+    --prompt "Put the pot to the left of the purple item." \
+    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
+    --action-mode forward_dynamics \
+    --action-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.json" \
+    --action-chunk-size 16 \
+    --domain-name bridge_orig_lerobot \
+    --resolution-tier 480 --fps 5 \
+    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
+    --output results/cosmos3_forward_dynamics_robot
+```
+
+Action forward dynamics, autonomous-vehicle domain:
+
+```bash
+python examples/cosmos3/inference_cosmos3.py \
+    --model nano \
+    --prompt "You are an autonomous vehicle planning system." \
+    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
+    --action-mode forward_dynamics \
+    --action-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_action_25.json" \
+    --action-chunk-size 60 \
+    --domain-name av \
+    --resolution-tier 480 --fps 10 \
+    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
+    --output results/cosmos3_forward_dynamics_av
+```
+
+Action inverse dynamics, robot domain (predict actions from an observed video):
+
+```bash
+python examples/cosmos3/inference_cosmos3.py \
+    --model nano \
+    --prompt "Put the pot to the left of the purple item." \
+    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
+    --action-mode inverse_dynamics \
+    --action-chunk-size 16 \
+    --domain-name bridge_orig_lerobot \
+    --resolution-tier 480 --fps 5 \
+    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
+    --output results/cosmos3_inverse_dynamics_robot
+```
+
+Action inverse dynamics, autonomous-vehicle domain:
+
+```bash
+python examples/cosmos3/inference_cosmos3.py \
+    --model nano \
+    --prompt "You are an autonomous vehicle planning system." \
+    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
+    --action-mode inverse_dynamics \
+    --action-chunk-size 60 \
+    --domain-name av \
+    --resolution-tier 480 --fps 10 \
+    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
+    --output results/cosmos3_inverse_dynamics_av
+```
+
+Action policy, robot domain (predict both future video and actions from the first observation frame):
+
+```bash
+python examples/cosmos3/inference_cosmos3.py \
+    --model nano \
+    --prompt "Put the pot to the left of the purple item." \
+    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
+    --action-mode policy \
+    --action-chunk-size 16 \
+    --domain-name bridge_orig_lerobot \
+    --resolution-tier 480 --fps 5 \
+    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
+    --output results/cosmos3_policy_robot
+```
+
+Action policy, autonomous-vehicle domain:
+
+```bash
+python examples/cosmos3/inference_cosmos3.py \
+    --model nano \
+    --prompt "You are an autonomous vehicle planning system. Please go backward." \
+    --vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
+    --action-mode policy \
+    --action-chunk-size 60 \
+    --domain-name av \
+    --resolution-tier 480 --fps 10 \
+    --num-inference-steps 30 --guidance-scale 1.0 --flow-shift 10.0 --seed 0 \
+    --output results/cosmos3_policy_av
+```
+
+Action modes use `action_chunk_size + 1` conditioning frames. `forward_dynamics` consumes `--action-path`; `inverse_dynamics` and `policy` write predicted actions to `sample_action.json` in model-normalized action space. This script loads `--vision-path` as a video for all action modes; `policy` and `forward_dynamics` condition only on the first frame, while `inverse_dynamics` uses the whole video.
+
+Pass `--prompt` as a plain task description and select the camera perspective with `--view-point` (default `ego_view`); the pipeline builds the structured action caption (task, viewpoint, duration, FPS, resolution) the model was trained on. Do not hand-write the viewpoint sentence into `--prompt`.
+
+`--resolution-tier` is a resolution *tier* (`256`/`480`/`704`/`720`). The tier keys a table of predefined aspect-ratio canvases; the one closest to the input aspect ratio becomes the padded conditioning canvas. It is not the output frame size: the input is downscaled (never upscaled) and padded to fill the canvas, then the padding is cropped from the latents so the decoded output follows the downscaled input content. `--height` / `--width` (and `--num-frames`) are ignored for action modes.
+
+Pick the tier that matches the native resolution of your conditioning input (`480` for ~480p, `720` for ~720p). A tier below your input downscales it and discards detail; a tier above your input gains no resolution (content is never upscaled), wastes compute on padding, and is a train/inference distribution mismatch that can degrade quality.
+
 ### Useful flags
 
 | Flag | Default | Description |
 |---|---|---|
 | `--prompt` | (required) | Text prompt. |
-| `--vision-path` | `None` | URL or local path for an image-conditioning frame (image-to-video). |
-| `--num-frames` | `189` | `1` = image, otherwise number of video frames (`189` ≈ 7.9 s @ 24 FPS). |
-| `--height` / `--width` | `720` / `1280` | Output resolution (must be a multiple of the VAE spatial scale factor). |
+| `--vision-path` | `None` | URL or local path for an image-conditioning frame (image-to-video), or the image/video conditioning for action modes. |
+| `--num-frames` | `189` | `1` = image, otherwise number of video frames (`189` ≈ 7.9 s @ 24 FPS). Ignored for action modes (derived from `--action-chunk-size`). |
+| `--height` / `--width` | `720` / `1280` | Output resolution (must be a multiple of the VAE spatial scale factor). Ignored for action modes; use `--resolution-tier`. |
+| `--resolution-tier` | `480` | Action resolution tier (`256`/`480`/`704`/`720`): selects the aspect bin / padded conditioning canvas, not the output size. |
 | `--fps` | `24.0` | Frame rate of the generated video. |
+| `--flow-shift` | `None` | Override `UniPCMultistepScheduler.flow_shift` (and force `use_karras_sigmas=False`); left at the checkpoint default when unset. Cosmos3 runs use `10.0`. |
 | `--enable-sound` | off | Generate a synchronized audio track. |
-| `--no-duration-template` | off | Skip the duration metadata sentence appended to the prompt and negative prompt. Ignored for `--num-frames 1`. |
-| `--no-resolution-template` | off | Skip the resolution metadata sentence appended to the prompt and negative prompt. |
+| `--action-mode` | `None` | Enable action conditioning/generation. One of `forward_dynamics`, `inverse_dynamics`, or `policy`. |
+| `--action-path` | `None` | URL or local JSON action path for `forward_dynamics`. |
+| `--action-chunk-size` | `None` | Number of action tokens. Action runs generate/use `action_chunk_size + 1` video frames. |
+| `--domain-name` | `None` | Action embodiment domain, for example `bridge_orig_lerobot` or `av`. |
+| `--view-point` | `ego_view` | Camera perspective for the action caption's framing (`ego_view`, `third_person_view`, `wrist_view`, `concat_view`). Action only. |
+| `--no-duration-template` | off | Skip the duration metadata sentence appended to the prompt and negative prompt. Ignored for `--num-frames 1` and for action modes (which build a structured caption instead). |
+| `--no-resolution-template` | off | Skip the resolution metadata sentence appended to the prompt and negative prompt. Ignored for action modes. |
 | `--output` | `.` | Directory to write `sample.jpg` or `sample.mp4`. |
@@ -23,13 +23,15 @@
 """
 
 import argparse
+import json
 import pathlib
+import urllib.request
 
 import torch
 from huggingface_hub import snapshot_download
 
-from diffusers import Cosmos3OmniPipeline
-from diffusers.utils import encode_video, export_to_video, load_image
+from diffusers import Cosmos3OmniPipeline, CosmosActionCondition, UniPCMultistepScheduler
+from diffusers.utils import encode_video, export_to_video, load_image, load_video
 
 
 HF_REPOS = {
@@ -38,6 +40,22 @@
 }
 
 
+def _load_action(path: str | None):
+    if path is None:
+        raise ValueError("--action-path is required for forward_dynamics mode.")
+    if path.startswith(("http://", "https://")):
+        with urllib.request.urlopen(path) as response:
+            action = json.loads(response.read().decode("utf-8"))
+    else:
+        action = json.loads(pathlib.Path(path).read_text())
+    tensor = torch.as_tensor(action, dtype=torch.float32)
+    if tensor.ndim == 3 and tensor.shape[0] == 1:
+        tensor = tensor.squeeze(0)
+    if tensor.ndim != 2:
+        raise ValueError(f"Cosmos3 action must have shape [T, D], got {tuple(tensor.shape)}.")
+    return tensor
+
+
 def main():
     parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
     parser.add_argument("--prompt", required=True, help="Text prompt.")
@@ -50,24 +68,68 @@ def main():
     parser.add_argument(
         "--vision-path",
         default=None,
-        help="Optional URL or local path for an image-conditioning frame (enables image-to-video).",
+        help="Optional URL or local path for an image-conditioning frame, or an action conditioning video.",
     )
     parser.add_argument("--output", default=".", help="Directory to save generated video/image/audio files.")
-    parser.add_argument("--height", type=int, default=720)
-    parser.add_argument("--width", type=int, default=1280)
+    parser.add_argument(
+        "--height",
+        type=int,
+        default=None,
+        help="Output height in pixels (default 720). Ignored for action modes; use --resolution-tier instead.",
+    )
+    parser.add_argument(
+        "--width",
+        type=int,
+        default=None,
+        help="Output width in pixels (default 1280). Ignored for action modes; use --resolution-tier instead.",
+    )
     parser.add_argument(
         "--num-frames",
         type=int,
         default=189,
         help="Number of frames to generate. Use 1 for text-to-image; defaults to 189 for video (≈ 7.9s @ 24 FPS).",
     )
     parser.add_argument("--fps", type=float, default=24.0)
+    parser.add_argument("--guidance-scale", type=float, default=6.0, help="Classifier-free guidance scale.")
+    parser.add_argument("--num-inference-steps", type=int, default=35, help="Number of denoising steps.")
+    parser.add_argument(
+        "--flow-shift",
+        type=float,
+        default=None,
+        help="Override the scheduler's flow-matching shift (UniPCMultistepScheduler.flow_shift).",
+    )
+    parser.add_argument("--seed", type=int, default=None, help="Random seed for latent initialization.")
     parser.add_argument(
         "--enable-sound",
         action="store_true",
         default=False,
         help="Generate sound alongside video (requires a sound-capable checkpoint).",
     )
+    parser.add_argument(
+        "--action-mode",
+        choices=["forward_dynamics", "inverse_dynamics", "policy"],
+        default=None,
+        help="Enable Cosmos3 action generation with a loaded conditioning video.",
+    )
+    parser.add_argument("--action-path", default=None, help="JSON action path for forward_dynamics mode.")
+    parser.add_argument("--action-chunk-size", type=int, default=None, help="Number of action tokens to generate/use.")
+    parser.add_argument("--domain-name", default=None, help="Cosmos3 action embodiment domain name.")
+    parser.add_argument(
+        "--view-point",
+        choices=["ego_view", "third_person_view", "wrist_view", "concat_view"],
+        default="ego_view",
+        help="Camera perspective for the action caption's cinematography.framing field (default: ego_view).",
+    )
+    parser.add_argument(
+        "--resolution-tier",
+        type=int,
+        default=480,
+        choices=[256, 480, 704, 720],
+        help=(
+            "Action resolution tier (256/480/704/720). Selects the aspect bin / padded conditioning canvas, "
+            "not the output frame size."
+        ),
+    )
     parser.add_argument(
         "--no-duration-template",
         dest="add_duration_template",
@@ -108,23 +170,59 @@ def main():
     )
     print("Pipeline loaded successfully.")
 
+    if args.flow_shift is not None:
+        pipeline.scheduler = UniPCMultistepScheduler.from_config(
+            pipeline.scheduler.config, flow_shift=args.flow_shift, use_karras_sigmas=False
+        )
+
     output_dir = pathlib.Path(args.output)
     output_dir.mkdir(parents=True, exist_ok=True)
-
-    image = load_image(args.vision_path) if args.vision_path is not None else None
-
-    result = pipeline(
-        prompt=args.prompt,
-        image=image,
-        num_frames=args.num_frames,
-        height=args.height,
-        width=args.width,
-        fps=args.fps,
-        enable_sound=args.enable_sound,
-        add_resolution_template=args.add_resolution_template,
-        add_duration_template=args.add_duration_template,
-        enable_safety_check=not args.no_safety_check,
-    )
+    generator = torch.Generator().manual_seed(args.seed) if args.seed is not None else None
+
+    if args.action_mode is not None:
+        if args.vision_path is None:
+            raise ValueError("--vision-path must point to a conditioning video for action modes.")
+        if args.action_chunk_size is None:
+            raise ValueError("--action-chunk-size is required for action modes.")
+        video = load_video(args.vision_path)
+        raw_actions = _load_action(args.action_path) if args.action_mode == "forward_dynamics" else None
+        result = pipeline(
+            prompt=args.prompt,
+            action=CosmosActionCondition(
+                mode=args.action_mode,
+                chunk_size=args.action_chunk_size,
+                domain_name=args.domain_name,
+                resolution_tier=args.resolution_tier,
+                raw_actions=raw_actions,
+                video=video,
+                view_point=args.view_point,
+            ),
+            fps=args.fps,
+            num_inference_steps=args.num_inference_steps,
+            guidance_scale=args.guidance_scale,
+            generator=generator,
+            use_system_prompt=False,
+            add_resolution_template=args.add_resolution_template,
+            add_duration_template=args.add_duration_template,
+            enable_safety_check=not args.no_safety_check,
+        )
+    else:
+        image = load_image(args.vision_path) if args.vision_path is not None else None
+        result = pipeline(
+            prompt=args.prompt,
+            image=image,
+            num_frames=args.num_frames,
+            height=args.height,
+            width=args.width,
+            fps=args.fps,
+            num_inference_steps=args.num_inference_steps,
+            enable_sound=args.enable_sound,
+            guidance_scale=args.guidance_scale,
+            generator=generator,
+            add_resolution_template=args.add_resolution_template,
+            add_duration_template=args.add_duration_template,
+            enable_safety_check=not args.no_safety_check,
+        )
 
     if args.num_frames == 1:
         save_path = output_dir / "sample.jpg"
@@ -145,6 +243,13 @@ def main():
             export_to_video(result.video, str(save_path), fps=int(args.fps), quality=10, macro_block_size=1)
     print(f"Saved: {save_path}")
 
+    if result.action is not None:
+        for action in result.action:
+            action_path = output_dir / "sample_action.json"
+            with open(action_path, "w") as f:
+                json.dump(action.tolist(), f)
+            print(f"Saved: {action_path}")
+
 
 if __name__ == "__main__":
     main()
@@ -553,6 +553,7 @@
             "Cosmos2TextToImagePipeline",
             "Cosmos2VideoToWorldPipeline",
             "Cosmos3OmniPipeline",
+            "CosmosActionCondition",
             "CosmosTextToWorldPipeline",
             "CosmosVideoToWorldPipeline",
             "CycleDiffusionPipeline",
@@ -1373,6 +1374,7 @@
             Cosmos2TextToImagePipeline,
             Cosmos2VideoToWorldPipeline,
             Cosmos3OmniPipeline,
+            CosmosActionCondition,
             CosmosTextToWorldPipeline,
             CosmosVideoToWorldPipeline,
             CycleDiffusionPipeline,