feat(cosmos3): add Cosmos3 Super Omni inference tasks by gushiqiao · Pull Request #1196 · ModelTC/LightX2V

gushiqiao · 2026-06-30T04:29:10Z

Summary

Add end-to-end LightX2V inference support for Cosmos3 Super / Cosmos3 Super Omni tasks, with configs and scripts aligned by task name.

Supported Tasks

t2i / cosmos3_super_t2i
- Text-to-image generation from a text prompt.
t2v / cosmos3_super_omni_t2v
- Text-to-video generation from a text prompt.
i2v / cosmos3_super_i2v, cosmos3_super_omni_i2v
- Image-to-video generation conditioned on an input first frame plus prompt.
t2av / cosmos3_super_omni_t2av
- Text-to-audio-video generation from a prompt, producing video with generated audio.
i2av / cosmos3_super_omni_i2av
- Image-to-audio-video generation from a first frame and prompt, producing video with generated audio.
i2va forward dynamics / cosmos3_super_omni_action_fd_agibotworld
- Action-conditioned video rollout from an initial observation image and a provided robot action chunk.
i2va multi-chunk forward dynamics / cosmos3_super_omni_action_fd_agibotworld_multichunk
- Autoregressive multi-segment action rollout: each generated segment feeds its last frame into the next segment, using subsequent action chunks.
v2av inverse dynamics / cosmos3_super_omni_action_id_av
- Inverse dynamics for the autonomous-driving domain: condition on an observed video and predict the corresponding action sequence, saving action output to JSON.

Implementation Notes

Reuses the Cosmos3 runner/model path instead of importing diffusers model code.
Adds action conditioning support for forward dynamics, inverse dynamics, and multi-chunk rollout.
Adds audio decoding/muxing support for Omni audio-video tasks.
Aligns Cosmos3 config filenames with the script names under scripts/cosmos3.
Keeps per-task configs and scripts consistent with existing LightX2V style.

Validation

Verified Cosmos3 config JSON files parse successfully.
Verified Cosmos3 shell scripts reference existing matching config files.
Verified updated Cosmos3 runner Python syntax compiles successfully.

gemini-code-assist

Code Review

This pull request introduces support for Cosmos3 models, enabling multi-modal generation capabilities including video, sound, and action outputs. Key changes include the addition of a sound tokenizer, updates to the inference and post-inference modules to handle sound and action segments, and runner/scheduler support for multi-chunk action rollouts and audio muxing. The review feedback highlights several improvement opportunities: optimizing video loading by breaking early when only the first frame is needed, adding UTF-8 encoding when reading prompt files, wrapping the ffmpeg subprocess in a try-except block to prevent pipeline crashes, and safely defaulting action_domain_id to prevent a TypeError when converting to a tensor.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-30T04:31:07Z

+        try:
+            for frame in reader:
+                frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width))
+                if len(frames) >= num_frames:
+                    break


When keep_first is True, we only need the first frame of the video. However, the current loop continues decoding up to num_frames frames from the reader before discarding them. Breaking early when keep_first is True significantly improves performance by avoiding redundant video decoding.

Suggested change

try:

for frame in reader:

frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width))

if len(frames) >= num_frames:

break

try:

for frame in reader:

frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width))

if keep_first or len(frames) >= num_frames:

break

gemini-code-assist · 2026-06-30T04:31:07Z

+
+        action_domain_ids = None
+        if action_latents is not None:
+            action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)


If action_domain_id is None, calling torch.as_tensor(None) will raise a TypeError. Defaulting it to 0 (representing no_action) prevents this potential crash.

Suggested change

action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)

action_domain_id = 0 if action_domain_id is None else action_domain_id

action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)

gemini-code-assist · 2026-06-30T04:31:07Z

+        if text.endswith(".json"):
+            with open(text, "r") as f:
+                return json.dumps(json.load(f))
+        with open(text, "r") as f:
+            return f.read().strip()


When opening files for reading prompt text, it is highly recommended to specify encoding="utf-8" to prevent potential UnicodeDecodeError on systems where the default encoding is not UTF-8 (e.g., Windows).

Suggested change

if text.endswith(".json"):

with open(text, "r") as f:

return json.dumps(json.load(f))

with open(text, "r") as f:

return f.read().strip()

if text.endswith(".json"):

with open(text, "r", encoding="utf-8") as f:

return json.dumps(json.load(f))

with open(text, "r", encoding="utf-8") as f:

return f.read().strip()

gemini-code-assist · 2026-06-30T04:31:08Z

+            result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+            if result.returncode != 0:
+                stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error"
+                logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}")
+                return
+            os.replace(tmp_video_path, video_path)


If ffmpeg is not installed or fails to execute, subprocess.run can raise a FileNotFoundError or other OS exceptions. Wrapping the execution in a try-except block ensures that the entire inference pipeline does not crash at the very end, allowing the silent video to be preserved as intended.

Suggested change

result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

if result.returncode != 0:

stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error"

logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}")

return

os.replace(tmp_video_path, video_path)

try:

result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

if result.returncode != 0:

stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error"

logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}")

return

os.replace(tmp_video_path, video_path)

except Exception as e:

logger.warning(f"Cosmos3 generated audio mux failed with exception, keep silent video. Error: {e}")

feat(cosmos3): add Cosmos3 Super Omni inference tasks

8029c3c

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

llmc-reviewer approved these changes Jul 1, 2026

View reviewed changes

llmc-reviewer merged commit 2cbe1f2 into main Jul 1, 2026
2 checks passed

llmc-reviewer deleted the gsq/dev-cosmos-3-super-omni branch July 1, 2026 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cosmos3): add Cosmos3 Super Omni inference tasks#1196

feat(cosmos3): add Cosmos3 Super Omni inference tasks#1196
llmc-reviewer merged 1 commit into
mainfrom
gsq/dev-cosmos-3-super-omni

gushiqiao commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)
	action_domain_id = 0 if action_domain_id is None else action_domain_id
	action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)

Uh oh!

Conversation

gushiqiao commented Jun 30, 2026

Summary

Supported Tasks

Implementation Notes

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants