feat(cosmos3): add Cosmos3 Super Omni inference tasks#1196
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Cosmos3 models, enabling multi-modal generation capabilities including video, sound, and action outputs. Key changes include the addition of a sound tokenizer, updates to the inference and post-inference modules to handle sound and action segments, and runner/scheduler support for multi-chunk action rollouts and audio muxing. The review feedback highlights several improvement opportunities: optimizing video loading by breaking early when only the first frame is needed, adding UTF-8 encoding when reading prompt files, wrapping the ffmpeg subprocess in a try-except block to prevent pipeline crashes, and safely defaulting action_domain_id to prevent a TypeError when converting to a tensor.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| for frame in reader: | ||
| frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width)) | ||
| if len(frames) >= num_frames: | ||
| break |
There was a problem hiding this comment.
When keep_first is True, we only need the first frame of the video. However, the current loop continues decoding up to num_frames frames from the reader before discarding them. Breaking early when keep_first is True significantly improves performance by avoiding redundant video decoding.
| try: | |
| for frame in reader: | |
| frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width)) | |
| if len(frames) >= num_frames: | |
| break | |
| try: | |
| for frame in reader: | |
| frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width)) | |
| if keep_first or len(frames) >= num_frames: | |
| break |
|
|
||
| action_domain_ids = None | ||
| if action_latents is not None: | ||
| action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1) |
There was a problem hiding this comment.
If action_domain_id is None, calling torch.as_tensor(None) will raise a TypeError. Defaulting it to 0 (representing no_action) prevents this potential crash.
| action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1) | |
| action_domain_id = 0 if action_domain_id is None else action_domain_id | |
| action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1) |
| if text.endswith(".json"): | ||
| with open(text, "r") as f: | ||
| return json.dumps(json.load(f)) | ||
| with open(text, "r") as f: | ||
| return f.read().strip() |
There was a problem hiding this comment.
When opening files for reading prompt text, it is highly recommended to specify encoding="utf-8" to prevent potential UnicodeDecodeError on systems where the default encoding is not UTF-8 (e.g., Windows).
| if text.endswith(".json"): | |
| with open(text, "r") as f: | |
| return json.dumps(json.load(f)) | |
| with open(text, "r") as f: | |
| return f.read().strip() | |
| if text.endswith(".json"): | |
| with open(text, "r", encoding="utf-8") as f: | |
| return json.dumps(json.load(f)) | |
| with open(text, "r", encoding="utf-8") as f: | |
| return f.read().strip() |
| result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) | ||
| if result.returncode != 0: | ||
| stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error" | ||
| logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}") | ||
| return | ||
| os.replace(tmp_video_path, video_path) |
There was a problem hiding this comment.
If ffmpeg is not installed or fails to execute, subprocess.run can raise a FileNotFoundError or other OS exceptions. Wrapping the execution in a try-except block ensures that the entire inference pipeline does not crash at the very end, allowing the silent video to be preserved as intended.
| result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) | |
| if result.returncode != 0: | |
| stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error" | |
| logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}") | |
| return | |
| os.replace(tmp_video_path, video_path) | |
| try: | |
| result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) | |
| if result.returncode != 0: | |
| stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error" | |
| logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}") | |
| return | |
| os.replace(tmp_video_path, video_path) | |
| except Exception as e: | |
| logger.warning(f"Cosmos3 generated audio mux failed with exception, keep silent video. Error: {e}") |
Summary
Add end-to-end LightX2V inference support for Cosmos3 Super / Cosmos3 Super Omni tasks, with configs and scripts aligned by task name.
Supported Tasks
t2i/cosmos3_super_t2it2v/cosmos3_super_omni_t2vi2v/cosmos3_super_i2v,cosmos3_super_omni_i2vt2av/cosmos3_super_omni_t2avi2av/cosmos3_super_omni_i2avi2vaforward dynamics /cosmos3_super_omni_action_fd_agibotworldi2vamulti-chunk forward dynamics /cosmos3_super_omni_action_fd_agibotworld_multichunkv2avinverse dynamics /cosmos3_super_omni_action_id_avImplementation Notes
scripts/cosmos3.Validation