Skip to content

Cogvideox#13402

Merged
comfyanonymous merged 15 commits intomasterfrom
cogvideox
Apr 29, 2026
Merged

Cogvideox#13402
comfyanonymous merged 15 commits intomasterfrom
cogvideox

Conversation

@Talmaj
Copy link
Copy Markdown
Contributor

@Talmaj Talmaj commented Apr 14, 2026

This is a pre-requirement PR for CORE-38, netflix's VOID model.
It was written by @kijai and rebased by myself. I've tested and used it only with VOID model and it works well.

This is a newly opened PR directly within the main repo. I'm closing the initial one: #13351

@Talmaj Talmaj mentioned this pull request Apr 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Adds full CogVideoX support: a new latent format (latent_formats.CogVideoX); a 3D transformer UNet (comfy/ldm/cogvideo/model.py) with patch embedding, rotary/positional embeddings, and multi-block transformer layers; a causal 3D VAE (comfy/ldm/cogvideo/vae.py) with encoder/decoder and rolling/causal decode paths; a new sampling mode V_PREDICTION_DDPM and ModelType entry; model-detection heuristics for CogVideoX; VAE backend branch and registry entries for two supported models (CogVideoX_T2V, CogVideoX_I2V); and a CogVideo XT5 tokenizer subclass.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Cogvideox' is vague and generic, using only a model name without describing what is actually being added or changed in the pull request. Consider a more descriptive title like 'Add CogVideoX model support' or 'Implement CogVideoX latent format, VAE, and transformer' to clarify the scope of changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description is related to the changeset by identifying it as a prerequisite for the VOID model and mentioning testing with that model, providing adequate context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
comfy/ldm/cogvideo/vae_backup.py (1)

1-485: Avoid shipping a second, unused CogVideoX VAE implementation.

comfy/sd.py only wires up comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX, so this backup copy will drift as soon as the primary path gets fixes. If you want to keep it around, it should at least live behind an explicit dev-only flag or out of the runtime package surface.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/cogvideo/vae_backup.py` around lines 1 - 485, This file contains an
extra, unused CogVideoX VAE implementation (class AutoencoderKLCogVideoX and
related Encoder3D/Decoder3D, etc.) that duplicates the canonical
comfy.ldm.cogvideo.vae module wired by comfy/sd.py; remove this backup from the
runtime package surface (delete the file) or move it into a dev-only location
(e.g., tests/dev or gated behind an explicit runtime check/ENV flag) so it won't
drift from the primary implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.
comfy/supported_models.py (1)

1813-1818: Nested tokenizer class is functional but unconventional.

The CogVideoXT5Tokenizer nested class inside clip_target() works correctly, but defining a class inside a method is unusual in this codebase. Consider moving it to module level for better discoverability and potential reuse.

That said, this pattern has no functional issues and the min_length=226 correctly matches the transformer's max_text_seq_length parameter (per context snippet 1).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/supported_models.py` around lines 1813 - 1818, Nested class
CogVideoXT5Tokenizer inside clip_target is unconventional; move
CogVideoXT5Tokenizer to module level (define it alongside other tokenizers)
keeping the same initializer signature (embedding_directory=None,
tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 519-539: The current precompute builds full [B,C,T,H,W] entries in
z_at_res by calling _interpolate_zq(z, target) for every Phase 2 resolution,
which OOMs; instead, expand z along time once (to t_expanded) outside the
resolution loop and then perform spatial interpolation from that
temporally-expanded tensor inside the chunk/decode loop so you only create
per-chunk spatially-interpolated tensors when needed. Concretely: keep a single
temporally-expanded_z (use t_expanded) and remove the eager per-resolution calls
to _interpolate_zq in the remaining_blocks/decoder.up_blocks loop; when
processing each chunk/resolution, call _interpolate_zq on the
temporally-expanded slice for that chunk/target and cache only the
spatially-interpolated result in z_at_res keyed by (h,w) to avoid allocating
full clip-sized [B,C,T,H,W] per resolution.
- Around line 453-465: The encode method currently folds the remainder frames
into the first chunk causing batches larger than num_sample_frames_batch_size;
change the batching so chunks are capped to self.num_sample_frames_batch_size by
computing num_batches = ceil(t / frame_batch) and for each i set start = i *
frame_batch and end = min((i + 1) * frame_batch, t), then call self.encoder(x[:,
:, start:end], conv_cache=conv_cache) appending results and carrying forward
conv_cache; this ensures each chunk size <= num_sample_frames_batch_size and
prevents oversized VRAM spikes while preserving conv_cache behavior in encode.

---

Nitpick comments:
In `@comfy/ldm/cogvideo/vae_backup.py`:
- Around line 1-485: This file contains an extra, unused CogVideoX VAE
implementation (class AutoencoderKLCogVideoX and related Encoder3D/Decoder3D,
etc.) that duplicates the canonical comfy.ldm.cogvideo.vae module wired by
comfy/sd.py; remove this backup from the runtime package surface (delete the
file) or move it into a dev-only location (e.g., tests/dev or gated behind an
explicit runtime check/ENV flag) so it won't drift from the primary
implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.

In `@comfy/supported_models.py`:
- Around line 1813-1818: Nested class CogVideoXT5Tokenizer inside clip_target is
unconventional; move CogVideoXT5Tokenizer to module level (define it alongside
other tokenizers) keeping the same initializer signature
(embedding_directory=None, tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2df899fb-7a4b-40ce-ae3b-2acdcbf1ff24

📥 Commits

Reviewing files that changed from the base of the PR and between fed4ac0 and 541f26a.

📒 Files selected for processing (11)
  • comfy/latent_formats.py
  • comfy/ldm/cogvideo/__init__.py
  • comfy/ldm/cogvideo/model.py
  • comfy/ldm/cogvideo/vae.py
  • comfy/ldm/cogvideo/vae_backup.py
  • comfy/model_base.py
  • comfy/model_detection.py
  • comfy/model_sampling.py
  • comfy/sd.py
  • comfy/supported_models.py
  • nodes.py

Comment thread comfy/ldm/cogvideo/vae.py
Comment thread comfy/ldm/cogvideo/vae.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 50-60: The fast-path in encode() returns None for conv_cache when
x.shape[2] == 1, which drops causal context; compute and return the same cache
that the non-fast path would produce instead of None. After computing w2d and
out, build the next conv_cache equivalent to what the regular 3D path produces
(i.e., the trailing temporal context frames expected by self.conv.kernel_size
and self.conv.dilation) by replicating the single input frame to the needed
temporal length and running the same conv logic (using w2d/self.conv and
F.conv3d or by assembling the expected cache tensor), then return (out,
conv_cache) so subsequent chunks preserve causal state. Ensure you use the
existing symbols conv_cache, self.conv, w2d, and F.conv3d and manage GPU memory
like the rest of comfy/** code.

In `@comfy/model_sampling.py`:
- Around line 62-76: The sampler is inconsistent: calculate_denoised() expects
DDPM x_t but calculate_input() returns raw noise while noise_scaling() currently
adds latent_image, over-scaling the model input; fix by making noise_scaling()
produce pure sigma-space noise samples (i.e., return noise * sigma, or noise *
sqrt(1.0 + sigma**2) when max_denoise=True) and do NOT add latent_image there so
calculate_input(self, sigma, noise) can remain returning the sigma-space sample;
update references to reshape_sigma as needed and ensure
calculate_denoised(model_input, ...) consumes that sigma-space x_t consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d11ebf2e-ca2d-4490-bd07-83135a1955e2

📥 Commits

Reviewing files that changed from the base of the PR and between 52156ed and 0399bd3.

📒 Files selected for processing (11)
  • comfy/latent_formats.py
  • comfy/ldm/cogvideo/__init__.py
  • comfy/ldm/cogvideo/model.py
  • comfy/ldm/cogvideo/vae.py
  • comfy/model_base.py
  • comfy/model_detection.py
  • comfy/model_sampling.py
  • comfy/sd.py
  • comfy/supported_models.py
  • comfy/text_encoders/cogvideo.py
  • nodes.py
✅ Files skipped from review due to trivial changes (1)
  • nodes.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • comfy/text_encoders/cogvideo.py
  • comfy/model_detection.py
  • comfy/ldm/cogvideo/model.py

Comment thread comfy/ldm/cogvideo/vae.py
Comment on lines +50 to +60
if conv_cache is None and x.shape[2] == 1:
# Fast path: single frame, no cache. All temporal padding
# frames are copies of the input (replicate-style), so the
# 3D conv reduces to a 2D conv with summed temporal kernel.
w = comfy.ops.cast_to_input(self.conv.weight, x)
b = comfy.ops.cast_to_input(self.conv.bias, x) if self.conv.bias is not None else None
w2d = w.sum(dim=2, keepdim=True)
out = F.conv3d(x, w2d, b,
self.conv.stride, self.conv.padding,
self.conv.dilation, self.conv.groups)
return out, None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the causal cache in the single-frame fast path.

When encode() starts with a one-frame remainder chunk, this path returns None, so the next chunk loses frame-0 causal context and treats its first frame as a fresh sequence boundary. Return the same cache the non-fast path would have produced.

🐛 Proposed fix
                 if conv_cache is None and x.shape[2] == 1:
                     # Fast path: single frame, no cache. All temporal padding
                     # frames are copies of the input (replicate-style), so the
                     # 3D conv reduces to a 2D conv with summed temporal kernel.
                     w = comfy.ops.cast_to_input(self.conv.weight, x)
                     b = comfy.ops.cast_to_input(self.conv.bias, x) if self.conv.bias is not None else None
                     w2d = w.sum(dim=2, keepdim=True)
                     out = F.conv3d(x, w2d, b,
                                    self.conv.stride, self.conv.padding,
                                    self.conv.dilation, self.conv.groups)
-                    return out, None
+                    conv_cache = x[:, :, :1].repeat(1, 1, kernel_t - 1, 1, 1).clone()
+                    return out, conv_cache

As per coding guidelines, comfy/** changes should focus on memory management and GPU resource handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/cogvideo/vae.py` around lines 50 - 60, The fast-path in encode()
returns None for conv_cache when x.shape[2] == 1, which drops causal context;
compute and return the same cache that the non-fast path would produce instead
of None. After computing w2d and out, build the next conv_cache equivalent to
what the regular 3D path produces (i.e., the trailing temporal context frames
expected by self.conv.kernel_size and self.conv.dilation) by replicating the
single input frame to the needed temporal length and running the same conv logic
(using w2d/self.conv and F.conv3d or by assembling the expected cache tensor),
then return (out, conv_cache) so subsequent chunks preserve causal state. Ensure
you use the existing symbols conv_cache, self.conv, w2d, and F.conv3d and manage
GPU memory like the rest of comfy/** code.

Comment thread comfy/model_sampling.py
Comment on lines +62 to +76
def calculate_input(self, sigma, noise):
return noise

def calculate_denoised(self, sigma, model_output, model_input):
sigma = reshape_sigma(sigma, model_output.ndim)
return model_input / (sigma ** 2 + 1.0) ** 0.5 - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5

def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
sigma = reshape_sigma(sigma, noise.ndim)
if max_denoise:
noise = noise * torch.sqrt(1.0 + sigma ** 2.0)
else:
noise = noise * sigma
noise += latent_image
return noise
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the DDPM v-prediction sample scale consistent.

calculate_denoised() assumes model_input is DDPM-normalized x_t, but noise_scaling() still builds sigma-space samples (latent + sigma * noise, or sqrt(1 + sigma²) * noise for max denoise) while calculate_input() passes them through unchanged. That over-scales what the CogVideoX model sees and changes the denoised estimate.

Please either normalize the sampler input path for this class, or reuse the standard sigma-space v-prediction formulas.

One possible sigma-space fix
 class V_PREDICTION_DDPM:
@@
     def calculate_input(self, sigma, noise):
-        return noise
+        sigma = reshape_sigma(sigma, noise.ndim)
+        return noise / (sigma ** 2 + 1.0) ** 0.5
 
     def calculate_denoised(self, sigma, model_output, model_input):
         sigma = reshape_sigma(sigma, model_output.ndim)
-        return model_input / (sigma ** 2 + 1.0) ** 0.5 - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5
+        return model_input / (sigma ** 2 + 1.0) - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_sampling.py` around lines 62 - 76, The sampler is inconsistent:
calculate_denoised() expects DDPM x_t but calculate_input() returns raw noise
while noise_scaling() currently adds latent_image, over-scaling the model input;
fix by making noise_scaling() produce pure sigma-space noise samples (i.e.,
return noise * sigma, or noise * sqrt(1.0 + sigma**2) when max_denoise=True) and
do NOT add latent_image there so calculate_input(self, sigma, noise) can remain
returning the sigma-space sample; update references to reshape_sigma as needed
and ensure calculate_denoised(model_input, ...) consumes that sigma-space x_t
consistently.

@comfyanonymous comfyanonymous merged commit 5eeae3f into master Apr 29, 2026
21 checks passed
@comfyanonymous comfyanonymous deleted the cogvideox branch April 29, 2026 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants