Cogvideox#13402
Conversation
📝 WalkthroughWalkthroughAdds full CogVideoX support: a new latent format ( 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
comfy/ldm/cogvideo/vae_backup.py (1)
1-485: Avoid shipping a second, unused CogVideoX VAE implementation.
comfy/sd.pyonly wires upcomfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX, so this backup copy will drift as soon as the primary path gets fixes. If you want to keep it around, it should at least live behind an explicit dev-only flag or out of the runtime package surface.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/cogvideo/vae_backup.py` around lines 1 - 485, This file contains an extra, unused CogVideoX VAE implementation (class AutoencoderKLCogVideoX and related Encoder3D/Decoder3D, etc.) that duplicates the canonical comfy.ldm.cogvideo.vae module wired by comfy/sd.py; remove this backup from the runtime package surface (delete the file) or move it into a dev-only location (e.g., tests/dev or gated behind an explicit runtime check/ENV flag) so it won't drift from the primary implementation; ensure no imports reference comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.comfy/supported_models.py (1)
1813-1818: Nested tokenizer class is functional but unconventional.The
CogVideoXT5Tokenizernested class insideclip_target()works correctly, but defining a class inside a method is unusual in this codebase. Consider moving it to module level for better discoverability and potential reuse.That said, this pattern has no functional issues and the
min_length=226correctly matches the transformer'smax_text_seq_lengthparameter (per context snippet 1).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/supported_models.py` around lines 1813 - 1818, Nested class CogVideoXT5Tokenizer inside clip_target is unconventional; move CogVideoXT5Tokenizer to module level (define it alongside other tokenizers) keeping the same initializer signature (embedding_directory=None, tokenizer_data={}, min_length=226) and base class comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer, comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the module-level class for discoverability and reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 519-539: The current precompute builds full [B,C,T,H,W] entries in
z_at_res by calling _interpolate_zq(z, target) for every Phase 2 resolution,
which OOMs; instead, expand z along time once (to t_expanded) outside the
resolution loop and then perform spatial interpolation from that
temporally-expanded tensor inside the chunk/decode loop so you only create
per-chunk spatially-interpolated tensors when needed. Concretely: keep a single
temporally-expanded_z (use t_expanded) and remove the eager per-resolution calls
to _interpolate_zq in the remaining_blocks/decoder.up_blocks loop; when
processing each chunk/resolution, call _interpolate_zq on the
temporally-expanded slice for that chunk/target and cache only the
spatially-interpolated result in z_at_res keyed by (h,w) to avoid allocating
full clip-sized [B,C,T,H,W] per resolution.
- Around line 453-465: The encode method currently folds the remainder frames
into the first chunk causing batches larger than num_sample_frames_batch_size;
change the batching so chunks are capped to self.num_sample_frames_batch_size by
computing num_batches = ceil(t / frame_batch) and for each i set start = i *
frame_batch and end = min((i + 1) * frame_batch, t), then call self.encoder(x[:,
:, start:end], conv_cache=conv_cache) appending results and carrying forward
conv_cache; this ensures each chunk size <= num_sample_frames_batch_size and
prevents oversized VRAM spikes while preserving conv_cache behavior in encode.
---
Nitpick comments:
In `@comfy/ldm/cogvideo/vae_backup.py`:
- Around line 1-485: This file contains an extra, unused CogVideoX VAE
implementation (class AutoencoderKLCogVideoX and related Encoder3D/Decoder3D,
etc.) that duplicates the canonical comfy.ldm.cogvideo.vae module wired by
comfy/sd.py; remove this backup from the runtime package surface (delete the
file) or move it into a dev-only location (e.g., tests/dev or gated behind an
explicit runtime check/ENV flag) so it won't drift from the primary
implementation; ensure no imports reference
comfy.ldm.cogvideo.vae_backup.AutoencoderKLCogVideoX and that comfy/sd.py
continues to import the canonical comfy.ldm.cogvideo.vae.AutoencoderKLCogVideoX.
In `@comfy/supported_models.py`:
- Around line 1813-1818: Nested class CogVideoXT5Tokenizer inside clip_target is
unconventional; move CogVideoXT5Tokenizer to module level (define it alongside
other tokenizers) keeping the same initializer signature
(embedding_directory=None, tokenizer_data={}, min_length=226) and base class
comfy.text_encoders.sd3_clip.T5XXLTokenizer, then update the clip_target method
to return supported_models_base.ClipTarget(CogVideoXT5Tokenizer,
comfy.text_encoders.sd3_clip.T5XXLModel) so the method simply references the
module-level class for discoverability and reuse.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2df899fb-7a4b-40ce-ae3b-2acdcbf1ff24
📒 Files selected for processing (11)
comfy/latent_formats.pycomfy/ldm/cogvideo/__init__.pycomfy/ldm/cogvideo/model.pycomfy/ldm/cogvideo/vae.pycomfy/ldm/cogvideo/vae_backup.pycomfy/model_base.pycomfy/model_detection.pycomfy/model_sampling.pycomfy/sd.pycomfy/supported_models.pynodes.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/ldm/cogvideo/vae.py`:
- Around line 50-60: The fast-path in encode() returns None for conv_cache when
x.shape[2] == 1, which drops causal context; compute and return the same cache
that the non-fast path would produce instead of None. After computing w2d and
out, build the next conv_cache equivalent to what the regular 3D path produces
(i.e., the trailing temporal context frames expected by self.conv.kernel_size
and self.conv.dilation) by replicating the single input frame to the needed
temporal length and running the same conv logic (using w2d/self.conv and
F.conv3d or by assembling the expected cache tensor), then return (out,
conv_cache) so subsequent chunks preserve causal state. Ensure you use the
existing symbols conv_cache, self.conv, w2d, and F.conv3d and manage GPU memory
like the rest of comfy/** code.
In `@comfy/model_sampling.py`:
- Around line 62-76: The sampler is inconsistent: calculate_denoised() expects
DDPM x_t but calculate_input() returns raw noise while noise_scaling() currently
adds latent_image, over-scaling the model input; fix by making noise_scaling()
produce pure sigma-space noise samples (i.e., return noise * sigma, or noise *
sqrt(1.0 + sigma**2) when max_denoise=True) and do NOT add latent_image there so
calculate_input(self, sigma, noise) can remain returning the sigma-space sample;
update references to reshape_sigma as needed and ensure
calculate_denoised(model_input, ...) consumes that sigma-space x_t consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d11ebf2e-ca2d-4490-bd07-83135a1955e2
📒 Files selected for processing (11)
comfy/latent_formats.pycomfy/ldm/cogvideo/__init__.pycomfy/ldm/cogvideo/model.pycomfy/ldm/cogvideo/vae.pycomfy/model_base.pycomfy/model_detection.pycomfy/model_sampling.pycomfy/sd.pycomfy/supported_models.pycomfy/text_encoders/cogvideo.pynodes.py
✅ Files skipped from review due to trivial changes (1)
- nodes.py
🚧 Files skipped from review as they are similar to previous changes (3)
- comfy/text_encoders/cogvideo.py
- comfy/model_detection.py
- comfy/ldm/cogvideo/model.py
| if conv_cache is None and x.shape[2] == 1: | ||
| # Fast path: single frame, no cache. All temporal padding | ||
| # frames are copies of the input (replicate-style), so the | ||
| # 3D conv reduces to a 2D conv with summed temporal kernel. | ||
| w = comfy.ops.cast_to_input(self.conv.weight, x) | ||
| b = comfy.ops.cast_to_input(self.conv.bias, x) if self.conv.bias is not None else None | ||
| w2d = w.sum(dim=2, keepdim=True) | ||
| out = F.conv3d(x, w2d, b, | ||
| self.conv.stride, self.conv.padding, | ||
| self.conv.dilation, self.conv.groups) | ||
| return out, None |
There was a problem hiding this comment.
Preserve the causal cache in the single-frame fast path.
When encode() starts with a one-frame remainder chunk, this path returns None, so the next chunk loses frame-0 causal context and treats its first frame as a fresh sequence boundary. Return the same cache the non-fast path would have produced.
🐛 Proposed fix
if conv_cache is None and x.shape[2] == 1:
# Fast path: single frame, no cache. All temporal padding
# frames are copies of the input (replicate-style), so the
# 3D conv reduces to a 2D conv with summed temporal kernel.
w = comfy.ops.cast_to_input(self.conv.weight, x)
b = comfy.ops.cast_to_input(self.conv.bias, x) if self.conv.bias is not None else None
w2d = w.sum(dim=2, keepdim=True)
out = F.conv3d(x, w2d, b,
self.conv.stride, self.conv.padding,
self.conv.dilation, self.conv.groups)
- return out, None
+ conv_cache = x[:, :, :1].repeat(1, 1, kernel_t - 1, 1, 1).clone()
+ return out, conv_cacheAs per coding guidelines, comfy/** changes should focus on memory management and GPU resource handling.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/ldm/cogvideo/vae.py` around lines 50 - 60, The fast-path in encode()
returns None for conv_cache when x.shape[2] == 1, which drops causal context;
compute and return the same cache that the non-fast path would produce instead
of None. After computing w2d and out, build the next conv_cache equivalent to
what the regular 3D path produces (i.e., the trailing temporal context frames
expected by self.conv.kernel_size and self.conv.dilation) by replicating the
single input frame to the needed temporal length and running the same conv logic
(using w2d/self.conv and F.conv3d or by assembling the expected cache tensor),
then return (out, conv_cache) so subsequent chunks preserve causal state. Ensure
you use the existing symbols conv_cache, self.conv, w2d, and F.conv3d and manage
GPU memory like the rest of comfy/** code.
| def calculate_input(self, sigma, noise): | ||
| return noise | ||
|
|
||
| def calculate_denoised(self, sigma, model_output, model_input): | ||
| sigma = reshape_sigma(sigma, model_output.ndim) | ||
| return model_input / (sigma ** 2 + 1.0) ** 0.5 - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5 | ||
|
|
||
| def noise_scaling(self, sigma, noise, latent_image, max_denoise=False): | ||
| sigma = reshape_sigma(sigma, noise.ndim) | ||
| if max_denoise: | ||
| noise = noise * torch.sqrt(1.0 + sigma ** 2.0) | ||
| else: | ||
| noise = noise * sigma | ||
| noise += latent_image | ||
| return noise |
There was a problem hiding this comment.
Keep the DDPM v-prediction sample scale consistent.
calculate_denoised() assumes model_input is DDPM-normalized x_t, but noise_scaling() still builds sigma-space samples (latent + sigma * noise, or sqrt(1 + sigma²) * noise for max denoise) while calculate_input() passes them through unchanged. That over-scales what the CogVideoX model sees and changes the denoised estimate.
Please either normalize the sampler input path for this class, or reuse the standard sigma-space v-prediction formulas.
One possible sigma-space fix
class V_PREDICTION_DDPM:
@@
def calculate_input(self, sigma, noise):
- return noise
+ sigma = reshape_sigma(sigma, noise.ndim)
+ return noise / (sigma ** 2 + 1.0) ** 0.5
def calculate_denoised(self, sigma, model_output, model_input):
sigma = reshape_sigma(sigma, model_output.ndim)
- return model_input / (sigma ** 2 + 1.0) ** 0.5 - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5
+ return model_input / (sigma ** 2 + 1.0) - model_output * sigma / (sigma ** 2 + 1.0) ** 0.5🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/model_sampling.py` around lines 62 - 76, The sampler is inconsistent:
calculate_denoised() expects DDPM x_t but calculate_input() returns raw noise
while noise_scaling() currently adds latent_image, over-scaling the model input;
fix by making noise_scaling() produce pure sigma-space noise samples (i.e.,
return noise * sigma, or noise * sqrt(1.0 + sigma**2) when max_denoise=True) and
do NOT add latent_image there so calculate_input(self, sigma, noise) can remain
returning the sigma-space sample; update references to reshape_sigma as needed
and ensure calculate_denoised(model_input, ...) consumes that sigma-space x_t
consistently.
This is a pre-requirement PR for CORE-38, netflix's VOID model.
It was written by @kijai and rebased by myself. I've tested and used it only with VOID model and it works well.
This is a newly opened PR directly within the main repo. I'm closing the initial one: #13351