Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs by naxci1 · Pull Request #486 · numz/ComfyUI-SeedVR2_VideoUpscaler

naxci1 · 2026-01-12T12:52:27Z

…own3D, DupUp3D, and Wan2_2_VAE wrapper class

- Add VAEArchitectureConfig for encoder/decoder configuration - Add VAEEncodingConfig for encoding parameters - Add VAEModelConfig for complete model configuration - Implement VAEConfigManager with full CRUD operations - Support JSON serialization/deserialization - Include predefined configs for Wan2.1 and Wan2.2 - Add config cloning, updating, saving, and loading - Support batch import/export operations

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

…it__ Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs

naxci1 · 2026-01-12T12:53:23Z

NVFP4 modelis download link:

https://huggingface.co/Nexus24/vaeGGUF/tree/main

naxci1 · 2026-01-12T13:36:17Z

Model: 3b_nvfp4

   v2.5.24                                    © ByteDance Seed · NumZ · AInVFX
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[17:31:55.351] ℹ️ OS: Windows (10.0.19045) | GPU: NVIDIA GeForce RTX 5070 Ti (16GB)
[17:31:55.351] ℹ️ Python: 3.12.10 | PyTorch: 2.7.1+cu128 | FlashAttn: v2 ✓ | SageAttn: v2 ✓ | Triton: ✓
[17:31:55.351] ℹ️ CUDA: 12.8 | cuDNN: 90701 | ComfyUI: 0.3.68
[17:31:55.351]
[17:31:55.351]  ━━━━━━━━━ Model Preparation ━━━━━━━━━
[17:31:55.353] 📊 Before model preparation:
[17:31:55.353] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:31:55.353] 📊   [RAM] 2.29GB process / 13.37GB others / 80.14GB free / 95.79GB total
[17:31:55.353] 📊 Resetting VRAM peak memory statistics
[17:31:55.353] 📥 Checking and downloading models if needed...
[17:31:55.353] ⚠️ [WARNING] seedvr2_3b_nvfp4.safetensors not in registry, skipping validation
[17:31:55.354] 🔧 VAE model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:31:55.354] 🔧 VAE model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:31:55.354] 🔧 Generation context initialized: DiT=cuda:0, VAE=cuda:0, Offload=[none], LOCAL_RANK=0
[17:31:55.354] 🎯 Unified compute dtype: torch.bfloat16 across entire pipeline for maximum performance
[17:31:55.354] 🏃 Configuring inference runner...
[17:31:55.354] 🏃 Creating new runner: DiT=seedvr2_3b_nvfp4.safetensors, VAE=ema_vae_fp16.safetensors
[17:31:55.364] 🚀 Creating DiT model structure on meta device
[17:31:55.421] 🎨 Creating VAE model structure on meta device
[17:31:55.455] 🎨 VAE downsample factors configured (spatial: 8x, temporal: 4x)
[17:31:55.457] 🔄 Moving text_pos_embeds from CPU to CUDA:0 (DiT inference)
[17:31:55.457] 🔄 Moving text_neg_embeds from CPU to CUDA:0 (DiT inference)
[17:31:55.457] 🚀 Loaded text embeddings for DiT
[17:31:55.460] 📊 After model preparation:
[17:31:55.460] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:31:55.460] 📊   [RAM] 2.29GB process / 13.38GB others / 80.13GB free / 95.79GB total
[17:31:55.460] 📊 Resetting VRAM peak memory statistics
[17:31:55.460] ⚡ Model preparation: 0.11s
[17:31:55.460] ⚡   └─ Model structures prepared: 0.09s
[17:31:55.460] ⚡     └─ DiT structure created: 0.05s
[17:31:55.460] ⚡     └─ VAE structure created: 0.03s
[17:31:55.460] 🔧   Initializing video transformation pipeline for 720px (shortest edge)
[17:31:55.469] 🔧   Target dimensions: 956x720 (padded to 960x720 for processing)
[17:31:55.470]
[17:31:55.470] 🎬 Starting upscaling generation...
[17:31:55.470] 🎬   Input: 150 frames, 306x230px → Padded: 960x720px → Output: 956x720px (shortest edge: 720px)
[17:31:55.470] 🎬   Batch size: 85, Seed: 1333, Channels: RGB
[17:31:55.470]
[17:31:55.470]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━
[17:31:55.470] ♻️ Reusing pre-initialized video transformation pipeline
[17:31:55.470]
[17:31:55.470] 💡 Tip: For 150 frames, batch_size=149 matches video length optimally
[17:31:55.470] 💡   Matching batch_size to shot length improves temporal coherence
[17:31:55.470]
[17:31:55.470] 🎨 Materializing VAE weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:31:55.594] 🎯 Converting VAE weights to torch.bfloat16 during loading
[17:31:55.597] 🎨 Materializing VAE: 250 parameters, 478.07MB total
[17:31:55.600] 🎨 VAE materialized directly from meta with loaded weights
[17:31:55.600] 🎨 VAE model set to eval mode (gradients disabled)
[17:31:55.601] 🎨 Configuring VAE causal slicing for temporal processing
[17:31:55.601] 🎨 Configuring VAE memory limits for causal convolutions
[17:31:55.602] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:31:55.602] 🎨 Using seed: 1001333 (VAE uses seed+1000000 for deterministic sampling)
[17:31:55.602] 🔄 VAE already on CUDA:0, skipping movement
[17:31:55.604] 📊 After VAE loading for encoding:
[17:31:55.605] 📊   [VRAM] 0.48GB allocated / 0.52GB reserved / Peak: 0.50GB / 14.07GB free / 15.92GB total
[17:31:55.605] 📊   [RAM] 2.28GB process / 13.37GB others / 80.14GB free / 95.79GB total
[17:31:55.605] 📊   Memory changes: VRAM +0.48GB
[17:31:55.605] 📊 Resetting VRAM peak memory statistics
[17:31:55.605] 🎨 Encoding batch 1/2
[17:31:55.605] 🔄   Moving video_batch_1 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:31:55.615] 📹   Sequence of 85 frames
[17:32:05.127] ℹ️   Latents shape: torch.Size([22, 90, 120, 16])
[17:32:05.127] 🎨 Encoding batch 2/2
[17:32:05.127] 🔄   Moving video_batch_2 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:32:06.092] 📹   Sequence of 65 frames
[17:32:13.067] ℹ️   Latents shape: torch.Size([17, 90, 120, 16])
[17:32:13.067] ⚡ Phase 1: VAE encoding complete: 17.60s
[17:32:13.067] ⚡   └─ Encoded batch 1: 9.52s
[17:32:13.067] ⚡   └─ Encoded batch 2: 7.94s
[17:32:13.068] ⚡   └─ VAE materialized: 0.13s
[17:32:13.068] ⚡     └─ VAE weights loaded from file: 0.12s
[17:32:13.070] 📊 After phase 1 (VAE encoding):
[17:32:13.070] 📊   [VRAM] 0.50GB allocated / 13.96GB reserved / Peak: 9.15GB / 0.00GB free / 15.92GB total
[17:32:13.071] 📊   [RAM] 2.28GB process / 13.42GB others / 80.09GB free / 95.79GB total
[17:32:13.071] 📊   Memory changes: VRAM +0.02GB
[17:32:13.071] 📊 Resetting VRAM peak memory statistics
[17:32:13.071]
[17:32:13.071]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━
[17:32:14.035] 🚀 Materializing DiT weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_3b_nvfp4.safetensors
[17:32:14.035] 🔄 Detected NVFP4 checkpoint on Blackwell GPU - enabling 4-bit optimization
[17:32:14.917] 🔄 NVFP4 loading: 0 quantized, 133 preserved in FP16
[17:32:14.918] 🚀 Materializing DiT: 635 parameters, 3235.12MB total
[17:32:14.926] 🚀 DiT materialized directly from meta with loaded weights
[17:32:14.929] ✅ Initialized 64 non-persistent buffers
[17:32:14.929] 🔧 Applying DiT compatibility wrapper
[17:32:14.929] 🎯 Detected NaDiT 3B FP8 - Converting RoPE freqs for FP8 compatibility
[17:32:14.930] ✅ Converted 0 RoPE frequency buffers from FP8 to torch.bfloat16 for compatibility
[17:32:14.930] 🔧 Stabilizing RoPE computations for numerical stability
[17:32:14.930] ✅ Stabilized 64 RoPE modules
[17:32:14.930] 🔧 Applying sageattn_2 attention mode and torch.bfloat16 compute dtype to model
[17:32:14.931] ✅ Applied sageattn_2 and compute_dtype=torch.bfloat16 to 32 modules
[17:32:14.931] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[17:32:14.931] 🔄 DiT already on CUDA:0, skipping movement
[17:32:14.934] 📊 After DiT loading for upscaling:
[17:32:14.934] 📊   [VRAM] 3.71GB allocated / 13.96GB reserved / Peak: 3.71GB / 0.00GB free / 15.92GB total
[17:32:14.934] 📊   [RAM] 2.28GB process / 13.41GB others / 80.10GB free / 95.79GB total
[17:32:14.934] 📊   Memory changes: VRAM +3.21GB
[17:32:14.934] 📊 Resetting VRAM peak memory statistics
[17:32:14.934] 🎬 Upscaling batch 1/2
[17:32:14.935] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.33s/it]
[17:32:22.270] 🎬 Upscaling batch 2/2
[17:32:22.271] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.47s/it]
[17:32:27.744] 🧹 Cleaning up DiT components
[17:32:27.745] 🔄 Moving DiT from CUDA:0 to CPU (releasing GPU memory)
[17:32:28.333] 🧹 DiT model deleted
[17:32:28.333] 🧹 Cleaned up text embeddings: texts_pos, texts_neg
[17:32:28.333] ⚡ Phase 2: DiT upscaling complete: 15.26s
[17:32:28.333] ⚡   └─ Upscaled batch 1: 7.33s
[17:32:28.334] ⚡     └─ DiT inference 1: 7.33s
[17:32:28.334] ⚡   └─ Upscaled batch 2: 5.47s
[17:32:28.334] ⚡     └─ DiT inference 2: 5.47s
[17:32:28.334] ⚡   └─ DiT materialized: 0.90s
[17:32:28.334] ⚡     └─ DiT weights loaded from file: 0.88s
[17:32:28.334] ⚡   └─ DiT moved to CPU: 0.58s
[17:32:28.334] ⚡   └─ (other operations): 0.97s
[17:32:28.336] 📊 After phase 2 (DiT upscaling):
[17:32:28.336] 📊   [VRAM] 0.50GB allocated / 15.00GB reserved / Peak: 12.64GB / 0.00GB free / 15.92GB total
[17:32:28.336] 📊   [RAM] 5.54GB process / 13.44GB others / 76.82GB free / 95.79GB total
[17:32:28.337] 📊   Memory changes: VRAM -3.21GB, RAM +3.26GB
[17:32:28.337] 📊 Resetting VRAM peak memory statistics
[17:32:28.463]
[17:32:28.463]  ━━━━━━━━ Phase 3: VAE decoding ━━━━━━━━
[17:32:28.464] 🔧 Pre-allocating output tensor: 150 frames, 956x720px, RGB (0.58GB)
[17:32:28.464] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:32:28.465] 🔄 VAE already on CUDA:0, skipping movement
[17:32:28.469] 📊 After VAE loading for decoding:
[17:32:28.469] 📊   [VRAM] 0.50GB allocated / 15.00GB reserved / Peak: 0.50GB / 0.00GB free / 15.92GB total
[17:32:28.469] 📊   [RAM] 2.38GB process / 13.45GB others / 79.96GB free / 95.79GB total
[17:32:28.469] 📊   Memory changes: RAM -3.16GB
[17:32:28.469] 📊 Resetting VRAM peak memory statistics
[17:32:28.469] 🎨 Decoding batch 1/2
[17:32:28.469] ℹ️   Latents shape: torch.Size([1, 22, 90, 120, 16])
[17:32:28.469] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:32:28.470] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:32:28.470] 🎨   Decoding tiles 1-2 / 2
[17:32:53.072] 📹   Trimming spatial padding: 960x720 → 956x720
[17:32:53.072] 🔄   Moving sample_1 from CUDA:0 to CPU (writing to final_video)
[17:32:53.774] 📹   Wrote 85 frames to positions 0-85
[17:32:53.788] 🎨 Decoding batch 2/2
[17:32:53.788] ℹ️   Latents shape: torch.Size([1, 17, 90, 120, 16])
[17:32:53.788] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:32:53.789] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:32:53.789] 🎨   Decoding tiles 1-2 / 2
[17:33:12.429] 📹   Trimming spatial padding: 960x720 → 956x720
[17:33:12.431] 🔄   Moving sample_2 from CUDA:0 to CPU (writing to final_video)
[17:33:13.102] 📹   Wrote 65 frames to positions 85-150
[17:33:13.113] 🧹 Cleaning up VAE components
[17:33:13.113] 🔄 Moving VAE from CUDA:0 to CPU (releasing GPU memory)
[17:33:13.238] 🧹 VAE model deleted
[17:33:13.239] ⚡ Phase 3: VAE decoding complete: 44.77s
[17:33:13.239] ⚡   └─ Decoded batch 1: 25.32s
[17:33:13.239] ⚡     └─ VAE decode: 18.64s
[17:33:13.240] ⚡   └─ Decoded batch 2: 19.32s
[17:33:13.240] ⚡     └─ VAE decode: 18.64s
[17:33:13.240] ⚡   └─ VAE moved to CPU: 0.10s
[17:33:13.240] ⚡   └─ (other operations): 0.03s
[17:33:13.244] 📊 After phase 3 (VAE decoding):
[17:33:13.244] 📊   [VRAM] 0.01GB allocated / 12.91GB reserved / Peak: 10.47GB / 0.96GB free / 15.92GB total
[17:33:13.244] 📊   [RAM] 2.97GB process / 13.43GB others / 79.40GB free / 95.79GB total
[17:33:13.244] 📊   Memory changes: VRAM -0.49GB, RAM +0.58GB
[17:33:13.244] 📊 Resetting VRAM peak memory statistics
[17:33:13.244]
[17:33:13.244]  ━━━━━━━━ Phase 4: Post-processing ━━━━━━━━
[17:33:13.244] 📹 Post-processing batch 1/2
[17:33:13.245] 🔄   Moving sample_1 from CPU to CUDA:0 (post-processing)
[17:33:13.269] 📹   Color correction disabled (set to none)
[17:33:13.270] 🔄   Moving sample_1_final from CUDA:0 to CPU (writing processed result to final_video)
[17:33:13.364] 📹 Post-processing batch 2/2
[17:33:13.364] 🔄   Moving sample_2 from CPU to CUDA:0 (post-processing)
[17:33:13.383] 📹   Color correction disabled (set to none)
[17:33:13.383] 🔄   Moving sample_2_final from CUDA:0 to CPU (writing processed result to final_video)
[17:33:13.456] 🎬 Output assembled: 150 frames, Resolution: 956x720px, Channels: RGB
[17:33:13.457] ⚡ Phase 4: Post-processing complete: 0.21s
[17:33:13.457] ⚡   └─ Post-processed batch 1: 0.12s
[17:33:13.457] ⚡   └─ Post-processed batch 2: 0.09s
[17:33:13.459] 📊 After phase 4 (Post-processing):
[17:33:13.460] 📊   [VRAM] 0.01GB allocated / 12.91GB reserved / Peak: 0.33GB / 0.96GB free / 15.92GB total
[17:33:13.460] 📊   [RAM] 2.97GB process / 13.43GB others / 79.40GB free / 95.79GB total
[17:33:13.460] 📊 Resetting VRAM peak memory statistics
[17:33:13.460]
[17:33:13.502] 🎯 Converted output from torch.bfloat16 to float32
[17:33:13.502] ✅ Upscaling completed successfully!
[17:33:13.502] 🧹 Starting full cleanup
[17:33:13.503] 🧹 Clearing memory caches (deep)...
[17:33:13.683] ✅ Completed full cleanup
[17:33:13.712] 📊 After all phases complete:
[17:33:13.712] 📊   [VRAM] 0.00GB allocated / 0.02GB reserved / Peak: 0.01GB / 14.58GB free / 15.92GB total
[17:33:13.712] 📊   [RAM] 3.51GB process / 13.46GB others / 78.82GB free / 95.79GB total
[17:33:13.712] 📊   Memory changes: RAM +0.54GB
[17:33:13.712] 📊 Resetting VRAM peak memory statistics
[17:33:13.713]
[17:33:13.713]  ────────────────────────
[17:33:13.713] 📊 Peak memory by phase:
[17:33:13.713] 📊   1. VAE encoding: VRAM 9.15GB allocated, 13.96GB reserved | RAM 2.28GB
[17:33:13.713] 📊   2. DiT upscaling: VRAM 12.64GB allocated, 15.00GB reserved | RAM 5.54GB
[17:33:13.713] 📊   3. VAE decoding: VRAM 10.47GB allocated, 15.00GB reserved | RAM 2.97GB
[17:33:13.713] 📊   4. Post-processing: VRAM 0.33GB allocated, 12.91GB reserved | RAM 3.51GB
[17:33:13.713] 📊 Overall peak: VRAM 12.64GB allocated, 15.00GB reserved | RAM 5.54GB
[17:33:13.714]
[17:33:13.714]  ────────────────────────
[17:33:13.714] ⚡ Total execution: 78.36s
[17:33:13.714] ⚡   └─ Video generation: 78.03s
[17:33:13.714] ⚡   └─   Phase 3: VAE decoding: 44.77s
[17:33:13.714] ⚡   └─   Phase 1: VAE encoding: 17.60s
[17:33:13.714] ⚡   └─   Phase 2: DiT upscaling: 15.26s
[17:33:13.714] ⚡   └─   Phase 4: Post-processing: 0.21s
[17:33:13.714] ⚡   └─ Final cleanup: 0.21s
[17:33:13.714] ⚡   └─ Model preparation: 0.11s
[17:33:13.714] ⚡ Average FPS: 1.91 frames/sec
[17:33:13.714]
[17:33:13.714]  ────────────────────────
[17:33:13.714] 💬 Questions? Updates? Watch, star & sponsor if you can!
[17:33:13.715] 🎬 https://www.youtube.com/@AInVFX
[17:33:13.715] ⭐💝 https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler
pop 1.mp4
Prompt executed in 79.40 seconds

naxci1 · 2026-01-12T13:37:53Z

Model: 3bQ8

   v2.5.24                                    © ByteDance Seed · NumZ · AInVFX
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[17:34:38.826] ℹ️ OS: Windows (10.0.19045) | GPU: NVIDIA GeForce RTX 5070 Ti (16GB)
[17:34:38.826] ℹ️ Python: 3.12.10 | PyTorch: 2.7.1+cu128 | FlashAttn: v2 ✓ | SageAttn: v2 ✓ | Triton: ✓
[17:34:38.826] ℹ️ CUDA: 12.8 | cuDNN: 90701 | ComfyUI: 0.3.68
[17:34:38.826]
[17:34:38.826]  ━━━━━━━━━ Model Preparation ━━━━━━━━━
[17:34:38.828] 📊 Before model preparation:
[17:34:38.828] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:34:38.828] 📊   [RAM] 2.17GB process / 13.43GB others / 80.20GB free / 95.79GB total
[17:34:38.828] 📊 Resetting VRAM peak memory statistics
[17:34:38.828] 📥 Checking and downloading models if needed...
[17:34:38.828] 🔧 DiT model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_ema_3b-Q8_0.gguf
[17:34:38.829] 🔧 DiT model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_ema_3b-Q8_0.gguf
[17:34:38.829] 🔧 VAE model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:34:38.829] 🔧 VAE model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:34:38.829] 🔧 Generation context initialized: DiT=cuda:0, VAE=cuda:0, Offload=[none], LOCAL_RANK=0
[17:34:38.829] 🎯 Unified compute dtype: torch.bfloat16 across entire pipeline for maximum performance
[17:34:38.829] 🏃 Configuring inference runner...
[17:34:38.829] 🏃 Creating new runner: DiT=seedvr2_ema_3b-Q8_0.gguf, VAE=ema_vae_fp16.safetensors
[17:34:38.839] 🚀 Creating DiT model structure on meta device
[17:34:38.896] 🎨 Creating VAE model structure on meta device
[17:34:38.932] 🎨 VAE downsample factors configured (spatial: 8x, temporal: 4x)
[17:34:38.933] 🔄 Moving text_pos_embeds from CPU to CUDA:0 (DiT inference)
[17:34:38.934] 🔄 Moving text_neg_embeds from CPU to CUDA:0 (DiT inference)
[17:34:38.934] 🚀 Loaded text embeddings for DiT
[17:34:38.936] 📊 After model preparation:
[17:34:38.936] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:34:38.937] 📊   [RAM] 2.17GB process / 13.43GB others / 80.19GB free / 95.79GB total
[17:34:38.937] 📊 Resetting VRAM peak memory statistics
[17:34:38.937] ⚡ Model preparation: 0.11s
[17:34:38.937] ⚡   └─ Model structures prepared: 0.09s
[17:34:38.937] ⚡     └─ DiT structure created: 0.05s
[17:34:38.937] ⚡     └─ VAE structure created: 0.04s
[17:34:38.937] 🔧   Initializing video transformation pipeline for 720px (shortest edge)
[17:34:38.945] 🔧   Target dimensions: 956x720 (padded to 960x720 for processing)
[17:34:38.946]
[17:34:38.946] 🎬 Starting upscaling generation...
[17:34:38.946] 🎬   Input: 150 frames, 306x230px → Padded: 960x720px → Output: 956x720px (shortest edge: 720px)
[17:34:38.946] 🎬   Batch size: 85, Seed: 1333, Channels: RGB
[17:34:38.946]
[17:34:38.946]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━
[17:34:38.946] ♻️ Reusing pre-initialized video transformation pipeline
[17:34:38.947]
[17:34:38.947] 💡 Tip: For 150 frames, batch_size=149 matches video length optimally
[17:34:38.947] 💡   Matching batch_size to shot length improves temporal coherence
[17:34:38.947]
[17:34:38.947] 🎨 Materializing VAE weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:34:39.071] 🎯 Converting VAE weights to torch.bfloat16 during loading
[17:34:39.074] 🎨 Materializing VAE: 250 parameters, 478.07MB total
[17:34:39.077] 🎨 VAE materialized directly from meta with loaded weights
[17:34:39.077] 🎨 VAE model set to eval mode (gradients disabled)
[17:34:39.078] 🎨 Configuring VAE causal slicing for temporal processing
[17:34:39.078] 🎨 Configuring VAE memory limits for causal convolutions
[17:34:39.079] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:34:39.079] 🎨 Using seed: 1001333 (VAE uses seed+1000000 for deterministic sampling)
[17:34:39.079] 🔄 VAE already on CUDA:0, skipping movement
[17:34:39.082] 📊 After VAE loading for encoding:
[17:34:39.082] 📊   [VRAM] 0.48GB allocated / 0.52GB reserved / Peak: 0.50GB / 14.07GB free / 15.92GB total
[17:34:39.082] 📊   [RAM] 2.16GB process / 13.44GB others / 80.20GB free / 95.79GB total
[17:34:39.082] 📊   Memory changes: VRAM +0.48GB, RAM -0.01GB
[17:34:39.082] 📊 Resetting VRAM peak memory statistics
[17:34:39.082] 🎨 Encoding batch 1/2
[17:34:39.082] 🔄   Moving video_batch_1 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:34:39.092] 📹   Sequence of 85 frames
[17:34:48.656] ℹ️   Latents shape: torch.Size([22, 90, 120, 16])
[17:34:48.656] 🎨 Encoding batch 2/2
[17:34:48.656] 🔄   Moving video_batch_2 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:34:49.630] 📹   Sequence of 65 frames
[17:34:56.616] ℹ️   Latents shape: torch.Size([17, 90, 120, 16])
[17:34:56.616] ⚡ Phase 1: VAE encoding complete: 17.67s
[17:34:56.617] ⚡   └─ Encoded batch 1: 9.57s
[17:34:56.617] ⚡   └─ Encoded batch 2: 7.96s
[17:34:56.617] ⚡   └─ VAE materialized: 0.13s
[17:34:56.617] ⚡     └─ VAE weights loaded from file: 0.12s
[17:34:56.621] 📊 After phase 1 (VAE encoding):
[17:34:56.621] 📊   [VRAM] 0.50GB allocated / 13.96GB reserved / Peak: 9.15GB / 0.00GB free / 15.92GB total
[17:34:56.621] 📊   [RAM] 2.16GB process / 13.49GB others / 80.15GB free / 95.79GB total
[17:34:56.621] 📊   Memory changes: VRAM +0.02GB
[17:34:56.621] 📊 Resetting VRAM peak memory statistics
[17:34:56.621]
[17:34:56.621]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━
[17:34:57.583] 🚀 Materializing DiT weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_ema_3b-Q8_0.gguf
[17:34:57.603] 🚀 Loading 635 tensors to cuda:0...
[17:34:57.701] 🚀   Loaded 100/635 tensors...
[17:34:57.804] 🚀   Loaded 200/635 tensors...
[17:34:57.913] 🚀   Loaded 300/635 tensors...
[17:34:58.010] 🚀   Loaded 400/635 tensors...
[17:34:58.127] 🚀   Loaded 500/635 tensors...
[17:34:58.240] 🚀   Loaded 600/635 tensors...
[17:34:58.296] ✅ Successfully loaded 635 tensors to cuda:0
[17:34:58.380] 🚀 Materializing DiT: 635 parameters, 3490.99MB total
[17:34:58.380] 🚀 Loading GGUF weights
[17:34:58.382] ✅ Architecture check complete, no shape mismatch
[17:34:58.389] ✅ GGUF loading complete: 635 parameters loaded
[17:34:58.389] ℹ️ Quantized parameters: 210
[17:34:58.390] 🚀 Replacing layers with GGUF-optimized versions for precision handling
[17:34:58.393] ✅ Replaced 210 layers with GGUF-optimized versions
[17:34:58.393] 🎯 GGUF precision path: Q8_0:210 → FP16 (preserve) → BF16/FP32 (compute)
[17:34:58.397] ✅ Initialized 64 non-persistent buffers
[17:34:58.397] 🔧 Applying DiT compatibility wrapper
[17:34:58.397] 🔧 Stabilizing RoPE computations for numerical stability
[17:34:58.398] ✅ Stabilized 64 RoPE modules
[17:34:58.398] 🔧 Applying sageattn_2 attention mode and torch.bfloat16 compute dtype to model
[17:34:58.398] ✅ Applied sageattn_2 and compute_dtype=torch.bfloat16 to 32 modules
[17:34:58.398] 🎯 Model precision: DiT=torch.float16, VAE=torch.bfloat16, compute=torch.bfloat16
[17:34:58.398] 🔄 DiT already on CUDA:0, skipping movement
[17:34:58.400] 📊 After DiT loading for upscaling:
[17:34:58.401] 📊   [VRAM] 3.96GB allocated / 13.96GB reserved / Peak: 3.96GB / 0.00GB free / 15.92GB total
[17:34:58.401] 📊   [RAM] 2.16GB process / 13.49GB others / 80.15GB free / 95.79GB total
[17:34:58.401] 📊   Memory changes: VRAM +3.46GB
[17:34:58.401] 📊 Resetting VRAM peak memory statistics
[17:34:58.401] 🎬 Upscaling batch 1/2
[17:34:58.402] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.32s/it]
[17:35:05.722] 🎬 Upscaling batch 2/2
[17:35:05.723] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.61s/it]
[17:35:11.331] 🧹 Cleaning up DiT components
[17:35:11.333] 🔄 Moving DiT from CUDA:0 to CPU (releasing GPU memory)
[17:35:11.992] 🧹 DiT model deleted
[17:35:11.992] 🧹 Cleaned up text embeddings: texts_pos, texts_neg
[17:35:11.992] ⚡ Phase 2: DiT upscaling complete: 15.37s
[17:35:11.992] ⚡   └─ Upscaled batch 1: 7.32s
[17:35:11.992] ⚡     └─ DiT inference 1: 7.32s
[17:35:11.992] ⚡   └─ Upscaled batch 2: 5.61s
[17:35:11.992] ⚡     └─ DiT inference 2: 5.61s
[17:35:11.992] ⚡   └─ DiT materialized: 0.81s
[17:35:11.992] ⚡     └─ DiT weights loaded from file: 0.80s
[17:35:11.992] ⚡   └─ DiT moved to CPU: 0.66s
[17:35:11.992] ⚡   └─ (other operations): 0.97s
[17:35:11.994] 📊 After phase 2 (DiT upscaling):
[17:35:11.995] 📊   [VRAM] 0.50GB allocated / 14.95GB reserved / Peak: 12.89GB / 0.00GB free / 15.92GB total
[17:35:11.995] 📊   [RAM] 5.73GB process / 13.46GB others / 76.61GB free / 95.79GB total
[17:35:11.995] 📊   Memory changes: VRAM -3.46GB, RAM +3.58GB
[17:35:11.995] 📊 Resetting VRAM peak memory statistics
[17:35:12.130]
[17:35:12.130]  ━━━━━━━━ Phase 3: VAE decoding ━━━━━━━━
[17:35:12.130] 🔧 Pre-allocating output tensor: 150 frames, 956x720px, RGB (0.58GB)
[17:35:12.131] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:35:12.131] 🔄 VAE already on CUDA:0, skipping movement
[17:35:12.134] 📊 After VAE loading for decoding:
[17:35:12.134] 📊   [VRAM] 0.50GB allocated / 14.95GB reserved / Peak: 0.50GB / 0.00GB free / 15.92GB total
[17:35:12.134] 📊   [RAM] 2.32GB process / 13.48GB others / 79.99GB free / 95.79GB total
[17:35:12.134] 📊   Memory changes: RAM -3.41GB
[17:35:12.134] 📊 Resetting VRAM peak memory statistics
[17:35:12.134] 🎨 Decoding batch 1/2
[17:35:12.135] ℹ️   Latents shape: torch.Size([1, 22, 90, 120, 16])
[17:35:12.135] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:35:12.135] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:35:12.135] 🎨   Decoding tiles 1-2 / 2
[17:35:36.859] 📹   Trimming spatial padding: 960x720 → 956x720
[17:35:36.859] 🔄   Moving sample_1 from CUDA:0 to CPU (writing to final_video)
[17:35:37.559] 📹   Wrote 85 frames to positions 0-85
[17:35:37.576] 🎨 Decoding batch 2/2
[17:35:37.576] ℹ️   Latents shape: torch.Size([1, 17, 90, 120, 16])
[17:35:37.576] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:35:37.576] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:35:37.576] 🎨   Decoding tiles 1-2 / 2
[17:35:56.321] 📹   Trimming spatial padding: 960x720 → 956x720
[17:35:56.322] 🔄   Moving sample_2 from CUDA:0 to CPU (writing to final_video)
[17:35:56.989] 📹   Wrote 65 frames to positions 85-150
[17:35:56.999] 🧹 Cleaning up VAE components
[17:35:57.000] 🔄 Moving VAE from CUDA:0 to CPU (releasing GPU memory)
[17:35:57.129] 🧹 VAE model deleted
[17:35:57.129] ⚡ Phase 3: VAE decoding complete: 45.00s
[17:35:57.130] ⚡   └─ Decoded batch 1: 25.44s
[17:35:57.130] ⚡     └─ VAE decode: 18.75s
[17:35:57.130] ⚡   └─ Decoded batch 2: 19.42s
[17:35:57.130] ⚡     └─ VAE decode: 18.75s
[17:35:57.130] ⚡   └─ VAE moved to CPU: 0.11s
[17:35:57.131] ⚡   └─ (other operations): 0.02s
[17:35:57.133] 📊 After phase 3 (VAE decoding):
[17:35:57.133] 📊   [VRAM] 0.01GB allocated / 12.46GB reserved / Peak: 10.47GB / 1.44GB free / 15.92GB total
[17:35:57.133] 📊   [RAM] 2.97GB process / 13.07GB others / 79.75GB free / 95.79GB total
[17:35:57.133] 📊   Memory changes: VRAM -0.49GB, RAM +0.65GB
[17:35:57.133] 📊 Resetting VRAM peak memory statistics
[17:35:57.133]
[17:35:57.133]  ━━━━━━━━ Phase 4: Post-processing ━━━━━━━━
[17:35:57.133] 📹 Post-processing batch 1/2
[17:35:57.133] 🔄   Moving sample_1 from CPU to CUDA:0 (post-processing)
[17:35:57.164] 📹   Color correction disabled (set to none)
[17:35:57.164] 🔄   Moving sample_1_final from CUDA:0 to CPU (writing processed result to final_video)
[17:35:57.255] 📹 Post-processing batch 2/2
[17:35:57.256] 🔄   Moving sample_2 from CPU to CUDA:0 (post-processing)
[17:35:57.273] 📹   Color correction disabled (set to none)
[17:35:57.274] 🔄   Moving sample_2_final from CUDA:0 to CPU (writing processed result to final_video)
[17:35:57.343] 🎬 Output assembled: 150 frames, Resolution: 956x720px, Channels: RGB
[17:35:57.343] ⚡ Phase 4: Post-processing complete: 0.21s
[17:35:57.343] ⚡   └─ Post-processed batch 1: 0.12s
[17:35:57.343] ⚡   └─ Post-processed batch 2: 0.09s
[17:35:57.346] 📊 After phase 4 (Post-processing):
[17:35:57.346] 📊   [VRAM] 0.01GB allocated / 12.46GB reserved / Peak: 0.33GB / 1.44GB free / 15.92GB total
[17:35:57.346] 📊   [RAM] 2.97GB process / 13.07GB others / 79.75GB free / 95.79GB total
[17:35:57.346] 📊 Resetting VRAM peak memory statistics
[17:35:57.346]
[17:35:57.389] 🎯 Converted output from torch.bfloat16 to float32
[17:35:57.389] ✅ Upscaling completed successfully!
[17:35:57.389] 🧹 Starting full cleanup
[17:35:57.390] 🧹 Clearing memory caches (deep)...
[17:35:57.569] ✅ Completed full cleanup
[17:35:57.598] 📊 After all phases complete:
[17:35:57.598] 📊   [VRAM] 0.00GB allocated / 0.02GB reserved / Peak: 0.01GB / 14.58GB free / 15.92GB total
[17:35:57.598] 📊   [RAM] 3.51GB process / 13.07GB others / 79.22GB free / 95.79GB total
[17:35:57.598] 📊   Memory changes: RAM +0.54GB
[17:35:57.598] 📊 Resetting VRAM peak memory statistics
[17:35:57.598]
[17:35:57.599]  ────────────────────────
[17:35:57.599] 📊 Peak memory by phase:
[17:35:57.599] 📊   1. VAE encoding: VRAM 9.15GB allocated, 13.96GB reserved | RAM 2.16GB
[17:35:57.599] 📊   2. DiT upscaling: VRAM 12.89GB allocated, 14.95GB reserved | RAM 5.73GB
[17:35:57.599] 📊   3. VAE decoding: VRAM 10.47GB allocated, 14.95GB reserved | RAM 2.97GB
[17:35:57.599] 📊   4. Post-processing: VRAM 0.33GB allocated, 12.46GB reserved | RAM 3.51GB
[17:35:57.599] 📊 Overall peak: VRAM 12.89GB allocated, 14.95GB reserved | RAM 5.73GB
[17:35:57.599]
[17:35:57.599]  ────────────────────────
[17:35:57.599] ⚡ Total execution: 78.77s
[17:35:57.599] ⚡   └─ Video generation: 78.44s
[17:35:57.599] ⚡   └─   Phase 3: VAE decoding: 45.00s
[17:35:57.599] ⚡   └─   Phase 1: VAE encoding: 17.67s
[17:35:57.599] ⚡   └─   Phase 2: DiT upscaling: 15.37s
[17:35:57.599] ⚡   └─   Phase 4: Post-processing: 0.21s
[17:35:57.599] ⚡   └─ Final cleanup: 0.21s
[17:35:57.599] ⚡   └─ Model preparation: 0.11s
[17:35:57.599] ⚡ Average FPS: 1.90 frames/sec
[17:35:57.599]
[17:35:57.600]  ────────────────────────
[17:35:57.600] 💬 Questions? Updates? Watch, star & sponsor if you can!
[17:35:57.600] 🎬 https://www.youtube.com/@AInVFX
[17:35:57.600] ⭐💝 https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler
pop 1.mp4
Prompt executed in 79.76 seconds

…ackwell GPUs Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

…fication, remove private APIs Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Add NVFP4 async offloading and pinned memory for Blackwell GPU optimization

naxci1 · 2026-01-12T14:45:39Z

Setting output directory to: D:\output
[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2026-01-12 18:35:07.511
** Platform: Windows
** Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
** Python executable: C:\ComfyUI\python_embeded\python.exe
** ComfyUI Path: C:\ComfyUI\ComfyUI
** ComfyUI Base Folder Path: C:\ComfyUI\ComfyUI
** User directory: C:\ComfyUI\ComfyUI\user
** ComfyUI-Manager config path: C:\ComfyUI\ComfyUI\user\default\ComfyUI-Manager\config.ini
** Log path: C:\ComfyUI\ComfyUI\user\comfyui.log

Prestartup times for custom nodes:
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-easy-use
   1.5 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-manager

Checkpoint files will always be loaded safely.
Total VRAM 16303 MB, total RAM 98093 MB
pytorch version: 2.7.1+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5070 Ti : native
Using pytorch attention
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
ComfyUI version: 0.3.68
ComfyUI frontend version: 1.28.8
[Prompt Server] web root: C:\ComfyUI\python_embeded\Lib\site-packages\comfyui_frontend_package\static
[ComfyUI-Easy-Use] server: v1.3.2 Loaded
[ComfyUI-Easy-Use] web root: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-easy-use\web_version/v2 Loaded
### Loading: ComfyUI-Manager (V3.36)
[ComfyUI-Manager] network_mode: public
### ComfyUI Revision: 682 [265adad8] *DETACHED | Released on '2025-11-04'
WanVideoWrapper WARNING: FantasyPortrait nodes not available due to error in importing them: No module named 'onnx'
No Negpip.

[rgthree-comfy] Loaded 48 epic nodes. 🎉

⚡ SeedVR2 optimizations check: SageAttention ✅ | Flash Attention ✅ | Triton ✅
🚀 NVFP4 Blackwell optimization: ✅ (NVIDIA GeForce RTX 5070 Ti - 4-bit Tensor Core acceleration enabled)
   └─ Native FP4 dispatch configured (TF32 enabled, cuDNN benchmark active)
C:\ComfyUI\ComfyUI\custom_nodes\seedvr2_videoupscaler\src\optimization\compatibility.py:758: UserWarning: expandable_segments not supported on this platform (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.)
  a = torch.randn(8, 8, dtype=torch.bfloat16, device='cuda:0')
📊 Initial CUDA memory: 14.62GB free / 15.92GB total
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json

Import times for custom nodes:
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-inpaint-cropandstitch
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui_mittimi_load_presetlite
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyMath
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-dream-video-batches
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-videodircombiner
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-FlashVSR_Stable
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-VideoHelperSuite
   0.1 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper
   0.3 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-manager
   0.3 seconds: C:\ComfyUI\ComfyUI\custom_nodes\seedvr2_videoupscaler
   1.6 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-easy-use

[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
Context impl SQLiteImpl.
Will assume non-transactional DDL.
No target revision found.
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188

naxci1 · 2026-01-12T14:46:02Z

   v2.5.24                                    © ByteDance Seed · NumZ · AInVFX
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[18:39:34.659] ℹ️ OS: Windows (10.0.19045) | GPU: NVIDIA GeForce RTX 5070 Ti (16GB)
[18:39:34.659] ℹ️ Python: 3.12.10 | PyTorch: 2.7.1+cu128 | FlashAttn: v2 ✓ | SageAttn: v2 ✓ | Triton: ✓
[18:39:34.659] ℹ️ CUDA: 12.8 | cuDNN: 90701 | ComfyUI: 0.3.68
[18:39:34.659]
[18:39:34.659]  ━━━━━━━━━ Model Preparation ━━━━━━━━━
[18:39:34.661] 📊 Before model preparation:
[18:39:34.661] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[18:39:34.661] 📊   [RAM] 5.36GB process / 10.98GB others / 79.45GB free / 95.79GB total
[18:39:34.661] 📊 Resetting VRAM peak memory statistics
[18:39:34.661] 📥 Checking and downloading models if needed...
[18:39:34.662] ⚠️ [WARNING] seedvr2_3b_nvfp4.safetensors not in registry, skipping validation
[18:39:34.662] 🔧 VAE model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[18:39:34.662] 🔧 VAE model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[18:39:34.662] 🔧 Generation context initialized: DiT=cuda:0, VAE=cuda:0, Offload=[DiT offload=cpu], LOCAL_RANK=0
[18:39:34.662] 🎯 Unified compute dtype: torch.bfloat16 across entire pipeline for maximum performance
[18:39:34.662] 🏃 Configuring inference runner...
[18:39:34.662] 🏃 Creating new runner: DiT=seedvr2_3b_nvfp4.safetensors, VAE=ema_vae_fp16.safetensors
[18:39:34.672] ♻️ Reusing cached DiT (89): seedvr2_3b_nvfp4.safetensors
[18:39:34.673] 🚀 DiT configuration unchanged, reusing cached model
[18:39:34.676] 🎨 Creating VAE model structure on meta device
[18:39:34.712] 🎨 VAE downsample factors configured (spatial: 8x, temporal: 4x)
[18:39:34.713] 🔄 Moving text_pos_embeds from CPU to CUDA:0 (DiT inference)
[18:39:34.714] 🔄 Moving text_neg_embeds from CPU to CUDA:0 (DiT inference)
[18:39:34.714] 🚀 Loaded text embeddings for DiT
[18:39:34.716] 📊 After model preparation:
[18:39:34.716] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[18:39:34.716] 📊   [RAM] 5.36GB process / 10.98GB others / 79.45GB free / 95.79GB total
[18:39:34.716] 📊 Resetting VRAM peak memory statistics
[18:39:34.717] ⚡ Model preparation: 0.06s
[18:39:34.717] ⚡   └─ Model structures prepared: 0.04s
[18:39:34.717] ⚡     └─ VAE structure created: 0.04s
[18:39:34.717] 🔧   Initializing video transformation pipeline for 720px (shortest edge)
[18:39:34.725] 🔧   Target dimensions: 956x720 (padded to 960x720 for processing)
[18:39:34.726]
[18:39:34.726] 🎬 Starting upscaling generation...
[18:39:34.726] 🎬   Input: 150 frames, 306x230px → Padded: 960x720px → Output: 956x720px (shortest edge: 720px)
[18:39:34.726] 🎬   Batch size: 77, Seed: 1333, Channels: RGB
[18:39:34.727]
[18:39:34.727]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━
[18:39:34.727] ♻️ Reusing pre-initialized video transformation pipeline
[18:39:34.727]
[18:39:34.727] 💡 Tip: For 150 frames, batch_size=149 matches video length optimally
[18:39:34.727] 💡   Matching batch_size to shot length improves temporal coherence
[18:39:34.727]
[18:39:34.727] 🎨 Materializing VAE weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[18:39:34.848] 🎯 Converting VAE weights to torch.bfloat16 during loading
[18:39:34.852] 🎨 Materializing VAE: 250 parameters, 478.07MB total
[18:39:34.854] 🎨 VAE materialized directly from meta with loaded weights
[18:39:34.855] 🎨 VAE model set to eval mode (gradients disabled)
[18:39:34.856] 🎨 Configuring VAE causal slicing for temporal processing
[18:39:34.856] 🎨 Configuring VAE memory limits for causal convolutions
[18:39:34.857] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[18:39:34.857] 🎨 Using seed: 1001333 (VAE uses seed+1000000 for deterministic sampling)
[18:39:34.858] 🔄 VAE already on CUDA:0, skipping movement
[18:39:34.860] 📊 After VAE loading for encoding:
[18:39:34.860] 📊   [VRAM] 0.48GB allocated / 0.52GB reserved / Peak: 0.50GB / 14.07GB free / 15.92GB total
[18:39:34.860] 📊   [RAM] 5.36GB process / 10.98GB others / 79.45GB free / 95.79GB total
[18:39:34.860] 📊   Memory changes: VRAM +0.48GB
[18:39:34.860] 📊 Resetting VRAM peak memory statistics
[18:39:34.860] 🎨 Encoding batch 1/2
[18:39:34.860] 🔄   Moving video_batch_1 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[18:39:34.871] 📹   Sequence of 77 frames
[18:39:44.325] ℹ️   Latents shape: torch.Size([20, 90, 120, 16])
[18:39:44.326] 🎨 Encoding batch 2/2
[18:39:44.326] 🔄   Moving video_batch_2 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[18:39:44.339] 📹   Sequence of 73 frames
[18:39:53.221] ℹ️   Latents shape: torch.Size([19, 90, 120, 16])
[18:39:53.222] ⚡ Phase 1: VAE encoding complete: 18.49s
[18:39:53.222] ⚡   └─ Encoded batch 1: 9.46s
[18:39:53.223] ⚡   └─ Encoded batch 2: 8.90s
[18:39:53.223] ⚡   └─ VAE materialized: 0.13s
[18:39:53.223] ⚡     └─ VAE weights loaded from file: 0.12s
[18:39:53.226] 📊 After phase 1 (VAE encoding):
[18:39:53.227] 📊   [VRAM] 0.50GB allocated / 14.30GB reserved / Peak: 9.12GB / 0.00GB free / 15.92GB total
[18:39:53.227] 📊   [RAM] 5.36GB process / 10.96GB others / 79.48GB free / 95.79GB total
[18:39:53.227] 📊   Memory changes: VRAM +0.02GB
[18:39:53.227] 📊 Resetting VRAM peak memory statistics
[18:39:53.228]
[18:39:53.228]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━
[18:39:53.317] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[18:39:53.317] 🔄 Moving DiT from CPU to CUDA:0 (inference requirement)
[18:39:53.735] 📊 After DiT loading for upscaling:
[18:39:53.735] 📊   [VRAM] 3.71GB allocated / 14.30GB reserved / Peak: 3.71GB / 0.00GB free / 15.92GB total
[18:39:53.735] 📊   [RAM] 2.16GB process / 10.95GB others / 82.68GB free / 95.79GB total
[18:39:53.736] 📊   Memory changes: VRAM +3.21GB, RAM -3.20GB
[18:39:53.736] 📊 Resetting VRAM peak memory statistics
[18:39:53.736] 🎬 Upscaling batch 1/2
[18:39:53.736] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.47s/it]
[18:40:00.208] 🎬 Upscaling batch 2/2
[18:40:00.209] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.20s/it]
[18:40:06.409] 🧹 Cleaning up DiT components
[18:40:06.411] 🔄 Moving DiT from CUDA:0 to CPU (model caching)
[18:40:07.061] 🧹 Cleaned up text embeddings: texts_pos, texts_neg
[18:40:07.061] ⚡ Phase 2: DiT upscaling complete: 13.83s
[18:40:07.062] ⚡   └─ Upscaled batch 1: 6.47s
[18:40:07.062] ⚡     └─ DiT inference 1: 6.47s
[18:40:07.062] ⚡   └─ Upscaled batch 2: 6.20s
[18:40:07.062] ⚡     └─ DiT inference 2: 6.20s
[18:40:07.062] ⚡   └─ DiT moved to CPU: 0.65s
[18:40:07.062] ⚡   └─ DiT moved to CUDA:0: 0.42s
[18:40:07.063] ⚡   └─ (other operations): 0.10s
[18:40:07.064] 📊 After phase 2 (DiT upscaling):
[18:40:07.065] 📊   [VRAM] 0.50GB allocated / 14.32GB reserved / Peak: 11.86GB / 0.00GB free / 15.92GB total
[18:40:07.065] 📊   [RAM] 5.35GB process / 10.97GB others / 79.47GB free / 95.79GB total
[18:40:07.065] 📊   Memory changes: VRAM -3.21GB, RAM +3.19GB
[18:40:07.065] 📊 Resetting VRAM peak memory statistics
[18:40:07.065]
[18:40:07.065]  ━━━━━━━━ Phase 3: VAE decoding ━━━━━━━━
[18:40:07.065] 🔧 Pre-allocating output tensor: 150 frames, 956x720px, RGB (0.58GB)
[18:40:07.066] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[18:40:07.066] 🔄 VAE already on CUDA:0, skipping movement
[18:40:07.068] 📊 After VAE loading for decoding:
[18:40:07.068] 📊   [VRAM] 0.50GB allocated / 14.32GB reserved / Peak: 0.50GB / 0.00GB free / 15.92GB total
[18:40:07.068] 📊   [RAM] 5.35GB process / 10.98GB others / 79.46GB free / 95.79GB total
[18:40:07.068] 📊 Resetting VRAM peak memory statistics
[18:40:07.068] 🎨 Decoding batch 1/2
[18:40:07.068] ℹ️   Latents shape: torch.Size([1, 20, 90, 120, 16])
[18:40:07.069] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[18:40:07.069] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[18:40:07.069] 🎨   Decoding tiles 1-2 / 2
[18:40:29.341] 📹   Trimming spatial padding: 960x720 → 956x720
[18:40:29.341] 🔄   Moving sample_1 from CUDA:0 to CPU (writing to final_video)
[18:40:30.034] 📹   Wrote 77 frames to positions 0-77
[18:40:30.046] 🎨 Decoding batch 2/2
[18:40:30.047] ℹ️   Latents shape: torch.Size([1, 19, 90, 120, 16])
[18:40:30.047] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[18:40:30.047] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[18:40:30.047] 🎨   Decoding tiles 1-2 / 2
[18:40:51.114] 📹   Trimming spatial padding: 960x720 → 956x720
[18:40:51.115] 🔄   Moving sample_2 from CUDA:0 to CPU (writing to final_video)
[18:40:51.797] 📹   Wrote 73 frames to positions 77-150
[18:40:51.813] 🧹 Cleaning up VAE components
[18:40:51.813] 🔄 Moving VAE from CUDA:0 to CPU (releasing GPU memory)
[18:40:51.932] 🧹 VAE model deleted
[18:40:51.933] ⚡ Phase 3: VAE decoding complete: 44.87s
[18:40:51.933] ⚡   └─ Decoded batch 1: 22.98s
[18:40:51.933] ⚡     └─ VAE decode: 21.07s
[18:40:51.933] ⚡   └─ Decoded batch 2: 21.77s
[18:40:51.934] ⚡     └─ VAE decode: 21.07s
[18:40:51.934] ⚡   └─ VAE moved to CPU: 0.10s
[18:40:51.934] ⚡   └─ (other operations): 0.02s
[18:40:51.936] 📊 After phase 3 (VAE decoding):
[18:40:51.936] 📊   [VRAM] 0.01GB allocated / 13.45GB reserved / Peak: 10.47GB / 0.55GB free / 15.92GB total
[18:40:51.936] 📊   [RAM] 5.94GB process / 10.99GB others / 78.87GB free / 95.79GB total
[18:40:51.936] 📊   Memory changes: VRAM -0.49GB, RAM +0.58GB
[18:40:51.936] 📊 Resetting VRAM peak memory statistics
[18:40:51.936]
[18:40:51.936]  ━━━━━━━━ Phase 4: Post-processing ━━━━━━━━
[18:40:51.936] 📹 Post-processing batch 1/2
[18:40:51.936] 🔄   Moving sample_1 from CPU to CUDA:0 (post-processing)
[18:40:51.956] 📹   Color correction disabled (set to none)
[18:40:51.956] 🔄   Moving sample_1_final from CUDA:0 to CPU (writing processed result to final_video)
[18:40:52.040] 📹 Post-processing batch 2/2
[18:40:52.040] 🔄   Moving sample_2 from CPU to CUDA:0 (post-processing)
[18:40:52.059] 📹   Color correction disabled (set to none)
[18:40:52.059] 🔄   Moving sample_2_final from CUDA:0 to CPU (writing processed result to final_video)
[18:40:52.140] 🎬 Output assembled: 150 frames, Resolution: 956x720px, Channels: RGB
[18:40:52.141] ⚡ Phase 4: Post-processing complete: 0.21s
[18:40:52.141] ⚡   └─ Post-processed batch 1: 0.10s
[18:40:52.141] ⚡   └─ Post-processed batch 2: 0.10s
[18:40:52.144] 📊 After phase 4 (Post-processing):
[18:40:52.144] 📊   [VRAM] 0.01GB allocated / 13.45GB reserved / Peak: 0.30GB / 0.55GB free / 15.92GB total
[18:40:52.144] 📊   [RAM] 5.94GB process / 10.99GB others / 78.87GB free / 95.79GB total
[18:40:52.144] 📊 Resetting VRAM peak memory statistics
[18:40:52.144]
[18:40:52.205] 🎯 Converted output from torch.bfloat16 to float32
[18:40:52.206] ✅ Upscaling completed successfully!
[18:40:52.206] 🧹 Starting partial cleanup
[18:40:52.206] 🧹 Cleaning up DiT components
[18:40:52.208] 🧹 Clearing memory caches (deep)...
[18:40:52.402] 💾 Models cached for next run: DiT (seedvr2_3b_nvfp4.safetensors)
[18:40:52.402] ✅ Completed partial cleanup
[18:40:52.428] 📊 After all phases complete:
[18:40:52.437] 📊   [VRAM] 0.00GB allocated / 0.67GB reserved / Peak: 0.01GB / 13.93GB free / 15.92GB total
[18:40:52.437] 📊   [RAM] 6.51GB process / 10.96GB others / 78.32GB free / 95.79GB total
[18:40:52.437] 📊   Memory changes: RAM +0.58GB
[18:40:52.437] 📊 Resetting VRAM peak memory statistics
[18:40:52.438]
[18:40:52.438]  ────────────────────────
[18:40:52.438] 📊 Peak memory by phase:
[18:40:52.438] 📊   1. VAE encoding: VRAM 9.12GB allocated, 14.30GB reserved | RAM 5.36GB
[18:40:52.439] 📊   2. DiT upscaling: VRAM 11.86GB allocated, 14.32GB reserved | RAM 5.35GB
[18:40:52.439] 📊   3. VAE decoding: VRAM 10.47GB allocated, 14.32GB reserved | RAM 5.94GB
[18:40:52.439] 📊   4. Post-processing: VRAM 0.30GB allocated, 13.45GB reserved | RAM 6.51GB
[18:40:52.439] 📊 Overall peak: VRAM 11.86GB allocated, 14.32GB reserved | RAM 6.51GB
[18:40:52.439]
[18:40:52.439]  ────────────────────────
[18:40:52.440] ⚡ Total execution: 77.78s
[18:40:52.440] ⚡   └─ Video generation: 77.48s
[18:40:52.440] ⚡   └─   Phase 3: VAE decoding: 44.87s
[18:40:52.440] ⚡   └─   Phase 1: VAE encoding: 18.49s
[18:40:52.440] ⚡   └─   Phase 2: DiT upscaling: 13.83s
[18:40:52.440] ⚡   └─ Final cleanup: 0.22s
[18:40:52.440] ⚡   └─   Phase 4: Post-processing: 0.21s
[18:40:52.440] ⚡   └─ Model preparation: 0.06s
[18:40:52.440] ⚡ Average FPS: 1.93 frames/sec
[18:40:52.440]
[18:40:52.440]  ────────────────────────
[18:40:52.441] 💬 Questions? Updates? Watch, star & sponsor if you can!
[18:40:52.441] 🎬 https://www.youtube.com/@AInVFX
[18:40:52.441] ⭐💝 https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler
pop 1.mp4
Prompt executed in 78.96 seconds

naxci1 · 2026-01-12T14:47:24Z

run_nvidia_gpu_fast_fp16_accumulation - firefox.zip

You need to create the startup file like this for it to become active.

naxci1 · 2026-01-12T14:49:31Z

https://github.com/naxci1/ComfyUI-SeedVR2.5/blob/main/docs/BLACKWELL_OPTIMIZATION.md

naxci1 · 2026-01-12T14:51:09Z

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

….h when unavailable Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell

…Memory, torch.compile for Windows/Blackwell"

…-gpu-memory Revert "Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell"

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

…ell support Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

…ization-again Integrate SpargeAttn/Sage2 block-sparse attention with Blackwell GPU optimizations

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Add performance_mode dropdown for Blackwell-specific sparge_sage2 tuning

dsouzaankit · 2026-01-17T07:55:07Z

I recommend building sageattention2,3 from source, since pip install points to older sageattention1

pip install packaging setuptools ninja

install sageattention 3

cd ~
source ~/venv/bin/activate
git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention/sageattention3_blackwell
python setup.py install
pip show sageattn3

install sageattention 2++ for fallback

cd ~
source ~/venv/bin/activate
cd ~/SageAttention
python setup.py install # or pip install -e .
pip show sageattention

naxci1 · 2026-01-19T09:09:44Z

While version 2.5.23 used 15 GB of VRAM, this version uses 14 GB.

The processing speed was 1.8 fps in version 2.5.23, but it's 1.91 fps in this version.

Although the NVFP4 model doesn't give me exactly the results I want, it's worthwhile to add the other improvements to the main code.

dsouzaankit · 2026-01-23T23:57:35Z

@naxci1 VAE Decoding step is the slowest. So can we replace its fp16 model with nvfp4 as well?

naxci1 · 2026-01-24T08:45:58Z

@naxci1 VAE Decoding step is the slowest. So can we replace its fp16 model with nvfp4 as well?

I tried many different methods, but unfortunately it didn't work. It didn't work with the nvfp4 VAE model. The SeedVR2 DNA is very differently designed. It doesn't accept other VAE models; it absolutely has to be this VAE, which is very slow, unfortunately. The main slowdown occurs when decoding the VAE. I hope SeedVR3 comes out soon, with new codes and a new VAE model. The coding of this old model is very complex.

zelenooki87 · 2026-01-30T08:42:51Z

@naxci1 are you 100% sure you implemented nvfp4 correctly? I'd like to try your fork on an RTX 5090 today, but I'm put off by the fact that you aren't getting any speedup...

naxci1 · 2026-01-30T10:35:12Z

https://huggingface.co/Nexus24/vaeGGUF/tree/main?show_file_info=vae_nvfp4_blackwell.safetensors

Here you can see the details of the model yourself; theoretically, everything is complete. Claude just couldn't fully optimize it, and it took me several days. The problem is that the SeedVR2 code is very complex, there are unnecessary repetitions, its DNA is very different, and the DIT and VAE are very interconnected, unlike other models. In FlashVSR, I can integrate what I want much more easily, but SeedVR2 actually needs to be rewritten from scratch. That's why experts in this field need to get involved, and since they don't have the time, it remains unfinished. After all, I'm not a programmer; I'm just doing vibe coding.

zelenooki87 · 2026-02-03T07:02:05Z

@naxci1, I'm testing out the latest commit with the Blackwell optimizations right now (the one without NVFP4, since that’s on another branch).
Mate, I have to tell you—generation literally flies now.
The performance on the RTX 5090 is much faster.
The autoencoder part, in particular, is way quicker.

naxci1 · 2026-02-03T07:42:40Z

@zelenooki87

Which one do you mean? Write the link.

zelenooki87 · 2026-02-03T08:42:53Z

@naxci1 https://github.com/naxci1/ComfyUI-SeedVR2.5/tree/copilot/optimize-vae-performance-windows-50xx

naxci1 · 2026-02-03T08:48:50Z

@zelenooki87 Thanks

https://github.com/naxci1/ComfyUI-SeedVR2.5/blob/copilot/optimize-vae-performance-windows-50xx/docs/OPTIMIZATION_README.md

I had left this unfinished, I forgot about it.

naxci1 · 2026-02-03T10:01:10Z

@zelenooki87

I just tested it with the 3bQ8 model and the speed was the same for me. I have a 5070Ti GPU and you have a 5090. What were the FPS differences between the old and new FPS for you?

zelenooki87 · 2026-02-09T17:50:58Z

@naxci1 , how is the nvfp4 integration going? I saw you created a new fork/repo with the -new suffix.
I'm actually testing nvfp4 right now and will report back with benchmarks against GGUF models shortly.
In the meantime, could you let me know how to load vae_nvfp4_blackwell.safetensors?
Thanks!

naxci1 · 2026-02-09T18:39:58Z

Hi @zelenooki87

Actually, I've integrated and tested many new methods and technologies, but the tests take a very long time, and it leaves me frustrated, so I leave them unfinished. Unfortunately, AI isn't very smart when it comes to coding. Sometimes it claims to have done something, but when you look, it hasn't actually done it, which is why the tests take so long. There are many branches in my repository, all of which I've left unfinished. I abandon them when I don't get the results I want in the tests. I still haven't achieved the speed I want, unfortunately. SeedVR2 actually needs to be rewritten; it's very old and uses outdated methods, especially optimized for the H100 GPU. Therefore, changing some methods creates other problems, leading to wasted time. @IceClear actually wrote that he would be making SeedVR3; it would have been better if he had rewritten it. Complete code and methods would have been changed, and if it had been built for RT Tensors and especially if VAE had been chosen, integrating new features would have been much easier. Honestly, I don't have much time these days either. It's easier to fix code in FlashVSR; I can do everything in an hour, but integrating new features in SeedVR2 takes days. The code DNA is so complex that even Claude can't handle it.

#164

kotn3l · 2026-02-11T20:52:35Z

I know I shouldn't be talking as a 9070XT user here, but the "old" seedvr2_7b_nvfp4.safetensors model is working with ROCm. Though none of the new ones (like seedvr2_nvfp4_blackwell.safetensors) do. Seems to execute faster too than the current 7b fp16 model.

naxci1 · 2026-02-11T21:43:25Z

Actually, I could convert those models again; it only takes a minute to convert, but it didn't give me the speed I wanted, so I deleted them. The highest quality and fastest model is the 3bQ8; I recommend using that one.

kotn3l · 2026-02-11T21:50:22Z

Actually, I could convert those models again; it only takes a minute to convert, but it didn't give me the speed I wanted, so I deleted them. The highest quality and fastest model is the 3bQ8; I recommend using that one.

Which one is that model? Is it named differently in your HuggingFace repo? Also, if you could reconvert and upload them somewhere I'd love to test them again. Thanks!

naxci1 · 2026-02-11T22:04:11Z

This is SeedVR2's own model; it downloads as standard, and its name is listed in the model section: 3B Q8 model.

naxci1 and others added 10 commits December 20, 2025 00:17

Add complete Wan2.2 VAE implementation with patchify/unpatchify, AvgD…

d4f3dd5

…own3D, DupUp3D, and Wan2_2_VAE wrapper class

Add complete Wan2.1 VAE implementation with encoder/decoder blocks

d2186c4

Create VAE module init file with Wan2.1 and Wan2.2 imports

2afd914

Merge branch 'numz:main' into main

587155e

Initial plan

1c07c59

Add NVFP4 (Blackwell) quantization support for RTX 50-series GPUs

2d44411

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Fix code review issues in NVFP4 quantization implementation

169cb10

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Fix node registration by reverting eager imports in optimization __in…

14643bf

…it__ Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Merge pull request #18 from naxci1/copilot/add-nvfp4-support-seedvr2

2de7def

Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs

Copilot AI and others added 4 commits January 12, 2026 13:53

Initial plan

d98ca22

Add NVFP4 optimization, async offloading, and diagnostic tools for Bl…

fb831a9

…ackwell GPUs Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Address code review feedback: improve dtype size lookup, fix FP8 veri…

57cd806

…fication, remove private APIs Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Merge pull request #20 from naxci1/copilot/fix-nvfp4-bottleneck

1bd16f3

Add NVFP4 async offloading and pinned memory for Blackwell GPU optimization

Copilot AI and others added 8 commits January 13, 2026 10:20

Initial plan

520f671

Add async transfer, Windows Triton compat, and benchmark modules

f2ca142

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Address code review feedback: improve type hints and add constants

e469fda

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Fix Windows torch.compile OpenMP error by detecting and disabling omp…

bc9341d

….h when unavailable Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Merge pull request #22 from naxci1/copilot/optimize-comfyui-gpu-memory

40ab222

Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell

Revert "Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned …

e01c67c

…Memory, torch.compile for Windows/Blackwell"

Merge pull request #23 from naxci1/revert-22-copilot/optimize-comfyui…

32eb019

…-gpu-memory Revert "Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell"

Initial plan

a862fa6

Copilot AI and others added 11 commits January 14, 2026 12:21

Fix autotune.py imports to use local modules

39249dd

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Fix path detection, add diagnostic output, and improve SM 12.0 Blackw…

5fa4093

…ell support Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Add sparge_sage2 to attention_mode dropdown in ComfyUI nodes

ab229c9

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Merge pull request #26 from naxci1/copilot/integrate-spargeattn-optim…

9e44c03

…ization-again Integrate SpargeAttn/Sage2 block-sparse attention with Blackwell GPU optimizations

Initial plan

c9d90a1

Add performance_mode dropdown for Blackwell-specific tuning

735e3a1

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Add verification logging for Blackwell optimization

92b8d74

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Add proper logging with flush for Blackwell verification

e187de8

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Add strict verification for sparge_sage2 kernel execution

e509fa8

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Remove per-frame logging and cache overhead for Windows performance

530c0a2

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>

Merge pull request #27 from naxci1/copilot/add-blackwell-tuning-dropdown

fe1b469

Add performance_mode dropdown for Blackwell-specific sparge_sage2 tuning

naxci1 closed this Jan 17, 2026

naxci1 reopened this Jan 19, 2026

Uh oh!

Conversation

naxci1 commented Jan 12, 2026

Uh oh!

naxci1 commented Jan 12, 2026

Uh oh!

naxci1 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxci1 commented Jan 12, 2026

Uh oh!

naxci1 commented Jan 12, 2026

Uh oh!

naxci1 commented Jan 12, 2026

Uh oh!

naxci1 commented Jan 12, 2026

Uh oh!

naxci1 commented Jan 12, 2026

Uh oh!

naxci1 commented Jan 12, 2026

Uh oh!

dsouzaankit commented Jan 17, 2026

install sageattention 3

install sageattention 2++ for fallback

Uh oh!

naxci1 commented Jan 19, 2026

Uh oh!

dsouzaankit commented Jan 23, 2026

Uh oh!

naxci1 commented Jan 24, 2026

Uh oh!

zelenooki87 commented Jan 30, 2026

Uh oh!

naxci1 commented Jan 30, 2026

Uh oh!

zelenooki87 commented Feb 3, 2026

Uh oh!

naxci1 commented Feb 3, 2026

Uh oh!

zelenooki87 commented Feb 3, 2026

Uh oh!

naxci1 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxci1 commented Feb 3, 2026

Uh oh!

zelenooki87 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxci1 commented Feb 9, 2026

Uh oh!

kotn3l commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxci1 commented Feb 11, 2026

Uh oh!

kotn3l commented Feb 11, 2026

Uh oh!

naxci1 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

naxci1 commented Jan 12, 2026 •

edited

Loading

naxci1 commented Feb 3, 2026 •

edited

Loading

zelenooki87 commented Feb 9, 2026 •

edited

Loading

kotn3l commented Feb 11, 2026 •

edited

Loading