Skip to content

Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs#486

Open
naxci1 wants to merge 36 commits intonumz:mainfrom
naxci1:main
Open

Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs#486
naxci1 wants to merge 36 commits intonumz:mainfrom
naxci1:main

Conversation

@naxci1
Copy link
Copy Markdown
Contributor

@naxci1 naxci1 commented Jan 12, 2026

  • Research & Analysis: Understand NVFP4 quantization requirements for Blackwell (RTX 50-series)
  • Create NVFP4 quantization module: Implement src/optimization/nvfp4.py with E2M1 weight format support and E4M3 scaling factors
  • Update model loader: Modify src/core/model_loader.py to detect and load NVFP4 .safetensors weights
  • Implement layer handling: Ensure Bias, Norm, and Embedding layers remain in FP16 for quality preservation
  • Add async offloading support: Implement pinned memory and async model offloading in AsyncModelOffloader class
  • Update compatibility module: Add NVFP4/Blackwell detection to src/optimization/compatibility.py
  • Update DiT model loader node: Add NVFP4 quantization options (enable_nvfp4, nvfp4_async_offload)
  • Testing & Validation: Created and ran 24 unit tests - all passing
  • Code review fixes: Fixed E2M1 bit layout, dequantization consistency, bounds checking, removed non-existent model entries
  • Security checks: CodeQL passed with 0 alerts
  • Fix node registration: Reverted eager imports in optimization/__init__.py and use lazy imports in dit_model_loader.py

naxci1 and others added 10 commits December 20, 2025 00:17
…own3D, DupUp3D, and Wan2_2_VAE wrapper class
- Add VAEArchitectureConfig for encoder/decoder configuration
- Add VAEEncodingConfig for encoding parameters
- Add VAEModelConfig for complete model configuration
- Implement VAEConfigManager with full CRUD operations
- Support JSON serialization/deserialization
- Include predefined configs for Wan2.1 and Wan2.2
- Add config cloning, updating, saving, and loading
- Support batch import/export operations
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…it__

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs
@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

NVFP4 modelis download link:

https://huggingface.co/Nexus24/vaeGGUF/tree/main

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

Model: 3b_nvfp4

   v2.5.24                                    © ByteDance Seed · NumZ · AInVFX
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[17:31:55.351] ℹ️ OS: Windows (10.0.19045) | GPU: NVIDIA GeForce RTX 5070 Ti (16GB)
[17:31:55.351] ℹ️ Python: 3.12.10 | PyTorch: 2.7.1+cu128 | FlashAttn: v2 ✓ | SageAttn: v2 ✓ | Triton: ✓
[17:31:55.351] ℹ️ CUDA: 12.8 | cuDNN: 90701 | ComfyUI: 0.3.68
[17:31:55.351]
[17:31:55.351]  ━━━━━━━━━ Model Preparation ━━━━━━━━━
[17:31:55.353] 📊 Before model preparation:
[17:31:55.353] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:31:55.353] 📊   [RAM] 2.29GB process / 13.37GB others / 80.14GB free / 95.79GB total
[17:31:55.353] 📊 Resetting VRAM peak memory statistics
[17:31:55.353] 📥 Checking and downloading models if needed...
[17:31:55.353] ⚠️ [WARNING] seedvr2_3b_nvfp4.safetensors not in registry, skipping validation
[17:31:55.354] 🔧 VAE model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:31:55.354] 🔧 VAE model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:31:55.354] 🔧 Generation context initialized: DiT=cuda:0, VAE=cuda:0, Offload=[none], LOCAL_RANK=0
[17:31:55.354] 🎯 Unified compute dtype: torch.bfloat16 across entire pipeline for maximum performance
[17:31:55.354] 🏃 Configuring inference runner...
[17:31:55.354] 🏃 Creating new runner: DiT=seedvr2_3b_nvfp4.safetensors, VAE=ema_vae_fp16.safetensors
[17:31:55.364] 🚀 Creating DiT model structure on meta device
[17:31:55.421] 🎨 Creating VAE model structure on meta device
[17:31:55.455] 🎨 VAE downsample factors configured (spatial: 8x, temporal: 4x)
[17:31:55.457] 🔄 Moving text_pos_embeds from CPU to CUDA:0 (DiT inference)
[17:31:55.457] 🔄 Moving text_neg_embeds from CPU to CUDA:0 (DiT inference)
[17:31:55.457] 🚀 Loaded text embeddings for DiT
[17:31:55.460] 📊 After model preparation:
[17:31:55.460] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:31:55.460] 📊   [RAM] 2.29GB process / 13.38GB others / 80.13GB free / 95.79GB total
[17:31:55.460] 📊 Resetting VRAM peak memory statistics
[17:31:55.460] ⚡ Model preparation: 0.11s
[17:31:55.460] ⚡   └─ Model structures prepared: 0.09s
[17:31:55.460] ⚡     └─ DiT structure created: 0.05s
[17:31:55.460] ⚡     └─ VAE structure created: 0.03s
[17:31:55.460] 🔧   Initializing video transformation pipeline for 720px (shortest edge)
[17:31:55.469] 🔧   Target dimensions: 956x720 (padded to 960x720 for processing)
[17:31:55.470]
[17:31:55.470] 🎬 Starting upscaling generation...
[17:31:55.470] 🎬   Input: 150 frames, 306x230px → Padded: 960x720px → Output: 956x720px (shortest edge: 720px)
[17:31:55.470] 🎬   Batch size: 85, Seed: 1333, Channels: RGB
[17:31:55.470]
[17:31:55.470]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━
[17:31:55.470] ♻️ Reusing pre-initialized video transformation pipeline
[17:31:55.470]
[17:31:55.470] 💡 Tip: For 150 frames, batch_size=149 matches video length optimally
[17:31:55.470] 💡   Matching batch_size to shot length improves temporal coherence
[17:31:55.470]
[17:31:55.470] 🎨 Materializing VAE weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:31:55.594] 🎯 Converting VAE weights to torch.bfloat16 during loading
[17:31:55.597] 🎨 Materializing VAE: 250 parameters, 478.07MB total
[17:31:55.600] 🎨 VAE materialized directly from meta with loaded weights
[17:31:55.600] 🎨 VAE model set to eval mode (gradients disabled)
[17:31:55.601] 🎨 Configuring VAE causal slicing for temporal processing
[17:31:55.601] 🎨 Configuring VAE memory limits for causal convolutions
[17:31:55.602] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:31:55.602] 🎨 Using seed: 1001333 (VAE uses seed+1000000 for deterministic sampling)
[17:31:55.602] 🔄 VAE already on CUDA:0, skipping movement
[17:31:55.604] 📊 After VAE loading for encoding:
[17:31:55.605] 📊   [VRAM] 0.48GB allocated / 0.52GB reserved / Peak: 0.50GB / 14.07GB free / 15.92GB total
[17:31:55.605] 📊   [RAM] 2.28GB process / 13.37GB others / 80.14GB free / 95.79GB total
[17:31:55.605] 📊   Memory changes: VRAM +0.48GB
[17:31:55.605] 📊 Resetting VRAM peak memory statistics
[17:31:55.605] 🎨 Encoding batch 1/2
[17:31:55.605] 🔄   Moving video_batch_1 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:31:55.615] 📹   Sequence of 85 frames
[17:32:05.127] ℹ️   Latents shape: torch.Size([22, 90, 120, 16])
[17:32:05.127] 🎨 Encoding batch 2/2
[17:32:05.127] 🔄   Moving video_batch_2 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:32:06.092] 📹   Sequence of 65 frames
[17:32:13.067] ℹ️   Latents shape: torch.Size([17, 90, 120, 16])
[17:32:13.067] ⚡ Phase 1: VAE encoding complete: 17.60s
[17:32:13.067] ⚡   └─ Encoded batch 1: 9.52s
[17:32:13.067] ⚡   └─ Encoded batch 2: 7.94s
[17:32:13.068] ⚡   └─ VAE materialized: 0.13s
[17:32:13.068] ⚡     └─ VAE weights loaded from file: 0.12s
[17:32:13.070] 📊 After phase 1 (VAE encoding):
[17:32:13.070] 📊   [VRAM] 0.50GB allocated / 13.96GB reserved / Peak: 9.15GB / 0.00GB free / 15.92GB total
[17:32:13.071] 📊   [RAM] 2.28GB process / 13.42GB others / 80.09GB free / 95.79GB total
[17:32:13.071] 📊   Memory changes: VRAM +0.02GB
[17:32:13.071] 📊 Resetting VRAM peak memory statistics
[17:32:13.071]
[17:32:13.071]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━
[17:32:14.035] 🚀 Materializing DiT weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_3b_nvfp4.safetensors
[17:32:14.035] 🔄 Detected NVFP4 checkpoint on Blackwell GPU - enabling 4-bit optimization
[17:32:14.917] 🔄 NVFP4 loading: 0 quantized, 133 preserved in FP16
[17:32:14.918] 🚀 Materializing DiT: 635 parameters, 3235.12MB total
[17:32:14.926] 🚀 DiT materialized directly from meta with loaded weights
[17:32:14.929] ✅ Initialized 64 non-persistent buffers
[17:32:14.929] 🔧 Applying DiT compatibility wrapper
[17:32:14.929] 🎯 Detected NaDiT 3B FP8 - Converting RoPE freqs for FP8 compatibility
[17:32:14.930] ✅ Converted 0 RoPE frequency buffers from FP8 to torch.bfloat16 for compatibility
[17:32:14.930] 🔧 Stabilizing RoPE computations for numerical stability
[17:32:14.930] ✅ Stabilized 64 RoPE modules
[17:32:14.930] 🔧 Applying sageattn_2 attention mode and torch.bfloat16 compute dtype to model
[17:32:14.931] ✅ Applied sageattn_2 and compute_dtype=torch.bfloat16 to 32 modules
[17:32:14.931] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[17:32:14.931] 🔄 DiT already on CUDA:0, skipping movement
[17:32:14.934] 📊 After DiT loading for upscaling:
[17:32:14.934] 📊   [VRAM] 3.71GB allocated / 13.96GB reserved / Peak: 3.71GB / 0.00GB free / 15.92GB total
[17:32:14.934] 📊   [RAM] 2.28GB process / 13.41GB others / 80.10GB free / 95.79GB total
[17:32:14.934] 📊   Memory changes: VRAM +3.21GB
[17:32:14.934] 📊 Resetting VRAM peak memory statistics
[17:32:14.934] 🎬 Upscaling batch 1/2
[17:32:14.935] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.33s/it]
[17:32:22.270] 🎬 Upscaling batch 2/2
[17:32:22.271] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.47s/it]
[17:32:27.744] 🧹 Cleaning up DiT components
[17:32:27.745] 🔄 Moving DiT from CUDA:0 to CPU (releasing GPU memory)
[17:32:28.333] 🧹 DiT model deleted
[17:32:28.333] 🧹 Cleaned up text embeddings: texts_pos, texts_neg
[17:32:28.333] ⚡ Phase 2: DiT upscaling complete: 15.26s
[17:32:28.333] ⚡   └─ Upscaled batch 1: 7.33s
[17:32:28.334] ⚡     └─ DiT inference 1: 7.33s
[17:32:28.334] ⚡   └─ Upscaled batch 2: 5.47s
[17:32:28.334] ⚡     └─ DiT inference 2: 5.47s
[17:32:28.334] ⚡   └─ DiT materialized: 0.90s
[17:32:28.334] ⚡     └─ DiT weights loaded from file: 0.88s
[17:32:28.334] ⚡   └─ DiT moved to CPU: 0.58s
[17:32:28.334] ⚡   └─ (other operations): 0.97s
[17:32:28.336] 📊 After phase 2 (DiT upscaling):
[17:32:28.336] 📊   [VRAM] 0.50GB allocated / 15.00GB reserved / Peak: 12.64GB / 0.00GB free / 15.92GB total
[17:32:28.336] 📊   [RAM] 5.54GB process / 13.44GB others / 76.82GB free / 95.79GB total
[17:32:28.337] 📊   Memory changes: VRAM -3.21GB, RAM +3.26GB
[17:32:28.337] 📊 Resetting VRAM peak memory statistics
[17:32:28.463]
[17:32:28.463]  ━━━━━━━━ Phase 3: VAE decoding ━━━━━━━━
[17:32:28.464] 🔧 Pre-allocating output tensor: 150 frames, 956x720px, RGB (0.58GB)
[17:32:28.464] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:32:28.465] 🔄 VAE already on CUDA:0, skipping movement
[17:32:28.469] 📊 After VAE loading for decoding:
[17:32:28.469] 📊   [VRAM] 0.50GB allocated / 15.00GB reserved / Peak: 0.50GB / 0.00GB free / 15.92GB total
[17:32:28.469] 📊   [RAM] 2.38GB process / 13.45GB others / 79.96GB free / 95.79GB total
[17:32:28.469] 📊   Memory changes: RAM -3.16GB
[17:32:28.469] 📊 Resetting VRAM peak memory statistics
[17:32:28.469] 🎨 Decoding batch 1/2
[17:32:28.469] ℹ️   Latents shape: torch.Size([1, 22, 90, 120, 16])
[17:32:28.469] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:32:28.470] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:32:28.470] 🎨   Decoding tiles 1-2 / 2
[17:32:53.072] 📹   Trimming spatial padding: 960x720 → 956x720
[17:32:53.072] 🔄   Moving sample_1 from CUDA:0 to CPU (writing to final_video)
[17:32:53.774] 📹   Wrote 85 frames to positions 0-85
[17:32:53.788] 🎨 Decoding batch 2/2
[17:32:53.788] ℹ️   Latents shape: torch.Size([1, 17, 90, 120, 16])
[17:32:53.788] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:32:53.789] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:32:53.789] 🎨   Decoding tiles 1-2 / 2
[17:33:12.429] 📹   Trimming spatial padding: 960x720 → 956x720
[17:33:12.431] 🔄   Moving sample_2 from CUDA:0 to CPU (writing to final_video)
[17:33:13.102] 📹   Wrote 65 frames to positions 85-150
[17:33:13.113] 🧹 Cleaning up VAE components
[17:33:13.113] 🔄 Moving VAE from CUDA:0 to CPU (releasing GPU memory)
[17:33:13.238] 🧹 VAE model deleted
[17:33:13.239] ⚡ Phase 3: VAE decoding complete: 44.77s
[17:33:13.239] ⚡   └─ Decoded batch 1: 25.32s
[17:33:13.239] ⚡     └─ VAE decode: 18.64s
[17:33:13.240] ⚡   └─ Decoded batch 2: 19.32s
[17:33:13.240] ⚡     └─ VAE decode: 18.64s
[17:33:13.240] ⚡   └─ VAE moved to CPU: 0.10s
[17:33:13.240] ⚡   └─ (other operations): 0.03s
[17:33:13.244] 📊 After phase 3 (VAE decoding):
[17:33:13.244] 📊   [VRAM] 0.01GB allocated / 12.91GB reserved / Peak: 10.47GB / 0.96GB free / 15.92GB total
[17:33:13.244] 📊   [RAM] 2.97GB process / 13.43GB others / 79.40GB free / 95.79GB total
[17:33:13.244] 📊   Memory changes: VRAM -0.49GB, RAM +0.58GB
[17:33:13.244] 📊 Resetting VRAM peak memory statistics
[17:33:13.244]
[17:33:13.244]  ━━━━━━━━ Phase 4: Post-processing ━━━━━━━━
[17:33:13.244] 📹 Post-processing batch 1/2
[17:33:13.245] 🔄   Moving sample_1 from CPU to CUDA:0 (post-processing)
[17:33:13.269] 📹   Color correction disabled (set to none)
[17:33:13.270] 🔄   Moving sample_1_final from CUDA:0 to CPU (writing processed result to final_video)
[17:33:13.364] 📹 Post-processing batch 2/2
[17:33:13.364] 🔄   Moving sample_2 from CPU to CUDA:0 (post-processing)
[17:33:13.383] 📹   Color correction disabled (set to none)
[17:33:13.383] 🔄   Moving sample_2_final from CUDA:0 to CPU (writing processed result to final_video)
[17:33:13.456] 🎬 Output assembled: 150 frames, Resolution: 956x720px, Channels: RGB
[17:33:13.457] ⚡ Phase 4: Post-processing complete: 0.21s
[17:33:13.457] ⚡   └─ Post-processed batch 1: 0.12s
[17:33:13.457] ⚡   └─ Post-processed batch 2: 0.09s
[17:33:13.459] 📊 After phase 4 (Post-processing):
[17:33:13.460] 📊   [VRAM] 0.01GB allocated / 12.91GB reserved / Peak: 0.33GB / 0.96GB free / 15.92GB total
[17:33:13.460] 📊   [RAM] 2.97GB process / 13.43GB others / 79.40GB free / 95.79GB total
[17:33:13.460] 📊 Resetting VRAM peak memory statistics
[17:33:13.460]
[17:33:13.502] 🎯 Converted output from torch.bfloat16 to float32
[17:33:13.502] ✅ Upscaling completed successfully!
[17:33:13.502] 🧹 Starting full cleanup
[17:33:13.503] 🧹 Clearing memory caches (deep)...
[17:33:13.683] ✅ Completed full cleanup
[17:33:13.712] 📊 After all phases complete:
[17:33:13.712] 📊   [VRAM] 0.00GB allocated / 0.02GB reserved / Peak: 0.01GB / 14.58GB free / 15.92GB total
[17:33:13.712] 📊   [RAM] 3.51GB process / 13.46GB others / 78.82GB free / 95.79GB total
[17:33:13.712] 📊   Memory changes: RAM +0.54GB
[17:33:13.712] 📊 Resetting VRAM peak memory statistics
[17:33:13.713]
[17:33:13.713]  ────────────────────────
[17:33:13.713] 📊 Peak memory by phase:
[17:33:13.713] 📊   1. VAE encoding: VRAM 9.15GB allocated, 13.96GB reserved | RAM 2.28GB
[17:33:13.713] 📊   2. DiT upscaling: VRAM 12.64GB allocated, 15.00GB reserved | RAM 5.54GB
[17:33:13.713] 📊   3. VAE decoding: VRAM 10.47GB allocated, 15.00GB reserved | RAM 2.97GB
[17:33:13.713] 📊   4. Post-processing: VRAM 0.33GB allocated, 12.91GB reserved | RAM 3.51GB
[17:33:13.713] 📊 Overall peak: VRAM 12.64GB allocated, 15.00GB reserved | RAM 5.54GB
[17:33:13.714]
[17:33:13.714]  ────────────────────────
[17:33:13.714] ⚡ Total execution: 78.36s
[17:33:13.714] ⚡   └─ Video generation: 78.03s
[17:33:13.714] ⚡   └─   Phase 3: VAE decoding: 44.77s
[17:33:13.714] ⚡   └─   Phase 1: VAE encoding: 17.60s
[17:33:13.714] ⚡   └─   Phase 2: DiT upscaling: 15.26s
[17:33:13.714] ⚡   └─   Phase 4: Post-processing: 0.21s
[17:33:13.714] ⚡   └─ Final cleanup: 0.21s
[17:33:13.714] ⚡   └─ Model preparation: 0.11s
[17:33:13.714] ⚡ Average FPS: 1.91 frames/sec
[17:33:13.714]
[17:33:13.714]  ────────────────────────
[17:33:13.714] 💬 Questions? Updates? Watch, star & sponsor if you can!
[17:33:13.715] 🎬 https://www.youtube.com/@AInVFX
[17:33:13.715] ⭐💝 https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler
pop 1.mp4
Prompt executed in 79.40 seconds

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

Model: 3bQ8

   v2.5.24                                    © ByteDance Seed · NumZ · AInVFX
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[17:34:38.826] ℹ️ OS: Windows (10.0.19045) | GPU: NVIDIA GeForce RTX 5070 Ti (16GB)
[17:34:38.826] ℹ️ Python: 3.12.10 | PyTorch: 2.7.1+cu128 | FlashAttn: v2 ✓ | SageAttn: v2 ✓ | Triton: ✓
[17:34:38.826] ℹ️ CUDA: 12.8 | cuDNN: 90701 | ComfyUI: 0.3.68
[17:34:38.826]
[17:34:38.826]  ━━━━━━━━━ Model Preparation ━━━━━━━━━
[17:34:38.828] 📊 Before model preparation:
[17:34:38.828] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:34:38.828] 📊   [RAM] 2.17GB process / 13.43GB others / 80.20GB free / 95.79GB total
[17:34:38.828] 📊 Resetting VRAM peak memory statistics
[17:34:38.828] 📥 Checking and downloading models if needed...
[17:34:38.828] 🔧 DiT model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_ema_3b-Q8_0.gguf
[17:34:38.829] 🔧 DiT model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_ema_3b-Q8_0.gguf
[17:34:38.829] 🔧 VAE model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:34:38.829] 🔧 VAE model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:34:38.829] 🔧 Generation context initialized: DiT=cuda:0, VAE=cuda:0, Offload=[none], LOCAL_RANK=0
[17:34:38.829] 🎯 Unified compute dtype: torch.bfloat16 across entire pipeline for maximum performance
[17:34:38.829] 🏃 Configuring inference runner...
[17:34:38.829] 🏃 Creating new runner: DiT=seedvr2_ema_3b-Q8_0.gguf, VAE=ema_vae_fp16.safetensors
[17:34:38.839] 🚀 Creating DiT model structure on meta device
[17:34:38.896] 🎨 Creating VAE model structure on meta device
[17:34:38.932] 🎨 VAE downsample factors configured (spatial: 8x, temporal: 4x)
[17:34:38.933] 🔄 Moving text_pos_embeds from CPU to CUDA:0 (DiT inference)
[17:34:38.934] 🔄 Moving text_neg_embeds from CPU to CUDA:0 (DiT inference)
[17:34:38.934] 🚀 Loaded text embeddings for DiT
[17:34:38.936] 📊 After model preparation:
[17:34:38.936] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[17:34:38.937] 📊   [RAM] 2.17GB process / 13.43GB others / 80.19GB free / 95.79GB total
[17:34:38.937] 📊 Resetting VRAM peak memory statistics
[17:34:38.937] ⚡ Model preparation: 0.11s
[17:34:38.937] ⚡   └─ Model structures prepared: 0.09s
[17:34:38.937] ⚡     └─ DiT structure created: 0.05s
[17:34:38.937] ⚡     └─ VAE structure created: 0.04s
[17:34:38.937] 🔧   Initializing video transformation pipeline for 720px (shortest edge)
[17:34:38.945] 🔧   Target dimensions: 956x720 (padded to 960x720 for processing)
[17:34:38.946]
[17:34:38.946] 🎬 Starting upscaling generation...
[17:34:38.946] 🎬   Input: 150 frames, 306x230px → Padded: 960x720px → Output: 956x720px (shortest edge: 720px)
[17:34:38.946] 🎬   Batch size: 85, Seed: 1333, Channels: RGB
[17:34:38.946]
[17:34:38.946]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━
[17:34:38.946] ♻️ Reusing pre-initialized video transformation pipeline
[17:34:38.947]
[17:34:38.947] 💡 Tip: For 150 frames, batch_size=149 matches video length optimally
[17:34:38.947] 💡   Matching batch_size to shot length improves temporal coherence
[17:34:38.947]
[17:34:38.947] 🎨 Materializing VAE weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[17:34:39.071] 🎯 Converting VAE weights to torch.bfloat16 during loading
[17:34:39.074] 🎨 Materializing VAE: 250 parameters, 478.07MB total
[17:34:39.077] 🎨 VAE materialized directly from meta with loaded weights
[17:34:39.077] 🎨 VAE model set to eval mode (gradients disabled)
[17:34:39.078] 🎨 Configuring VAE causal slicing for temporal processing
[17:34:39.078] 🎨 Configuring VAE memory limits for causal convolutions
[17:34:39.079] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:34:39.079] 🎨 Using seed: 1001333 (VAE uses seed+1000000 for deterministic sampling)
[17:34:39.079] 🔄 VAE already on CUDA:0, skipping movement
[17:34:39.082] 📊 After VAE loading for encoding:
[17:34:39.082] 📊   [VRAM] 0.48GB allocated / 0.52GB reserved / Peak: 0.50GB / 14.07GB free / 15.92GB total
[17:34:39.082] 📊   [RAM] 2.16GB process / 13.44GB others / 80.20GB free / 95.79GB total
[17:34:39.082] 📊   Memory changes: VRAM +0.48GB, RAM -0.01GB
[17:34:39.082] 📊 Resetting VRAM peak memory statistics
[17:34:39.082] 🎨 Encoding batch 1/2
[17:34:39.082] 🔄   Moving video_batch_1 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:34:39.092] 📹   Sequence of 85 frames
[17:34:48.656] ℹ️   Latents shape: torch.Size([22, 90, 120, 16])
[17:34:48.656] 🎨 Encoding batch 2/2
[17:34:48.656] 🔄   Moving video_batch_2 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[17:34:49.630] 📹   Sequence of 65 frames
[17:34:56.616] ℹ️   Latents shape: torch.Size([17, 90, 120, 16])
[17:34:56.616] ⚡ Phase 1: VAE encoding complete: 17.67s
[17:34:56.617] ⚡   └─ Encoded batch 1: 9.57s
[17:34:56.617] ⚡   └─ Encoded batch 2: 7.96s
[17:34:56.617] ⚡   └─ VAE materialized: 0.13s
[17:34:56.617] ⚡     └─ VAE weights loaded from file: 0.12s
[17:34:56.621] 📊 After phase 1 (VAE encoding):
[17:34:56.621] 📊   [VRAM] 0.50GB allocated / 13.96GB reserved / Peak: 9.15GB / 0.00GB free / 15.92GB total
[17:34:56.621] 📊   [RAM] 2.16GB process / 13.49GB others / 80.15GB free / 95.79GB total
[17:34:56.621] 📊   Memory changes: VRAM +0.02GB
[17:34:56.621] 📊 Resetting VRAM peak memory statistics
[17:34:56.621]
[17:34:56.621]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━
[17:34:57.583] 🚀 Materializing DiT weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\seedvr2_ema_3b-Q8_0.gguf
[17:34:57.603] 🚀 Loading 635 tensors to cuda:0...
[17:34:57.701] 🚀   Loaded 100/635 tensors...
[17:34:57.804] 🚀   Loaded 200/635 tensors...
[17:34:57.913] 🚀   Loaded 300/635 tensors...
[17:34:58.010] 🚀   Loaded 400/635 tensors...
[17:34:58.127] 🚀   Loaded 500/635 tensors...
[17:34:58.240] 🚀   Loaded 600/635 tensors...
[17:34:58.296] ✅ Successfully loaded 635 tensors to cuda:0
[17:34:58.380] 🚀 Materializing DiT: 635 parameters, 3490.99MB total
[17:34:58.380] 🚀 Loading GGUF weights
[17:34:58.382] ✅ Architecture check complete, no shape mismatch
[17:34:58.389] ✅ GGUF loading complete: 635 parameters loaded
[17:34:58.389] ℹ️ Quantized parameters: 210
[17:34:58.390] 🚀 Replacing layers with GGUF-optimized versions for precision handling
[17:34:58.393] ✅ Replaced 210 layers with GGUF-optimized versions
[17:34:58.393] 🎯 GGUF precision path: Q8_0:210 → FP16 (preserve) → BF16/FP32 (compute)
[17:34:58.397] ✅ Initialized 64 non-persistent buffers
[17:34:58.397] 🔧 Applying DiT compatibility wrapper
[17:34:58.397] 🔧 Stabilizing RoPE computations for numerical stability
[17:34:58.398] ✅ Stabilized 64 RoPE modules
[17:34:58.398] 🔧 Applying sageattn_2 attention mode and torch.bfloat16 compute dtype to model
[17:34:58.398] ✅ Applied sageattn_2 and compute_dtype=torch.bfloat16 to 32 modules
[17:34:58.398] 🎯 Model precision: DiT=torch.float16, VAE=torch.bfloat16, compute=torch.bfloat16
[17:34:58.398] 🔄 DiT already on CUDA:0, skipping movement
[17:34:58.400] 📊 After DiT loading for upscaling:
[17:34:58.401] 📊   [VRAM] 3.96GB allocated / 13.96GB reserved / Peak: 3.96GB / 0.00GB free / 15.92GB total
[17:34:58.401] 📊   [RAM] 2.16GB process / 13.49GB others / 80.15GB free / 95.79GB total
[17:34:58.401] 📊   Memory changes: VRAM +3.46GB
[17:34:58.401] 📊 Resetting VRAM peak memory statistics
[17:34:58.401] 🎬 Upscaling batch 1/2
[17:34:58.402] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.32s/it]
[17:35:05.722] 🎬 Upscaling batch 2/2
[17:35:05.723] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.61s/it]
[17:35:11.331] 🧹 Cleaning up DiT components
[17:35:11.333] 🔄 Moving DiT from CUDA:0 to CPU (releasing GPU memory)
[17:35:11.992] 🧹 DiT model deleted
[17:35:11.992] 🧹 Cleaned up text embeddings: texts_pos, texts_neg
[17:35:11.992] ⚡ Phase 2: DiT upscaling complete: 15.37s
[17:35:11.992] ⚡   └─ Upscaled batch 1: 7.32s
[17:35:11.992] ⚡     └─ DiT inference 1: 7.32s
[17:35:11.992] ⚡   └─ Upscaled batch 2: 5.61s
[17:35:11.992] ⚡     └─ DiT inference 2: 5.61s
[17:35:11.992] ⚡   └─ DiT materialized: 0.81s
[17:35:11.992] ⚡     └─ DiT weights loaded from file: 0.80s
[17:35:11.992] ⚡   └─ DiT moved to CPU: 0.66s
[17:35:11.992] ⚡   └─ (other operations): 0.97s
[17:35:11.994] 📊 After phase 2 (DiT upscaling):
[17:35:11.995] 📊   [VRAM] 0.50GB allocated / 14.95GB reserved / Peak: 12.89GB / 0.00GB free / 15.92GB total
[17:35:11.995] 📊   [RAM] 5.73GB process / 13.46GB others / 76.61GB free / 95.79GB total
[17:35:11.995] 📊   Memory changes: VRAM -3.46GB, RAM +3.58GB
[17:35:11.995] 📊 Resetting VRAM peak memory statistics
[17:35:12.130]
[17:35:12.130]  ━━━━━━━━ Phase 3: VAE decoding ━━━━━━━━
[17:35:12.130] 🔧 Pre-allocating output tensor: 150 frames, 956x720px, RGB (0.58GB)
[17:35:12.131] 🎯 Model precision: VAE=torch.bfloat16, compute=torch.bfloat16
[17:35:12.131] 🔄 VAE already on CUDA:0, skipping movement
[17:35:12.134] 📊 After VAE loading for decoding:
[17:35:12.134] 📊   [VRAM] 0.50GB allocated / 14.95GB reserved / Peak: 0.50GB / 0.00GB free / 15.92GB total
[17:35:12.134] 📊   [RAM] 2.32GB process / 13.48GB others / 79.99GB free / 95.79GB total
[17:35:12.134] 📊   Memory changes: RAM -3.41GB
[17:35:12.134] 📊 Resetting VRAM peak memory statistics
[17:35:12.134] 🎨 Decoding batch 1/2
[17:35:12.135] ℹ️   Latents shape: torch.Size([1, 22, 90, 120, 16])
[17:35:12.135] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:35:12.135] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:35:12.135] 🎨   Decoding tiles 1-2 / 2
[17:35:36.859] 📹   Trimming spatial padding: 960x720 → 956x720
[17:35:36.859] 🔄   Moving sample_1 from CUDA:0 to CPU (writing to final_video)
[17:35:37.559] 📹   Wrote 85 frames to positions 0-85
[17:35:37.576] 🎨 Decoding batch 2/2
[17:35:37.576] ℹ️   Latents shape: torch.Size([1, 17, 90, 120, 16])
[17:35:37.576] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[17:35:37.576] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[17:35:37.576] 🎨   Decoding tiles 1-2 / 2
[17:35:56.321] 📹   Trimming spatial padding: 960x720 → 956x720
[17:35:56.322] 🔄   Moving sample_2 from CUDA:0 to CPU (writing to final_video)
[17:35:56.989] 📹   Wrote 65 frames to positions 85-150
[17:35:56.999] 🧹 Cleaning up VAE components
[17:35:57.000] 🔄 Moving VAE from CUDA:0 to CPU (releasing GPU memory)
[17:35:57.129] 🧹 VAE model deleted
[17:35:57.129] ⚡ Phase 3: VAE decoding complete: 45.00s
[17:35:57.130] ⚡   └─ Decoded batch 1: 25.44s
[17:35:57.130] ⚡     └─ VAE decode: 18.75s
[17:35:57.130] ⚡   └─ Decoded batch 2: 19.42s
[17:35:57.130] ⚡     └─ VAE decode: 18.75s
[17:35:57.130] ⚡   └─ VAE moved to CPU: 0.11s
[17:35:57.131] ⚡   └─ (other operations): 0.02s
[17:35:57.133] 📊 After phase 3 (VAE decoding):
[17:35:57.133] 📊   [VRAM] 0.01GB allocated / 12.46GB reserved / Peak: 10.47GB / 1.44GB free / 15.92GB total
[17:35:57.133] 📊   [RAM] 2.97GB process / 13.07GB others / 79.75GB free / 95.79GB total
[17:35:57.133] 📊   Memory changes: VRAM -0.49GB, RAM +0.65GB
[17:35:57.133] 📊 Resetting VRAM peak memory statistics
[17:35:57.133]
[17:35:57.133]  ━━━━━━━━ Phase 4: Post-processing ━━━━━━━━
[17:35:57.133] 📹 Post-processing batch 1/2
[17:35:57.133] 🔄   Moving sample_1 from CPU to CUDA:0 (post-processing)
[17:35:57.164] 📹   Color correction disabled (set to none)
[17:35:57.164] 🔄   Moving sample_1_final from CUDA:0 to CPU (writing processed result to final_video)
[17:35:57.255] 📹 Post-processing batch 2/2
[17:35:57.256] 🔄   Moving sample_2 from CPU to CUDA:0 (post-processing)
[17:35:57.273] 📹   Color correction disabled (set to none)
[17:35:57.274] 🔄   Moving sample_2_final from CUDA:0 to CPU (writing processed result to final_video)
[17:35:57.343] 🎬 Output assembled: 150 frames, Resolution: 956x720px, Channels: RGB
[17:35:57.343] ⚡ Phase 4: Post-processing complete: 0.21s
[17:35:57.343] ⚡   └─ Post-processed batch 1: 0.12s
[17:35:57.343] ⚡   └─ Post-processed batch 2: 0.09s
[17:35:57.346] 📊 After phase 4 (Post-processing):
[17:35:57.346] 📊   [VRAM] 0.01GB allocated / 12.46GB reserved / Peak: 0.33GB / 1.44GB free / 15.92GB total
[17:35:57.346] 📊   [RAM] 2.97GB process / 13.07GB others / 79.75GB free / 95.79GB total
[17:35:57.346] 📊 Resetting VRAM peak memory statistics
[17:35:57.346]
[17:35:57.389] 🎯 Converted output from torch.bfloat16 to float32
[17:35:57.389] ✅ Upscaling completed successfully!
[17:35:57.389] 🧹 Starting full cleanup
[17:35:57.390] 🧹 Clearing memory caches (deep)...
[17:35:57.569] ✅ Completed full cleanup
[17:35:57.598] 📊 After all phases complete:
[17:35:57.598] 📊   [VRAM] 0.00GB allocated / 0.02GB reserved / Peak: 0.01GB / 14.58GB free / 15.92GB total
[17:35:57.598] 📊   [RAM] 3.51GB process / 13.07GB others / 79.22GB free / 95.79GB total
[17:35:57.598] 📊   Memory changes: RAM +0.54GB
[17:35:57.598] 📊 Resetting VRAM peak memory statistics
[17:35:57.598]
[17:35:57.599]  ────────────────────────
[17:35:57.599] 📊 Peak memory by phase:
[17:35:57.599] 📊   1. VAE encoding: VRAM 9.15GB allocated, 13.96GB reserved | RAM 2.16GB
[17:35:57.599] 📊   2. DiT upscaling: VRAM 12.89GB allocated, 14.95GB reserved | RAM 5.73GB
[17:35:57.599] 📊   3. VAE decoding: VRAM 10.47GB allocated, 14.95GB reserved | RAM 2.97GB
[17:35:57.599] 📊   4. Post-processing: VRAM 0.33GB allocated, 12.46GB reserved | RAM 3.51GB
[17:35:57.599] 📊 Overall peak: VRAM 12.89GB allocated, 14.95GB reserved | RAM 5.73GB
[17:35:57.599]
[17:35:57.599]  ────────────────────────
[17:35:57.599] ⚡ Total execution: 78.77s
[17:35:57.599] ⚡   └─ Video generation: 78.44s
[17:35:57.599] ⚡   └─   Phase 3: VAE decoding: 45.00s
[17:35:57.599] ⚡   └─   Phase 1: VAE encoding: 17.67s
[17:35:57.599] ⚡   └─   Phase 2: DiT upscaling: 15.37s
[17:35:57.599] ⚡   └─   Phase 4: Post-processing: 0.21s
[17:35:57.599] ⚡   └─ Final cleanup: 0.21s
[17:35:57.599] ⚡   └─ Model preparation: 0.11s
[17:35:57.599] ⚡ Average FPS: 1.90 frames/sec
[17:35:57.599]
[17:35:57.600]  ────────────────────────
[17:35:57.600] 💬 Questions? Updates? Watch, star & sponsor if you can!
[17:35:57.600] 🎬 https://www.youtube.com/@AInVFX
[17:35:57.600] ⭐💝 https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler
pop 1.mp4
Prompt executed in 79.76 seconds

Copilot AI and others added 4 commits January 12, 2026 13:53
…ackwell GPUs

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…fication, remove private APIs

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Add NVFP4 async offloading and pinned memory for Blackwell GPU optimization
@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

Setting output directory to: D:\output
[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2026-01-12 18:35:07.511
** Platform: Windows
** Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
** Python executable: C:\ComfyUI\python_embeded\python.exe
** ComfyUI Path: C:\ComfyUI\ComfyUI
** ComfyUI Base Folder Path: C:\ComfyUI\ComfyUI
** User directory: C:\ComfyUI\ComfyUI\user
** ComfyUI-Manager config path: C:\ComfyUI\ComfyUI\user\default\ComfyUI-Manager\config.ini
** Log path: C:\ComfyUI\ComfyUI\user\comfyui.log

Prestartup times for custom nodes:
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-easy-use
   1.5 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-manager

Checkpoint files will always be loaded safely.
Total VRAM 16303 MB, total RAM 98093 MB
pytorch version: 2.7.1+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5070 Ti : native
Using pytorch attention
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
ComfyUI version: 0.3.68
ComfyUI frontend version: 1.28.8
[Prompt Server] web root: C:\ComfyUI\python_embeded\Lib\site-packages\comfyui_frontend_package\static
[ComfyUI-Easy-Use] server: v1.3.2 Loaded
[ComfyUI-Easy-Use] web root: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-easy-use\web_version/v2 Loaded
### Loading: ComfyUI-Manager (V3.36)
[ComfyUI-Manager] network_mode: public
### ComfyUI Revision: 682 [265adad8] *DETACHED | Released on '2025-11-04'
WanVideoWrapper WARNING: FantasyPortrait nodes not available due to error in importing them: No module named 'onnx'
No Negpip.

[rgthree-comfy] Loaded 48 epic nodes. 🎉

⚡ SeedVR2 optimizations check: SageAttention ✅ | Flash Attention ✅ | Triton ✅
🚀 NVFP4 Blackwell optimization: ✅ (NVIDIA GeForce RTX 5070 Ti - 4-bit Tensor Core acceleration enabled)
   └─ Native FP4 dispatch configured (TF32 enabled, cuDNN benchmark active)
C:\ComfyUI\ComfyUI\custom_nodes\seedvr2_videoupscaler\src\optimization\compatibility.py:758: UserWarning: expandable_segments not supported on this platform (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.)
  a = torch.randn(8, 8, dtype=torch.bfloat16, device='cuda:0')
📊 Initial CUDA memory: 14.62GB free / 15.92GB total
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json

Import times for custom nodes:
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-inpaint-cropandstitch
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui_mittimi_load_presetlite
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyMath
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-dream-video-batches
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-videodircombiner
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-FlashVSR_Stable
   0.0 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-VideoHelperSuite
   0.1 seconds: C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper
   0.3 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-manager
   0.3 seconds: C:\ComfyUI\ComfyUI\custom_nodes\seedvr2_videoupscaler
   1.6 seconds: C:\ComfyUI\ComfyUI\custom_nodes\comfyui-easy-use

[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
Context impl SQLiteImpl.
Will assume non-transactional DDL.
No target revision found.
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

   v2.5.24                                    © ByteDance Seed · NumZ · AInVFX
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[18:39:34.659] ℹ️ OS: Windows (10.0.19045) | GPU: NVIDIA GeForce RTX 5070 Ti (16GB)
[18:39:34.659] ℹ️ Python: 3.12.10 | PyTorch: 2.7.1+cu128 | FlashAttn: v2 ✓ | SageAttn: v2 ✓ | Triton: ✓
[18:39:34.659] ℹ️ CUDA: 12.8 | cuDNN: 90701 | ComfyUI: 0.3.68
[18:39:34.659]
[18:39:34.659]  ━━━━━━━━━ Model Preparation ━━━━━━━━━
[18:39:34.661] 📊 Before model preparation:
[18:39:34.661] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[18:39:34.661] 📊   [RAM] 5.36GB process / 10.98GB others / 79.45GB free / 95.79GB total
[18:39:34.661] 📊 Resetting VRAM peak memory statistics
[18:39:34.661] 📥 Checking and downloading models if needed...
[18:39:34.662] ⚠️ [WARNING] seedvr2_3b_nvfp4.safetensors not in registry, skipping validation
[18:39:34.662] 🔧 VAE model found: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[18:39:34.662] 🔧 VAE model already validated (cache): C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[18:39:34.662] 🔧 Generation context initialized: DiT=cuda:0, VAE=cuda:0, Offload=[DiT offload=cpu], LOCAL_RANK=0
[18:39:34.662] 🎯 Unified compute dtype: torch.bfloat16 across entire pipeline for maximum performance
[18:39:34.662] 🏃 Configuring inference runner...
[18:39:34.662] 🏃 Creating new runner: DiT=seedvr2_3b_nvfp4.safetensors, VAE=ema_vae_fp16.safetensors
[18:39:34.672] ♻️ Reusing cached DiT (89): seedvr2_3b_nvfp4.safetensors
[18:39:34.673] 🚀 DiT configuration unchanged, reusing cached model
[18:39:34.676] 🎨 Creating VAE model structure on meta device
[18:39:34.712] 🎨 VAE downsample factors configured (spatial: 8x, temporal: 4x)
[18:39:34.713] 🔄 Moving text_pos_embeds from CPU to CUDA:0 (DiT inference)
[18:39:34.714] 🔄 Moving text_neg_embeds from CPU to CUDA:0 (DiT inference)
[18:39:34.714] 🚀 Loaded text embeddings for DiT
[18:39:34.716] 📊 After model preparation:
[18:39:34.716] 📊   [VRAM] 0.00GB allocated / 0.00GB reserved / Peak: 0.00GB / 14.59GB free / 15.92GB total
[18:39:34.716] 📊   [RAM] 5.36GB process / 10.98GB others / 79.45GB free / 95.79GB total
[18:39:34.716] 📊 Resetting VRAM peak memory statistics
[18:39:34.717] ⚡ Model preparation: 0.06s
[18:39:34.717] ⚡   └─ Model structures prepared: 0.04s
[18:39:34.717] ⚡     └─ VAE structure created: 0.04s
[18:39:34.717] 🔧   Initializing video transformation pipeline for 720px (shortest edge)
[18:39:34.725] 🔧   Target dimensions: 956x720 (padded to 960x720 for processing)
[18:39:34.726]
[18:39:34.726] 🎬 Starting upscaling generation...
[18:39:34.726] 🎬   Input: 150 frames, 306x230px → Padded: 960x720px → Output: 956x720px (shortest edge: 720px)
[18:39:34.726] 🎬   Batch size: 77, Seed: 1333, Channels: RGB
[18:39:34.727]
[18:39:34.727]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━
[18:39:34.727] ♻️ Reusing pre-initialized video transformation pipeline
[18:39:34.727]
[18:39:34.727] 💡 Tip: For 150 frames, batch_size=149 matches video length optimally
[18:39:34.727] 💡   Matching batch_size to shot length improves temporal coherence
[18:39:34.727]
[18:39:34.727] 🎨 Materializing VAE weights to CUDA:0: C:\ComfyUI\ComfyUI\models\SEEDVR2\ema_vae_fp16.safetensors
[18:39:34.848] 🎯 Converting VAE weights to torch.bfloat16 during loading
[18:39:34.852] 🎨 Materializing VAE: 250 parameters, 478.07MB total
[18:39:34.854] 🎨 VAE materialized directly from meta with loaded weights
[18:39:34.855] 🎨 VAE model set to eval mode (gradients disabled)
[18:39:34.856] 🎨 Configuring VAE causal slicing for temporal processing
[18:39:34.856] 🎨 Configuring VAE memory limits for causal convolutions
[18:39:34.857] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[18:39:34.857] 🎨 Using seed: 1001333 (VAE uses seed+1000000 for deterministic sampling)
[18:39:34.858] 🔄 VAE already on CUDA:0, skipping movement
[18:39:34.860] 📊 After VAE loading for encoding:
[18:39:34.860] 📊   [VRAM] 0.48GB allocated / 0.52GB reserved / Peak: 0.50GB / 14.07GB free / 15.92GB total
[18:39:34.860] 📊   [RAM] 5.36GB process / 10.98GB others / 79.45GB free / 95.79GB total
[18:39:34.860] 📊   Memory changes: VRAM +0.48GB
[18:39:34.860] 📊 Resetting VRAM peak memory statistics
[18:39:34.860] 🎨 Encoding batch 1/2
[18:39:34.860] 🔄   Moving video_batch_1 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[18:39:34.871] 📹   Sequence of 77 frames
[18:39:44.325] ℹ️   Latents shape: torch.Size([20, 90, 120, 16])
[18:39:44.326] 🎨 Encoding batch 2/2
[18:39:44.326] 🔄   Moving video_batch_2 from CPU to CUDA:0, torch.float32 → torch.bfloat16 (VAE encoding)
[18:39:44.339] 📹   Sequence of 73 frames
[18:39:53.221] ℹ️   Latents shape: torch.Size([19, 90, 120, 16])
[18:39:53.222] ⚡ Phase 1: VAE encoding complete: 18.49s
[18:39:53.222] ⚡   └─ Encoded batch 1: 9.46s
[18:39:53.223] ⚡   └─ Encoded batch 2: 8.90s
[18:39:53.223] ⚡   └─ VAE materialized: 0.13s
[18:39:53.223] ⚡     └─ VAE weights loaded from file: 0.12s
[18:39:53.226] 📊 After phase 1 (VAE encoding):
[18:39:53.227] 📊   [VRAM] 0.50GB allocated / 14.30GB reserved / Peak: 9.12GB / 0.00GB free / 15.92GB total
[18:39:53.227] 📊   [RAM] 5.36GB process / 10.96GB others / 79.48GB free / 95.79GB total
[18:39:53.227] 📊   Memory changes: VRAM +0.02GB
[18:39:53.227] 📊 Resetting VRAM peak memory statistics
[18:39:53.228]
[18:39:53.228]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━
[18:39:53.317] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[18:39:53.317] 🔄 Moving DiT from CPU to CUDA:0 (inference requirement)
[18:39:53.735] 📊 After DiT loading for upscaling:
[18:39:53.735] 📊   [VRAM] 3.71GB allocated / 14.30GB reserved / Peak: 3.71GB / 0.00GB free / 15.92GB total
[18:39:53.735] 📊   [RAM] 2.16GB process / 10.95GB others / 82.68GB free / 95.79GB total
[18:39:53.736] 📊   Memory changes: VRAM +3.21GB, RAM -3.20GB
[18:39:53.736] 📊 Resetting VRAM peak memory statistics
[18:39:53.736] 🎬 Upscaling batch 1/2
[18:39:53.736] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.47s/it]
[18:40:00.208] 🎬 Upscaling batch 2/2
[18:40:00.209] 🚀 Using seed: 1333 for deterministic generation
EulerSampler: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.20s/it]
[18:40:06.409] 🧹 Cleaning up DiT components
[18:40:06.411] 🔄 Moving DiT from CUDA:0 to CPU (model caching)
[18:40:07.061] 🧹 Cleaned up text embeddings: texts_pos, texts_neg
[18:40:07.061] ⚡ Phase 2: DiT upscaling complete: 13.83s
[18:40:07.062] ⚡   └─ Upscaled batch 1: 6.47s
[18:40:07.062] ⚡     └─ DiT inference 1: 6.47s
[18:40:07.062] ⚡   └─ Upscaled batch 2: 6.20s
[18:40:07.062] ⚡     └─ DiT inference 2: 6.20s
[18:40:07.062] ⚡   └─ DiT moved to CPU: 0.65s
[18:40:07.062] ⚡   └─ DiT moved to CUDA:0: 0.42s
[18:40:07.063] ⚡   └─ (other operations): 0.10s
[18:40:07.064] 📊 After phase 2 (DiT upscaling):
[18:40:07.065] 📊   [VRAM] 0.50GB allocated / 14.32GB reserved / Peak: 11.86GB / 0.00GB free / 15.92GB total
[18:40:07.065] 📊   [RAM] 5.35GB process / 10.97GB others / 79.47GB free / 95.79GB total
[18:40:07.065] 📊   Memory changes: VRAM -3.21GB, RAM +3.19GB
[18:40:07.065] 📊 Resetting VRAM peak memory statistics
[18:40:07.065]
[18:40:07.065]  ━━━━━━━━ Phase 3: VAE decoding ━━━━━━━━
[18:40:07.065] 🔧 Pre-allocating output tensor: 150 frames, 956x720px, RGB (0.58GB)
[18:40:07.066] 🎯 Model precision: DiT=torch.float8_e4m3fn, VAE=torch.bfloat16, compute=torch.bfloat16
[18:40:07.066] 🔄 VAE already on CUDA:0, skipping movement
[18:40:07.068] 📊 After VAE loading for decoding:
[18:40:07.068] 📊   [VRAM] 0.50GB allocated / 14.32GB reserved / Peak: 0.50GB / 0.00GB free / 15.92GB total
[18:40:07.068] 📊   [RAM] 5.35GB process / 10.98GB others / 79.46GB free / 95.79GB total
[18:40:07.068] 📊 Resetting VRAM peak memory statistics
[18:40:07.068] 🎨 Decoding batch 1/2
[18:40:07.068] ℹ️   Latents shape: torch.Size([1, 20, 90, 120, 16])
[18:40:07.069] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[18:40:07.069] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[18:40:07.069] 🎨   Decoding tiles 1-2 / 2
[18:40:29.341] 📹   Trimming spatial padding: 960x720 → 956x720
[18:40:29.341] 🔄   Moving sample_1 from CUDA:0 to CPU (writing to final_video)
[18:40:30.034] 📹   Wrote 77 frames to positions 0-77
[18:40:30.046] 🎨 Decoding batch 2/2
[18:40:30.047] ℹ️   Latents shape: torch.Size([1, 19, 90, 120, 16])
[18:40:30.047] 🎨   Using VAE tiled decoding (Tile: (736, 736), Overlap: (32, 32))
[18:40:30.047] 🎨 Decoding 2 tiles (Tile: (736, 736), Overlap: (32, 32))
[18:40:30.047] 🎨   Decoding tiles 1-2 / 2
[18:40:51.114] 📹   Trimming spatial padding: 960x720 → 956x720
[18:40:51.115] 🔄   Moving sample_2 from CUDA:0 to CPU (writing to final_video)
[18:40:51.797] 📹   Wrote 73 frames to positions 77-150
[18:40:51.813] 🧹 Cleaning up VAE components
[18:40:51.813] 🔄 Moving VAE from CUDA:0 to CPU (releasing GPU memory)
[18:40:51.932] 🧹 VAE model deleted
[18:40:51.933] ⚡ Phase 3: VAE decoding complete: 44.87s
[18:40:51.933] ⚡   └─ Decoded batch 1: 22.98s
[18:40:51.933] ⚡     └─ VAE decode: 21.07s
[18:40:51.933] ⚡   └─ Decoded batch 2: 21.77s
[18:40:51.934] ⚡     └─ VAE decode: 21.07s
[18:40:51.934] ⚡   └─ VAE moved to CPU: 0.10s
[18:40:51.934] ⚡   └─ (other operations): 0.02s
[18:40:51.936] 📊 After phase 3 (VAE decoding):
[18:40:51.936] 📊   [VRAM] 0.01GB allocated / 13.45GB reserved / Peak: 10.47GB / 0.55GB free / 15.92GB total
[18:40:51.936] 📊   [RAM] 5.94GB process / 10.99GB others / 78.87GB free / 95.79GB total
[18:40:51.936] 📊   Memory changes: VRAM -0.49GB, RAM +0.58GB
[18:40:51.936] 📊 Resetting VRAM peak memory statistics
[18:40:51.936]
[18:40:51.936]  ━━━━━━━━ Phase 4: Post-processing ━━━━━━━━
[18:40:51.936] 📹 Post-processing batch 1/2
[18:40:51.936] 🔄   Moving sample_1 from CPU to CUDA:0 (post-processing)
[18:40:51.956] 📹   Color correction disabled (set to none)
[18:40:51.956] 🔄   Moving sample_1_final from CUDA:0 to CPU (writing processed result to final_video)
[18:40:52.040] 📹 Post-processing batch 2/2
[18:40:52.040] 🔄   Moving sample_2 from CPU to CUDA:0 (post-processing)
[18:40:52.059] 📹   Color correction disabled (set to none)
[18:40:52.059] 🔄   Moving sample_2_final from CUDA:0 to CPU (writing processed result to final_video)
[18:40:52.140] 🎬 Output assembled: 150 frames, Resolution: 956x720px, Channels: RGB
[18:40:52.141] ⚡ Phase 4: Post-processing complete: 0.21s
[18:40:52.141] ⚡   └─ Post-processed batch 1: 0.10s
[18:40:52.141] ⚡   └─ Post-processed batch 2: 0.10s
[18:40:52.144] 📊 After phase 4 (Post-processing):
[18:40:52.144] 📊   [VRAM] 0.01GB allocated / 13.45GB reserved / Peak: 0.30GB / 0.55GB free / 15.92GB total
[18:40:52.144] 📊   [RAM] 5.94GB process / 10.99GB others / 78.87GB free / 95.79GB total
[18:40:52.144] 📊 Resetting VRAM peak memory statistics
[18:40:52.144]
[18:40:52.205] 🎯 Converted output from torch.bfloat16 to float32
[18:40:52.206] ✅ Upscaling completed successfully!
[18:40:52.206] 🧹 Starting partial cleanup
[18:40:52.206] 🧹 Cleaning up DiT components
[18:40:52.208] 🧹 Clearing memory caches (deep)...
[18:40:52.402] 💾 Models cached for next run: DiT (seedvr2_3b_nvfp4.safetensors)
[18:40:52.402] ✅ Completed partial cleanup
[18:40:52.428] 📊 After all phases complete:
[18:40:52.437] 📊   [VRAM] 0.00GB allocated / 0.67GB reserved / Peak: 0.01GB / 13.93GB free / 15.92GB total
[18:40:52.437] 📊   [RAM] 6.51GB process / 10.96GB others / 78.32GB free / 95.79GB total
[18:40:52.437] 📊   Memory changes: RAM +0.58GB
[18:40:52.437] 📊 Resetting VRAM peak memory statistics
[18:40:52.438]
[18:40:52.438]  ────────────────────────
[18:40:52.438] 📊 Peak memory by phase:
[18:40:52.438] 📊   1. VAE encoding: VRAM 9.12GB allocated, 14.30GB reserved | RAM 5.36GB
[18:40:52.439] 📊   2. DiT upscaling: VRAM 11.86GB allocated, 14.32GB reserved | RAM 5.35GB
[18:40:52.439] 📊   3. VAE decoding: VRAM 10.47GB allocated, 14.32GB reserved | RAM 5.94GB
[18:40:52.439] 📊   4. Post-processing: VRAM 0.30GB allocated, 13.45GB reserved | RAM 6.51GB
[18:40:52.439] 📊 Overall peak: VRAM 11.86GB allocated, 14.32GB reserved | RAM 6.51GB
[18:40:52.439]
[18:40:52.439]  ────────────────────────
[18:40:52.440] ⚡ Total execution: 77.78s
[18:40:52.440] ⚡   └─ Video generation: 77.48s
[18:40:52.440] ⚡   └─   Phase 3: VAE decoding: 44.87s
[18:40:52.440] ⚡   └─   Phase 1: VAE encoding: 18.49s
[18:40:52.440] ⚡   └─   Phase 2: DiT upscaling: 13.83s
[18:40:52.440] ⚡   └─ Final cleanup: 0.22s
[18:40:52.440] ⚡   └─   Phase 4: Post-processing: 0.21s
[18:40:52.440] ⚡   └─ Model preparation: 0.06s
[18:40:52.440] ⚡ Average FPS: 1.93 frames/sec
[18:40:52.440]
[18:40:52.440]  ────────────────────────
[18:40:52.441] 💬 Questions? Updates? Watch, star & sponsor if you can!
[18:40:52.441] 🎬 https://www.youtube.com/@AInVFX
[18:40:52.441] ⭐💝 https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler
pop 1.mp4
Prompt executed in 78.96 seconds

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

run_nvidia_gpu_fast_fp16_accumulation - firefox.zip

You need to create the startup file like this for it to become active.

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 12, 2026

image

Copilot AI and others added 8 commits January 13, 2026 10:20
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
….h when unavailable

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell
…Memory, torch.compile for Windows/Blackwell"
…-gpu-memory

Revert "Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell"
Copilot AI and others added 11 commits January 14, 2026 12:21
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…ell support

Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…ization-again

Integrate SpargeAttn/Sage2 block-sparse attention with Blackwell GPU optimizations
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Add performance_mode dropdown for Blackwell-specific sparge_sage2 tuning
@dsouzaankit
Copy link
Copy Markdown

I recommend building sageattention2,3 from source, since pip install points to older sageattention1

pip install packaging setuptools ninja

install sageattention 3

cd ~
source ~/venv/bin/activate
git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention/sageattention3_blackwell
python setup.py install
pip show sageattn3

install sageattention 2++ for fallback

cd ~
source ~/venv/bin/activate
cd ~/SageAttention
python setup.py install # or pip install -e .
pip show sageattention

@naxci1 naxci1 closed this Jan 17, 2026
@naxci1 naxci1 reopened this Jan 19, 2026
@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 19, 2026

While version 2.5.23 used 15 GB of VRAM, this version uses 14 GB.

The processing speed was 1.8 fps in version 2.5.23, but it's 1.91 fps in this version.

Although the NVFP4 model doesn't give me exactly the results I want, it's worthwhile to add the other improvements to the main code.

@dsouzaankit
Copy link
Copy Markdown

@naxci1 VAE Decoding step is the slowest. So can we replace its fp16 model with nvfp4 as well?

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 24, 2026

@naxci1 VAE Decoding step is the slowest. So can we replace its fp16 model with nvfp4 as well?

I tried many different methods, but unfortunately it didn't work. It didn't work with the nvfp4 VAE model. The SeedVR2 DNA is very differently designed. It doesn't accept other VAE models; it absolutely has to be this VAE, which is very slow, unfortunately. The main slowdown occurs when decoding the VAE. I hope SeedVR3 comes out soon, with new codes and a new VAE model. The coding of this old model is very complex.

@zelenooki87
Copy link
Copy Markdown

@naxci1 are you 100% sure you implemented nvfp4 correctly? I'd like to try your fork on an RTX 5090 today, but I'm put off by the fact that you aren't getting any speedup...

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Jan 30, 2026

https://huggingface.co/Nexus24/vaeGGUF/tree/main?show_file_info=vae_nvfp4_blackwell.safetensors

Here you can see the details of the model yourself; theoretically, everything is complete. Claude just couldn't fully optimize it, and it took me several days. The problem is that the SeedVR2 code is very complex, there are unnecessary repetitions, its DNA is very different, and the DIT and VAE are very interconnected, unlike other models. In FlashVSR, I can integrate what I want much more easily, but SeedVR2 actually needs to be rewritten from scratch. That's why experts in this field need to get involved, and since they don't have the time, it remains unfinished. After all, I'm not a programmer; I'm just doing vibe coding.

@zelenooki87
Copy link
Copy Markdown

@naxci1, I'm testing out the latest commit with the Blackwell optimizations right now (the one without NVFP4, since that’s on another branch).
Mate, I have to tell you—generation literally flies now.
The performance on the RTX 5090 is much faster.
The autoencoder part, in particular, is way quicker.

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Feb 3, 2026

@zelenooki87

Which one do you mean? Write the link.

@zelenooki87
Copy link
Copy Markdown

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Feb 3, 2026

@zelenooki87 Thanks

https://github.com/naxci1/ComfyUI-SeedVR2.5/blob/copilot/optimize-vae-performance-windows-50xx/docs/OPTIMIZATION_README.md

I had left this unfinished, I forgot about it.

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Feb 3, 2026

@zelenooki87

I just tested it with the 3bQ8 model and the speed was the same for me. I have a 5070Ti GPU and you have a 5090. What were the FPS differences between the old and new FPS for you?

@zelenooki87
Copy link
Copy Markdown

zelenooki87 commented Feb 9, 2026

@naxci1 , how is the nvfp4 integration going? I saw you created a new fork/repo with the -new suffix.
I'm actually testing nvfp4 right now and will report back with benchmarks against GGUF models shortly.
In the meantime, could you let me know how to load vae_nvfp4_blackwell.safetensors?
Thanks!

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Feb 9, 2026

Hi @zelenooki87

Actually, I've integrated and tested many new methods and technologies, but the tests take a very long time, and it leaves me frustrated, so I leave them unfinished. Unfortunately, AI isn't very smart when it comes to coding. Sometimes it claims to have done something, but when you look, it hasn't actually done it, which is why the tests take so long. There are many branches in my repository, all of which I've left unfinished. I abandon them when I don't get the results I want in the tests. I still haven't achieved the speed I want, unfortunately. SeedVR2 actually needs to be rewritten; it's very old and uses outdated methods, especially optimized for the H100 GPU. Therefore, changing some methods creates other problems, leading to wasted time. @IceClear actually wrote that he would be making SeedVR3; it would have been better if he had rewritten it. Complete code and methods would have been changed, and if it had been built for RT Tensors and especially if VAE had been chosen, integrating new features would have been much easier. Honestly, I don't have much time these days either. It's easier to fix code in FlashVSR; I can do everything in an hour, but integrating new features in SeedVR2 takes days. The code DNA is so complex that even Claude can't handle it.

#164

@kotn3l
Copy link
Copy Markdown

kotn3l commented Feb 11, 2026

I know I shouldn't be talking as a 9070XT user here, but the "old" seedvr2_7b_nvfp4.safetensors model is working with ROCm. Though none of the new ones (like seedvr2_nvfp4_blackwell.safetensors) do. Seems to execute faster too than the current 7b fp16 model.

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Feb 11, 2026

Actually, I could convert those models again; it only takes a minute to convert, but it didn't give me the speed I wanted, so I deleted them. The highest quality and fastest model is the 3bQ8; I recommend using that one.

@kotn3l
Copy link
Copy Markdown

kotn3l commented Feb 11, 2026

Actually, I could convert those models again; it only takes a minute to convert, but it didn't give me the speed I wanted, so I deleted them. The highest quality and fastest model is the 3bQ8; I recommend using that one.

Which one is that model? Is it named differently in your HuggingFace repo? Also, if you could reconvert and upload them somewhere I'd love to test them again. Thanks!

@naxci1
Copy link
Copy Markdown
Contributor Author

naxci1 commented Feb 11, 2026

This is SeedVR2's own model; it downloads as standard, and its name is listed in the model section: 3B Q8 model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants