Fix fp16 LoRA unscale crash after validation in train_dreambooth_lora.py (#13895)

HaozheZhang6 · sayakpaul · web-flow · commit e377c0a4ab09 · 2026-06-09T16:13:08.000+05:30
When training with `--mixed_precision="fp16"` and `--validation_prompt`, the first optimizer step after a validation run fails with `ValueError: Attempting to unscale FP16 gradients`. Under fp16, `cast_training_params` keeps the trainable LoRA params in fp32. The in-loop validation pipeline is built with the same live `unet` object, and `log_validation` then calls `pipeline.to(device, dtype=torch_dtype)`, which downcasts those fp32 LoRA params back to fp16. The next backward therefore produces fp16 grads and `GradScaler.unscale_` raises. Drop the dtype cast from that `.to(...)` so the shared `unet` keeps its fp32 LoRA params. This matches train_dreambooth_lora_sdxl.py, which moves the validation pipeline with `.to(accelerator.device)` only. Fixes #13124 Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
diff --git a/examples/dreambooth/train_dreambooth_lora.py b/examples/dreambooth/train_dreambooth_lora.py
@@ -147,7 +147,11 @@ def log_validation(
 
     pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, **scheduler_args)
 
-    pipeline = pipeline.to(accelerator.device, dtype=torch_dtype)
+    # Don't pass `dtype` here: under fp16 the trainable LoRA params are kept in fp32 (see
+    # `cast_training_params` above) and the validation pipeline shares the training `unet`, so casting it
+    # to fp16 would break the next optimizer step ("Attempting to unscale FP16 gradients"). Matches the
+    # SDXL script.
+    pipeline = pipeline.to(accelerator.device)
     pipeline.set_progress_bar_config(disable=True)
 
     # run inference