You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text encoder used is Gemma2-2b. The text and image processors use lightweight single-stream blocks. Multimodal-ROPE (mRoPE) is used to model text-image sequences.
Optimizer Hyperparameters
Optimizer: AdamW
Learning Rate: 2×10e−4 for all three training stages (Low Res., High Res., HQ Tuning).
Flow Matching Hyperparameters
The paper mentions that an auxiliary loss computes the flow-matching objective for high res training.
Training Hyperparameters
The training process is divided into three stages with the following configurations:
Low Resolution Stage:
Image Resolution: 256×256
#Images: 100M
Training Steps (K): 144
Batch Size: 1024
GPU Days (A100): 191
High Resolution Stage:
Image Resolution: 1024×1024
#Images: 10M
Training Steps (K): 40
Batch Size: 512
GPU Days (A100): 176
HQ Tuning Stage:
Image Resolution: 1024×1024
#Images: 1M
Training Steps (K): 15
Batch Size: 512
GPU Days (A100): 224
The training was performed on 32 A100 GPUs for all stages.
Inference Hyperparameters
For efficient inference, several techniques are discussed:
Classifier-Free Guidance (CFG): The paper discusses the use of CFG-Renormalization and CFG-Truncation.
CFG-Truncation threshold (a): A predefined threshold for switching off conditional velocity calculation. The specific value is not mentioned.
Flow-DPM-Solver (FDPM): Achieves convergence in 14-20 NFEs (Number of Function Evaluations).