DreamBooth is a method to personalize text2image models like stable diffusion given just a few (3~5) images of a subject.
The train_dreambooth_flux.py script showing how to implement the training procedure and adapt it for FLUX.1 [dev] is working in progress. We currently provides a LoRA implementation in the train_dreambooth_lora_flux.py script.
Note
Gated model
As the model is gated, before using it with diffusers you first need to go to the FLUX.1 [dev] Hugging Face page, fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:
huggingface-cli loginThis will also allow us to push the trained model parameters to the Hugging Face Hub platform.
Before running the scripts, make sure to install the library's training dependencies:
Important
To make sure you can successfully run the latest versions of the example scripts, we highly recommend installing from source and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
git clone https://github.com/mindspore-lab/mindone
cd mindone
pip install -e .[training]Then cd in the examples/diffusers/dreambooth folder.
Now let's get our dataset. For this example we will use some dog images: https://huggingface.co/datasets/diffusers/dog-example.
Let's first download it locally:
from huggingface_hub import snapshot_download
local_dir = "./dog"
snapshot_download(
"diffusers/dog-example",
local_dir=local_dir, repo_type="dataset",
ignore_patterns=".gitattributes",
)To better track our training experiments, we're using the following flags in the command above:
validation_promptandvalidation_epochsto allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
Note
If you want to train using long prompts with the T5 text encoder, you can use --max_sequence_length to set the token limit. The default is 77, but it can be increased to as high as 512. Note that this will use more resources and may slow down the training in some cases.
LoRA is a popular parameter-efficient fine-tuning technique that allows you to achieve full-finetuning like performance but with a fraction of learnable parameters.
To perform DreamBooth with LoRA, run:
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="trained-flux-lora"
python train_dreambooth_lora_flux.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--mixed_precision="bf16" \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--guidance_scale=1 \
--gradient_accumulation_steps=4 \
--optimizer="AdamW" \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=25 \
--seed="0"Alongside the transformer, fine-tuning of the CLIP text encoder is also supported.
To do so, just specify --train_text_encoder while launching training. Please keep the following points in mind:
Note
This is still an experimental feature.
FLUX.1 has 2 text encoders (CLIP L/14 and T5-v1.1-XXL).
By enabling --train_text_encoder, fine-tuning of the CLIP encoder is performed.
At the moment, T5 fine-tuning is not supported and weights remain frozen when text encoder training is enabled.
To perform DreamBooth LoRA with text-encoder training, run:
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export OUTPUT_DIR="trained-flux-dev-dreambooth-lora"
python train_dreambooth_lora_flux.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--mixed_precision="bf16" \
--train_text_encoder\
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--guidance_scale=1 \
--gradient_accumulation_steps=4 \
--optimizer="AdamW" \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--seed="0"As mentioned, Flux Dreambooth LoRA training is very memory intensive Here are some options (some still experimental) for a more memory efficient training.
An easy way to mitigate some of the memory requirements is through --resolution. --resolution refers to the resolution for input images, all the images in the train/validation dataset are resized to this.
Note that by default, images are resized to resolution of 512, but it's good to keep in mind in case you're accustomed to training on higher resolutions.
--gradient accumulationrefers to the number of updates steps to accumulate before performing a backward/update pass. by passing a value > 1 you can reduce the amount of backward/update passes and hence also memory reqs.- with
--gradient checkpointingwe can save memory by not storing all intermediate activations during the forward pass. Instead, only a subset of these activations (the checkpoints) are stored and the rest is recomputed as needed during the backward pass. Note that this comes at the expanse of a slower backward pass.
By default, trained transformer layers are saved in the precision dtype in which training was performed. E.g. when training in mixed precision is enabled with --mixed_precision="bf16", final finetuned layers will be saved in mindspore.bfloat16 as well.
This reduces memory requirements significantly w/o a significant quality loss. Note that if you do wish to save the final layers in float32 at the expanse of more memory usage, you can do so by passing --upcast_before_saving.