Skip to content
This repository was archived by the owner on Dec 14, 2023. It is now read-only.

Commit 2a09bb7

Browse files
Merge pull request #90 from ExponentialML/feat/stable_lora
Stable LoRA addition and webui text2video extension support.
2 parents 99c9fd9 + 51d910b commit 2a09bb7

8 files changed

Lines changed: 1063 additions & 152 deletions

File tree

README.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
[output.webm](https://user-images.githubusercontent.com/59846140/230748413-fe91e90b-94b9-49ea-97ec-250469ee9472.webm)
55

66
### Updates
7+
- **2023-7-12**: You can now train a LoRA that is compatibile with the [webui extension](https://github.com/kabachuha/sd-webui-text2video)! See instructions [here.](https://github.com/ExponentialML/Text-To-Video-Finetuning/edit/feat/stable_lora/README.md#training-a-lora)
78
- **2023-4-17**: You can now convert your trained models from diffusers to `.ckpt` format for A111 webui. Thanks @kabachuha!
89
- **2023-4-8**: LoRA Training released! Checkout `configs/v2/lora_training_config.yaml` for instructions.
910
- **2023-4-8**: Version 2 is released!
@@ -46,15 +47,13 @@ It is **highly recommended** to install >= Torch 2.0. This way, you don't have t
4647

4748
If you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers
4849

49-
5050
Recommended to use a RTX 3090, but you should be able to train on GPUs with <= 16GB ram with:
5151
- Validation turned off.
5252
- Xformers or Torch 2.0 Scaled Dot-Product Attention
5353
- Gradient checkpointing enabled.
5454
- Resolution of 256.
5555
- Enable all LoRA options.
5656

57-
5857
## Running inference
5958
The `inference.py` script can be used to render videos with trained checkpoints.
6059

@@ -164,6 +163,18 @@ Then, follow each line and configure it for your specific use case.
164163

165164
The instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board.
166165

166+
## Training a LoRA
167+
You can also train a LoRA that is both compatible with the webui extension.. By default it's set to 'cloneofsimo', which was the first LoRA implementation for Stable Diffusion.
168+
This version you can use in the `inference.py` file in this repository. It is **not** compatible with the webui.
169+
170+
To use a LoRA with the webui, change the `lora_version` to "stable_lora" in your config. This will train an [A1111 webui extension](https://github.com/kabachuha/sd-webui-text2video) compatibile LoRA.
171+
You can get started at `configs/v2/stable_lora_config.yaml` and edit it from there. During and after training, LoRAs will be saved in your outputs directory with the prefix `_webui`.
172+
173+
### What you cannot do:
174+
- Use LoRA files that were made for SD image models in other trainers.
175+
- Use 'cloneofsimo' LoRAs in another project (unless you build it or create a PR)
176+
- Merge LoRA weights together (yet).
177+
167178
## Finetune.
168179
```python
169180
python train.py --config ./configs/v2/train_config.yaml

configs/v2/stable_lora_config.yaml

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Pretrained diffusers model path.
2+
pretrained_model_path: "./models/model_scope_diffusers/" #https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/tree/main
3+
4+
# The folder where your training outputs will be placed.
5+
output_dir: "./outputs"
6+
7+
# You can train multiple datasets at once. They will be joined together for training.
8+
# Simply remove the line you don't need, or keep them all for mixed training.
9+
10+
# 'image': A folder of images and captions (.txt)
11+
# 'folder': A folder a videos and captions (.txt)
12+
# 'json': The JSON file created with automatic BLIP2 captions using https://github.com/ExponentialML/Video-BLIP2-Preprocessor
13+
# 'single_video': A single video file.mp4 and text prompt
14+
dataset_types:
15+
- 'image'
16+
- 'folder'
17+
- 'json'
18+
- 'single_video'
19+
20+
# Adds offset noise to training. See https://www.crosslabs.org/blog/diffusion-with-offset-noise
21+
# If this is enabled, rescale_schedule will be disabled.
22+
offset_noise_strength: 0.1
23+
use_offset_noise: False
24+
25+
# Uses schedule rescale, also known as the "better" offset noise. See https://arxiv.org/pdf/2305.08891.pdf
26+
# If this is enabled, offset noise will be disabled.
27+
rescale_schedule: False
28+
29+
# When True, this extends all items in all enabled datasets to the highest length.
30+
# For example, if you have 200 videos and 10 images, 10 images will be duplicated to the length of 200.
31+
extend_dataset: False
32+
33+
# Caches the latents (Frames-Image -> VAE -> Latent) to a HDD or SDD.
34+
# The latents will be saved under your training folder, and loaded automatically for training.
35+
# This both saves memory and speeds up training and takes very little disk space.
36+
cache_latents: True
37+
38+
# If you have cached latents set to `True` and have a directory of cached latents,
39+
# you can skip the caching process and load previously saved ones.
40+
cached_latent_dir: null #/path/to/cached_latents
41+
42+
# https://github.com/cloneofsimo/lora (NOT Compatible with webui extension)
43+
# This is the first, original implementation of LoRA by cloneofsimo.
44+
# Use this version if you want to maintain compatibility to the original version.
45+
46+
# https://github.com/ExponentialML/Stable-LoRA/tree/main (Compatible with webui text2video extension)
47+
# This is an implementation based off of the original LoRA repository by Microsoft, and the default LoRA method here.
48+
# It works a different by using embeddings instead of the intermediate activations (Linear || Conv).
49+
# This means that there isn't an extra function when doing low ranking adaption.
50+
# It solely saves the weight differential between the initialized weights and updates.
51+
52+
# "cloneofsimo" or "stable_lora"
53+
lora_version: "stable_lora"
54+
55+
# Use LoRA for the UNET model.
56+
use_unet_lora: True
57+
58+
# Use LoRA for the Text Encoder. If this is set, the text encoder for the model will not be trained.
59+
use_text_lora: True
60+
61+
# LoRA Dropout. This parameter adds the probability of randomly zeros out elements. Helps prevent overfitting.
62+
# See: https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
63+
lora_unet_dropout: 0.1
64+
65+
lora_text_dropout: 0.1
66+
67+
# https://github.com/kabachuha/sd-webui-text2video
68+
# This saves a LoRA that is compatible with the text2video webui extension.
69+
# It only works when the lora version is 'stable_lora'.
70+
# This is also a DIFFERENT implementation than Kohya's, so it will NOT work the same implementation.
71+
save_lora_for_webui: True
72+
73+
# The LoRA file will be converted to a different format to be compatible with the webui extension.
74+
# The difference between this and 'save_lora_for_webui' is that you can continue training a Diffusers pipeline model
75+
# when this version is set to False
76+
only_lora_for_webui: False
77+
78+
# Choose whether or not ito save the full pretrained model weights for both checkpoints and after training.
79+
# The only time you want this off is if you're doing full LoRA training.
80+
save_pretrained_model: False
81+
82+
# The modules to use for LoRA. Advanced usage.
83+
unet_lora_modules:
84+
- "UNet3DConditionModel" # Defaults to training the entire UNET.
85+
#- "ResnetBlock2D"
86+
#- "TransformerTemporalModel"
87+
#- "Transformer2DModel"
88+
#- "CrossAttention"
89+
#- "Attention"
90+
#- "GEGLU"
91+
#- "TemporalConvLayer"
92+
93+
text_encoder_lora_modules:
94+
- "CLIPEncoderLayer" # Defaults to training the entire Text Encoder.
95+
#- "CLIPAttention"
96+
97+
# The rank for LoRA training. With ModelScope, the maximum should be 1024.
98+
# VRAM increases with higher rank, lower when decreased.
99+
lora_rank: 16
100+
101+
# Training data parameters
102+
train_data:
103+
104+
# The width and height in which you want your training data to be resized to.
105+
width: 384
106+
height: 384
107+
108+
# This will find the closest aspect ratio to your input width and height.
109+
# For example, 512x512 width and height with a video of resolution 1280x720 will be resized to 512x256
110+
use_bucketing: True
111+
112+
# The start frame index where your videos should start (Leave this at one for json and folder based training).
113+
sample_start_idx: 1
114+
115+
# Used for 'folder'. The rate at which your frames are sampled. Does nothing for 'json' and 'single_video' dataset.
116+
fps: 24
117+
118+
# For 'single_video' and 'json'. The number of frames to "step" (1,2,3,4) (frame_step=2) -> (1,3,5,7, ...).
119+
frame_step: 1
120+
121+
# The number of frames to sample. The higher this number, the higher the VRAM (acts similar to batch size).
122+
n_sample_frames: 8
123+
124+
# 'single_video'
125+
single_video_path: "path/to/single/video.mp4"
126+
127+
# The prompt when using a a single video file
128+
single_video_prompt: ""
129+
130+
# Fallback prompt if caption cannot be read. Enabled for 'image' and 'folder'.
131+
fallback_prompt: ''
132+
133+
# 'folder'
134+
path: "path/to/folder/of/videos/"
135+
136+
# 'json'
137+
json_path: 'path/to/train/json/'
138+
139+
# 'image'
140+
image_dir: 'path/to/image/directory'
141+
142+
# The prompt for all image files. Leave blank to use caption files (.txt)
143+
single_img_prompt: ""
144+
145+
# Validation data parameters.
146+
validation_data:
147+
148+
# A custom prompt that is different from your training dataset.
149+
prompt: ""
150+
151+
# Whether or not to sample preview during training (Requires more VRAM).
152+
sample_preview: True
153+
154+
# The number of frames to sample during validation.
155+
num_frames: 16
156+
157+
# Height and width of validation sample.
158+
width: 384
159+
height: 384
160+
161+
# Number of inference steps when generating the video.
162+
num_inference_steps: 25
163+
164+
# CFG scale
165+
guidance_scale: 9
166+
167+
# Learning rate for AdamW
168+
learning_rate: 2e-5
169+
170+
# Weight decay. Higher = more regularization. Lower = closer to dataset.
171+
adam_weight_decay: 0
172+
173+
# Optimizer parameters for the UNET. Overrides base learning rate parameters.
174+
extra_unet_params: null
175+
#learning_rate: 1e-5
176+
#adam_weight_decay: 1e-4
177+
178+
# Optimizer parameters for the Text Encoder. Overrides base learning rate parameters.
179+
extra_text_encoder_params: null
180+
#learning_rate: 5e-6
181+
#adam_weight_decay: 0.2
182+
183+
# How many batches to train. Not to be confused with video frames.
184+
train_batch_size: 1
185+
186+
# Maximum number of train steps. Model is saved after training.
187+
max_train_steps: 10000
188+
189+
# Saves a model every nth step.
190+
checkpointing_steps: 2500
191+
192+
# How many steps to do for validation if sample_preview is enabled.
193+
validation_steps: 100
194+
195+
# Seed for validation.
196+
seed: 64
197+
198+
# Whether or not we want to use mixed precision with accelerate
199+
mixed_precision: "fp16"
200+
201+
# This seems to be incompatible at the moment.
202+
use_8bit_adam: False
203+
204+
# Trades VRAM usage for speed. You lose roughly 20% of training speed, but save a lot of VRAM.
205+
# If you need to save more VRAM, it can also be enabled for the text encoder, but reduces speed x2.
206+
gradient_checkpointing: True
207+
208+
# Xformers must be installed for best memory savings and performance (< Pytorch 2.0)
209+
enable_xformers_memory_efficient_attention: False
210+
211+
# Use scaled dot product attention (Only available with >= Torch 2.0)
212+
enable_torch_2_attn: True

configs/v2/train_config.yaml

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,16 +42,47 @@ cached_latent_dir: null #/path/to/cached_latents
4242
# Train the text encoder for the model. LoRA Training overrides this setting.
4343
train_text_encoder: False
4444

45-
# https://github.com/cloneofsimo/lora
46-
# Use LoRA to train extra layers whilst saving memory. It trains both a LoRA & the model itself.
47-
# You can choose to train the entire model, or set individual paramaters to save memory.
45+
# https://github.com/cloneofsimo/lora (NOT Compatible with webui extension)
46+
# This is the first, original implementation of LoRA by cloneofsimo.
47+
# Use this version if you want to maintain compatibility to the original version.
48+
49+
# https://github.com/ExponentialML/Stable-LoRA/tree/main (Compatible with webui text2video extension)
50+
# This is an implementation based off of the original LoRA repository by Microsoft, and the default LoRA method here.
51+
# It works a different by using embeddings instead of the intermediate activations (Linear || Conv).
52+
# This means that there isn't an extra function when doing low ranking adaption.
53+
# It solely saves the weight differential between the initialized weights and updates.
54+
55+
# "cloneofsimo" or "stable_lora"
56+
lora_version: "cloneofsimo"
4857

4958
# Use LoRA for the UNET model.
5059
use_unet_lora: True
5160

5261
# Use LoRA for the Text Encoder. If this is set, the text encoder for the model will not be trained.
5362
use_text_lora: True
5463

64+
# LoRA Dropout. This parameter adds the probability of randomly zeros out elements. Helps prevent overfitting.
65+
# See: https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
66+
lora_unet_dropout: 0.1
67+
68+
lora_text_dropout: 0.1
69+
70+
# https://github.com/kabachuha/sd-webui-text2video
71+
# This saves a LoRA that is compatible with the text2video webui extension.
72+
# It only works when the lora version is 'stable_lora'.
73+
# This is also a DIFFERENT implementation than Kohya's, so it will NOT work the same implementation.
74+
save_lora_for_webui: True
75+
76+
# The LoRA file will be converted to a different format to be compatible with the webui extension.
77+
# The difference between this and 'save_lora_for_webui' is that you can continue training a Diffusers pipeline model
78+
# when this version is set to False
79+
only_lora_for_webui: False
80+
81+
# Choose whether or not ito save the full pretrained model weights for both checkpoints and after training.
82+
# The only time you want this off is if you're doing full LoRA training.
83+
save_pretrained_model: True
84+
85+
# The modules to use for LoRA. Different from 'trainable_modules'.
5586
unet_lora_modules:
5687
- "UNet3DConditionModel"
5788
#- "ResnetBlock2D"

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ torchvision
44
torchaudio
55
git+https://github.com/huggingface/diffusers.git
66
git+https://github.com/cloneofsimo/lora.git
7+
git+https://github.com/microsoft/LoRA
78
transformers
89
einops
910
decord

0 commit comments

Comments
 (0)