ExponentialML
diff --git a/‎README.md‎
Lines changed: 13 additions & 2 deletions b/‎README.md‎
Lines changed: 13 additions & 2 deletions
diff --git a/‎configs/v2/stable_lora_config.yaml‎
Lines changed: 212 additions & 0 deletions b/‎configs/v2/stable_lora_config.yaml‎
Lines changed: 212 additions & 0 deletions
diff --git a/‎configs/v2/train_config.yaml‎
Lines changed: 34 additions & 3 deletions b/‎configs/v2/train_config.yaml‎
Lines changed: 34 additions & 3 deletions
diff --git a/‎requirements.txt‎
Lines changed: 1 addition & 0 deletions b/‎requirements.txt‎
Lines changed: 1 addition & 0 deletions
@@ -4,6 +4,7 @@
 [output.webm](https://user-images.githubusercontent.com/59846140/230748413-fe91e90b-94b9-49ea-97ec-250469ee9472.webm)
 
 ### Updates
+- **2023-7-12**: You can now train a LoRA that is compatibile with the [webui extension](https://github.com/kabachuha/sd-webui-text2video)! See instructions [here.](https://github.com/ExponentialML/Text-To-Video-Finetuning/edit/feat/stable_lora/README.md#training-a-lora)
 - **2023-4-17**: You can now convert your trained models from diffusers to `.ckpt` format for A111 webui. Thanks @kabachuha!  
 - **2023-4-8**: LoRA Training released! Checkout `configs/v2/lora_training_config.yaml` for instructions. 
 - **2023-4-8**: Version 2 is released! 
@@ -46,15 +47,13 @@ It is **highly recommended** to install >= Torch 2.0. This way, you don't have t
 
 If you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers
 
-
 Recommended to use a RTX 3090, but you should be able to train on GPUs with <= 16GB ram with:
 - Validation turned off.
 - Xformers or Torch 2.0 Scaled Dot-Product Attention 
 - Gradient checkpointing enabled. 
 - Resolution of 256.
 - Enable all LoRA options.
 
-
 ## Running inference
 The `inference.py` script can be used to render videos with trained checkpoints.
 
@@ -164,6 +163,18 @@ Then, follow each line and configure it for your specific use case.
 
 The instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board.
 
+## Training a LoRA
+You can also train a LoRA that is both compatible with the webui extension.. By default it's set to 'cloneofsimo', which was the first LoRA implementation for Stable Diffusion.
+This version you can use in the `inference.py` file in this repository. It is **not** compatible with the webui.
+
+To use a LoRA with the webui, change the `lora_version` to "stable_lora" in your config. This will train an [A1111 webui extension](https://github.com/kabachuha/sd-webui-text2video) compatibile LoRA.
+You can get started at `configs/v2/stable_lora_config.yaml` and edit it from there. During and after training, LoRAs will be saved in your outputs directory with the prefix `_webui`.
+
+### What you cannot do:
+- Use LoRA files that were made for SD image models in other trainers.
+- Use 'cloneofsimo' LoRAs in another project (unless you build it or create a PR)
+- Merge LoRA weights together (yet).
+
 ## Finetune.
 ```python
 python train.py --config ./configs/v2/train_config.yaml
 
@@ -0,0 +1,212 @@
+# Pretrained diffusers model path.
+pretrained_model_path: "./models/model_scope_diffusers/" #https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/tree/main
+
+# The folder where your training outputs will be placed.
+output_dir: "./outputs"
+
+# You can train multiple datasets at once. They will be joined together for training.
+# Simply remove the line you don't need, or keep them all for mixed training.
+
+# 'image': A folder of images and captions (.txt)
+# 'folder': A folder a videos and captions (.txt)
+# 'json': The JSON file created with automatic BLIP2 captions using https://github.com/ExponentialML/Video-BLIP2-Preprocessor
+# 'single_video': A single video file.mp4 and text prompt
+dataset_types: 
+  - 'image'
+  - 'folder'
+  - 'json'
+  - 'single_video'
+
+# Adds offset noise to training. See https://www.crosslabs.org/blog/diffusion-with-offset-noise
+# If this is enabled, rescale_schedule will be disabled.
+offset_noise_strength: 0.1
+use_offset_noise: False
+
+# Uses schedule rescale, also known as the "better" offset noise. See https://arxiv.org/pdf/2305.08891.pdf
+# If this is enabled, offset noise will be disabled.
+rescale_schedule: False
+
+# When True, this extends all items in all enabled datasets to the highest length. 
+# For example, if you have 200 videos and 10 images, 10 images will be duplicated to the length of 200. 
+extend_dataset: False
+
+# Caches the latents (Frames-Image -> VAE -> Latent) to a HDD or SDD. 
+# The latents will be saved under your training folder, and loaded automatically for training.
+# This both saves memory and speeds up training and takes very little disk space.
+cache_latents: True
+
+# If you have cached latents set to `True` and have a directory of cached latents,
+# you can skip the caching process and load previously saved ones. 
+cached_latent_dir: null #/path/to/cached_latents
+
+# https://github.com/cloneofsimo/lora (NOT Compatible with webui extension)
+# This is the first, original implementation of LoRA by cloneofsimo.
+# Use this version if you want to maintain compatibility to the original version.
+
+# https://github.com/ExponentialML/Stable-LoRA/tree/main (Compatible with webui text2video extension)
+# This is an implementation based off of the original LoRA repository by Microsoft, and the default LoRA method here.
+# It works a different by using embeddings instead of the intermediate activations (Linear || Conv).
+# This means that there isn't an extra function when doing low ranking adaption.
+# It solely saves the weight differential between the initialized weights and updates. 
+
+# "cloneofsimo" or "stable_lora"
+lora_version: "stable_lora"
+
+# Use LoRA for the UNET model.
+use_unet_lora: True
+
+# Use LoRA for the Text Encoder. If this is set, the text encoder for the model will not be trained.
+use_text_lora: True
+
+# LoRA Dropout. This parameter adds the probability of randomly zeros out elements. Helps prevent overfitting.
+# See: https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
+lora_unet_dropout: 0.1
+
+lora_text_dropout: 0.1
+
+# https://github.com/kabachuha/sd-webui-text2video
+# This saves a LoRA that is compatible with the text2video webui extension.
+# It only works when the lora version is 'stable_lora'.
+# This is also a DIFFERENT implementation than Kohya's, so it will NOT work the same implementation.
+save_lora_for_webui: True
+
+# The LoRA file will be converted to a different format to be compatible with the webui extension.
+# The difference between this and 'save_lora_for_webui' is that you can continue training a Diffusers pipeline model
+# when this version is set to False
+only_lora_for_webui: False
+
+# Choose whether or not ito save the full pretrained model weights for both checkpoints and after training.
+# The only time you want this off is if you're doing full LoRA training.
+save_pretrained_model: False
+
+# The modules to use for LoRA. Advanced usage.
+unet_lora_modules:
+  - "UNet3DConditionModel" # Defaults to training the entire UNET.
+  #- "ResnetBlock2D"
+  #- "TransformerTemporalModel"
+  #- "Transformer2DModel"
+  #- "CrossAttention"
+  #- "Attention"
+  #- "GEGLU"
+  #- "TemporalConvLayer"
+
+text_encoder_lora_modules:
+  - "CLIPEncoderLayer" # Defaults to training the entire Text Encoder.
+  #- "CLIPAttention"
+
+# The rank for LoRA training. With ModelScope, the maximum should be 1024. 
+# VRAM increases with higher rank, lower when decreased.
+lora_rank: 16
+
+# Training data parameters
+train_data:
+
+  # The width and height in which you want your training data to be resized to.
+  width: 384      
+  height: 384
+
+  # This will find the closest aspect ratio to your input width and height. 
+  # For example, 512x512 width and height with a video of resolution 1280x720 will be resized to 512x256
+  use_bucketing: True
+
+  # The start frame index where your videos should start (Leave this at one for json and folder based training).
+  sample_start_idx: 1
+
+  # Used for 'folder'. The rate at which your frames are sampled. Does nothing for 'json' and 'single_video' dataset.
+  fps: 24
+
+  # For 'single_video' and 'json'. The number of frames to "step" (1,2,3,4) (frame_step=2) -> (1,3,5,7, ...).  
+  frame_step: 1
+
+  # The number of frames to sample. The higher this number, the higher the VRAM (acts similar to batch size).
+  n_sample_frames: 8
+  
+  # 'single_video'
+  single_video_path: "path/to/single/video.mp4"
+
+  # The prompt when using a a single video file
+  single_video_prompt: ""
+
+  # Fallback prompt if caption cannot be read. Enabled for 'image' and 'folder'.
+  fallback_prompt: ''
+  
+  # 'folder'
+  path: "path/to/folder/of/videos/"
+
+  # 'json'
+  json_path: 'path/to/train/json/'
+
+  # 'image'
+  image_dir: 'path/to/image/directory'
+
+  # The prompt for all image files. Leave blank to use caption files (.txt) 
+  single_img_prompt: ""
+
+# Validation data parameters.
+validation_data:
+
+  # A custom prompt that is different from your training dataset. 
+  prompt: ""
+
+  # Whether or not to sample preview during training (Requires more VRAM).
+  sample_preview: True
+
+  # The number of frames to sample during validation.
+  num_frames: 16
+
+  # Height and width of validation sample.
+  width: 384
+  height: 384
+
+  # Number of inference steps when generating the video.
+  num_inference_steps: 25
+
+  # CFG scale
+  guidance_scale: 9
+
+# Learning rate for AdamW
+learning_rate: 2e-5
+
+# Weight decay. Higher = more regularization. Lower = closer to dataset.
+adam_weight_decay: 0
+
+# Optimizer parameters for the UNET. Overrides base learning rate parameters.
+extra_unet_params: null
+  #learning_rate: 1e-5
+  #adam_weight_decay: 1e-4
+
+# Optimizer parameters for the Text Encoder. Overrides base learning rate parameters.
+extra_text_encoder_params: null
+  #learning_rate: 5e-6
+  #adam_weight_decay: 0.2
+
+# How many batches to train. Not to be confused with video frames.
+train_batch_size: 1
+
+# Maximum number of train steps. Model is saved after training.
+max_train_steps: 10000
+
+# Saves a model every nth step.
+checkpointing_steps: 2500
+
+# How many steps to do for validation if sample_preview is enabled.
+validation_steps: 100
+
+# Seed for validation.
+seed: 64
+
+# Whether or not we want to use mixed precision with accelerate
+mixed_precision: "fp16"
+
+# This seems to be incompatible at the moment.
+use_8bit_adam: False 
+
+# Trades VRAM usage for speed. You lose roughly 20% of training speed, but save a lot of VRAM.
+# If you need to save more VRAM, it can also be enabled for the text encoder, but reduces speed x2.
+gradient_checkpointing: True
+
+# Xformers must be installed for best memory savings and performance (< Pytorch 2.0)
+enable_xformers_memory_efficient_attention: False
+
+# Use scaled dot product attention (Only available with >= Torch 2.0)
+enable_torch_2_attn: True
@@ -42,16 +42,47 @@ cached_latent_dir: null #/path/to/cached_latents
 # Train the text encoder for the model. LoRA Training overrides this setting.
 train_text_encoder: False
 
-# https://github.com/cloneofsimo/lora
-# Use LoRA to train extra layers whilst saving memory. It trains both a LoRA & the model itself.
-# You can choose to train the entire model, or set individual paramaters to save memory.
+# https://github.com/cloneofsimo/lora (NOT Compatible with webui extension)
+# This is the first, original implementation of LoRA by cloneofsimo.
+# Use this version if you want to maintain compatibility to the original version.
+
+# https://github.com/ExponentialML/Stable-LoRA/tree/main (Compatible with webui text2video extension)
+# This is an implementation based off of the original LoRA repository by Microsoft, and the default LoRA method here.
+# It works a different by using embeddings instead of the intermediate activations (Linear || Conv).
+# This means that there isn't an extra function when doing low ranking adaption.
+# It solely saves the weight differential between the initialized weights and updates. 
+
+# "cloneofsimo" or "stable_lora"
+lora_version: "cloneofsimo"
 
 # Use LoRA for the UNET model.
 use_unet_lora: True
 
 # Use LoRA for the Text Encoder. If this is set, the text encoder for the model will not be trained.
 use_text_lora: True
 
+# LoRA Dropout. This parameter adds the probability of randomly zeros out elements. Helps prevent overfitting.
+# See: https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html
+lora_unet_dropout: 0.1
+
+lora_text_dropout: 0.1
+
+# https://github.com/kabachuha/sd-webui-text2video
+# This saves a LoRA that is compatible with the text2video webui extension.
+# It only works when the lora version is 'stable_lora'.
+# This is also a DIFFERENT implementation than Kohya's, so it will NOT work the same implementation.
+save_lora_for_webui: True
+
+# The LoRA file will be converted to a different format to be compatible with the webui extension.
+# The difference between this and 'save_lora_for_webui' is that you can continue training a Diffusers pipeline model
+# when this version is set to False
+only_lora_for_webui: False
+
+# Choose whether or not ito save the full pretrained model weights for both checkpoints and after training.
+# The only time you want this off is if you're doing full LoRA training.
+save_pretrained_model: True
+
+# The modules to use for LoRA. Different from 'trainable_modules'.
 unet_lora_modules:
   - "UNet3DConditionModel"
   #- "ResnetBlock2D"
 
@@ -4,6 +4,7 @@ torchvision
 torchaudio
 git+https://github.com/huggingface/diffusers.git
 git+https://github.com/cloneofsimo/lora.git
+git+https://github.com/microsoft/LoRA
 transformers
 einops
 decord