Image generation benchmark#3082
Conversation
Adds an image generation benchmark similar to the existing MetaMathQA to PEFT. Right now, uses Flux2 klein 4b to train an adapter on the cat toy dataset. Current state is that the model is hardly learning, so there is some debugging needed.
|
To strengthen the benchmark, is it possible to include a LoRA rank sweep (r = 8, 16, 32, 64) to better understand the trade-off between adapter capacity and generation quality, as well as a full fine-tune baseline to establish an upper-bound baseline reference for performance and convergence behavior. These additions could help determine whether current learning instability is capacity-related or optimization-related. Additionally, GPU utilization is currently low (~30% at ~5 sec/iteration for 1024×1024), so improving compute efficiency will be important to accelerate experimentation, reduce turnaround time for rank sweeps and baselines, and make the benchmark more scalable. |
|
@srijondasgit This is still WIP, there will be more work on performance and hyper-parameters before we merge this. At this stage, I'm still trying to figure out if something fundamental is broken. |
- use dog dataset (but may revert to cat later) - several smaller fixes - precompute stuff and onload for faster results - 1024x1024 by default for Flux 2
Clean up data loading, precompute latents, simplify eval.
|
@srijondasgit With the latest changes, the benchmark should run much faster and yield better results. Feel free to test it. If you have suggestions for better hyper-parameters, feel free to share. Note that there is already an experiment setting for full fine-tuning, but I can't run it right now (not enough memory on my machine). |
| pipeline.text_encoder.to(target_device) | ||
| try: | ||
| for idx, prompt in enumerate(train_config.sample_image_prompts, start=1): | ||
| generator = torch.Generator(device=target_device).manual_seed(train_config.seed + 100_000 + idx) |
There was a problem hiding this comment.
Should this differ from how seeding is done in evaluate()?
There was a problem hiding this comment.
Good point, I moved the generator out of the loop to be consistent.
| total_samples += current_batch_size | ||
|
|
||
| model_input_ids = pipeline._prepare_latent_ids(latents).to(latents.device) | ||
| noise = torch.randn_like(latents) |
There was a problem hiding this comment.
This is not seeded and can cause randomness.
| train_config=train_config, print_fn=print_verbose | ||
| ) | ||
| train_size_base = len(train_dataset["prompts"]) | ||
| train_indices = torch.cat([torch.randperm(train_size_base) for _ in range(train_dataset["repeats"])]) |
There was a problem hiding this comment.
Doesn't use generator which can introduce randomness in between runs.
|
@sayakpaul Thank you for your review, good catch. I have addressed the seeding in the latest commit. When re-running the experiment, there is, however, still considerable variance in the similarity score. The losses are pretty tight though: |
This suggests that the generated images are a bit different. Could you disable the adapter and run the pipeline in the exact setting a few times to see if the results are same or different? This should provide some information if setting the seed is still rendering trained adapters to be different across different runs. |
|
With disabled adapters, similarity is always 0.0530. |
Okay then we can conclude this variance isn't coming from the pipeline. I will look into the similarity computation code tomorrow. |
|
Thanks. I did confirm that |
Could be. But do the results stay consistent across different generation rounds when using the same adapter? I think we can still report thw DINO similarity scores, TBH |
|
The experiment was run on a Mac development machine, so training is significantly slower than typical GPU-based diffusion benchmarks. The GPU on my mac is active sometimes, but it stays idle a lot, so the overall training is quite slow I created this PR to use a model which can be used for full fine-tuning Vs Peft on a Mac. All the weights of UNet2DConditionModel were initialized from the model checkpoint at runwayml/stable-diffusion-v1-5. Final step .. Steps: 100%|████████████████████████████████████████████████████████████████████| 500/500 [2:47:22<00:00, 20.08s/it, loss=0.00471, lr=0.0001] Inference ... |
So I ran evaluation three times every 50 steps instead of just once and collected the similarity scores. With the same seed, the score is always identical, as should be expected. When using different seeds, the scores can vary quite a bit:
Yes, for sure. If there was a more reliable score (or dataset that leads to a more reliable score), then it would lead to a better signal for users though. For now, I'd say let's do only one score for validation, as it's quite slow, but we can do multiple scores for the test set and average those. |
|
Just tested OFT (default settings), which is intended especially for image tasks, and it got a similarity of 0.747 (average of 10): |
|
Very cool! Let's get this shipped ASAP then! |
|
Yep, I'll run a few more benchmarks with PEFT methods that explicitly mention they're good at image generation. One more step would be to convert the resulting checkpoints to LoRA and check if they retain the quality. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@sayakpaul @githubnemo From my POV, this PR is ready to be reviewed. There are still some TODOs:
I don't think those are crucial to merge the PR and I documented them in the corresponding README.md. I'll tackle work on follow up PRs when this one is merged. |
|
So, #3082 (comment) is regular LoRA and #3082 (comment) is OFT converted to LoRA or just OFT? |
|
That was regular OFT. That comment showed a converted adapter. |
sayakpaul
left a comment
There was a problem hiding this comment.
I didn't review all the files, as the results seem okay for me, confirming the effectiveness of the implementation. But if you want me to look into specific files, I can do that.
| "layers_to_transform": null, | ||
| "modules_to_save": null, | ||
| "peft_type": "GRALORA", | ||
| "peft_version": "0.18.2.dev0@UNKNOWN", |
There was a problem hiding this comment.
https://pypi.org/project/setuptools-git-versioning/ pretty useful.
There was a problem hiding this comment.
Yeah, we could do something more sophisticated here but it's out of scope for this PR. We added the PEFT version here is just for documentation purposes (and run migrations if needed). It's not relevant for experiment reproducibility, we track the actual PEFT commit used for the experiment hash in the results JSON. The value here is just the commit used to create the adapter_config.json.
| @@ -0,0 +1,8 @@ | |||
| { | |||
There was a problem hiding this comment.
What's the advantage of keeping training params different from adapter params?
There was a problem hiding this comment.
- I follow the same convention as in MetaMath.
- The adapter params are just serialized PEFT configs that are easy to generate. With a unified file, it would require extra effort to create.
- Easier to compose different settings.
| @@ -0,0 +1,90 @@ | |||
| # Makefile for running MetaMathQA experiments. | |||
| # Makefile for running MetaMathQA experiments. | ||
|
|
||
| # --- Configuration --- | ||
| PYTHON := python |
There was a problem hiding this comment.
Honestly, a Makefile for launching a couple of experiments seems more complicated than it needs to be. But I am sure I am missing out on something. What's the advantage of using Makefiles for this?
There was a problem hiding this comment.
First, it's the same as MetaMath, so we keep it for consistency. Second, it checks which experiments have already run and only runs the missing ones. I'd say that's pretty much the main use case for make.
| @@ -0,0 +1,6 @@ | |||
| datasets | |||
| diffusers | |||
There was a problem hiding this comment.
You mean a fixed version or a min version? I'd be okay with a min version if needed, but in general we don't want to fix the version. I think it's more useful to think of this benchmark as showing the current state of the tested PEFT methods, which is why we re-run the experiments every couple of months or when we suspect a recent change could affect the results.
For reproducibility, we log the versions of the most important packages in the results file, but guaranteeing 100% reproducibility is not a goal of this benchmark.
There was a problem hiding this comment.
Specifying the version where Klein was introduced -- maybe let's specify that (0.37.0)?
- update comment in Makefile - initiate the Generator consistently
BenjaminBossan
left a comment
There was a problem hiding this comment.
@sayakpaul I addressed/replied to all of your comments.
I didn't review all the files, as the results seem okay for me, confirming the effectiveness of the implementation. But if you want me to look into specific files, I can do that.
Yes, fine not to review everything (a lot is adapted from the MetaMath benchmark anyway). Most important is whether the results look as expected or if there is a significant flaw in the training itself.
| @@ -0,0 +1,8 @@ | |||
| { | |||
There was a problem hiding this comment.
- I follow the same convention as in MetaMath.
- The adapter params are just serialized PEFT configs that are easy to generate. With a unified file, it would require extra effort to create.
- Easier to compose different settings.
| "layers_to_transform": null, | ||
| "modules_to_save": null, | ||
| "peft_type": "GRALORA", | ||
| "peft_version": "0.18.2.dev0@UNKNOWN", |
There was a problem hiding this comment.
Yeah, we could do something more sophisticated here but it's out of scope for this PR. We added the PEFT version here is just for documentation purposes (and run migrations if needed). It's not relevant for experiment reproducibility, we track the actual PEFT commit used for the experiment hash in the results JSON. The value here is just the commit used to create the adapter_config.json.
| @@ -0,0 +1,90 @@ | |||
| # Makefile for running MetaMathQA experiments. | |||
| # Makefile for running MetaMathQA experiments. | ||
|
|
||
| # --- Configuration --- | ||
| PYTHON := python |
There was a problem hiding this comment.
First, it's the same as MetaMath, so we keep it for consistency. Second, it checks which experiments have already run and only runs the missing ones. I'd say that's pretty much the main use case for make.
| @@ -0,0 +1,6 @@ | |||
| datasets | |||
| diffusers | |||
There was a problem hiding this comment.
You mean a fixed version or a min version? I'd be okay with a min version if needed, but in general we don't want to fix the version. I think it's more useful to think of this benchmark as showing the current state of the tested PEFT methods, which is why we re-run the experiments every couple of months or when we suspect a recent change could affect the results.
For reproducibility, we log the versions of the most important packages in the results file, but guaranteeing 100% reproducibility is not a goal of this benchmark.
| pipeline.text_encoder.to(target_device) | ||
| try: | ||
| for idx, prompt in enumerate(train_config.sample_image_prompts, start=1): | ||
| generator = torch.Generator(device=target_device).manual_seed(train_config.seed + 100_000 + idx) |
There was a problem hiding this comment.
Good point, I moved the generator out of the loop to be consistent.
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
|
not stale |
no values changed
This helps to save memory, for LoRA from ~21 GB to ~17 GB max. Training time increases from 50 sec to 71 sec per 100 steps, which is an acceptable tradeoff overall. In contrast to the MetaMathQA benchmark, I turn this on by default, as the image gen benchmark has several methods that OOM.
There was a problem hiding this comment.
Pull request overview
Adds a new DreamBooth-style image generation benchmark under method_comparison/image-gen and extends the existing method-comparison pipeline (processing + Gradio app) to support multiple tasks (MetaMathQA + image-gen) with task-specific metrics and Pareto defaults.
Changes:
- Add a new
image-genbenchmark (training/eval pipeline, dataset handling, logging, experiments, Makefile, docs). - Refactor
processing.pyto support task-specific preprocessing, dtypes, and “important columns”. - Update
method_comparison/app.pyto load multiple tasks and use task-specific metric preferences and defaults.
Reviewed changes
Copilot reviewed 41 out of 47 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| method_comparison/processing.py | Refactors preprocessing to support multiple tasks and adds image-gen metrics/dtypes/column ordering. |
| method_comparison/app.py | Adds multi-task loading, task-specific metric preferences/defaults, and updates Pareto plotting UI logic. |
| method_comparison/image-gen/utils.py | Implements shared utilities for the image-gen benchmark (config, pipeline setup, logging, HF bucket uploads). |
| method_comparison/image-gen/run.py | Adds the main training/eval script for the image-gen benchmark (including DINO similarity + drift). |
| method_comparison/image-gen/data.py | Adds dataset loading/splitting and image preprocessing for training/eval. |
| method_comparison/image-gen/default_training_params.json | Provides default training/eval parameters for image-gen runs. |
| method_comparison/image-gen/requirements.txt | Declares extra deps needed for the image-gen benchmark. |
| method_comparison/image-gen/README.md | Documents how to run the image-gen benchmark and configure experiments. |
| method_comparison/image-gen/Makefile | Adds sweep automation to run only experiments missing result files. |
| method_comparison/image-gen/results/.gitkeep | Keeps results dir in git. |
| method_comparison/image-gen/temporary_results/.gitkeep | Keeps temporary_results dir in git. |
| method_comparison/image-gen/cancelled_results/.gitkeep | Keeps cancelled_results dir in git. |
| method_comparison/image-gen/sample-images/results/.gitkeep | Keeps sample image results dir in git. |
| method_comparison/image-gen/sample-images/temporary_results/.gitkeep | Keeps sample image temporary_results dir in git. |
| method_comparison/image-gen/sample-images/cancelled_results/.gitkeep | Keeps sample image cancelled_results dir in git. |
| method_comparison/image-gen/experiments/adalora/flux2-klein-default/adapter_config.json | Adds an AdaLoRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/boft/flux2-klein-default/adapter_config.json | Adds a BOFT experiment config for image-gen. |
| method_comparison/image-gen/experiments/c3a/flux2-klein-default/adapter_config.json | Adds a C3A experiment config for image-gen. |
| method_comparison/image-gen/experiments/c3a/flux2-klein-default/training_params.json | Overrides training params for C3A image-gen experiment. |
| method_comparison/image-gen/experiments/delora/flux2-klein-default/adapter_config.json | Adds a DeLoRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/delora/flux2-klein-default/training_params.json | Overrides training params for DeLoRA image-gen experiment. |
| method_comparison/image-gen/experiments/fourierft/flux2-klein-default/adapter_config.json | Adds a FourierFT experiment config for image-gen. |
| method_comparison/image-gen/experiments/full-finetuning/flux2-klein-default/training_params.json | Adds a full fine-tuning experiment override for image-gen. |
| method_comparison/image-gen/experiments/gralora/flux2-klein-default/adapter_config.json | Adds a GraLoRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/hra/flux2-klein-default/adapter_config.json | Adds an HRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/ia3/flux2-klein-default/adapter_config.json | Adds an IA3 experiment config for image-gen. |
| method_comparison/image-gen/experiments/ia3/flux2-klein-default/training_params.json | Overrides training params for IA3 image-gen experiment. |
| method_comparison/image-gen/experiments/lily/flux2-klein-default/adapter_config.json | Adds a LILY experiment config for image-gen. |
| method_comparison/image-gen/experiments/ln_tuning/flux2-klein-default/adapter_config.json | Adds an LN_TUNING experiment config for image-gen. |
| method_comparison/image-gen/experiments/loha/flux2-klein-default/adapter_config.json | Adds a LoHa experiment config for image-gen. |
| method_comparison/image-gen/experiments/lokr/flux2-klein-default/adapter_config.json | Adds a LoKr experiment config for image-gen. |
| method_comparison/image-gen/experiments/lora/flux2-klein-default/adapter_config.json | Adds a LoRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/miss/flux2-klein-default/adapter_config.json | Adds a MISS experiment config for image-gen. |
| method_comparison/image-gen/experiments/oft/flux2-klein-default/adapter_config.json | Adds an OFT experiment config for image-gen. |
| method_comparison/image-gen/experiments/osf/flux2-klein-default/adapter_config.json | Adds an OSF experiment config for image-gen. |
| method_comparison/image-gen/experiments/peanut/flux2-klein-default/adapter_config.json | Adds a PEANUT experiment config for image-gen. |
| method_comparison/image-gen/experiments/psoft/flux2-klein-default/adapter_config.json | Adds a PSOFT experiment config for image-gen. |
| method_comparison/image-gen/experiments/pvera/flux2-klein-default/adapter_config.json | Adds a PVERA experiment config for image-gen. |
| method_comparison/image-gen/experiments/pvera/flux2-klein-default/training_params.json | Overrides training params for PVERA image-gen experiment. |
| method_comparison/image-gen/experiments/randlora/flux2-klein-default/adapter_config.json | Adds a RandLoRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/road/flux2-klein-default/adapter_config.json | Adds a ROAD experiment config for image-gen. |
| method_comparison/image-gen/experiments/road/flux2-klein-default/training_params.json | Overrides training params for ROAD image-gen experiment. |
| method_comparison/image-gen/experiments/shira/flux2-klein-default/adapter_config.json | Adds a SHIRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/vblora/flux2-klein-default/adapter_config.json | Adds a VBLoRA experiment config for image-gen. |
| method_comparison/image-gen/experiments/vera/flux2-klein-default/adapter_config.json | Adds a VERA experiment config for image-gen. |
| method_comparison/image-gen/experiments/vera/flux2-klein-default/training_params.json | Overrides training params for VERA image-gen experiment. |
| method_comparison/image-gen/experiments/waveft/flux2-klein-default/adapter_config.json | Adds a WaveFT experiment config for image-gen. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [ | ||
| T.Resize(size, interpolation=T.InterpolationMode.BILINEAR), | ||
| T.ToTensor(), | ||
| T.Normalize([0.5], [0.5]), |
There was a problem hiding this comment.
T.Normalize([0.5], [0.5]) expects one mean/std per channel, but images are explicitly converted to RGB (3 channels). With RGB tensors this will raise a shape mismatch. Use 3-value mean/std lists (e.g. one per channel) to match the [3, H, W] tensor shape.
| T.Normalize([0.5], [0.5]), | |
| T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]), |
Contrary to MetaMathQA, the eval in the image gen benchmark adds significant memory usage overall due to the use of a second model (Dino) for calculating the metric. This should not count towards the max memory usage, as it is not an inherent part of the PEFT method being used.
|
@sayakpaul I made some further changes to the PR but now it's ready from my point of view. I would be happy for another review. Results are logged into jsons and shown in the Gradio app, same as with the existing MetaMath benchmark. Main changes since the last review:
As for potentially showing the sample images in the app, that's something I'd look into once we have created the images. |











Adds an image generation benchmark similar to the existing MetaMathQA to PEFT.
Right now, uses Flux2 klein 4b to train an adapter on the
dog datasetupdate: now using the cat plushy dataset, as it contains more samples.The benchmark is based on this Diffusers script.
Sample images from my local runs can be inspected here.
Yet todo: Sample images are already being created and uploaded to a HF bucket (if a token is provided). The images are not shown yet in the app (not sure if/how that works), this can be added once we have the final results.