Skip to content

Image generation benchmark#3082

Open
BenjaminBossan wants to merge 26 commits intohuggingface:mainfrom
BenjaminBossan:feat-add-image-gen-benchmark
Open

Image generation benchmark#3082
BenjaminBossan wants to merge 26 commits intohuggingface:mainfrom
BenjaminBossan:feat-add-image-gen-benchmark

Conversation

@BenjaminBossan
Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan commented Mar 4, 2026

Adds an image generation benchmark similar to the existing MetaMathQA to PEFT.

Right now, uses Flux2 klein 4b to train an adapter on the dog dataset update: now using the cat plushy dataset, as it contains more samples.

The benchmark is based on this Diffusers script.

Sample images from my local runs can be inspected here.

Yet todo: Sample images are already being created and uploaded to a HF bucket (if a token is provided). The images are not shown yet in the app (not sure if/how that works), this can be added once we have the final results.

Adds an image generation benchmark similar to the existing MetaMathQA to
PEFT.

Right now, uses Flux2 klein 4b to train an adapter on the cat toy
dataset.

Current state is that the model is hardly learning, so there is some
debugging needed.
@srijondasgit
Copy link
Copy Markdown

To strengthen the benchmark, is it possible to include a LoRA rank sweep (r = 8, 16, 32, 64) to better understand the trade-off between adapter capacity and generation quality, as well as a full fine-tune baseline to establish an upper-bound baseline reference for performance and convergence behavior.

These additions could help determine whether current learning instability is capacity-related or optimization-related. Additionally, GPU utilization is currently low (~30% at ~5 sec/iteration for 1024×1024), so improving compute efficiency will be important to accelerate experimentation, reduce turnaround time for rank sweeps and baselines, and make the benchmark more scalable.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

@srijondasgit This is still WIP, there will be more work on performance and hyper-parameters before we merge this. At this stage, I'm still trying to figure out if something fundamental is broken.

- use dog dataset (but may revert to cat later)
- several smaller fixes
- precompute stuff and onload for faster results
- 1024x1024 by default for Flux 2
Clean up data loading, precompute latents, simplify eval.
@BenjaminBossan
Copy link
Copy Markdown
Member Author

@srijondasgit With the latest changes, the benchmark should run much faster and yield better results. Feel free to test it. If you have suggestions for better hyper-parameters, feel free to share. Note that there is already an experiment setting for full fine-tuning, but I can't run it right now (not enough memory on my machine).

Comment thread method_comparison/image-gen/run.py Outdated
pipeline.text_encoder.to(target_device)
try:
for idx, prompt in enumerate(train_config.sample_image_prompts, start=1):
generator = torch.Generator(device=target_device).manual_seed(train_config.seed + 100_000 + idx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this differ from how seeding is done in evaluate()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I moved the generator out of the loop to be consistent.

Comment thread method_comparison/image-gen/run.py Outdated
total_samples += current_batch_size

model_input_ids = pipeline._prepare_latent_ids(latents).to(latents.device)
noise = torch.randn_like(latents)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not seeded and can cause randomness.

Comment thread method_comparison/image-gen/run.py Outdated
train_config=train_config, print_fn=print_verbose
)
train_size_base = len(train_dataset["prompts"])
train_indices = torch.cat([torch.randperm(train_size_base) for _ in range(train_dataset["repeats"])])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't use generator which can introduce randomness in between runs.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

@sayakpaul Thank you for your review, good catch. I have addressed the seeding in the latest commit. When re-running the experiment, there is, however, still considerable variance in the similarity score. The losses are pretty tight though:

# run 1
{"step": "  50", "loss avg": "0.6826", "valid sim": "0.0043", "train time": "17.7s"}
{"step": " 100", "loss avg": "0.6857", "valid sim": "0.2506", "train time": "17.7s"}
{"step": " 150", "loss avg": "0.6669", "valid sim": "0.7335", "train time": "17.7s"}
{"step": " 200", "loss avg": "0.6710", "valid sim": "0.5652", "train time": "17.7s"}

# run 2
{"step": "  50", "loss avg": "0.6826", "valid sim": "-0.0345", "train time": "17.8s"}
{"step": " 100", "loss avg": "0.6855", "valid sim": "0.1775", "train time": "17.7s"}
{"step": " 150", "loss avg": "0.6661", "valid sim": "0.6675", "train time": "17.8s"}
{"step": " 200", "loss avg": "0.6701", "valid sim": "0.6906", "train time": "17.8s"}

@sayakpaul
Copy link
Copy Markdown
Member

When re-running the experiment, there is, however, still considerable variance in the similarity score.

This suggests that the generated images are a bit different. Could you disable the adapter and run the pipeline in the exact setting a few times to see if the results are same or different? This should provide some information if setting the seed is still rendering trained adapters to be different across different runs.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

With disabled adapters, similarity is always 0.0530.

@sayakpaul
Copy link
Copy Markdown
Member

With disabled adapters, similarity is always 0.0530.

Okay then we can conclude this variance isn't coming from the pipeline. I will look into the similarity computation code tomorrow.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

BenjaminBossan commented Mar 5, 2026

Thanks. I did confirm that get_dino_embeddings itself is deterministic. Maybe it's just that tiny changes in the adapter weights are amplified when generating with 20 steps.

@sayakpaul
Copy link
Copy Markdown
Member

Maybe it's just that tiny changes in the adapter weights are amplified when generating with 20 steps.

Could be. But do the results stay consistent across different generation rounds when using the same adapter?

I think we can still report thw DINO similarity scores, TBH

@srijondasgit
Copy link
Copy Markdown

srijondasgit commented Mar 5, 2026

The experiment was run on a Mac development machine, so training is significantly slower than typical GPU-based diffusion benchmarks. The GPU on my mac is active sometimes, but it stays idle a lot, so the overall training is quite slow

I created this PR to use a model which can be used for full fine-tuning Vs Peft on a Mac.

All the weights of UNet2DConditionModel were initialized from the model checkpoint at runwayml/stable-diffusion-v1-5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UNet2DConditionModel for predictions without further training.
03/05/2026 22:59:03 - INFO - main - [RANK 0] ***** Running training *****
03/05/2026 22:59:03 - INFO - main - [RANK 0] Num examples = 5
03/05/2026 22:59:03 - INFO - main - [RANK 0] Num batches each epoch = 5
03/05/2026 22:59:03 - INFO - main - [RANK 0] Num Epochs = 250
03/05/2026 22:59:03 - INFO - main - [RANK 0] Instantaneous batch size per device = 1
03/05/2026 22:59:03 - INFO - main - [RANK 0] Total train batch size (w. parallel, distributed & accumulation) = 4
03/05/2026 22:59:03 - INFO - main - [RANK 0] Gradient Accumulation steps = 4
03/05/2026 22:59:03 - INFO - main - [RANK 0] Total optimization steps = 500
Steps: 0%| | 0/500 [00:00<?, ?it/s]/Users/srijon/Desktop/peft_diffuser_benchmark1/peft/venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:2056: UserWarning: Dynamo detected a call to a functools.lru_cache-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a potential risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
Steps: 35%|███████████████████████▉ | 176/500 [1:14:05<3:30:40, 39.01s/it, loss=0.222, lr=0.0001]

Final step ..

Steps: 100%|████████████████████████████████████████████████████████████████████| 500/500 [2:47:22<00:00, 20.08s/it, loss=0.00471, lr=0.0001]

Inference ...
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.49s/it]
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying prefix=None to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:10<00:00, 1.41s/it]
Image saved as dog_inference.png

The dog bouncing on a trampoline...
dog_inference

@BenjaminBossan
Copy link
Copy Markdown
Member Author

But do the results stay consistent across different generation rounds when using the same adapter?

So I ran evaluation three times every 50 steps instead of just once and collected the similarity scores. With the same seed, the score is always identical, as should be expected. When using different seeds, the scores can vary quite a bit:

step  50: [0.1211815923452377, 0.5188983678817749, 0.04316230118274689]
step 100: [0.5750174522399902, 0.6544530391693115, 0.5297497510910034]
step 150: [0.5367261171340942, 0.673410177230835, 0.6793834567070007]
step 200: [0.8191202282905579, 0.7153311967849731, 0.8426690101623535]
step 250: [0.6548720598220825, 0.7877984046936035, 0.7447556853294373]

I think we can still report thw DINO similarity scores, TBH

Yes, for sure. If there was a more reliable score (or dataset that leads to a more reliable score), then it would lead to a better signal for users though.

For now, I'd say let's do only one score for validation, as it's quite slow, but we can do multiple scores for the test set and average those.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

BenjaminBossan commented Mar 6, 2026

Just tested OFT (default settings), which is intended especially for image tasks, and it got a similarity of 0.747 (average of 10):
image
So this benchmark seems to work for other PEFT methods too.

@sayakpaul
Copy link
Copy Markdown
Member

Very cool! Let's get this shipped ASAP then!

@BenjaminBossan
Copy link
Copy Markdown
Member Author

Yep, I'll run a few more benchmarks with PEFT methods that explicitly mention they're good at image generation. One more step would be to convert the resulting checkpoints to LoRA and check if they retain the quality.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

DeLoRA

0.815 similarity

delora--flux2-klein-default--2026-03-06T16-42-52+00-00_01

GraLoRA

0.8007 similarity

gralora--flux2-klein-default--2026-03-06T16-49-16+00-00_01

HRA

OOM

WaveFT

0.353 similarity

waveft--flux2-klein-default--2026-03-06T16-56-23+00-00_01

LoRA rank 64

0.8366 similarity

lora--flux2-klein-rank64--2026-03-06T17-03-08+00-00_01

@BenjaminBossan
Copy link
Copy Markdown
Member Author

BenjaminBossan commented Mar 6, 2026

GraLoRA checkpoint converted to LoRA with rank 128. Similarity score of the converted LoRA is 0.8015, so no difference to the original. But there is a visible degradation in the generated images when it comes to the subject (but this was just my first attempt, there is possibly room for improvement).

Original:

gralora--flux2-klein-default--2026-03-06T16-49-16+00-00_01

Converted:

image

"a photo of sks dog diving under water" original:

gralora--flux2-klein-default--2026-03-06T16-49-16+00-00_03

"a photo of sks dog diving under water" converted:

image

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

@sayakpaul @githubnemo From my POV, this PR is ready to be reviewed. There are still some TODOs:

  • Add further experiments (more PEFT methods) and explore better hyper-parameters (could be community outsourced?).
  • Test images are already created but they're not uploaded anywhere.
  • The method comparison Gradio app needs to be updated to show the results.

I don't think those are crucial to merge the PR and I documented them in the corresponding README.md. I'll tackle work on follow up PRs when this one is merged.

@sayakpaul
Copy link
Copy Markdown
Member

So, #3082 (comment) is regular LoRA and #3082 (comment) is OFT converted to LoRA or just OFT?

@BenjaminBossan
Copy link
Copy Markdown
Member Author

That was regular OFT. That comment showed a converted adapter.

Copy link
Copy Markdown
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review all the files, as the results seem okay for me, confirming the effectiveness of the implementation. But if you want me to look into specific files, I can do that.

"layers_to_transform": null,
"modules_to_save": null,
"peft_type": "GRALORA",
"peft_version": "0.18.2.dev0@UNKNOWN",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could do something more sophisticated here but it's out of scope for this PR. We added the PEFT version here is just for documentation purposes (and run migrations if needed). It's not relevant for experiment reproducibility, we track the actual PEFT commit used for the experiment hash in the results JSON. The value here is just the commit used to create the adapter_config.json.

@@ -0,0 +1,8 @@
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of keeping training params different from adapter params?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I follow the same convention as in MetaMath.
  2. The adapter params are just serialized PEFT configs that are easy to generate. With a unified file, it would require extra effort to create.
  3. Easier to compose different settings.

Comment thread method_comparison/image-gen/Makefile Outdated
@@ -0,0 +1,90 @@
# Makefile for running MetaMathQA experiments.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No MetaMathQA here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

# Makefile for running MetaMathQA experiments.

# --- Configuration ---
PYTHON := python
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, a Makefile for launching a couple of experiments seems more complicated than it needs to be. But I am sure I am missing out on something. What's the advantage of using Makefiles for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, it's the same as MetaMath, so we keep it for consistency. Second, it checks which experiments have already run and only runs the missing ones. I'd say that's pretty much the main use case for make.

@@ -0,0 +1,6 @@
datasets
diffusers
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the versions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a fixed version or a min version? I'd be okay with a min version if needed, but in general we don't want to fix the version. I think it's more useful to think of this benchmark as showing the current state of the tested PEFT methods, which is why we re-run the experiments every couple of months or when we suspect a recent change could affect the results.

For reproducibility, we log the versions of the most important packages in the results file, but guaranteeing 100% reproducibility is not a goal of this benchmark.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying the version where Klein was introduced -- maybe let's specify that (0.37.0)?

- update comment in Makefile
- initiate the Generator consistently
Copy link
Copy Markdown
Member Author

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sayakpaul I addressed/replied to all of your comments.

I didn't review all the files, as the results seem okay for me, confirming the effectiveness of the implementation. But if you want me to look into specific files, I can do that.

Yes, fine not to review everything (a lot is adapted from the MetaMath benchmark anyway). Most important is whether the results look as expected or if there is a significant flaw in the training itself.

@@ -0,0 +1,8 @@
{
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I follow the same convention as in MetaMath.
  2. The adapter params are just serialized PEFT configs that are easy to generate. With a unified file, it would require extra effort to create.
  3. Easier to compose different settings.

"layers_to_transform": null,
"modules_to_save": null,
"peft_type": "GRALORA",
"peft_version": "0.18.2.dev0@UNKNOWN",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could do something more sophisticated here but it's out of scope for this PR. We added the PEFT version here is just for documentation purposes (and run migrations if needed). It's not relevant for experiment reproducibility, we track the actual PEFT commit used for the experiment hash in the results JSON. The value here is just the commit used to create the adapter_config.json.

Comment thread method_comparison/image-gen/Makefile Outdated
@@ -0,0 +1,90 @@
# Makefile for running MetaMathQA experiments.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

# Makefile for running MetaMathQA experiments.

# --- Configuration ---
PYTHON := python
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, it's the same as MetaMath, so we keep it for consistency. Second, it checks which experiments have already run and only runs the missing ones. I'd say that's pretty much the main use case for make.

@@ -0,0 +1,6 @@
datasets
diffusers
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a fixed version or a min version? I'd be okay with a min version if needed, but in general we don't want to fix the version. I think it's more useful to think of this benchmark as showing the current state of the tested PEFT methods, which is why we re-run the experiments every couple of months or when we suspect a recent change could affect the results.

For reproducibility, we log the versions of the most important packages in the results file, but guaranteeing 100% reproducibility is not a goal of this benchmark.

Comment thread method_comparison/image-gen/run.py Outdated
pipeline.text_encoder.to(target_device)
try:
for idx, prompt in enumerate(train_config.sample_image_prompts, start=1):
generator = torch.Generator(device=target_device).manual_seed(train_config.seed + 100_000 + idx)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I moved the generator out of the loop to be consistent.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

I also updated the method comparison app to include the results from the image gen benchmark (choose benchmark through the dropdown menu). Right now, there are no results to show, but I ran a couple of experiments locally to check the app. Here is a screenshot:

image

@github-actions
Copy link
Copy Markdown

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

not stale

This helps to save memory, for LoRA from ~21 GB to ~17 GB max. Training
time increases from 50 sec to 71 sec per 100 steps, which is an
acceptable tradeoff overall.

In contrast to the MetaMathQA benchmark, I turn this on by default, as
the image gen benchmark has several methods that OOM.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new DreamBooth-style image generation benchmark under method_comparison/image-gen and extends the existing method-comparison pipeline (processing + Gradio app) to support multiple tasks (MetaMathQA + image-gen) with task-specific metrics and Pareto defaults.

Changes:

  • Add a new image-gen benchmark (training/eval pipeline, dataset handling, logging, experiments, Makefile, docs).
  • Refactor processing.py to support task-specific preprocessing, dtypes, and “important columns”.
  • Update method_comparison/app.py to load multiple tasks and use task-specific metric preferences and defaults.

Reviewed changes

Copilot reviewed 41 out of 47 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
method_comparison/processing.py Refactors preprocessing to support multiple tasks and adds image-gen metrics/dtypes/column ordering.
method_comparison/app.py Adds multi-task loading, task-specific metric preferences/defaults, and updates Pareto plotting UI logic.
method_comparison/image-gen/utils.py Implements shared utilities for the image-gen benchmark (config, pipeline setup, logging, HF bucket uploads).
method_comparison/image-gen/run.py Adds the main training/eval script for the image-gen benchmark (including DINO similarity + drift).
method_comparison/image-gen/data.py Adds dataset loading/splitting and image preprocessing for training/eval.
method_comparison/image-gen/default_training_params.json Provides default training/eval parameters for image-gen runs.
method_comparison/image-gen/requirements.txt Declares extra deps needed for the image-gen benchmark.
method_comparison/image-gen/README.md Documents how to run the image-gen benchmark and configure experiments.
method_comparison/image-gen/Makefile Adds sweep automation to run only experiments missing result files.
method_comparison/image-gen/results/.gitkeep Keeps results dir in git.
method_comparison/image-gen/temporary_results/.gitkeep Keeps temporary_results dir in git.
method_comparison/image-gen/cancelled_results/.gitkeep Keeps cancelled_results dir in git.
method_comparison/image-gen/sample-images/results/.gitkeep Keeps sample image results dir in git.
method_comparison/image-gen/sample-images/temporary_results/.gitkeep Keeps sample image temporary_results dir in git.
method_comparison/image-gen/sample-images/cancelled_results/.gitkeep Keeps sample image cancelled_results dir in git.
method_comparison/image-gen/experiments/adalora/flux2-klein-default/adapter_config.json Adds an AdaLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/boft/flux2-klein-default/adapter_config.json Adds a BOFT experiment config for image-gen.
method_comparison/image-gen/experiments/c3a/flux2-klein-default/adapter_config.json Adds a C3A experiment config for image-gen.
method_comparison/image-gen/experiments/c3a/flux2-klein-default/training_params.json Overrides training params for C3A image-gen experiment.
method_comparison/image-gen/experiments/delora/flux2-klein-default/adapter_config.json Adds a DeLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/delora/flux2-klein-default/training_params.json Overrides training params for DeLoRA image-gen experiment.
method_comparison/image-gen/experiments/fourierft/flux2-klein-default/adapter_config.json Adds a FourierFT experiment config for image-gen.
method_comparison/image-gen/experiments/full-finetuning/flux2-klein-default/training_params.json Adds a full fine-tuning experiment override for image-gen.
method_comparison/image-gen/experiments/gralora/flux2-klein-default/adapter_config.json Adds a GraLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/hra/flux2-klein-default/adapter_config.json Adds an HRA experiment config for image-gen.
method_comparison/image-gen/experiments/ia3/flux2-klein-default/adapter_config.json Adds an IA3 experiment config for image-gen.
method_comparison/image-gen/experiments/ia3/flux2-klein-default/training_params.json Overrides training params for IA3 image-gen experiment.
method_comparison/image-gen/experiments/lily/flux2-klein-default/adapter_config.json Adds a LILY experiment config for image-gen.
method_comparison/image-gen/experiments/ln_tuning/flux2-klein-default/adapter_config.json Adds an LN_TUNING experiment config for image-gen.
method_comparison/image-gen/experiments/loha/flux2-klein-default/adapter_config.json Adds a LoHa experiment config for image-gen.
method_comparison/image-gen/experiments/lokr/flux2-klein-default/adapter_config.json Adds a LoKr experiment config for image-gen.
method_comparison/image-gen/experiments/lora/flux2-klein-default/adapter_config.json Adds a LoRA experiment config for image-gen.
method_comparison/image-gen/experiments/miss/flux2-klein-default/adapter_config.json Adds a MISS experiment config for image-gen.
method_comparison/image-gen/experiments/oft/flux2-klein-default/adapter_config.json Adds an OFT experiment config for image-gen.
method_comparison/image-gen/experiments/osf/flux2-klein-default/adapter_config.json Adds an OSF experiment config for image-gen.
method_comparison/image-gen/experiments/peanut/flux2-klein-default/adapter_config.json Adds a PEANUT experiment config for image-gen.
method_comparison/image-gen/experiments/psoft/flux2-klein-default/adapter_config.json Adds a PSOFT experiment config for image-gen.
method_comparison/image-gen/experiments/pvera/flux2-klein-default/adapter_config.json Adds a PVERA experiment config for image-gen.
method_comparison/image-gen/experiments/pvera/flux2-klein-default/training_params.json Overrides training params for PVERA image-gen experiment.
method_comparison/image-gen/experiments/randlora/flux2-klein-default/adapter_config.json Adds a RandLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/road/flux2-klein-default/adapter_config.json Adds a ROAD experiment config for image-gen.
method_comparison/image-gen/experiments/road/flux2-klein-default/training_params.json Overrides training params for ROAD image-gen experiment.
method_comparison/image-gen/experiments/shira/flux2-klein-default/adapter_config.json Adds a SHIRA experiment config for image-gen.
method_comparison/image-gen/experiments/vblora/flux2-klein-default/adapter_config.json Adds a VBLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/vera/flux2-klein-default/adapter_config.json Adds a VERA experiment config for image-gen.
method_comparison/image-gen/experiments/vera/flux2-klein-default/training_params.json Overrides training params for VERA image-gen experiment.
method_comparison/image-gen/experiments/waveft/flux2-klein-default/adapter_config.json Adds a WaveFT experiment config for image-gen.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

[
T.Resize(size, interpolation=T.InterpolationMode.BILINEAR),
T.ToTensor(),
T.Normalize([0.5], [0.5]),
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T.Normalize([0.5], [0.5]) expects one mean/std per channel, but images are explicitly converted to RGB (3 channels). With RGB tensors this will raise a shape mismatch. Use 3-value mean/std lists (e.g. one per channel) to match the [3, H, W] tensor shape.

Suggested change
T.Normalize([0.5], [0.5]),
T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),

Copilot uses AI. Check for mistakes.
Comment thread method_comparison/image-gen/default_training_params.json
Comment thread method_comparison/image-gen/README.md Outdated
Comment thread method_comparison/image-gen/README.md Outdated
Comment thread method_comparison/image-gen/README.md Outdated
Comment thread method_comparison/image-gen/default_training_params.json
Comment thread method_comparison/image-gen/run.py Outdated
Comment thread method_comparison/image-gen/run.py
Comment thread method_comparison/image-gen/run.py
Comment thread method_comparison/processing.py
Contrary to MetaMathQA, the eval in the image gen benchmark adds
significant memory usage overall due to the use of a second model (Dino)
for calculating the metric. This should not count towards the max memory
usage, as it is not an inherent part of the PEFT method being used.
@BenjaminBossan
Copy link
Copy Markdown
Member Author

BenjaminBossan commented Apr 27, 2026

@sayakpaul I made some further changes to the PR but now it's ready from my point of view. I would be happy for another review.

Results are logged into jsons and shown in the Gradio app, same as with the existing MetaMath benchmark. Main changes since the last review:

  • add more experiments
  • update Gradio app to include results
  • option upload the checkpoint files and sample images to a HF bucket
  • now track the memory only for the training part, not the eval

As for potentially showing the sample images in the app, that's something I'd look into once we have created the images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants