Image generation benchmark by BenjaminBossan · Pull Request #3082 · huggingface/peft

BenjaminBossan · 2026-03-04T11:48:14Z

Adds an image generation benchmark similar to the existing MetaMathQA to PEFT.

Right now, uses Flux2 klein 4b to train an adapter on the ~~dog dataset~~ update: now using the cat plushy dataset, as it contains more samples.

The benchmark is based on this Diffusers script.

Sample images from my local runs can be inspected here.

Yet todo: Sample images are already being created and uploaded to a HF bucket (if a token is provided). The images are not shown yet in the app (not sure if/how that works), this can be added once we have the final results.

Adds an image generation benchmark similar to the existing MetaMathQA to PEFT. Right now, uses Flux2 klein 4b to train an adapter on the cat toy dataset. Current state is that the model is hardly learning, so there is some debugging needed.

srijondasgit · 2026-03-04T13:53:06Z

To strengthen the benchmark, is it possible to include a LoRA rank sweep (r = 8, 16, 32, 64) to better understand the trade-off between adapter capacity and generation quality, as well as a full fine-tune baseline to establish an upper-bound baseline reference for performance and convergence behavior.

These additions could help determine whether current learning instability is capacity-related or optimization-related. Additionally, GPU utilization is currently low (~30% at ~5 sec/iteration for 1024×1024), so improving compute efficiency will be important to accelerate experimentation, reduce turnaround time for rank sweeps and baselines, and make the benchmark more scalable.

BenjaminBossan · 2026-03-04T15:41:41Z

@srijondasgit This is still WIP, there will be more work on performance and hyper-parameters before we merge this. At this stage, I'm still trying to figure out if something fundamental is broken.

- use dog dataset (but may revert to cat later) - several smaller fixes - precompute stuff and onload for faster results - 1024x1024 by default for Flux 2

Clean up data loading, precompute latents, simplify eval.

BenjaminBossan · 2026-03-05T12:24:56Z

@srijondasgit With the latest changes, the benchmark should run much faster and yield better results. Feel free to test it. If you have suggestions for better hyper-parameters, feel free to share. Note that there is already an experiment setting for full fine-tuning, but I can't run it right now (not enough memory on my machine).

sayakpaul · 2026-03-05T13:17:24Z

+    pipeline.text_encoder.to(target_device)
+    try:
+        for idx, prompt in enumerate(train_config.sample_image_prompts, start=1):
+            generator = torch.Generator(device=target_device).manual_seed(train_config.seed + 100_000 + idx)


Should this differ from how seeding is done in evaluate()?

Good point, I moved the generator out of the loop to be consistent.

sayakpaul · 2026-03-05T13:18:40Z

+            total_samples += current_batch_size
+
+            model_input_ids = pipeline._prepare_latent_ids(latents).to(latents.device)
+            noise = torch.randn_like(latents)


This is not seeded and can cause randomness.

sayakpaul · 2026-03-05T13:21:13Z

+        train_config=train_config, print_fn=print_verbose
+    )
+    train_size_base = len(train_dataset["prompts"])
+    train_indices = torch.cat([torch.randperm(train_size_base) for _ in range(train_dataset["repeats"])])


Doesn't use generator which can introduce randomness in between runs.

BenjaminBossan · 2026-03-05T14:05:54Z

@sayakpaul Thank you for your review, good catch. I have addressed the seeding in the latest commit. When re-running the experiment, there is, however, still considerable variance in the similarity score. The losses are pretty tight though:

# run 1
{"step": "  50", "loss avg": "0.6826", "valid sim": "0.0043", "train time": "17.7s"}
{"step": " 100", "loss avg": "0.6857", "valid sim": "0.2506", "train time": "17.7s"}
{"step": " 150", "loss avg": "0.6669", "valid sim": "0.7335", "train time": "17.7s"}
{"step": " 200", "loss avg": "0.6710", "valid sim": "0.5652", "train time": "17.7s"}

# run 2
{"step": "  50", "loss avg": "0.6826", "valid sim": "-0.0345", "train time": "17.8s"}
{"step": " 100", "loss avg": "0.6855", "valid sim": "0.1775", "train time": "17.7s"}
{"step": " 150", "loss avg": "0.6661", "valid sim": "0.6675", "train time": "17.8s"}
{"step": " 200", "loss avg": "0.6701", "valid sim": "0.6906", "train time": "17.8s"}

sayakpaul · 2026-03-05T14:09:25Z

When re-running the experiment, there is, however, still considerable variance in the similarity score.

This suggests that the generated images are a bit different. Could you disable the adapter and run the pipeline in the exact setting a few times to see if the results are same or different? This should provide some information if setting the seed is still rendering trained adapters to be different across different runs.

BenjaminBossan · 2026-03-05T14:14:32Z

With disabled adapters, similarity is always 0.0530.

sayakpaul · 2026-03-05T14:17:28Z

With disabled adapters, similarity is always 0.0530.

Okay then we can conclude this variance isn't coming from the pipeline. I will look into the similarity computation code tomorrow.

BenjaminBossan · 2026-03-05T14:24:11Z

Thanks. I did confirm that get_dino_embeddings itself is deterministic. Maybe it's just that tiny changes in the adapter weights are amplified when generating with 20 steps.

sayakpaul · 2026-03-05T15:32:58Z

Maybe it's just that tiny changes in the adapter weights are amplified when generating with 20 steps.

Could be. But do the results stay consistent across different generation rounds when using the same adapter?

I think we can still report thw DINO similarity scores, TBH

srijondasgit · 2026-03-05T18:48:23Z

The experiment was run on a Mac development machine, so training is significantly slower than typical GPU-based diffusion benchmarks. The GPU on my mac is active sometimes, but it stays idle a lot, so the overall training is quite slow

I created this PR to use a model which can be used for full fine-tuning Vs Peft on a Mac.

All the weights of UNet2DConditionModel were initialized from the model checkpoint at runwayml/stable-diffusion-v1-5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UNet2DConditionModel for predictions without further training.
03/05/2026 22:59:03 - INFO - main - [RANK 0] ***** Running training *****
03/05/2026 22:59:03 - INFO - main - [RANK 0] Num examples = 5
03/05/2026 22:59:03 - INFO - main - [RANK 0] Num batches each epoch = 5
03/05/2026 22:59:03 - INFO - main - [RANK 0] Num Epochs = 250
03/05/2026 22:59:03 - INFO - main - [RANK 0] Instantaneous batch size per device = 1
03/05/2026 22:59:03 - INFO - main - [RANK 0] Total train batch size (w. parallel, distributed & accumulation) = 4
03/05/2026 22:59:03 - INFO - main - [RANK 0] Gradient Accumulation steps = 4
03/05/2026 22:59:03 - INFO - main - [RANK 0] Total optimization steps = 500
Steps: 0%| | 0/500 [00:00<?, ?it/s]/Users/srijon/Desktop/peft_diffuser_benchmark1/peft/venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:2056: UserWarning: Dynamo detected a call to a functools.lru_cache-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a potential risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
Steps: 35%|███████████████████████▉ | 176/500 [1:14:05<3:30:40, 39.01s/it, loss=0.222, lr=0.0001]

Final step ..

Steps: 100%|████████████████████████████████████████████████████████████████████| 500/500 [2:47:22<00:00, 20.08s/it, loss=0.00471, lr=0.0001]

Inference ...
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.49s/it]
No LoRA keys associated to CLIPTextModel found with the prefix='text_encoder'. This is safe to ignore if LoRA state dict didn't originally have any CLIPTextModel related params. You can also try specifying prefix=None to resolve the warning. Otherwise, open an issue if you think it's unexpected: https://github.com/huggingface/diffusers/issues/new
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:10<00:00, 1.41s/it]
Image saved as dog_inference.png

The dog bouncing on a trampoline...

BenjaminBossan · 2026-03-06T10:35:07Z

But do the results stay consistent across different generation rounds when using the same adapter?

So I ran evaluation three times every 50 steps instead of just once and collected the similarity scores. With the same seed, the score is always identical, as should be expected. When using different seeds, the scores can vary quite a bit:

step  50: [0.1211815923452377, 0.5188983678817749, 0.04316230118274689]
step 100: [0.5750174522399902, 0.6544530391693115, 0.5297497510910034]
step 150: [0.5367261171340942, 0.673410177230835, 0.6793834567070007]
step 200: [0.8191202282905579, 0.7153311967849731, 0.8426690101623535]
step 250: [0.6548720598220825, 0.7877984046936035, 0.7447556853294373]

I think we can still report thw DINO similarity scores, TBH

Yes, for sure. If there was a more reliable score (or dataset that leads to a more reliable score), then it would lead to a better signal for users though.

For now, I'd say let's do only one score for validation, as it's quite slow, but we can do multiple scores for the test set and average those.

BenjaminBossan · 2026-03-06T16:37:35Z

Just tested OFT (default settings), which is intended especially for image tasks, and it got a similarity of 0.747 (average of 10):

So this benchmark seems to work for other PEFT methods too.

sayakpaul · 2026-03-06T16:38:38Z

Very cool! Let's get this shipped ASAP then!

BenjaminBossan · 2026-03-06T16:44:02Z

Yep, I'll run a few more benchmarks with PEFT methods that explicitly mention they're good at image generation. One more step would be to convert the resulting checkpoints to LoRA and check if they retain the quality.

BenjaminBossan · 2026-03-06T17:10:59Z

DeLoRA

0.815 similarity

GraLoRA

0.8007 similarity

HRA

OOM

WaveFT

0.353 similarity

LoRA rank 64

0.8366 similarity

lora--flux2-klein-rank64--2026-03-06T17-03-08+00-00_01

BenjaminBossan · 2026-03-06T17:25:32Z

GraLoRA checkpoint converted to LoRA with rank 128. Similarity score of the converted LoRA is 0.8015, so no difference to the original. But there is a visible degradation in the generated images when it comes to the subject (but this was just my first attempt, there is possibly room for improvement).

Original:

Converted:

"a photo of sks dog diving under water" original:

gralora--flux2-klein-default--2026-03-06T16-49-16+00-00_03

"a photo of sks dog diving under water" converted:

HuggingFaceDocBuilderDev · 2026-03-09T16:08:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2026-03-13T11:29:01Z

@sayakpaul @githubnemo From my POV, this PR is ready to be reviewed. There are still some TODOs:

Add further experiments (more PEFT methods) and explore better hyper-parameters (could be community outsourced?).
Test images are already created but they're not uploaded anywhere.
The method comparison Gradio app needs to be updated to show the results.

I don't think those are crucial to merge the PR and I documented them in the corresponding README.md. I'll tackle work on follow up PRs when this one is merged.

sayakpaul · 2026-03-16T08:56:05Z

So, #3082 (comment) is regular LoRA and #3082 (comment) is OFT converted to LoRA or just OFT?

BenjaminBossan · 2026-03-16T09:14:12Z

That was regular OFT. That comment showed a converted adapter.

sayakpaul

I didn't review all the files, as the results seem okay for me, confirming the effectiveness of the implementation. But if you want me to look into specific files, I can do that.

sayakpaul · 2026-03-16T08:59:34Z

+  "layers_to_transform": null,
+  "modules_to_save": null,
+  "peft_type": "GRALORA",
+  "peft_version": "0.18.2.dev0@UNKNOWN",


https://pypi.org/project/setuptools-git-versioning/ pretty useful.

Yeah, we could do something more sophisticated here but it's out of scope for this PR. We added the PEFT version here is just for documentation purposes (and run migrations if needed). It's not relevant for experiment reproducibility, we track the actual PEFT commit used for the experiment hash in the results JSON. The value here is just the commit used to create the adapter_config.json.

sayakpaul · 2026-03-16T09:00:25Z

@@ -0,0 +1,8 @@
+{


What's the advantage of keeping training params different from adapter params?

I follow the same convention as in MetaMath.

The adapter params are just serialized PEFT configs that are easy to generate. With a unified file, it would require extra effort to create.

Easier to compose different settings.

sayakpaul · 2026-03-16T09:02:00Z

@@ -0,0 +1,90 @@
+# Makefile for running MetaMathQA experiments.


No MetaMathQA here.

Thanks, fixed.

sayakpaul · 2026-03-16T09:03:18Z

+# Makefile for running MetaMathQA experiments.
+
+# --- Configuration ---
+PYTHON := python


Honestly, a Makefile for launching a couple of experiments seems more complicated than it needs to be. But I am sure I am missing out on something. What's the advantage of using Makefiles for this?

First, it's the same as MetaMath, so we keep it for consistency. Second, it checks which experiments have already run and only runs the missing ones. I'd say that's pretty much the main use case for make.

sayakpaul · 2026-03-16T09:13:46Z

@@ -0,0 +1,6 @@
+datasets
+diffusers


Let's fix the versions.

You mean a fixed version or a min version? I'd be okay with a min version if needed, but in general we don't want to fix the version. I think it's more useful to think of this benchmark as showing the current state of the tested PEFT methods, which is why we re-run the experiments every couple of months or when we suspect a recent change could affect the results.

For reproducibility, we log the versions of the most important packages in the results file, but guaranteeing 100% reproducibility is not a goal of this benchmark.

Specifying the version where Klein was introduced -- maybe let's specify that (0.37.0)?

- update comment in Makefile - initiate the Generator consistently

BenjaminBossan

@sayakpaul I addressed/replied to all of your comments.

I didn't review all the files, as the results seem okay for me, confirming the effectiveness of the implementation. But if you want me to look into specific files, I can do that.

Yes, fine not to review everything (a lot is adapted from the MetaMath benchmark anyway). Most important is whether the results look as expected or if there is a significant flaw in the training itself.

BenjaminBossan · 2026-03-16T13:37:13Z

@@ -0,0 +1,8 @@
+{


I follow the same convention as in MetaMath.

The adapter params are just serialized PEFT configs that are easy to generate. With a unified file, it would require extra effort to create.

Easier to compose different settings.

BenjaminBossan · 2026-03-16T13:40:07Z

+  "layers_to_transform": null,
+  "modules_to_save": null,
+  "peft_type": "GRALORA",
+  "peft_version": "0.18.2.dev0@UNKNOWN",


Yeah, we could do something more sophisticated here but it's out of scope for this PR. We added the PEFT version here is just for documentation purposes (and run migrations if needed). It's not relevant for experiment reproducibility, we track the actual PEFT commit used for the experiment hash in the results JSON. The value here is just the commit used to create the adapter_config.json.

BenjaminBossan · 2026-03-16T13:42:16Z

@@ -0,0 +1,90 @@
+# Makefile for running MetaMathQA experiments.


Thanks, fixed.

BenjaminBossan · 2026-03-16T13:43:30Z

+# Makefile for running MetaMathQA experiments.
+
+# --- Configuration ---
+PYTHON := python


First, it's the same as MetaMath, so we keep it for consistency. Second, it checks which experiments have already run and only runs the missing ones. I'd say that's pretty much the main use case for make.

BenjaminBossan · 2026-03-16T13:47:26Z

@@ -0,0 +1,6 @@
+datasets
+diffusers


You mean a fixed version or a min version? I'd be okay with a min version if needed, but in general we don't want to fix the version. I think it's more useful to think of this benchmark as showing the current state of the tested PEFT methods, which is why we re-run the experiments every couple of months or when we suspect a recent change could affect the results.

For reproducibility, we log the versions of the most important packages in the results file, but guaranteeing 100% reproducibility is not a goal of this benchmark.

BenjaminBossan · 2026-03-16T13:50:59Z

+    pipeline.text_encoder.to(target_device)
+    try:
+        for idx, prompt in enumerate(train_config.sample_image_prompts, start=1):
+            generator = torch.Generator(device=target_device).manual_seed(train_config.seed + 100_000 + idx)


Good point, I moved the generator out of the loop to be consistent.

Some of those OOM on my 24 GB card

BenjaminBossan · 2026-03-23T16:39:54Z

I also updated the method comparison app to include the results from the image gen benchmark (choose benchmark through the dropdown menu). Right now, there are no results to show, but I ran a couple of experiments locally to check the app. Here is a screenshot:

github-actions · 2026-04-17T15:24:07Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan · 2026-04-17T15:28:58Z

not stale

no values changed

This helps to save memory, for LoRA from ~21 GB to ~17 GB max. Training time increases from 50 sec to 71 sec per 100 steps, which is an acceptable tradeoff overall. In contrast to the MetaMathQA benchmark, I turn this on by default, as the image gen benchmark has several methods that OOM.

optionally

Copilot

Pull request overview

Adds a new DreamBooth-style image generation benchmark under method_comparison/image-gen and extends the existing method-comparison pipeline (processing + Gradio app) to support multiple tasks (MetaMathQA + image-gen) with task-specific metrics and Pareto defaults.

Changes:

Add a new image-gen benchmark (training/eval pipeline, dataset handling, logging, experiments, Makefile, docs).
Refactor processing.py to support task-specific preprocessing, dtypes, and “important columns”.
Update method_comparison/app.py to load multiple tasks and use task-specific metric preferences and defaults.

Reviewed changes

Copilot reviewed 41 out of 47 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
method_comparison/processing.py	Refactors preprocessing to support multiple tasks and adds image-gen metrics/dtypes/column ordering.
method_comparison/app.py	Adds multi-task loading, task-specific metric preferences/defaults, and updates Pareto plotting UI logic.
method_comparison/image-gen/utils.py	Implements shared utilities for the image-gen benchmark (config, pipeline setup, logging, HF bucket uploads).
method_comparison/image-gen/run.py	Adds the main training/eval script for the image-gen benchmark (including DINO similarity + drift).
method_comparison/image-gen/data.py	Adds dataset loading/splitting and image preprocessing for training/eval.
method_comparison/image-gen/default_training_params.json	Provides default training/eval parameters for image-gen runs.
method_comparison/image-gen/requirements.txt	Declares extra deps needed for the image-gen benchmark.
method_comparison/image-gen/README.md	Documents how to run the image-gen benchmark and configure experiments.
method_comparison/image-gen/Makefile	Adds sweep automation to run only experiments missing result files.
method_comparison/image-gen/results/.gitkeep	Keeps results dir in git.
method_comparison/image-gen/temporary_results/.gitkeep	Keeps temporary_results dir in git.
method_comparison/image-gen/cancelled_results/.gitkeep	Keeps cancelled_results dir in git.
method_comparison/image-gen/sample-images/results/.gitkeep	Keeps sample image results dir in git.
method_comparison/image-gen/sample-images/temporary_results/.gitkeep	Keeps sample image temporary_results dir in git.
method_comparison/image-gen/sample-images/cancelled_results/.gitkeep	Keeps sample image cancelled_results dir in git.
method_comparison/image-gen/experiments/adalora/flux2-klein-default/adapter_config.json	Adds an AdaLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/boft/flux2-klein-default/adapter_config.json	Adds a BOFT experiment config for image-gen.
method_comparison/image-gen/experiments/c3a/flux2-klein-default/adapter_config.json	Adds a C3A experiment config for image-gen.
method_comparison/image-gen/experiments/c3a/flux2-klein-default/training_params.json	Overrides training params for C3A image-gen experiment.
method_comparison/image-gen/experiments/delora/flux2-klein-default/adapter_config.json	Adds a DeLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/delora/flux2-klein-default/training_params.json	Overrides training params for DeLoRA image-gen experiment.
method_comparison/image-gen/experiments/fourierft/flux2-klein-default/adapter_config.json	Adds a FourierFT experiment config for image-gen.
method_comparison/image-gen/experiments/full-finetuning/flux2-klein-default/training_params.json	Adds a full fine-tuning experiment override for image-gen.
method_comparison/image-gen/experiments/gralora/flux2-klein-default/adapter_config.json	Adds a GraLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/hra/flux2-klein-default/adapter_config.json	Adds an HRA experiment config for image-gen.
method_comparison/image-gen/experiments/ia3/flux2-klein-default/adapter_config.json	Adds an IA3 experiment config for image-gen.
method_comparison/image-gen/experiments/ia3/flux2-klein-default/training_params.json	Overrides training params for IA3 image-gen experiment.
method_comparison/image-gen/experiments/lily/flux2-klein-default/adapter_config.json	Adds a LILY experiment config for image-gen.
method_comparison/image-gen/experiments/ln_tuning/flux2-klein-default/adapter_config.json	Adds an LN_TUNING experiment config for image-gen.
method_comparison/image-gen/experiments/loha/flux2-klein-default/adapter_config.json	Adds a LoHa experiment config for image-gen.
method_comparison/image-gen/experiments/lokr/flux2-klein-default/adapter_config.json	Adds a LoKr experiment config for image-gen.
method_comparison/image-gen/experiments/lora/flux2-klein-default/adapter_config.json	Adds a LoRA experiment config for image-gen.
method_comparison/image-gen/experiments/miss/flux2-klein-default/adapter_config.json	Adds a MISS experiment config for image-gen.
method_comparison/image-gen/experiments/oft/flux2-klein-default/adapter_config.json	Adds an OFT experiment config for image-gen.
method_comparison/image-gen/experiments/osf/flux2-klein-default/adapter_config.json	Adds an OSF experiment config for image-gen.
method_comparison/image-gen/experiments/peanut/flux2-klein-default/adapter_config.json	Adds a PEANUT experiment config for image-gen.
method_comparison/image-gen/experiments/psoft/flux2-klein-default/adapter_config.json	Adds a PSOFT experiment config for image-gen.
method_comparison/image-gen/experiments/pvera/flux2-klein-default/adapter_config.json	Adds a PVERA experiment config for image-gen.
method_comparison/image-gen/experiments/pvera/flux2-klein-default/training_params.json	Overrides training params for PVERA image-gen experiment.
method_comparison/image-gen/experiments/randlora/flux2-klein-default/adapter_config.json	Adds a RandLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/road/flux2-klein-default/adapter_config.json	Adds a ROAD experiment config for image-gen.
method_comparison/image-gen/experiments/road/flux2-klein-default/training_params.json	Overrides training params for ROAD image-gen experiment.
method_comparison/image-gen/experiments/shira/flux2-klein-default/adapter_config.json	Adds a SHIRA experiment config for image-gen.
method_comparison/image-gen/experiments/vblora/flux2-klein-default/adapter_config.json	Adds a VBLoRA experiment config for image-gen.
method_comparison/image-gen/experiments/vera/flux2-klein-default/adapter_config.json	Adds a VERA experiment config for image-gen.
method_comparison/image-gen/experiments/vera/flux2-klein-default/training_params.json	Overrides training params for VERA image-gen experiment.
method_comparison/image-gen/experiments/waveft/flux2-klein-default/adapter_config.json	Adds a WaveFT experiment config for image-gen.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T13:54:49Z

+        [
+            T.Resize(size, interpolation=T.InterpolationMode.BILINEAR),
+            T.ToTensor(),
+            T.Normalize([0.5], [0.5]),


T.Normalize([0.5], [0.5]) expects one mean/std per channel, but images are explicitly converted to RGB (3 channels). With RGB tensors this will raise a shape mismatch. Use 3-value mean/std lists (e.g. one per channel) to match the [3, H, W] tensor shape.

Suggested change

T.Normalize([0.5], [0.5]),

T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),

Contrary to MetaMathQA, the eval in the image gen benchmark adds significant memory usage overall due to the use of a second model (Dino) for calculating the metric. This should not count towards the max memory usage, as it is not an inherent part of the PEFT method being used.

BenjaminBossan · 2026-04-27T16:44:20Z

@sayakpaul I made some further changes to the PR but now it's ready from my point of view. I would be happy for another review.

Results are logged into jsons and shown in the Gradio app, same as with the existing MetaMath benchmark. Main changes since the last review:

add more experiments
update Gradio app to include results
option upload the checkpoint files and sample images to a HF bucket
now track the memory only for the training part, not the eval

As for potentially showing the sample images in the app, that's something I'd look into once we have created the images.

BenjaminBossan added 2 commits March 4, 2026 12:43

[WIP] Image generation benchmark

84c0ba6

Adds an image generation benchmark similar to the existing MetaMathQA to PEFT. Right now, uses Flux2 klein 4b to train an adapter on the cat toy dataset. Current state is that the model is hardly learning, so there is some debugging needed.

Some clean up

250c3be

BenjaminBossan added 3 commits March 5, 2026 11:52

Some adjustments

674f544

- use dog dataset (but may revert to cat later) - several smaller fixes - precompute stuff and onload for faster results - 1024x1024 by default for Flux 2

More fixes

45f1963

Clean up data loading, precompute latents, simplify eval.

Go back to 512x512 with train batch size 2

40e768b

sayakpaul reviewed Mar 5, 2026

View reviewed changes

Correctly deal with seeds

ebe0d59

Add TODO comment, remove obsolete debug line

9ce4828

Take average of 10 generations for test similarity

84c5478

BenjaminBossan added 2 commits March 9, 2026 15:17

Add drift metric

66ed7a9

Switch to bigger dataset

3b81cda

BenjaminBossan added 2 commits March 10, 2026 11:43

Increase default max steps, reduce lr

9ab2ef3

Add more experiments, minor fixes, update docs

805c757

BenjaminBossan requested review from githubnemo and sayakpaul March 13, 2026 11:29

sayakpaul approved these changes Mar 16, 2026

View reviewed changes

Reviewer comments:

81f80e0

- update comment in Makefile - initiate the Generator consistently

BenjaminBossan commented Mar 16, 2026

View reviewed changes

BenjaminBossan added 6 commits March 16, 2026 15:57

Reviewer feedback: Add min Diffusers version

bd9f4d4

Merge branch 'main' into feat-add-image-gen-benchmark

b964b1d

Add more progress bars for slow steps

4bd05f2

Add more experiments

3bd307a

Some of those OOM on my 24 GB card

Update app.py to include image benchmark

566385e

Update README

9f28c81

BenjaminBossan mentioned this pull request Mar 24, 2026

black-forest-labs/FLUX.2-klein-9B not working with lora with lokr huggingface/diffusers#13261

Open

BenjaminBossan mentioned this pull request Mar 31, 2026

miss update #3122

Merged

BenjaminBossan added 4 commits April 22, 2026 12:11

Merge branch 'main' into feat-add-image-gen-benchmark

07ef6f9

Rearrange config for better structure

46f2355

no values changed

Upload checkpoints and sample images to HF bucket

f4250bf

optionally

BenjaminBossan requested a review from Copilot April 27, 2026 13:45

Copilot started reviewing on behalf of BenjaminBossan April 27, 2026 13:46 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

BenjaminBossan added 2 commits April 27, 2026 16:12

Apply copilot reviewer feedback

b757b62

BenjaminBossan requested a review from sayakpaul April 27, 2026 16:44

		@@ -0,0 +1,90 @@
		# Makefile for running MetaMathQA experiments.

	T.Normalize([0.5], [0.5]),
	T.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),

		@@ -0,0 +1,6 @@
		datasets
		diffusers

		@@ -0,0 +1,6 @@
		datasets
		diffusers

Conversation

BenjaminBossan commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srijondasgit commented Mar 4, 2026

Uh oh!

BenjaminBossan commented Mar 4, 2026

Uh oh!

BenjaminBossan commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Mar 5, 2026

Uh oh!

sayakpaul commented Mar 5, 2026

Uh oh!

BenjaminBossan commented Mar 5, 2026

Uh oh!

sayakpaul commented Mar 5, 2026

Uh oh!

BenjaminBossan commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 5, 2026

Uh oh!

srijondasgit commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Mar 6, 2026

Uh oh!

BenjaminBossan commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 6, 2026

Uh oh!

BenjaminBossan commented Mar 6, 2026

Uh oh!

BenjaminBossan commented Mar 6, 2026

DeLoRA

GraLoRA

HRA

WaveFT

LoRA rank 64

Uh oh!

BenjaminBossan commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 9, 2026

Uh oh!

BenjaminBossan commented Mar 13, 2026

Uh oh!

sayakpaul commented Mar 16, 2026

Uh oh!

BenjaminBossan commented Mar 16, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Mar 4, 2026 •

edited

Loading

BenjaminBossan commented Mar 5, 2026 •

edited

Loading

srijondasgit commented Mar 5, 2026 •

edited

Loading

BenjaminBossan commented Mar 6, 2026 •

edited

Loading

BenjaminBossan commented Mar 6, 2026 •

edited

Loading