perf(diffusion): improve Flux training throughput#2251
Open
pthombre wants to merge 4 commits into
Open
Conversation
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
98bc816 to
0e61f24
Compare
Contributor
Author
|
/claude review |
Contributor
Author
|
/ok to test 0e61f24 |
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Contributor
Author
|
/claude review |
Contributor
Author
|
/ok to test ac3388d |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Improves FLUX.1-dev diffusion training throughput for both full fine-tuning and LoRA by adding measured performance controls to the diffusion recipe and promoting the best validated Flux configs.
Changelog
_target_/fused AdamW kwargs.NeMoAutoDiffusionPipeline, including safe-subset filtering for batch-only Flux conditioning/output projections.transformer_blocksandsingle_transformer_blocks.3/2, local/global batch4/32, and compile disabled.6/48, and TE FP8 disabled.Experiment summary
The selected defaults come from bounded 8x H100 performance sweeps for FLUX.1-dev full fine-tuning and LoRA fine-tuning. Each promoted setting was required to complete the bounded run, keep loss and gradient norms finite, and avoid regressing generation/checkpoint validation where applicable.
Full fine-tuning
1/8measured16.795 samples/s.foreach=falseandfused=true.3/2was the best low-risk prefetch setting. Deeper4/3prefetch regressed and used more memory, so the config keeps3/2.4/32without compile completed at34.70 samples/s; batch6/48OOMed and batch5/40was slower, so the full fine-tune config uses4/32.torch.compileremains disabled by default because it helped small-batch runs but failed for batch4/32.flash: FlashAttention 3 was unavailable in the measured environment, whileflexandflash_varlenregressed.39.6 samples/s. It beat current scaling, tied deeper prefetch, completed a 500-step larger-data validation run, and passed generation validation from the final checkpoint.LoRA fine-tuning
1/8measured21.83 samples/sand was dominated by FSDP all-gather.36.39 samples/sin the initial DDP check.6/48was the best stable batch-size point; batch7/56OOMed.6/48remained valid and produced a small/noisy improvement, so it is enabled for LoRA too.53.57 samples/s; a comparable FSDP checkpoint run measured22.07 samples/s. Image verification from the baseline and optimized adapters produced coherent, comparable outputs.Validation
uv run ruff format nemo_automodel/_diffusers/auto_diffusion_pipeline.py nemo_automodel/components/distributed/parallelizer.py nemo_automodel/recipes/diffusion/train.py examples/diffusion/generate/generate.pyuv run ruff check --fix nemo_automodel/_diffusers/auto_diffusion_pipeline.py nemo_automodel/components/distributed/parallelizer.py nemo_automodel/recipes/diffusion/train.py examples/diffusion/generate/generate.pyuv run python -m py_compile nemo_automodel/_diffusers/auto_diffusion_pipeline.py nemo_automodel/components/distributed/parallelizer.py nemo_automodel/recipes/diffusion/train.py examples/diffusion/generate/generate.pyruff format, targetedruff check --fix,py_compile, andgit diff --cached --checkfor the changed diffusion files.ruff format, targetedruff check --fix,py_compile, andgit diff --cached --checkfor the changed diffusion recipe/config files.Before your PR is "Ready for review"
Pre checks:
Additional Information