Describe the bug
Image generation using black-forest-labs/Flux.2-dev with diffusers using quantization at int8 and separate stages for prompt encoding and transformer inference results in bad random checkerboard image output. This same workflow works find with all other large models I have tried (QwenImage, z-image, Flux.1-dev, stable-diffusion-3-5-large).
System:
OS: Fedora
Kernel: x86_64 Linux 7.0.8-100.fc43.x86_64
CPU: Intel Xeon Silver 4114 @ 40x 3GHz [46.0°C]
GPU: AMD Radeon Pro W7900 (radeonsi, navi31, LLVM 21.1.8, DRM 3.64, 7.0.8-100.fc43.x86_64)
RAM: 321061MiB
Using docker images: rocm/pytorch
tags tested:
- latest (as of May 20, 2026)
- rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.7.1
Model resulting in bad output:
- black-forest-labs/Flux.2-dev
The reproduction script included runs the prompt encoding and inference with an int8 quantization, but explicitly separated by unloading everything in between.
Output image:
Reproduction
import gc
import diffusers
import torch
import transformers
# tested with these docker images (rocm/pytorch):
# rocm/pytorch:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.7.1
# rocm/pytorch:latest (as of 2026-05-20)
# Where latest was at pytorch version 2.8.0
# this seems to make no difference on the output or performance
# torch.backends.cuda.enable_mem_efficient_sdp(False)
model = "black-forest-labs/FLUX.2-dev"
outfile = "cool-cat.png"
prompt = "A cat with a banjo"
print("==== Phase 1: text encoder ====")
print("Loading text encoder (quantization config: llm_int8)...")
te_qconfig = transformers.BitsAndBytesConfig(
load_in_8bit=True,
)
text_encoder = transformers.Mistral3ForConditionalGeneration.from_pretrained(
model,
subfolder="text_encoder",
quantization_config=te_qconfig,
tie_word_embeddings=False,
torch_dtype=torch.bfloat16,
)
print("Building prompt-encoder pipeline (with quantization)...")
encoder_pipeline = diffusers.Flux2Pipeline.from_pretrained(
model,
text_encoder=text_encoder,
transformer=None,
vae=None,
torch_dtype=torch.bfloat16,
)
encoder_pipeline.to("cuda")
print("Encoding prompt...")
with torch.no_grad():
prompt_embeds, text_ids = encoder_pipeline.encode_prompt(prompt=prompt)
print("Unloading prompt-encoder pipeline...")
del encoder_pipeline
del text_encoder
gc.collect()
torch.cuda.empty_cache()
print("==== Phase 2: inference ====")
print("Loading transformer (quantization config: llm_int8)...")
tr_qconfig = diffusers.BitsAndBytesConfig(
load_in_8bit=True,
)
transformer = diffusers.Flux2Transformer2DModel.from_pretrained(
model,
subfolder="transformer",
quantization_config=tr_qconfig,
torch_dtype=torch.bfloat16,
)
print("Building inference pipeline (with quantization)...")
pipeline = diffusers.Flux2Pipeline.from_pretrained(
model,
text_encoder=None,
tokenizer=None,
transformer=transformer,
torch_dtype=torch.bfloat16,
)
pipeline = pipeline.to("cuda")
print("Running inference...")
result = pipeline(prompt_embeds=prompt_embeds)
print(f"Saving image to {outfile}...")
result.images[0].save(outfile)
print("Done.")
Logs
# python test-flux2-int8.py
==== Phase 1: text encoder ====
Loading text encoder (quantization config: llm_int8)...
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 585/585 [04:38<00:00, 2.10it/s]
Building prompt-encoder pipeline (with quantization)...
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.44it/s]
Encoding prompt...
[transformers] Kwargs passed to `processor.__call__` have to be in `processor_kwargs` dict, not in `**kwargs`
/opt/venv/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py:123: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Unloading prompt-encoder pipeline...
==== Phase 2: inference ====
Loading transformer (quantization config: llm_int8)...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 7/7 [06:17<00:00, 53.96s/it]
Building inference pipeline (with quantization)...
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.51it/s]
Running inference...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [05:16<00:00, 6.33s/it]
Saving image to cool-cat.png...
Done.
System Info
- 🤗 Diffusers version: 0.38.0
- Platform: Linux-7.0.8-100.fc43.x86_64-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.12.3
- PyTorch version (GPU?): 2.8.0+rocm7.0.0.git64359f59 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 1.15.0
- Transformers version: 5.8.1
- Accelerate version: 1.13.0
- PEFT version: 0.19.1
- Bitsandbytes version: 0.49.2
- Safetensors version: 0.8.0-rc.0
- xFormers version: not installed
- Accelerator: NA
System:
OS: Fedora
Kernel: x86_64 Linux 7.0.8-100.fc43.x86_64
Shell: zsh 5.9
Resolution: 10240x2880
DE: GNOME 49.7
WM: Mutter
WM Theme: Adwaita
GTK Theme: Adwaita [GTK2/3]
Icon Theme: Adwaita
Font: Adwaita Sans 11
CPU: Intel Xeon Silver 4114 @ 40x 3GHz [46.0°C]
GPU: AMD Radeon Pro W7900 (radeonsi, navi31, LLVM 21.1.8, DRM 3.64, 7.0.8-100.fc43.x86_64)
RAM: 321061MiB
Who can help?
This is general use issue about regular inference with a base model.
@sayakpaul @DN6
Describe the bug
Image generation using
black-forest-labs/Flux.2-devwith diffusers using quantization at int8 and separate stages for prompt encoding and transformer inference results in bad random checkerboard image output. This same workflow works find with all other large models I have tried (QwenImage, z-image, Flux.1-dev, stable-diffusion-3-5-large).System:
OS: Fedora
Kernel: x86_64 Linux 7.0.8-100.fc43.x86_64
CPU: Intel Xeon Silver 4114 @ 40x 3GHz [46.0°C]
GPU: AMD Radeon Pro W7900 (radeonsi, navi31, LLVM 21.1.8, DRM 3.64, 7.0.8-100.fc43.x86_64)
RAM: 321061MiB
Using docker images:
rocm/pytorchtags tested:
Model resulting in bad output:
The reproduction script included runs the prompt encoding and inference with an int8 quantization, but explicitly separated by unloading everything in between.
Output image:
Reproduction
Logs
System Info
System:
OS: Fedora
Kernel: x86_64 Linux 7.0.8-100.fc43.x86_64
Shell: zsh 5.9
Resolution: 10240x2880
DE: GNOME 49.7
WM: Mutter
WM Theme: Adwaita
GTK Theme: Adwaita [GTK2/3]
Icon Theme: Adwaita
Font: Adwaita Sans 11
CPU: Intel Xeon Silver 4114 @ 40x 3GHz [46.0°C]
GPU: AMD Radeon Pro W7900 (radeonsi, navi31, LLVM 21.1.8, DRM 3.64, 7.0.8-100.fc43.x86_64)
RAM: 321061MiB
Who can help?
This is general use issue about regular inference with a base model.
@sayakpaul @DN6