Skip to content

Commit a50ade4

Browse files
Carlofklyiyixuxusayakpauldg845github-actions[bot]
authored
[Pipelines] Add DreamLite text-to-image and image-edit pipelines (#13815)
* feat(pipelines): add DreamLite text-to-image and image-edit pipelines Add ByteDance's DreamLite model family to diffusers. DreamLite is a UNet-based diffusion model that supports both text-to-image generation and reference-image editing through a shared 3-branch dual-CFG design. Two pipelines are shipped: * DreamLitePipeline - full 3-branch dual CFG (negative, reference, prompt); supports T2I and I2I editing at 1024x1024. * DreamLiteMobilePipeline - distilled single-branch variant for on-device inference; no CFG. New model code (all isolated under *_dreamlite.py / unet_dreamlite.py to avoid touching shared upstream files): * models/transformers/transformer_2d_dreamlite.py - DreamLite 2D transformer block. * models/unets/unet_dreamlite.py - DreamLiteUNetModel. * models/unets/unet_2d_blocks_dreamlite.py - DreamLite-specific down/up/mid blocks. * models/resnet_dreamlite.py - DreamLite ResNet variants. * models/attention_processor.py - add DreamLiteAttnProcessor2_0 (pure addition, no existing processor modified). Pipeline + tests + docs: * pipelines/dreamlite/{__init__.py, pipeline_dreamlite.py, pipeline_dreamlite_mobile.py, pipeline_output.py}. * tests/pipelines/dreamlite/{test_pipeline_dreamlite.py, test_pipeline_dreamlite_mobile.py} with the standard PipelineTesterMixin suite; setUp/tearDown auto-patches encode_prompt with a fake so MagicMock text encoders work without per-test boilerplate. * Skip 8 mixin tests that don't apply to DreamLite (MagicMock serialisation, custom attention processor, encode_prompt return shape, batch_size > 1 sweep), mirroring SD3 / Flux conventions. * docs/source/en/api/pipelines/dreamlite.md + _toctree.yml entry (alphabetically between DiT and EasyAnimate). * Register exports in 6 __init__.py files. Two real bugs surfaced by the mixin test suite are fixed in this commit: * num_images_per_prompt > 1: prompt_embeds and text_attention_mask are now repeated along the batch dimension in both pipelines' T2I and I2I branches before being passed to the UNet. * vae=None: __init__ now guards the encoder_block_out_channels lookup so encode_prompt can be tested in isolation per PipelineTesterMixin convention. SlowTests real-checkpoint resolution is set to 1024x1024 (the only size DreamLite is trained for). Test result: 27 passed, 50 skipped, 0 failed on CPU fast suite. make style && make quality: clean. * docs+tests(pipelines/dreamlite): pin Hub repos to `diffusers` branch The `carlofkl/DreamLite-{base,mobile}` Hub repos host two flavours of the same checkpoint: * `main` branch - keeps `model_index.json` pointing at ByteDance's internal package path so the original (non-diffusers) reference code can still load these weights. * `diffusers` branch - rewrites the `unet` entry of `model_index.json` to `["diffusers", "DreamLiteUNetModel"]` so this integration loads correctly from `diffusers`. This commit pins every `from_pretrained(...)` call shipped with the diffusers integration (docs examples, pipeline docstrings, SlowTests) to `revision="diffusers"`. Local-override env vars (DREAMLITE_BASE_PATH / DREAMLITE_MOBILE_PATH) still bypass the revision pin. * chore(pipelines/dreamlite): sync `# Copied from` blocks + dummy objects after rebase Mechanical changes after rebasing onto current `main`: * `pipeline_dreamlite.py::retrieve_timesteps` — re-synced from `diffusers.pipelines.flux.pipeline_flux.retrieve_timesteps` (PEP 604 type hints, expanded docstring, plus the new `accepts_timesteps` / `accept_sigmas` introspection guards). DreamLite's default code path uses `num_inference_steps` (uniform schedule) and never passes custom `timesteps` / `sigmas`, so the added guards are dead-code for this pipeline — behaviour is unchanged. * `dummy_pt_objects.py` / `dummy_torch_and_transformers_objects.py` — registered the dummy classes auto-generated by `make fix-copies` for `DreamLiteTransformer2DModel`, `DreamLiteUNetModel`, `DreamLitePipeline`, `DreamLiteMobilePipeline`, `DreamLitePipelineOutput`. Generated by `make fix-copies`. No hand edits. * docs(dreamlite): register attention processor + split combined docstring entries - Register DreamLiteAttnProcessor2_0 in docs/source/en/api/attnprocessor.md (fixes check_support_list.py). - Split combined 'height / width' and 'guidance_scale / image_guidance_scale' entries in the two pipeline docstrings; add a complete Args block to DreamLiteTransformer2DModel.forward (fixes check_forward_call_docstrings.py). No behavioral change. * refactor(dreamlite): address review feedback from #13815 - Inline the down/up block factories and define DreamLiteCrossAttn{,NoSelfAttn}{Down,Up}Block2D directly (review #1, #2) - Rename DownBlock2DDreamLite/UpBlock2DDreamLite to DreamLiteDownBlock2D/DreamLiteUpBlock2D to match diffusers naming conventions (review #3, #4) - Merge unet_2d_blocks_dreamlite.py into unet_dreamlite.py to mirror recent transformer model files (review #5) - Wire max_sequence_length into the tokenizer call for generate mode (review #6) - Replace hard-coded drop_idx values (64/34) with self.prompt_template_encode_*_start_idx attributes plus a comment explaining how the offsets are derived (review #7, #8) - Drop the manual Image.resize call and rely on VaeImageProcessor's LANCZOS default in preprocess(image, height, width) (review #9) - Use self.guidance_scale / self.image_guidance_scale properties in the CFG combine instead of the underscore-prefixed attributes (review #10, #11) - Inline retrieve_latents / retrieve_timesteps / calculate_shift in the mobile pipeline with `# Copied from` markers, removing the cross-pipeline imports (review #12) - Add `# Copied from` marker to _extract_masked_hidden in the mobile pipeline (review #13) * refactor(dreamlite): address dg845 follow-up review - Merge resnet_dreamlite.py (DepthwiseSeparableConv + ResnetBlock2DDreamLite) into unet_dreamlite.py and delete the standalone module (review #1) - Move DreamLiteAttnProcessor2_0 from attention_processor.py into unet_dreamlite.py to keep all DreamLite-specific code in one place; update docs autodoc reference accordingly (review #2) - Drop the PyTorch 2.0 hasattr/ImportError check in DreamLiteAttnProcessor2_0.__init__ (diffusers already requires torch>=2.0; matches Wan deprecation) (review #3) - Drop the deprecated `scale` argument handling from DreamLiteAttnProcessor2_0.__call__ (new model, no legacy callers) (review #4) - Switch SDPA call to dispatch_attention_fn so all diffusers attention backends (FlashAttention, FlashAttention-3, sageattention, etc.) are selectable (review #5) - Rename block dispatch keys in _get_{down,mid,up}_block_dreamlite to match the Python class names (DreamLiteCrossAttn{Down,Up}Block2D / DreamLiteCrossAttnNoSelfAttn{Down,Up}Block2D / DreamLiteUNetMidBlock2DCrossAttn / DreamLite{Down,Up}Block2D); default down/up/mid block_types in DreamLiteUNetModel and the test fixtures are updated to the new keys (review #6, #7); the carlofkl/DreamLite-{base,mobile} (diffusers branch) Hub configs are being updated in lock-step - Localize retrieve_latents inside pipeline_dreamlite.py with a `# Copied from` marker, removing the cross-pipeline import; mirrors the mobile pipeline (review #8) - Add a check_inputs() method to both DreamLitePipeline and DreamLiteMobilePipeline (mobile uses `# Copied from`); call it from __call__; pulls the image-type validation out of prepare_image_latents and adds prompt-type and h/w-divisibility checks (review #9) * fix(dreamlite): correct Q/K/V layout for dispatch_attention_fn dispatch_attention_fn expects (batch, seq, heads, head_dim) and handles the transpose internally; the previous code passed (batch, heads, seq, head_dim), which collided with the dispatch's internal transpose and broke inference (RuntimeError: tensor size mismatch at non-singleton dimension 1). * test(dreamlite): swap MagicMock for tiny real Qwen3-VL fixture Address dg845's review: rebuild the DreamLite fast-test fixture around a real (tiny) Qwen3VLForConditionalGeneration + Qwen3VLProcessor so the standard PipelineTesterMixin save/load, dtype, and offload tests run end-to-end against the actual encode_prompt code path. Override DreamLiteUNetModel.set_default_attn_processor to reinstall the GQA processor so mixin utilities that round-trip through it keep working. * Apply style fixes * fix(dreamlite): address blocking review issues from #13815 - Override _no_split_modules / _repeated_blocks on DreamLiteUNetModel with the actual DreamLite class names (BasicTransformerBlockDreamLite, ResnetBlock2DDreamLite, DreamLiteCrossAttnUpBlock2D, DreamLiteUpBlock2D) so device_map="auto" and compile_repeated_blocks() match correctly. - Keep attention masks as bool tensors in DreamLiteTransformer2DModel instead of converting them to dense additive float biases. The dense format hard-raises on flash / _flash_3 / _sage backends in dispatch_attention_fn (which requires dtype == torch.bool). - Add explicit parentheses around each clause in check_inputs's mixed and/or condition (both pipelines) for readability. - Replace nn.Module.__init__(self) with ModelMixin.__init__(self) in DreamLiteUNetModel.__init__ so mixin state (e.g. _gradient_checkpointing_func) is properly initialised. ConfigMixin / PushToHubMixin don't define their own __init__, so this covers the full chain without re-running UNet2DConditionModel.__init__. * fix(dreamlite): forward all processor outputs to Qwen3VL text encoder Recent versions of Qwen3VLProcessor add an mm_token_type_ids output, and Qwen3VLModel.compute_3d_position_ids raises ValueError whenever multimodal inputs are present (image_grid_thw is not None) but mm_token_type_ids is None. encode_prompt previously forwarded only input_ids / attention_mask / pixel_values / image_grid_thw, dropping the new field and breaking the fast pipeline tests against transformers main. Switch to ``self.text_encoder(**tk_out, output_hidden_states=True)`` (matching NucleusMoEImagePipeline) so all processor outputs are forwarded automatically and future additions don't regress this path. * Apply style fixes * docs(dreamlite): address final review nits from #13815 - Replace broken cat.png URL in editing examples (both base and mobile) with the standard `huggingface/documentation-images` source used elsewhere in the diffusers docs. - Promote the recommended guidance_scale=3.5 / image_guidance_scale=1.5 to the default values of DreamLitePipeline.__call__, and drop the now-redundant explicit args from the docs examples. - Switch the EXAMPLE_DOC_STRING examples in both pipelines from torch.float16 to torch.bfloat16 for consistency with the rest of the docs. --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 0b83812 commit a50ade4

19 files changed

Lines changed: 4828 additions & 0 deletions

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -527,6 +527,8 @@
527527
title: DeepFloyd IF
528528
- local: api/pipelines/dit
529529
title: DiT
530+
- local: api/pipelines/dreamlite
531+
title: DreamLite
530532
- local: api/pipelines/easyanimate
531533
title: EasyAnimate
532534
- local: api/pipelines/ernie_image

docs/source/en/api/attnprocessor.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,10 @@ An attention processor is a class for applying different types of attention mech
4444

4545
[[autodoc]] models.attention_processor.FusedCogVideoXAttnProcessor2_0
4646

47+
## DreamLite
48+
49+
[[autodoc]] models.unets.unet_dreamlite.DreamLiteAttnProcessor2_0
50+
4751
## CrossFrameAttnProcessor
4852

4953
[[autodoc]] pipelines.deprecated.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
<!--Copyright 2026 The ByteDance Authors. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# DreamLite
14+
15+
DreamLite is a text-to-image and image-editing model from ByteDance. It pairs a custom 2D U-Net
16+
(`DreamLiteUNetModel`) with the `Qwen3-VL` multimodal encoder as its prompt / image-instruction encoder,
17+
and uses an `AutoencoderTiny` (TAESD-style) VAE for fast latent encode/decode.
18+
19+
Two pipelines are exposed:
20+
21+
| Pipeline | Modes | CFG | Use case |
22+
|---|---|---|---|
23+
| [`DreamLitePipeline`] | text-to-image **and** image-editing (auto-selected by whether `image` is `None`) | 3-branch dual CFG (`guidance_scale` on text branch, `image_guidance_scale` on image branch, à la InstructPix2Pix) | Highest quality |
24+
| [`DreamLiteMobilePipeline`] | text-to-image **and** image-editing (auto-selected by whether `image` is `None`) | None — distilled, single UNet forward per step | On-device / low-latency |
25+
26+
Official checkpoints:
27+
28+
* Base model: [carlofkl/DreamLite-base](https://huggingface.co/carlofkl/DreamLite-base)
29+
* Distilled mobile model: [carlofkl/DreamLite-mobile](https://huggingface.co/carlofkl/DreamLite-mobile)
30+
31+
> [!TIP]
32+
> Both pipelines auto-detect text-to-image vs. image-editing mode from whether the `image` argument is
33+
> provided. There is no separate `Img2Img` class.
34+
35+
> [!TIP]
36+
> When loading an input image for editing, prefer `diffusers.utils.load_image(...)` over raw `PIL.Image.open(...)`.
37+
> `load_image` enforces an RGB conversion and applies EXIF orientation, both of which the pipeline assumes.
38+
> A plain `Image.open` of an RGBA / palette / EXIF-rotated source will silently produce a different latent
39+
> conditioning and degrade output quality.
40+
41+
## Text-to-image (Base)
42+
43+
```python
44+
import torch
45+
from diffusers import DreamLitePipeline
46+
47+
pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
48+
pipe = pipe.to("cuda")
49+
50+
image = pipe(
51+
prompt="a dog running on the grass",
52+
negative_prompt="",
53+
height=1024,
54+
width=1024,
55+
num_inference_steps=28,
56+
generator=torch.Generator("cpu").manual_seed(42),
57+
).images[0]
58+
image.save("dreamlite_t2i.png")
59+
```
60+
61+
## Image editing (Base)
62+
63+
Pass an `image` to enter edit mode. Both `guidance_scale` (text branch) and `image_guidance_scale`
64+
(image branch) are active here.
65+
66+
```python
67+
import torch
68+
from diffusers import DreamLitePipeline
69+
from diffusers.utils import load_image
70+
71+
pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
72+
pipe = pipe.to("cuda")
73+
74+
source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
75+
76+
image = pipe(
77+
prompt="turn the cat into a corgi",
78+
image=source,
79+
height=1024,
80+
width=1024,
81+
num_inference_steps=28,
82+
generator=torch.Generator("cpu").manual_seed(42),
83+
).images[0]
84+
image.save("dreamlite_edit.png")
85+
```
86+
87+
## Text-to-image (Mobile)
88+
89+
The mobile pipeline is distilled and skips CFG entirely — a single UNet forward per step. It accepts the
90+
same `prompt` / `height` / `width` / `num_inference_steps` arguments, but **ignores** `guidance_scale` and
91+
`image_guidance_scale` if passed (a warning is logged).
92+
93+
```python
94+
import torch
95+
from diffusers import DreamLiteMobilePipeline
96+
97+
pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
98+
pipe = pipe.to("cuda")
99+
100+
image = pipe(
101+
prompt="a dog running on the grass",
102+
height=1024,
103+
width=1024,
104+
num_inference_steps=4,
105+
generator=torch.Generator("cpu").manual_seed(42),
106+
).images[0]
107+
image.save("dreamlite_mobile_t2i.png")
108+
```
109+
110+
## Image editing (Mobile)
111+
112+
```python
113+
import torch
114+
from diffusers import DreamLiteMobilePipeline
115+
from diffusers.utils import load_image
116+
117+
pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
118+
pipe = pipe.to("cuda")
119+
120+
source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
121+
122+
image = pipe(
123+
prompt="turn the cat into a corgi",
124+
image=source,
125+
height=1024,
126+
width=1024,
127+
num_inference_steps=4,
128+
generator=torch.Generator("cpu").manual_seed(42),
129+
).images[0]
130+
image.save("dreamlite_mobile_edit.png")
131+
```
132+
133+
## Notes and limitations
134+
135+
* Both pipelines force `batch_size = 1` internally; `num_images_per_prompt` controls how many samples
136+
are drawn from the same prompt rather than parallel batching.
137+
* The prompt encoder is `Qwen3-VL`, which is a multimodal model. Loading the full pipeline therefore
138+
requires sufficient GPU memory for both the U-Net and the Qwen3-VL text encoder (~4 GB + ~0.7 GB
139+
in bf16 for the base release).
140+
* The VAE is `AutoencoderTiny` and exposes `encoder_block_out_channels`; `vae_scale_factor` is derived
141+
from it at pipeline init time.
142+
143+
## DreamLitePipeline
144+
145+
[[autodoc]] DreamLitePipeline
146+
- all
147+
- __call__
148+
149+
## DreamLiteMobilePipeline
150+
151+
[[autodoc]] DreamLiteMobilePipeline
152+
- all
153+
- __call__
154+
155+
## DreamLitePipelineOutput
156+
157+
[[autodoc]] pipelines.dreamlite.pipeline_output.DreamLitePipelineOutput

src/diffusers/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,8 @@
254254
"CosmosControlNetModel",
255255
"CosmosTransformer3DModel",
256256
"DiTTransformer2DModel",
257+
"DreamLiteTransformer2DModel",
258+
"DreamLiteUNetModel",
257259
"EasyAnimateTransformer3DModel",
258260
"ErnieImageTransformer2DModel",
259261
"Flux2Transformer2DModel",
@@ -570,6 +572,9 @@
570572
"CosmosTextToWorldPipeline",
571573
"CosmosVideoToWorldPipeline",
572574
"CycleDiffusionPipeline",
575+
"DreamLiteMobilePipeline",
576+
"DreamLitePipeline",
577+
"DreamLitePipelineOutput",
573578
"EasyAnimateControlPipeline",
574579
"EasyAnimateInpaintPipeline",
575580
"EasyAnimatePipeline",
@@ -1108,6 +1113,8 @@
11081113
CosmosControlNetModel,
11091114
CosmosTransformer3DModel,
11101115
DiTTransformer2DModel,
1116+
DreamLiteTransformer2DModel,
1117+
DreamLiteUNetModel,
11111118
EasyAnimateTransformer3DModel,
11121119
ErnieImageTransformer2DModel,
11131120
Flux2Transformer2DModel,
@@ -1399,6 +1406,9 @@
13991406
CosmosTextToWorldPipeline,
14001407
CosmosVideoToWorldPipeline,
14011408
CycleDiffusionPipeline,
1409+
DreamLiteMobilePipeline,
1410+
DreamLitePipeline,
1411+
DreamLitePipelineOutput,
14021412
EasyAnimateControlPipeline,
14031413
EasyAnimateInpaintPipeline,
14041414
EasyAnimatePipeline,

src/diffusers/models/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@
9696
_import_structure["transformers.stable_audio_transformer"] = ["StableAudioDiTModel"]
9797
_import_structure["transformers.t5_film_transformer"] = ["T5FilmDecoder"]
9898
_import_structure["transformers.transformer_2d"] = ["Transformer2DModel"]
99+
_import_structure["transformers.transformer_2d_dreamlite"] = ["DreamLiteTransformer2DModel"]
99100
_import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"]
100101
_import_structure["transformers.transformer_anyflow"] = ["AnyFlowTransformer3DModel"]
101102
_import_structure["transformers.transformer_anyflow_far"] = ["AnyFlowFARTransformer3DModel"]
@@ -145,6 +146,7 @@
145146
_import_structure["unets.unet_2d"] = ["UNet2DModel"]
146147
_import_structure["unets.unet_2d_condition"] = ["UNet2DConditionModel"]
147148
_import_structure["unets.unet_3d_condition"] = ["UNet3DConditionModel"]
149+
_import_structure["unets.unet_dreamlite"] = ["DreamLiteUNetModel"]
148150
_import_structure["unets.unet_i2vgen_xl"] = ["I2VGenXLUNet"]
149151
_import_structure["unets.unet_kandinsky3"] = ["Kandinsky3UNet"]
150152
_import_structure["unets.unet_motion_model"] = ["MotionAdapter", "UNetMotionModel"]
@@ -236,6 +238,7 @@
236238
Cosmos3OmniTransformer,
237239
CosmosTransformer3DModel,
238240
DiTTransformer2DModel,
241+
DreamLiteTransformer2DModel,
239242
DualTransformer2DModel,
240243
EasyAnimateTransformer3DModel,
241244
ErnieImageTransformer2DModel,
@@ -282,6 +285,7 @@
282285
ZImageTransformer2DModel,
283286
)
284287
from .unets import (
288+
DreamLiteUNetModel,
285289
I2VGenXLUNet,
286290
Kandinsky3UNet,
287291
MotionAdapter,

src/diffusers/models/transformers/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from .stable_audio_transformer import StableAudioDiTModel
1818
from .t5_film_transformer import T5FilmDecoder
1919
from .transformer_2d import Transformer2DModel
20+
from .transformer_2d_dreamlite import DreamLiteTransformer2DModel
2021
from .transformer_allegro import AllegroTransformer3DModel
2122
from .transformer_anyflow import AnyFlowTransformer3DModel
2223
from .transformer_anyflow_far import AnyFlowFARTransformer3DModel

0 commit comments

Comments
 (0)