Skip to content

Commit 51f528c

Browse files
Feat:(model) qwen image vae checkpoint (invoke-ai#9108)
* feat(qwen-image): standalone VAE checkpoint and Qwen2.5-VL encoder support Add standalone model types so Qwen Image can be run without downloading the full ~40 GB Diffusers pipeline. The VAE and Qwen2.5-VL encoder can now each come from their own model, with the Component Source (Diffusers) acting as a fallback for any submodel not provided separately. * feat(qwen-image): support ComfyUI single-file Qwen2.5-VL encoder Add a checkpoint loader for ComfyUI-style consolidated Qwen2.5-VL encoder files (e.g. qwen_2.5_vl_7b_fp8_scaled.safetensors), which bundle the language model and visual tower into one safetensors with FP8 + per-tensor weight_scale quantization. This drops the standalone encoder footprint from ~16 GB (Diffusers folder, FP16) to ~7 GB. * feat(qwen-image): register standalone components as starter models Add three new starter models so users can install a complete GGUF Qwen Image setup in one click without ever touching the full ~40 GB Diffusers pipeline: - "Qwen Image VAE" — single-file VAE checkpoint pulled from the Qwen-Image repo (~250 MB). - "Qwen2.5-VL Encoder (fp8 scaled)" — ComfyUI single-file FP8 encoder (~7 GB). - "Qwen2.5-VL Encoder (Diffusers)" — full-precision encoder via multi-folder HF download (text_encoder+tokenizer+processor, ~16 GB). The 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) now declare the VAE + fp8 encoder as dependencies, so installing any of them automatically pulls in everything needed to generate. The fp8 encoder is preferred as the default dependency since it's smaller and the on-the-fly dequantization is essentially free at runtime. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended so the bundled Lightning LoRA variants also benefit. * Chore Ruff Format * fix(qwen-image): backfill VAE/encoder fields on persisted state, recall in metadata, optimize scan - bump params slice persisted state to v3 with a v2→v3 migration that backfills qwenImageVaeModel and qwenImageQwenVLEncoderModel to null, preventing existing users from losing all persisted params on upgrade - emit qwen_image_vae and qwen_image_qwen_vl_encoder into graph metadata and add recall handlers so generations using standalone components are reproducible - clear the two new fields in the modelSelected listener when switching away from qwen-image, matching the existing cleanup pattern - identify single-file Qwen VL encoder checkpoints by reading only the safetensors key index via safe_open, instead of loading the full ~7GB state dict into RAM during model scan - log a clear info message and raise an actionable RuntimeError when the first-time HuggingFace tokenizer/config download is needed but offline, pointing users to the diffusers folder layout as an offline alternative - add unit tests for the migration, metadata recall, and identification * fix(qwen-image): auto-select VAE/encoder, clarify GGUF tip, fix fp8 single-file encoder crash - Auto-select first available standalone VAE and Qwen2.5-VL encoder when switching to a Qwen Image model, so GGUF users are ready-to-go without digging into Advanced. Prefers the diffusers-folder encoder over the single-file checkpoint. - Update the "Required for GGUF models" placeholder to clarify that the diffusers source is only required when a standalone VAE & encoder is not installed. - Fix QwenVLEncoderCheckpointLoader crash on ComfyUI fp8_scaled single-file encoders. Two issues: (1) handle the `.scale_weight` / `.scale_input` quantization key scheme alongside `.weight_scale`, and (2) apply Qwen2_5_VLForConditionalGeneration's _checkpoint_conversion_mapping before load_state_dict so legacy `visual.*` / `model.*` keys map onto the new `model.visual.*` / `model.language_model.*` layout expected by transformers ≥4.50. --------- Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>
1 parent b9bd8ef commit 51f528c

26 files changed

Lines changed: 1408 additions & 60 deletions

File tree

invokeai/app/invocations/qwen_image_model_loader.py

Lines changed: 62 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -34,19 +34,22 @@ class QwenImageModelLoaderOutput(BaseInvocationOutput):
3434
title="Main Model - Qwen Image",
3535
tags=["model", "qwen_image"],
3636
category="model",
37-
version="1.1.0",
37+
version="1.2.0",
3838
classification=Classification.Prototype,
3939
)
4040
class QwenImageModelLoaderInvocation(BaseInvocation):
4141
"""Loads a Qwen Image model, outputting its submodels.
4242
4343
The transformer is always loaded from the main model (Diffusers or GGUF).
4444
45-
For GGUF quantized models, the VAE and Qwen VL encoder must come from a
46-
separate Diffusers model specified in the "Component Source" field.
45+
Components can be mixed and matched:
46+
- VAE: standalone Qwen Image VAE checkpoint, the Component Source (Diffusers),
47+
or the main model if it's Diffusers.
48+
- Qwen VL Encoder: standalone Qwen2.5-VL encoder, the Component Source
49+
(Diffusers), or the main model if it's Diffusers.
4750
48-
For Diffusers models, all components are extracted from the main model
49-
automatically. The "Component Source" field is ignored.
51+
Together, the standalone VAE and standalone encoder allow running a GGUF
52+
transformer without ever downloading the full ~40 GB Diffusers pipeline.
5053
"""
5154

5255
model: ModelIdentifierField = InputField(
@@ -57,11 +60,31 @@ class QwenImageModelLoaderInvocation(BaseInvocation):
5760
title="Transformer",
5861
)
5962

63+
vae_model: Optional[ModelIdentifierField] = InputField(
64+
default=None,
65+
description="Standalone Qwen Image VAE model. "
66+
"If not provided, VAE will be loaded from the Component Source (or from the main model if it is Diffusers).",
67+
input=Input.Direct,
68+
ui_model_base=BaseModelType.QwenImage,
69+
ui_model_type=ModelType.VAE,
70+
title="VAE",
71+
)
72+
73+
qwen_vl_encoder_model: Optional[ModelIdentifierField] = InputField(
74+
default=None,
75+
description="Standalone Qwen2.5-VL encoder model. "
76+
"If not provided, the encoder will be loaded from the Component Source "
77+
"(or from the main model if it is Diffusers).",
78+
input=Input.Direct,
79+
ui_model_type=ModelType.QwenVLEncoder,
80+
title="Qwen VL Encoder",
81+
)
82+
6083
component_source: Optional[ModelIdentifierField] = InputField(
6184
default=None,
62-
description="Diffusers Qwen Image model to extract the VAE and Qwen VL encoder from. "
63-
"Required when using a GGUF quantized transformer. "
64-
"Ignored when the main model is already in Diffusers format.",
85+
description="Diffusers Qwen Image model to extract VAE and/or Qwen VL encoder from. "
86+
"Use this if you don't have separate VAE/encoder models. "
87+
"Ignored for any submodel that is provided separately.",
6588
input=Input.Direct,
6689
ui_model_base=BaseModelType.QwenImage,
6790
ui_model_type=ModelType.Main,
@@ -76,32 +99,49 @@ def invoke(self, context: InvocationContext) -> QwenImageModelLoaderOutput:
7699
# Transformer always comes from the main model
77100
transformer = self.model.model_copy(update={"submodel_type": SubModelType.Transformer})
78101

79-
if main_is_diffusers:
80-
# Diffusers model: extract all components directly
102+
# Resolve VAE: standalone override > main (if Diffusers) > component source
103+
if self.vae_model is not None:
104+
vae = self.vae_model.model_copy(update={"submodel_type": SubModelType.VAE})
105+
elif main_is_diffusers:
81106
vae = self.model.model_copy(update={"submodel_type": SubModelType.VAE})
107+
elif self.component_source is not None:
108+
self._validate_component_source_format(context, self.component_source)
109+
vae = self.component_source.model_copy(update={"submodel_type": SubModelType.VAE})
110+
else:
111+
raise ValueError(
112+
"No source for VAE. Either set 'VAE' to a standalone Qwen Image VAE, "
113+
"or set 'Component Source' to a Diffusers Qwen Image model."
114+
)
115+
116+
# Resolve Qwen VL encoder: standalone override > main (if Diffusers) > component source
117+
if self.qwen_vl_encoder_model is not None:
118+
tokenizer = self.qwen_vl_encoder_model.model_copy(update={"submodel_type": SubModelType.Tokenizer})
119+
text_encoder = self.qwen_vl_encoder_model.model_copy(update={"submodel_type": SubModelType.TextEncoder})
120+
elif main_is_diffusers:
82121
tokenizer = self.model.model_copy(update={"submodel_type": SubModelType.Tokenizer})
83122
text_encoder = self.model.model_copy(update={"submodel_type": SubModelType.TextEncoder})
84123
elif self.component_source is not None:
85-
# GGUF/checkpoint transformer: get VAE + encoder from the component source
86-
source_config = context.models.get_config(self.component_source)
87-
if source_config.format != ModelFormat.Diffusers:
88-
raise ValueError(
89-
f"The Component Source model must be in Diffusers format. "
90-
f"The selected model '{source_config.name}' is in {source_config.format.value} format."
91-
)
92-
vae = self.component_source.model_copy(update={"submodel_type": SubModelType.VAE})
124+
self._validate_component_source_format(context, self.component_source)
93125
tokenizer = self.component_source.model_copy(update={"submodel_type": SubModelType.Tokenizer})
94126
text_encoder = self.component_source.model_copy(update={"submodel_type": SubModelType.TextEncoder})
95127
else:
96128
raise ValueError(
97-
"No source for VAE and Qwen VL encoder. "
98-
"GGUF quantized models only contain the transformer — "
99-
"please set 'Component Source' to a Diffusers Qwen Image model "
100-
"to provide the VAE and text encoder."
129+
"No source for Qwen VL encoder. "
130+
"Either set 'Qwen VL Encoder' to a standalone Qwen2.5-VL encoder, "
131+
"or set 'Component Source' to a Diffusers Qwen Image model."
101132
)
102133

103134
return QwenImageModelLoaderOutput(
104135
transformer=TransformerField(transformer=transformer, loras=[]),
105136
qwen_vl_encoder=QwenVLEncoderField(tokenizer=tokenizer, text_encoder=text_encoder),
106137
vae=VAEField(vae=vae),
107138
)
139+
140+
@staticmethod
141+
def _validate_component_source_format(context: InvocationContext, model: ModelIdentifierField) -> None:
142+
source_config = context.models.get_config(model)
143+
if source_config.format != ModelFormat.Diffusers:
144+
raise ValueError(
145+
f"The Component Source model must be in Diffusers format. "
146+
f"The selected model '{source_config.name}' is in {source_config.format.value} format."
147+
)

invokeai/app/invocations/qwen_image_text_encoder.py

Lines changed: 33 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -161,17 +161,35 @@ def _encode(
161161
# Build the processor
162162
tokenizer_config = context.models.get_config(self.qwen_vl_encoder.tokenizer)
163163
model_root = context.models.get_absolute_path(tokenizer_config)
164-
tokenizer_dir = model_root / "tokenizer"
165164

166-
tokenizer = AutoTokenizer.from_pretrained(str(tokenizer_dir), local_files_only=True)
165+
# Single-file checkpoints (e.g. ComfyUI fp8_scaled): model_root is the
166+
# safetensors file itself, so there's no tokenizer/processor folder
167+
# alongside it. Fall back to the canonical Qwen2.5-VL repo on HF (small
168+
# ~10 MB download for tokenizer+processor configs, cached for offline use).
169+
if model_root.is_file():
170+
HF_REPO = "Qwen/Qwen2.5-VL-7B-Instruct"
171+
try:
172+
tokenizer = AutoTokenizer.from_pretrained(HF_REPO, local_files_only=True)
173+
except OSError:
174+
tokenizer = AutoTokenizer.from_pretrained(HF_REPO)
175+
try:
176+
image_processor = _ImageProcessorCls.from_pretrained(HF_REPO, local_files_only=True)
177+
except OSError:
178+
try:
179+
image_processor = _ImageProcessorCls.from_pretrained(HF_REPO)
180+
except Exception:
181+
image_processor = _ImageProcessorCls()
182+
else:
183+
tokenizer_dir = model_root / "tokenizer"
184+
tokenizer = AutoTokenizer.from_pretrained(str(tokenizer_dir), local_files_only=True)
167185

168-
image_processor = None
169-
for search_dir in [model_root / "processor", tokenizer_dir, model_root, model_root / "image_processor"]:
170-
if (search_dir / "preprocessor_config.json").exists():
171-
image_processor = _ImageProcessorCls.from_pretrained(str(search_dir), local_files_only=True)
172-
break
173-
if image_processor is None:
174-
image_processor = _ImageProcessorCls()
186+
image_processor = None
187+
for search_dir in [model_root / "processor", tokenizer_dir, model_root, model_root / "image_processor"]:
188+
if (search_dir / "preprocessor_config.json").exists():
189+
image_processor = _ImageProcessorCls.from_pretrained(str(search_dir), local_files_only=True)
190+
break
191+
if image_processor is None:
192+
image_processor = _ImageProcessorCls()
175193

176194
processor = Qwen2_5_VLProcessor(
177195
tokenizer=tokenizer,
@@ -264,6 +282,12 @@ def _load_quantized_encoder(self, context: InvocationContext):
264282

265283
encoder_config = context.models.get_config(self.qwen_vl_encoder.text_encoder)
266284
model_root = context.models.get_absolute_path(encoder_config)
285+
if model_root.is_file():
286+
# Single-file checkpoint (e.g. ComfyUI fp8_scaled): BnB can't load from
287+
# a single file, and the checkpoint is already FP8-compressed anyway.
288+
# Fall back to the cached path; the user effectively gets fp8 instead of
289+
# int8/nf4, which is comparable in size.
290+
return self._load_cached_encoder(context)
267291
encoder_path = model_root / "text_encoder"
268292

269293
if self.quantization == "nf4":

invokeai/backend/model_manager/configs/factory.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,10 @@
9090
Qwen3Encoder_GGUF_Config,
9191
Qwen3Encoder_Qwen3Encoder_Config,
9292
)
93+
from invokeai.backend.model_manager.configs.qwen_vl_encoder import (
94+
QwenVLEncoder_Checkpoint_Config,
95+
QwenVLEncoder_Diffusers_Config,
96+
)
9397
from invokeai.backend.model_manager.configs.siglip import SigLIP_Diffusers_Config
9498
from invokeai.backend.model_manager.configs.spandrel import Spandrel_Checkpoint_Config
9599
from invokeai.backend.model_manager.configs.t2i_adapter import (
@@ -111,6 +115,7 @@
111115
VAE_Checkpoint_Anima_Config,
112116
VAE_Checkpoint_Flux2_Config,
113117
VAE_Checkpoint_FLUX_Config,
118+
VAE_Checkpoint_QwenImage_Config,
114119
VAE_Checkpoint_SD1_Config,
115120
VAE_Checkpoint_SD2_Config,
116121
VAE_Checkpoint_SDXL_Config,
@@ -194,6 +199,7 @@
194199
Annotated[VAE_Checkpoint_SDXL_Config, VAE_Checkpoint_SDXL_Config.get_tag()],
195200
Annotated[VAE_Checkpoint_FLUX_Config, VAE_Checkpoint_FLUX_Config.get_tag()],
196201
Annotated[VAE_Checkpoint_Flux2_Config, VAE_Checkpoint_Flux2_Config.get_tag()],
202+
Annotated[VAE_Checkpoint_QwenImage_Config, VAE_Checkpoint_QwenImage_Config.get_tag()],
197203
Annotated[VAE_Checkpoint_Anima_Config, VAE_Checkpoint_Anima_Config.get_tag()],
198204
# VAE - diffusers format
199205
Annotated[VAE_Diffusers_SD1_Config, VAE_Diffusers_SD1_Config.get_tag()],
@@ -242,6 +248,9 @@
242248
Annotated[Qwen3Encoder_Qwen3Encoder_Config, Qwen3Encoder_Qwen3Encoder_Config.get_tag()],
243249
Annotated[Qwen3Encoder_Checkpoint_Config, Qwen3Encoder_Checkpoint_Config.get_tag()],
244250
Annotated[Qwen3Encoder_GGUF_Config, Qwen3Encoder_GGUF_Config.get_tag()],
251+
# Qwen VL Encoder (Qwen2.5-VL multimodal encoder for Qwen Image)
252+
Annotated[QwenVLEncoder_Diffusers_Config, QwenVLEncoder_Diffusers_Config.get_tag()],
253+
Annotated[QwenVLEncoder_Checkpoint_Config, QwenVLEncoder_Checkpoint_Config.get_tag()],
245254
# TI - file format
246255
Annotated[TI_File_SD1_Config, TI_File_SD1_Config.get_tag()],
247256
Annotated[TI_File_SD2_Config, TI_File_SD2_Config.get_tag()],
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
import json
2+
from pathlib import Path
3+
from typing import Any, Iterable, Literal, Self
4+
5+
from pydantic import Field
6+
from safetensors import safe_open
7+
8+
from invokeai.backend.model_manager.configs.base import Checkpoint_Config_Base, Config_Base
9+
from invokeai.backend.model_manager.configs.identification_utils import (
10+
NotAMatchError,
11+
raise_for_override_fields,
12+
raise_if_not_dir,
13+
raise_if_not_file,
14+
)
15+
from invokeai.backend.model_manager.model_on_disk import ModelOnDisk
16+
from invokeai.backend.model_manager.taxonomy import BaseModelType, ModelFormat, ModelType
17+
18+
_RECOGNIZED_TEXT_ENCODER_CLASSES = {
19+
"Qwen2_5_VLForConditionalGeneration",
20+
"Qwen2VLForConditionalGeneration",
21+
}
22+
23+
24+
def _has_qwen_vl_keys(keys: Iterable[str]) -> bool:
25+
"""A Qwen2.5-VL/Qwen2-VL checkpoint must have both LM weights and a visual
26+
tower — that's what distinguishes it from text-only Qwen3/Qwen2 encoders."""
27+
has_lm = False
28+
has_vision = False
29+
for k in keys:
30+
if not isinstance(k, str):
31+
continue
32+
if not has_lm and (k == "model.embed_tokens.weight" or k.startswith("model.layers.")):
33+
has_lm = True
34+
if not has_vision and (k.startswith("visual.patch_embed.") or k.startswith("visual.blocks.")):
35+
has_vision = True
36+
if has_lm and has_vision:
37+
return True
38+
return False
39+
40+
41+
def _read_safetensors_keys(path: Path) -> list[str]:
42+
"""Read only the key index from a safetensors file without loading tensor data.
43+
44+
Avoids holding multi-GB encoder weights in RAM just to classify the file.
45+
"""
46+
with safe_open(str(path), framework="pt", device="cpu") as f:
47+
return list(f.keys())
48+
49+
50+
class QwenVLEncoder_Diffusers_Config(Config_Base):
51+
"""Configuration for standalone Qwen2.5-VL encoder models in diffusers-style folder layout.
52+
53+
Expected structure:
54+
<model_root>/
55+
text_encoder/
56+
config.json (with `_class_name` or `architectures` listing
57+
`Qwen2_5_VLForConditionalGeneration`)
58+
model.safetensors
59+
tokenizer/
60+
tokenizer_config.json
61+
...
62+
processor/ (optional, for vision preprocessing)
63+
preprocessor_config.json
64+
65+
This lets users avoid downloading the full ~40 GB Qwen Image diffusers pipeline
66+
when they only need the Qwen2.5-VL encoder for use with a GGUF transformer.
67+
"""
68+
69+
base: Literal[BaseModelType.Any] = Field(default=BaseModelType.Any)
70+
type: Literal[ModelType.QwenVLEncoder] = Field(default=ModelType.QwenVLEncoder)
71+
format: Literal[ModelFormat.QwenVLEncoder] = Field(default=ModelFormat.QwenVLEncoder)
72+
73+
@classmethod
74+
def from_model_on_disk(cls, mod: ModelOnDisk, override_fields: dict[str, Any]) -> Self:
75+
raise_if_not_dir(mod)
76+
77+
raise_for_override_fields(cls, override_fields)
78+
79+
# Reject anything that looks like a full pipeline (those are matched as Main models).
80+
if (mod.path / "model_index.json").exists() or (mod.path / "transformer").exists():
81+
raise NotAMatchError(
82+
"directory looks like a full diffusers pipeline (has model_index.json or transformer folder), "
83+
"not a standalone Qwen VL encoder"
84+
)
85+
86+
text_encoder_dir = mod.path / "text_encoder"
87+
tokenizer_dir = mod.path / "tokenizer"
88+
89+
if not text_encoder_dir.is_dir():
90+
raise NotAMatchError("missing text_encoder/ subfolder")
91+
if not tokenizer_dir.is_dir():
92+
raise NotAMatchError("missing tokenizer/ subfolder")
93+
94+
config_path = text_encoder_dir / "config.json"
95+
if not config_path.is_file():
96+
raise NotAMatchError(f"missing {config_path}")
97+
98+
try:
99+
with open(config_path, "r", encoding="utf-8") as f:
100+
cfg = json.load(f)
101+
except (OSError, json.JSONDecodeError) as e:
102+
raise NotAMatchError(f"could not read text_encoder/config.json: {e}") from e
103+
104+
class_name = cfg.get("_class_name")
105+
architectures = cfg.get("architectures") or []
106+
candidates = {class_name, *architectures} - {None}
107+
108+
if not candidates & _RECOGNIZED_TEXT_ENCODER_CLASSES:
109+
raise NotAMatchError(
110+
f"text_encoder class is {sorted(candidates) or 'unknown'}, "
111+
f"expected one of {sorted(_RECOGNIZED_TEXT_ENCODER_CLASSES)}"
112+
)
113+
114+
return cls(**override_fields)
115+
116+
117+
class QwenVLEncoder_Checkpoint_Config(Checkpoint_Config_Base, Config_Base):
118+
"""Configuration for single-file Qwen2.5-VL encoder checkpoints (safetensors).
119+
120+
This matches ComfyUI-style consolidated single-file encoders such as
121+
`qwen_2.5_vl_7b_fp8_scaled.safetensors`, which bundle the language model
122+
and the visual tower into one file (typically with FP8 + per-tensor
123+
`weight_scale` ComfyUI quantization).
124+
125+
The matching tokenizer + processor are pulled from HuggingFace
126+
(`Qwen/Qwen2.5-VL-7B-Instruct`) on first use and cached for offline use.
127+
"""
128+
129+
base: Literal[BaseModelType.Any] = Field(default=BaseModelType.Any)
130+
type: Literal[ModelType.QwenVLEncoder] = Field(default=ModelType.QwenVLEncoder)
131+
format: Literal[ModelFormat.Checkpoint] = Field(default=ModelFormat.Checkpoint)
132+
133+
@classmethod
134+
def from_model_on_disk(cls, mod: ModelOnDisk, override_fields: dict[str, Any]) -> Self:
135+
raise_if_not_file(mod)
136+
137+
raise_for_override_fields(cls, override_fields)
138+
139+
# Only safetensors checkpoints are supported as single-file Qwen VL encoders.
140+
# Reject other extensions cheaply before attempting to read keys.
141+
if mod.path.suffix != ".safetensors":
142+
raise NotAMatchError(f"expected a .safetensors file, got {mod.path.suffix or '(no suffix)'}")
143+
144+
# Read only the key index — a 7GB fp8 encoder weighs ~7GB on disk, but we
145+
# only need the key names to classify it, not the tensor data.
146+
try:
147+
keys = _read_safetensors_keys(mod.path)
148+
except Exception as e:
149+
raise NotAMatchError(f"could not read safetensors header: {e}") from e
150+
151+
if not _has_qwen_vl_keys(keys):
152+
raise NotAMatchError("state dict does not look like a Qwen2.5-VL/Qwen2-VL checkpoint")
153+
154+
return cls(**override_fields)

0 commit comments

Comments
 (0)