Skip to content

Commit 3e83f43

Browse files
apolinariodg845
andauthored
Add structured prompt upsampling to Ideogram4 (#13860)
* Add structured prompt upsampling to Ideogram4 Rewrite prompts into Ideogram4's native structured JSON caption before encoding, opt-in via `prompt_upsampling=True` in `__call__` or the standalone `upsample_prompt`. Upsampling is driven by a generative `text_encoder` (`Qwen3VLForConditionalGeneration`, which carries the LM head); the head-less `Qwen3VLModel` is still supported for plain conditioning, and `upsample_prompt` raises an instructive error when the encoder cannot generate. Captions are schema-constrained via `outlines` when installed, and the modular pipeline gains a matching prompt-enhancer block. * Remove LM-head grafting; modular block uses a generative text_encoder Drop `graft_lm_head` and drive the modular `Ideogram4PromptUpsampleStep` off a generative `text_encoder` (`Qwen3VLForConditionalGeneration`), matching the standard pipeline: guard with `can_generate()` and an instructive error, and build the outlines logits processor lazily. Updates the copied `_get_text_encoder_hidden_states` to resolve the decoder for both encoder classes. * Style docstrings with doc-builder; mark prompt strings docstyle-ignore Reflow the Ideogram4 prompt-enhancer docstrings to the 119-col doc-builder style, and add `# docstyle-ignore` to the functional `CAPTION_SYSTEM_MESSAGE` and `CAPTION_USER_TEMPLATE` strings so the styler doesn't rewrap them (matching Flux2's `system_messages.py`). * Use an Ideogram4PromptEnhancerHead component for prompt upsampling Add `Ideogram4PromptEnhancerHead`, a small `ModelMixin` holding the Qwen3-VL LM head, as an optional `prompt_enhancer_head` pipeline component. Upsampling loads the head via a normal `from_pretrained` (its own repo, or bundled later) instead of an in-pipeline download, and grafts it onto the shared head-less `text_encoder` body so no second 8B body is loaded. Both the standard and modular pipelines build the generative model from `text_encoder` + the head; `upsample_prompt` raises an instructive error when the head component is absent. * Update src/diffusers/pipelines/ideogram4/pipeline_ideogram4.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Update src/diffusers/pipelines/ideogram4/pipeline_ideogram4.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Apply suggestion from @apolinario * Apply suggestion from @dg845 Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Apply suggestion from @apolinario * Fix trailing whitespace * docs: add prompt-upsampling examples (remote API + local head) for Ideogram4 --------- Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
1 parent 9b0818c commit 3e83f43

11 files changed

Lines changed: 552 additions & 11 deletions

File tree

docs/source/en/api/pipelines/ideogram4.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,12 +40,78 @@ image = pipe(prompt, height=1024, width=1024, generator=torch.Generator("cuda").
4040
image.save("ideogram4.png")
4141
```
4242

43+
## Prompt upsampling
44+
45+
Ideogram 4 is trained on a structured JSON caption rather than a free-form prompt, so a short prompt is best
46+
expanded into that native schema before generation. There are two ways to produce the caption.
47+
48+
### Remote (Ideogram API)
49+
50+
For the best results, expand the prompt with Ideogram's hosted magic-prompt API and pass the returned caption
51+
straight to the pipeline (get a key at [developer.ideogram.ai](https://developer.ideogram.ai/)):
52+
53+
```python
54+
import json
55+
import requests
56+
import torch
57+
from diffusers import Ideogram4Pipeline
58+
59+
pipe = Ideogram4Pipeline.from_pretrained("ideogram-ai/ideogram-4-nf4", torch_dtype=torch.bfloat16)
60+
pipe.to("cuda")
61+
62+
# Expand the prompt into a structured JSON caption with Ideogram's hosted magic-prompt API.
63+
response = requests.post(
64+
"https://api.ideogram.ai/v1/ideogram-v4/magic-prompt",
65+
headers={"Api-Key": "your_ideogram_api_key"},
66+
json={"text_prompt": "A photo of a cat holding a sign that says hello world", "aspect_ratio": "1x1"},
67+
).json()
68+
caption = json.dumps(response["json_prompt"])
69+
70+
# The caption is already upsampled, so pass it directly (no prompt_upsampling).
71+
image = pipe(caption, height=1024, width=1024, generator=torch.Generator("cuda").manual_seed(0)).images[0]
72+
image.save("ideogram4_upsampled.png")
73+
```
74+
75+
### Local (on-device)
76+
77+
For a fully local pipeline, load a small [`Ideogram4PromptEnhancerHead`] (the Qwen3-VL LM head) as the optional
78+
`prompt_enhancer_head` component and pass `prompt_upsampling=True`. The head is grafted onto the shared
79+
`text_encoder`, so no second text encoder is loaded. Install `outlines` for schema-constrained captions (the nf4
80+
checkpoint also needs `bitsandbytes`):
81+
82+
```python
83+
import torch
84+
from diffusers import Ideogram4Pipeline, Ideogram4PromptEnhancerHead
85+
86+
prompt_enhancer_head = Ideogram4PromptEnhancerHead.from_pretrained(
87+
"diffusers/qwen3-vl-8b-instruct-lm-head", torch_dtype=torch.bfloat16
88+
)
89+
pipe = Ideogram4Pipeline.from_pretrained(
90+
"ideogram-ai/ideogram-4-nf4", prompt_enhancer_head=prompt_enhancer_head, torch_dtype=torch.bfloat16
91+
)
92+
pipe.to("cuda")
93+
94+
prompt = "A photo of a cat holding a sign that says hello world"
95+
image = pipe(
96+
prompt,
97+
height=1024,
98+
width=1024,
99+
prompt_upsampling=True,
100+
generator=torch.Generator("cuda").manual_seed(0),
101+
).images[0]
102+
image.save("ideogram4_upsampled.png")
103+
```
104+
43105
## Ideogram4Pipeline
44106

45107
[[autodoc]] Ideogram4Pipeline
46108
- all
47109
- __call__
48110

111+
## Ideogram4PromptEnhancerHead
112+
113+
[[autodoc]] Ideogram4PromptEnhancerHead
114+
49115
## Ideogram4PipelineOutput
50116

51117
[[autodoc]] pipelines.ideogram4.pipeline_output.Ideogram4PipelineOutput

src/diffusers/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -594,6 +594,7 @@
594594
"HunyuanVideoPipeline",
595595
"I2VGenXLPipeline",
596596
"Ideogram4Pipeline",
597+
"Ideogram4PromptEnhancerHead",
597598
"IFImg2ImgPipeline",
598599
"IFImg2ImgSuperResolutionPipeline",
599600
"IFInpaintingPipeline",
@@ -1413,6 +1414,7 @@
14131414
HunyuanVideoPipeline,
14141415
I2VGenXLPipeline,
14151416
Ideogram4Pipeline,
1417+
Ideogram4PromptEnhancerHead,
14161418
IFImg2ImgPipeline,
14171419
IFImg2ImgSuperResolutionPipeline,
14181420
IFInpaintingPipeline,

src/diffusers/modular_pipelines/ideogram4/encoders.py

Lines changed: 141 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,14 @@
1717
from transformers import Qwen2Tokenizer, Qwen3VLModel
1818
from transformers.masking_utils import create_causal_mask
1919

20-
from ...utils import logging
20+
from ...pipelines.ideogram4.prompt_enhancer import (
21+
PROMPT_UPSAMPLE_TEMPERATURE,
22+
Ideogram4PromptEnhancerHead,
23+
build_caption_logits_processor,
24+
build_prompt_enhancer,
25+
generate_captions,
26+
)
27+
from ...utils import is_outlines_available, logging
2128
from ..modular_pipeline import ModularPipelineBlocks, PipelineState
2229
from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
2330
from .modular_pipeline import Ideogram4ModularPipeline
@@ -31,6 +38,139 @@
3138
QWEN3_VL_ACTIVATION_LAYERS = (0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 35)
3239

3340

41+
# auto_docstring
42+
class Ideogram4PromptUpsampleStep(ModularPipelineBlocks):
43+
"""
44+
Optional step that rewrites the prompt(s) into Ideogram4's native structured JSON caption (the format the model is
45+
trained on) when ``prompt_upsampling=True``. Requires the optional ``prompt_enhancer_head`` component, which is
46+
grafted onto the shared ``text_encoder`` body to make it generative; install ``outlines`` for schema-constrained
47+
captions.
48+
49+
Components:
50+
text_encoder (`Qwen3VLModel`): The Qwen3-VL text encoder. tokenizer (`Qwen2Tokenizer`): The tokenizer paired
51+
with the text encoder. prompt_enhancer_head (`Ideogram4PromptEnhancerHead`): The LM head grafted onto the
52+
text encoder for upsampling.
53+
54+
Inputs:
55+
prompt (`str`):
56+
The prompt or prompts to guide image generation.
57+
prompt_upsampling (`bool`, *optional*, defaults to False):
58+
If True, rewrite the prompt into the native JSON caption before encoding.
59+
prompt_upsampling_temperature (`float`, *optional*, defaults to 1.0):
60+
Sampling temperature for prompt upsampling.
61+
height (`int`, *optional*):
62+
Together with width, sets the caption's target aspect ratio.
63+
width (`int`, *optional*):
64+
Together with height, sets the caption's target aspect ratio.
65+
generator (`Generator`, *optional*):
66+
Reused to make the upsampling reproducible.
67+
68+
Outputs:
69+
prompt (`str`):
70+
The (possibly upsampled) prompt forwarded to the text encoder.
71+
"""
72+
73+
model_name = "ideogram4"
74+
75+
def __init__(self):
76+
# Built lazily on first upsample: the head-less encoder body + `prompt_enhancer_head`, combined.
77+
self._prompt_enhancer = None
78+
# Outlines logits processor for schema-constrained captions; built lazily on first upsample.
79+
self._caption_logits_processor = None
80+
super().__init__()
81+
82+
@property
83+
def description(self) -> str:
84+
return (
85+
"Optional step that rewrites the prompt(s) into Ideogram4's native structured JSON caption when "
86+
"`prompt_upsampling=True` (the format the model is trained on). Requires a generative `text_encoder` "
87+
"(a `Qwen3VLForConditionalGeneration`); install `outlines` for schema-constrained captions."
88+
)
89+
90+
@property
91+
def expected_components(self) -> list[ComponentSpec]:
92+
return [
93+
ComponentSpec("text_encoder", Qwen3VLModel, description="The Qwen3-VL text encoder."),
94+
ComponentSpec("tokenizer", Qwen2Tokenizer, description="The tokenizer paired with the text encoder."),
95+
ComponentSpec(
96+
"prompt_enhancer_head",
97+
Ideogram4PromptEnhancerHead,
98+
description="LM head grafted onto the text encoder for prompt upsampling.",
99+
),
100+
]
101+
102+
@property
103+
def inputs(self) -> list[InputParam]:
104+
return [
105+
InputParam.template("prompt", required=True),
106+
InputParam(
107+
name="prompt_upsampling",
108+
type_hint=bool,
109+
default=False,
110+
description="If True, rewrite the prompt into Ideogram4's native JSON caption before encoding.",
111+
),
112+
InputParam(
113+
name="prompt_upsampling_temperature",
114+
type_hint=float,
115+
default=PROMPT_UPSAMPLE_TEMPERATURE,
116+
description="Sampling temperature for prompt upsampling.",
117+
),
118+
InputParam.template("height"),
119+
InputParam.template("width"),
120+
InputParam.template("max_sequence_length", default=2048),
121+
InputParam.template("generator"),
122+
]
123+
124+
@property
125+
def intermediate_outputs(self) -> list[OutputParam]:
126+
return [
127+
OutputParam(
128+
name="prompt",
129+
type_hint=list,
130+
description="The (possibly upsampled) prompt forwarded to the text encoder.",
131+
),
132+
]
133+
134+
@torch.no_grad()
135+
def __call__(self, components: Ideogram4ModularPipeline, state: PipelineState) -> PipelineState:
136+
block_state = self.get_block_state(state)
137+
138+
if block_state.prompt_upsampling:
139+
if components.prompt_enhancer_head is None:
140+
raise ValueError(
141+
"Prompt upsampling requires the `prompt_enhancer_head` component, which is not loaded. Load an "
142+
"`Ideogram4PromptEnhancerHead` and add it to the pipeline."
143+
)
144+
if self._prompt_enhancer is None:
145+
self._prompt_enhancer = build_prompt_enhancer(components.text_encoder, components.prompt_enhancer_head)
146+
if self._caption_logits_processor is None and is_outlines_available():
147+
self._caption_logits_processor = build_caption_logits_processor(
148+
self._prompt_enhancer, components.tokenizer
149+
)
150+
if self._caption_logits_processor is None:
151+
logger.warning_once(
152+
"`outlines` is not installed; prompt upsampling runs unconstrained and may not return "
153+
"schema-valid JSON. Install with `pip install outlines` for structured captions."
154+
)
155+
height = block_state.height or components.default_height
156+
width = block_state.width or components.default_width
157+
block_state.prompt = generate_captions(
158+
self._prompt_enhancer,
159+
components.tokenizer,
160+
self._caption_logits_processor,
161+
block_state.prompt,
162+
height,
163+
width,
164+
temperature=block_state.prompt_upsampling_temperature,
165+
max_new_tokens=block_state.max_sequence_length,
166+
generator=block_state.generator,
167+
device=components._execution_device,
168+
)
169+
170+
self.set_block_state(state, block_state)
171+
return components, state
172+
173+
34174
# auto_docstring
35175
class Ideogram4TextEncoderStep(ModularPipelineBlocks):
36176
"""

src/diffusers/modular_pipelines/ideogram4/modular_blocks_ideogram4.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
)
2525
from .decoders import Ideogram4DecodeStep
2626
from .denoise import Ideogram4AfterDenoiseStep, Ideogram4DenoiseStep
27-
from .encoders import Ideogram4TextEncoderStep
27+
from .encoders import Ideogram4PromptUpsampleStep, Ideogram4TextEncoderStep
2828

2929

3030
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
@@ -123,6 +123,10 @@ class Ideogram4AutoBlocks(SequentialPipelineBlocks):
123123
Inputs:
124124
prompt (`str`):
125125
The prompt or prompts to guide image generation.
126+
prompt_upsampling (`bool`, *optional*, defaults to False):
127+
Rewrite the prompt into Ideogram4's native structured JSON caption before encoding.
128+
prompt_upsampling_temperature (`float`, *optional*, defaults to 1.0):
129+
Sampling temperature for prompt upsampling.
126130
max_sequence_length (`int`, *optional*, defaults to 2048):
127131
Maximum sequence length for prompt encoding.
128132
num_images_per_prompt (`int`, *optional*, defaults to 1):
@@ -154,8 +158,13 @@ class Ideogram4AutoBlocks(SequentialPipelineBlocks):
154158
"""
155159

156160
model_name = "ideogram4"
157-
block_classes = [Ideogram4TextEncoderStep(), Ideogram4CoreDenoiseStep(), Ideogram4DecodeStep()]
158-
block_names = ["text_encoder", "denoise", "decode"]
161+
block_classes = [
162+
Ideogram4PromptUpsampleStep(),
163+
Ideogram4TextEncoderStep(),
164+
Ideogram4CoreDenoiseStep(),
165+
Ideogram4DecodeStep(),
166+
]
167+
block_names = ["prompt_upsample", "text_encoder", "denoise", "decode"]
159168

160169
# Workflow map declaring the trigger conditions for each supported workflow.
161170
# `True` means the workflow triggers when the input is not None.
@@ -166,8 +175,8 @@ class Ideogram4AutoBlocks(SequentialPipelineBlocks):
166175
@property
167176
def description(self) -> str:
168177
return (
169-
"Auto Modular pipeline for text-to-image generation using Ideogram4: encode text -> core denoise "
170-
"(asymmetric CFG over two transformers) -> decode."
178+
"Auto Modular pipeline for text-to-image generation using Ideogram4: (optional) prompt upsampling -> "
179+
"encode text -> core denoise (asymmetric CFG over two transformers) -> decode."
171180
)
172181

173182
@property

src/diffusers/pipelines/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,7 @@
288288
]
289289
_import_structure["hunyuan_video1_5"] = ["HunyuanVideo15Pipeline", "HunyuanVideo15ImageToVideoPipeline"]
290290
_import_structure["hunyuan_image"] = ["HunyuanImagePipeline", "HunyuanImageRefinerPipeline"]
291-
_import_structure["ideogram4"] = ["Ideogram4Pipeline"]
291+
_import_structure["ideogram4"] = ["Ideogram4Pipeline", "Ideogram4PromptEnhancerHead"]
292292
_import_structure["kandinsky"] = [
293293
"KandinskyCombinedPipeline",
294294
"KandinskyImg2ImgCombinedPipeline",
@@ -748,7 +748,7 @@
748748
)
749749
from .hunyuan_video1_5 import HunyuanVideo15ImageToVideoPipeline, HunyuanVideo15Pipeline
750750
from .hunyuandit import HunyuanDiTPipeline
751-
from .ideogram4 import Ideogram4Pipeline
751+
from .ideogram4 import Ideogram4Pipeline, Ideogram4PromptEnhancerHead
752752
from .joyimage import JoyImageEditPipeline, JoyImageEditPipelineOutput
753753
from .kandinsky import (
754754
KandinskyCombinedPipeline,

src/diffusers/pipelines/ideogram4/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@
2525

2626
_import_structure["pipeline_output"] = ["Ideogram4PipelineOutput"]
2727

28+
_import_structure["prompt_enhancer"] = ["Ideogram4PromptEnhancerHead"]
29+
2830
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
2931
try:
3032
if not (is_transformers_available() and is_torch_available()):
@@ -34,6 +36,7 @@
3436
else:
3537
from .pipeline_ideogram4 import Ideogram4Pipeline
3638
from .pipeline_output import Ideogram4PipelineOutput
39+
from .prompt_enhancer import Ideogram4PromptEnhancerHead
3740
else:
3841
import sys
3942

0 commit comments

Comments
 (0)