You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: multimodal prompt for generateImage/generateVideo (image-to-image, image-to-video) (#624)
* feat(ai): add imageInputs / videoInputs / audioInputs for image-conditioned generation (closes#618)
Adds optional `imageInputs`, `videoInputs`, and `audioInputs` to `generateImage()`
and `generateVideo()` for image-to-image, multi-reference, mask / inpaint,
image-to-video, and starting-frame flows. Each input part may carry a
`metadata.role` hint (`'reference' | 'mask' | 'control' | 'start_frame' |
'end_frame' | 'character'`) that adapters use to route to the provider-specific
field.
Provider behavior:
- OpenAI image: gpt-image-1 / -mini route to `images.edit()` (up to 16 + mask);
dall-e-2 routes to `images.edit()` with one source; dall-e-3 throws.
- OpenAI video: Sora-2 / -pro accept a single `input_reference`; throws on >1.
- Gemini: native models receive inputs as multimodal `contents` parts; Imagen
throws (text-only).
- fal: 1 input → `image_url`, >1 → `image_urls`; metadata roles map to
`mask_url` / `control_image_url` / `reference_image_urls`; video adds
`start_image_url` / `end_image_url`. Interim mapping until the fal schemas
library lands.
- Grok, OpenRouter: throw with a link back to #618 (pending native Imagine API
rewrite and multimodal injection work respectively).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci: apply automated fixes
* feat(ai-fal): resolve image-input fields per endpoint from generated SDK type map
Replace the fal image-input field heuristic with a per-endpoint mapping
generated from @fal-ai/client's EndpointTypeMap (scripts/
generate-fal-image-field-map.ts, run via pnpm generate:fal-image-fields).
The committed artifact stores only the 362 endpoints whose field names
deviate from the defaults (e.g. nano-banana edit -> image_urls, Kling i2v
start frame -> image_url, Veo first-last-frame -> first_frame_url /
last_frame_url, Fooocus masks -> mask_image_url); the old heuristic
remains the fallback for endpoints newer than the installed SDK.
Safety rails: the generated file `satisfies`-checks every field name
against the SDK endpoint types (type-only, erased at runtime), and a unit
test hashes the installed endpoints.d.ts against the recorded hash so an
SDK bump without regeneration fails test:lib with the regen command.
Mappers are now typed: both return FalImageInputFields<TModel>, Pick'ed
from the endpoint's real input type via a generated field-name union.
Roles resolving to the same list field merge (source + reference on
nano-banana); colliding scalar fields throw instead of overwriting.
Also fixes the remaining CI lint failures: duplicate @tanstack/ai import
and non-null assertion in ai-fal video.ts, switch-exhaustiveness errors
in image-inputs.ts (restructured away), and the non-null assertion in
ai-openai image.ts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(ai-grok,ai-openrouter): support imageInputs for image-conditioned generation
Grok: add the xAI Imagine API image models (grok-imagine-image,
grok-imagine-image-quality) to model-meta. With imageInputs they route to
xAI's JSON POST /v1/images/edits endpoint via direct fetch (the OpenAI
SDK's images.edit() sends multipart/form-data, which xAI rejects) — a
single input as image:{url}, 2-3 inputs as images:[...] referenceable in
the prompt as <IMAGE_0>/<IMAGE_1>; >3 inputs and mask/control roles throw.
Their generic `size` uses an aspectRatio_resolution template ('16:9_2k',
suffix optional), mirroring Gemini's native image models, and maps to the
Imagine aspect_ratio/resolution parameters on both the generate and edit
paths. grok-2-image-1212 stays text-to-image only with a clear error.
OpenRouter: imageInputs are injected as multimodal image_url content parts
alongside the prompt in the chat-completions message and forwarded to the
underlying image model.
Neither path fetches or base64-encodes URL sources in-process — URLs pass
through verbatim and are fetched by the provider; data sources become data
URIs. Bumps ai-grok and ai-openrouter to minor in the existing changeset.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: adapt #618 branch to the packages/ restructure and post-rebase API drift
- Move the generated fal image-field map and the generator's paths from
packages/typescript/ai-fal to packages/ai-fal (repo flattened the layout)
- Add gpt-image-2 to EDIT_MAX_IMAGES (new model on main; same 16-image
edit limit as the other gpt-image models)
- Map edit-path usage through buildImagesUsage to match the new TokenUsage
shape, and drop two now-unnecessary type assertions
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(ai): make prompt multimodal for generateImage/generateVideo, pass text through verbatim
Replace the imageInputs / videoInputs / audioInputs fields with a multimodal
prompt: string | MediaPromptPart[]. Part order is meaningful — natively
multimodal providers (Gemini, OpenRouter) receive parts in interleaved order;
named-field providers (OpenAI, fal, xAI) extract media parts via the new
resolveMediaPrompt() utility and flatten the text.
Zero magic: prompt text is always sent verbatim. The SDK never injects or
rewrites in-prompt referencing markers — users write each provider's own
convention (fal Kling/Seedance @image1, OpenAI/FLUX.2 "image 1" prose, Gemini
content descriptions), now documented per provider in the media docs. An
earlier grok <IMAGE_n> auto-injection was removed after research showed the
convention is absent from xAI's official docs (images are addressed by
request order).
- Per-model compile-time prompt narrowing via TModelInputModalitiesByName
adapter generic (e.g. dall-e-3 / Imagen reject image parts as a type
error); fal modality maps are derived at the type level from the SDK's
endpoint input types
- metadata.tag added as an informational label (never read by adapters)
- Gemini now preserves true interleaving in contents; OpenRouter maps parts
1:1 onto chat content parts in order
Closes#618
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: address PR review findings for image/video input support
- openai: add gpt-image-2 to the editImages error message and JSDoc
(the model is edit-capable via EDIT_MAX_IMAGES but was omitted from
user-facing guidance); same fix in docs, SKILL.md, and the changeset
- openai: throw when the images.edit() response contains no usable
images (matching grok's guard) instead of resolving to { images: [] }
- openai: drop the unnecessary input_reference cast in the Sora
adapter — the SDK types the field, so assign directly
- fal: reject metadata.role 'mask'/'control' in the video mapper
instead of silently folding them into source frames
- docs: mark Veo role mappings as planned (no Veo adapter yet), note
the Gemini ~14-image limit is provider-side, bump samples to
gpt-image-2
- tests: cover the Gemini image-conditioned path (interleaved
contents, fileData vs inlineData vs fetch+inline, Imagen/video/audio
rejection), the Sora input_reference upload and guards (new file),
the fal video createVideoJob field assembly and audio guard, and the
openai empty-edit-response guard
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(ai-openai): throw on empty generateImages responses too
Same defect class as the editImages guard in the previous commit: the
text-to-image path silently resolved to { images: [] } when response
items had neither b64_json nor url. Surface it as an error instead.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: client-side multimodal prompts, e2e coverage, media example, fal field demotion
- ai-client: widen ImageGenerateInput.prompt / VideoGenerateInput.prompt
from string to MediaPrompt so useGenerateImage/useGenerateVideo can
carry image parts from the browser; re-export the MediaPrompt types
from @tanstack/ai/client
- ai-fal: demote media-conditioning fields (FalImageFieldName set plus
video_url/video_urls/reference_video_urls/audio_url) from required to
optional in FalImageProviderOptions / FalVideoProviderOptions — i2v
endpoints declare e.g. image_url as required, but with a multimodal
prompt the start frame arrives as a prompt part; modelOptions stays
available as the explicit escape hatch
- e2e: real coverage for image-to-image (OpenAI /v1/images/edits) and
image-to-video (Sora multipart /v1/videos with input_reference) — the
installed aimock 1.29 mocks both multipart endpoints, so the previous
"aimock can't mock this" empty provider sets were stale. New specs run
all three transports and assert via aimock's request journal that the
expected wire endpoint was hit. ImageGenUI/VideoGenUI gain a file
input, feature routing/fixtures/onVideo registration added, README
matrix updated
- examples/ts-react-media: ImageGenerator gains a multi-image reference
picker (Gemini native models); VideoGenerator sends the start frame as
a prompt part with role 'start_frame' instead of modelOptions URLs;
server functions narrow the wire prompt per model and throw on
unsupported part kinds instead of dropping them
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: address CodeRabbit review findings
- fal image/video: spread modelOptions after derived media fields so
explicit user overrides win (matches documented intent)
- openai video: validate effective size (size ?? modelOptions.size)
- generate-fal-image-field-map: run arity check for default-selected
fields too
- ts-react-media example: correct reference-image support comment
(Gemini multimodal models, not NanoBanana)
- e2e VideoGenUI: reject on malformed data URL instead of resolving ''
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(ai,ai-gemini): add Google Veo video adapter on the typed-duration contract (#634)
Restacked on 618-image-to-image-and-image-to-video-support to adopt the
multimodal MediaPrompt format, carrying a minimal additive port of the
#534 typed-duration contract:
- @tanstack/ai (non-breaking): VideoAdapter/BaseVideoAdapter gain a
TModelDurationByName generic (default Record<string, number> preserves
existing duration?: number typing), DurationOptions, snapToDurationOption,
and default availableDurations()/snapDuration() implementations.
generateVideo's duration is typed via VideoDurationForAdapter.
- @tanstack/ai-gemini: GeminiVideoAdapter over generateVideos /
getVideosOperation with per-model typed durations (Veo 3.x 4|6|8,
Veo 2 5|6|8 per current Veo docs), MediaPrompt image routing
(start_frame → image, end_frame → lastFrame, reference/character →
referenceImages), RAI filter surfacing, geminiVideo/createGeminiVideo
factories, and finalized Veo model-meta entries.
- E2E: gemini added to video-gen with a custom aimock mount for
:predictLongRunning + operations polling; all transports pass.
- Docs + media-generation skill updated for Veo (typed durations,
image-to-video role table).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
`generateImage()` and `generateVideo()` now accept a multimodal `prompt`: a plain string, or an ordered array of content parts (`TextPart` / `ImagePart` / `VideoPart` / `AudioPart`) for image-conditioned generation, image-to-image, multi-reference, image-to-video, and edit / inpaint flows. Part order is meaningful — "not like this _(image)_, more like this _(image)_" — and each media part may carry a `metadata.role` hint (`'reference' | 'mask' | 'control' | 'start_frame' | 'end_frame' | 'character'`) that adapters use to route to the provider-specific field, plus an informational `metadata.tag` label for your own bookkeeping. The accepted part types are narrowed per model at compile time via each adapter's input-modality map, so passing an image part to a text-only model is a type error (with a clear runtime throw as backstop).
13
+
14
+
Prompt text is always sent **verbatim** — the SDK never injects or rewrites in-prompt referencing markers. To reference inputs from your prompt, write the provider's own convention (fal Kling / Seedance `@Image1`, OpenAI / FLUX.2 `"image 1"` prose, Gemini content descriptions); see the image-generation docs for the per-provider table.
15
+
16
+
Provider behavior in this release:
17
+
18
+
-**OpenAI image** — Prompts with image parts route `gpt-image-2` / `gpt-image-1` / `gpt-image-1-mini` to `images.edit()` (up to 16 source images plus optional mask); `dall-e-2` routes to `images.edit()` with one source image; `dall-e-3` rejects image parts at compile time and at runtime.
19
+
-**OpenAI video** — Sora-2 / Sora-2-Pro accept a single image part as `input_reference`; passing more than one throws.
- **fal.ai** — Field names resolve per endpoint from a map generated from the fal SDK's endpoint types (362 endpoints with nonstandard fields, e.g. nano-banana edit → `image_urls`, Kling i2v start frame → `image_url`, Veo first-last-frame → `first_frame_url` / `last_frame_url`). Defaults for endpoints not in the map: single → `image_url`, multiple → `image_urls`; `role: 'mask'` → `mask_url`; `role: 'control'` → `control_image_url`; `role: 'reference'` / `'character'` → `reference_image_urls`; video `role: 'start_frame'` / `'end_frame'` → `start_image_url` / `end_image_url`. Per-model prompt modalities are derived at the type level from the SDK's endpoint input types. Regenerate the map after a fal SDK bump with `pnpm generate:fal-image-fields` (a unit test fails when it goes stale). In `FalImageProviderOptions` / `FalVideoProviderOptions`, media-conditioning fields the mappers can populate (`image_url`, `start_image_url`, `video_url`, `audio_url`, …) are demoted from required to optional — supply them as prompt parts, or keep passing them explicitly via `modelOptions`.
22
+
-**Grok** — New `grok-imagine-image` / `grok-imagine-image-quality` models. Prompts with image parts route to xAI's JSON `/v1/images/edits` endpoint (up to 3 source images, addressed by xAI in request order; the prompt is sent verbatim). `role: 'mask'` / `'control'` throw. Their `size` uses an `aspectRatio_resolution` template (`'16:9_2k'`, suffix optional) mirroring Gemini's native image models. `grok-2-image-1212` remains text-to-image only.
23
+
-**OpenRouter** — Prompt parts map 1:1 onto multimodal `text` / `image_url` chat content parts, preserving interleaved order, and are forwarded to the underlying image model. URL sources pass through verbatim (no fetching or re-encoding in your process); `data` sources become data URIs.
24
+
-**Anthropic** — Unchanged (no image generation API).
25
+
26
+
A new `resolveMediaPrompt()` utility (exported from `@tanstack/ai`) is the single downrev point from the canonical interleaved prompt shape to flattened text + per-modality part buckets, for adapter authors.
27
+
28
+
On the client side, `ImageGenerateInput.prompt` and `VideoGenerateInput.prompt` (`@tanstack/ai-client`, and the `useGenerateImage` / `useGenerateVideo` hooks built on them) are widened from `string` to the same `MediaPrompt` shape, so prompt parts can be sent from the browser through your server route to `generateImage()` / `generateVideo()`.
Copy file name to clipboardExpand all lines: docs/media/image-generation.md
+165-2Lines changed: 165 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ TanStack AI provides support for image generation through dedicated image adapte
22
22
23
23
Image generation is handled by image adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI. The image adapters support:
24
24
25
-
-**OpenAI**: DALL-E 2, DALL-E 3, GPT-Image-1, and GPT-Image-1-Mini models
25
+
-**OpenAI**: DALL-E 2, DALL-E 3, GPT-Image-1, GPT-Image-1-Mini, and GPT-Image-2 models
26
26
-**Gemini**: Gemini native image models (NanoBanana) and Imagen 3/4 models
27
27
-**fal.ai**: 600+ models including Nano Banana Pro, FLUX, and more
28
28
@@ -76,7 +76,7 @@ All image adapters support these common options:
76
76
| Option | Type | Description |
77
77
|--------|------|-------------|
78
78
|`adapter`|`ImageAdapter`| Image adapter instance with model (required) |
79
-
|`prompt`|`string`| Text description of the image to generate (required) |
79
+
|`prompt`|`string \| MediaPromptPart[]`| Description of the image to generate (required). A plain string, or — on models that support image-conditioned generation — an ordered array of content parts interleaving text with image inputs. See [Image-Conditioned Generation](#image-conditioned-generation) below.|
80
80
|`numberOfImages`|`number`| Number of images to generate |
81
81
|`size`|`string`| Size of the generated image in WIDTHxHEIGHT format |
82
82
|`modelOptions?`|`object`| Model-specific options (renamed from `providerOptions`) |
@@ -130,6 +130,169 @@ const result = await generateImage({
130
130
})
131
131
```
132
132
133
+
## Image-Conditioned Generation
134
+
135
+
For image-to-image, reference-guided, multi-reference, and edit / inpaint
136
+
flows, pass the `prompt` as an ordered array of content parts — the same
137
+
`TextPart` / `ImagePart` shapes used elsewhere for multimodal content:
138
+
139
+
```typescript
140
+
import { generateImage } from'@tanstack/ai'
141
+
import { openaiImage } from'@tanstack/ai-openai'
142
+
143
+
awaitgenerateImage({
144
+
adapter: openaiImage('gpt-image-2'),
145
+
prompt: [
146
+
{ type: 'text', content: 'Turn this into a cinematic product photo' },
|**OpenAI**|`gpt-image-2` / `gpt-image-1` / `gpt-image-1-mini` → routes to `images.edit()`, up to 16 source images plus optional mask.<br>`dall-e-2` → `images.edit()` with 1 source image only.<br>`dall-e-3` → throws (no edit support). |
287
+
|**Gemini**| Native models (`gemini-*-flash-image`, "nano-banana", etc.) → prompt parts map 1:1 onto multimodal `contents`, preserving interleaved order. Up to ~14 input images (provider limit, not enforced by the SDK).<br>Imagen models → throws (text-to-image only). |
288
+
|**fal.ai**| Field names resolve per endpoint from a map generated from the fal SDK's endpoint types (e.g. nano-banana edit gets `image_urls`, Fooocus masks get `mask_image_url`). Defaults for unknown endpoints: 1 input → `image_url`; multiple → `image_urls`; `role: 'mask'` → `mask_url`; `role: 'control'` → `control_image_url`; `role: 'reference'` / `'character'` → `reference_image_urls`. Override with `modelOptions` for endpoint-specific fields. |
289
+
|**Grok**| grok-imagine models → xAI's `/v1/images/edits` (up to 3 source images, addressed by xAI in request order; prompt sent verbatim). `role: 'mask'` / `'control'` throw (no Imagine API equivalent). `grok-2-image-1212` throws (text-to-image only). |
290
+
|**OpenRouter**| Prompt parts map 1:1 onto multimodal `image_url` / `text` content parts, preserving interleaved order, and are forwarded to the underlying image model. |
291
+
|**Anthropic**| n/a — no image generation API. |
292
+
293
+
Adapters that don't support image-conditioned generation throw a clear
294
+
runtime error so calls fail fast rather than silently dropping the inputs.
0 commit comments