Skip to content

Commit 8fa6cc5

Browse files
tombeckenhamclaudeautofix-ci[bot]
authored
feat: multimodal prompt for generateImage/generateVideo (image-to-image, image-to-video) (#624)
* feat(ai): add imageInputs / videoInputs / audioInputs for image-conditioned generation (closes #618) Adds optional `imageInputs`, `videoInputs`, and `audioInputs` to `generateImage()` and `generateVideo()` for image-to-image, multi-reference, mask / inpaint, image-to-video, and starting-frame flows. Each input part may carry a `metadata.role` hint (`'reference' | 'mask' | 'control' | 'start_frame' | 'end_frame' | 'character'`) that adapters use to route to the provider-specific field. Provider behavior: - OpenAI image: gpt-image-1 / -mini route to `images.edit()` (up to 16 + mask); dall-e-2 routes to `images.edit()` with one source; dall-e-3 throws. - OpenAI video: Sora-2 / -pro accept a single `input_reference`; throws on >1. - Gemini: native models receive inputs as multimodal `contents` parts; Imagen throws (text-only). - fal: 1 input → `image_url`, >1 → `image_urls`; metadata roles map to `mask_url` / `control_image_url` / `reference_image_urls`; video adds `start_image_url` / `end_image_url`. Interim mapping until the fal schemas library lands. - Grok, OpenRouter: throw with a link back to #618 (pending native Imagine API rewrite and multimodal injection work respectively). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: apply automated fixes * feat(ai-fal): resolve image-input fields per endpoint from generated SDK type map Replace the fal image-input field heuristic with a per-endpoint mapping generated from @fal-ai/client's EndpointTypeMap (scripts/ generate-fal-image-field-map.ts, run via pnpm generate:fal-image-fields). The committed artifact stores only the 362 endpoints whose field names deviate from the defaults (e.g. nano-banana edit -> image_urls, Kling i2v start frame -> image_url, Veo first-last-frame -> first_frame_url / last_frame_url, Fooocus masks -> mask_image_url); the old heuristic remains the fallback for endpoints newer than the installed SDK. Safety rails: the generated file `satisfies`-checks every field name against the SDK endpoint types (type-only, erased at runtime), and a unit test hashes the installed endpoints.d.ts against the recorded hash so an SDK bump without regeneration fails test:lib with the regen command. Mappers are now typed: both return FalImageInputFields<TModel>, Pick'ed from the endpoint's real input type via a generated field-name union. Roles resolving to the same list field merge (source + reference on nano-banana); colliding scalar fields throw instead of overwriting. Also fixes the remaining CI lint failures: duplicate @tanstack/ai import and non-null assertion in ai-fal video.ts, switch-exhaustiveness errors in image-inputs.ts (restructured away), and the non-null assertion in ai-openai image.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ai-grok,ai-openrouter): support imageInputs for image-conditioned generation Grok: add the xAI Imagine API image models (grok-imagine-image, grok-imagine-image-quality) to model-meta. With imageInputs they route to xAI's JSON POST /v1/images/edits endpoint via direct fetch (the OpenAI SDK's images.edit() sends multipart/form-data, which xAI rejects) — a single input as image:{url}, 2-3 inputs as images:[...] referenceable in the prompt as <IMAGE_0>/<IMAGE_1>; >3 inputs and mask/control roles throw. Their generic `size` uses an aspectRatio_resolution template ('16:9_2k', suffix optional), mirroring Gemini's native image models, and maps to the Imagine aspect_ratio/resolution parameters on both the generate and edit paths. grok-2-image-1212 stays text-to-image only with a clear error. OpenRouter: imageInputs are injected as multimodal image_url content parts alongside the prompt in the chat-completions message and forwarded to the underlying image model. Neither path fetches or base64-encodes URL sources in-process — URLs pass through verbatim and are fetched by the provider; data sources become data URIs. Bumps ai-grok and ai-openrouter to minor in the existing changeset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: adapt #618 branch to the packages/ restructure and post-rebase API drift - Move the generated fal image-field map and the generator's paths from packages/typescript/ai-fal to packages/ai-fal (repo flattened the layout) - Add gpt-image-2 to EDIT_MAX_IMAGES (new model on main; same 16-image edit limit as the other gpt-image models) - Map edit-path usage through buildImagesUsage to match the new TokenUsage shape, and drop two now-unnecessary type assertions Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ai): make prompt multimodal for generateImage/generateVideo, pass text through verbatim Replace the imageInputs / videoInputs / audioInputs fields with a multimodal prompt: string | MediaPromptPart[]. Part order is meaningful — natively multimodal providers (Gemini, OpenRouter) receive parts in interleaved order; named-field providers (OpenAI, fal, xAI) extract media parts via the new resolveMediaPrompt() utility and flatten the text. Zero magic: prompt text is always sent verbatim. The SDK never injects or rewrites in-prompt referencing markers — users write each provider's own convention (fal Kling/Seedance @image1, OpenAI/FLUX.2 "image 1" prose, Gemini content descriptions), now documented per provider in the media docs. An earlier grok <IMAGE_n> auto-injection was removed after research showed the convention is absent from xAI's official docs (images are addressed by request order). - Per-model compile-time prompt narrowing via TModelInputModalitiesByName adapter generic (e.g. dall-e-3 / Imagen reject image parts as a type error); fal modality maps are derived at the type level from the SDK's endpoint input types - metadata.tag added as an informational label (never read by adapters) - Gemini now preserves true interleaving in contents; OpenRouter maps parts 1:1 onto chat content parts in order Closes #618 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: address PR review findings for image/video input support - openai: add gpt-image-2 to the editImages error message and JSDoc (the model is edit-capable via EDIT_MAX_IMAGES but was omitted from user-facing guidance); same fix in docs, SKILL.md, and the changeset - openai: throw when the images.edit() response contains no usable images (matching grok's guard) instead of resolving to { images: [] } - openai: drop the unnecessary input_reference cast in the Sora adapter — the SDK types the field, so assign directly - fal: reject metadata.role 'mask'/'control' in the video mapper instead of silently folding them into source frames - docs: mark Veo role mappings as planned (no Veo adapter yet), note the Gemini ~14-image limit is provider-side, bump samples to gpt-image-2 - tests: cover the Gemini image-conditioned path (interleaved contents, fileData vs inlineData vs fetch+inline, Imagen/video/audio rejection), the Sora input_reference upload and guards (new file), the fal video createVideoJob field assembly and audio guard, and the openai empty-edit-response guard Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ai-openai): throw on empty generateImages responses too Same defect class as the editImages guard in the previous commit: the text-to-image path silently resolved to { images: [] } when response items had neither b64_json nor url. Surface it as an error instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: client-side multimodal prompts, e2e coverage, media example, fal field demotion - ai-client: widen ImageGenerateInput.prompt / VideoGenerateInput.prompt from string to MediaPrompt so useGenerateImage/useGenerateVideo can carry image parts from the browser; re-export the MediaPrompt types from @tanstack/ai/client - ai-fal: demote media-conditioning fields (FalImageFieldName set plus video_url/video_urls/reference_video_urls/audio_url) from required to optional in FalImageProviderOptions / FalVideoProviderOptions — i2v endpoints declare e.g. image_url as required, but with a multimodal prompt the start frame arrives as a prompt part; modelOptions stays available as the explicit escape hatch - e2e: real coverage for image-to-image (OpenAI /v1/images/edits) and image-to-video (Sora multipart /v1/videos with input_reference) — the installed aimock 1.29 mocks both multipart endpoints, so the previous "aimock can't mock this" empty provider sets were stale. New specs run all three transports and assert via aimock's request journal that the expected wire endpoint was hit. ImageGenUI/VideoGenUI gain a file input, feature routing/fixtures/onVideo registration added, README matrix updated - examples/ts-react-media: ImageGenerator gains a multi-image reference picker (Gemini native models); VideoGenerator sends the start frame as a prompt part with role 'start_frame' instead of modelOptions URLs; server functions narrow the wire prompt per model and throw on unsupported part kinds instead of dropping them Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: address CodeRabbit review findings - fal image/video: spread modelOptions after derived media fields so explicit user overrides win (matches documented intent) - openai video: validate effective size (size ?? modelOptions.size) - generate-fal-image-field-map: run arity check for default-selected fields too - ts-react-media example: correct reference-image support comment (Gemini multimodal models, not NanoBanana) - e2e VideoGenUI: reject on malformed data URL instead of resolving '' Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ai,ai-gemini): add Google Veo video adapter on the typed-duration contract (#634) Restacked on 618-image-to-image-and-image-to-video-support to adopt the multimodal MediaPrompt format, carrying a minimal additive port of the #534 typed-duration contract: - @tanstack/ai (non-breaking): VideoAdapter/BaseVideoAdapter gain a TModelDurationByName generic (default Record<string, number> preserves existing duration?: number typing), DurationOptions, snapToDurationOption, and default availableDurations()/snapDuration() implementations. generateVideo's duration is typed via VideoDurationForAdapter. - @tanstack/ai-gemini: GeminiVideoAdapter over generateVideos / getVideosOperation with per-model typed durations (Veo 3.x 4|6|8, Veo 2 5|6|8 per current Veo docs), MediaPrompt image routing (start_frame → image, end_frame → lastFrame, reference/character → referenceImages), RAI filter surfacing, geminiVideo/createGeminiVideo factories, and finalized Veo model-meta entries. - E2E: gemini added to video-gen with a custom aimock mount for :predictLongRunning + operations polling; all transports pass. - Docs + media-generation skill updated for Veo (typed durations, image-to-video role table). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
1 parent eadabbc commit 8fa6cc5

76 files changed

Lines changed: 6407 additions & 184 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
'@tanstack/ai': minor
3+
'@tanstack/ai-gemini': minor
4+
---
5+
6+
Add a Google Veo video adapter (`geminiVideo` / `createGeminiVideo`) and the
7+
per-model typed-duration video contract it is built on (#534, #634).
8+
9+
**`@tanstack/ai`** (additive, non-breaking): `VideoAdapter` /
10+
`BaseVideoAdapter` gain a `TModelDurationByName` generic (defaulting to
11+
`Record<string, number>`, preserving today's `duration?: number` typing for
12+
adapters without a map) plus two introspection methods with safe defaults:
13+
14+
- `availableDurations()` — a `DurationOptions` tagged union
15+
(`discrete | range | mixed | none`) describing the durations the current
16+
model accepts. Default: `{ kind: 'none' }`.
17+
- `snapDuration(seconds)` — coerce raw seconds to the closest valid duration
18+
(`snapToDurationOption` is exported for adapter authors). Default:
19+
`undefined`.
20+
21+
`generateVideo({ duration })` is now typed per model via
22+
`VideoDurationForAdapter<TAdapter>`.
23+
24+
**`@tanstack/ai-gemini`**: new Veo adapter over the long-running
25+
`:predictLongRunning` operation, supporting `veo-3.1-generate-preview`,
26+
`veo-3.1-fast-generate-preview`, `veo-3.0-generate-001`,
27+
`veo-3.0-fast-generate-001`, and `veo-2.0-generate-001`:
28+
29+
- `geminiVideo('veo-3.0-generate-001')``duration?: 4 | 6 | 8`
30+
(Veo 2: `5 | 6 | 8`); `adapter.snapDuration(7)``6`.
31+
- Multimodal prompts: the first un-roled / `'start_frame'` image part
32+
becomes the input image, `'end_frame'``lastFrame`, `'reference'` /
33+
`'character'``referenceImages`.
34+
- `size` takes Veo aspect ratios (`'16:9' | '9:16'`); everything else from
35+
the SDK's `GenerateVideosConfig` (e.g. `resolution`, `generateAudio`,
36+
`negativePrompt`) is available through `modelOptions`.
37+
- Responsible-AI filtering is surfaced as a failed job with the filter
38+
reasons.
39+
40+
Note: Veo result URLs are served by the Gemini Files API and require the
41+
Google API key to download (`x-goog-api-key` header or `key` query
42+
parameter).
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
'@tanstack/ai': minor
3+
'@tanstack/ai-openai': minor
4+
'@tanstack/ai-gemini': minor
5+
'@tanstack/ai-fal': minor
6+
'@tanstack/ai-grok': minor
7+
'@tanstack/ai-openrouter': minor
8+
'@tanstack/ai-client': minor
9+
'@tanstack/ai-event-client': patch
10+
---
11+
12+
`generateImage()` and `generateVideo()` now accept a multimodal `prompt`: a plain string, or an ordered array of content parts (`TextPart` / `ImagePart` / `VideoPart` / `AudioPart`) for image-conditioned generation, image-to-image, multi-reference, image-to-video, and edit / inpaint flows. Part order is meaningful — "not like this _(image)_, more like this _(image)_" — and each media part may carry a `metadata.role` hint (`'reference' | 'mask' | 'control' | 'start_frame' | 'end_frame' | 'character'`) that adapters use to route to the provider-specific field, plus an informational `metadata.tag` label for your own bookkeeping. The accepted part types are narrowed per model at compile time via each adapter's input-modality map, so passing an image part to a text-only model is a type error (with a clear runtime throw as backstop).
13+
14+
Prompt text is always sent **verbatim** — the SDK never injects or rewrites in-prompt referencing markers. To reference inputs from your prompt, write the provider's own convention (fal Kling / Seedance `@Image1`, OpenAI / FLUX.2 `"image 1"` prose, Gemini content descriptions); see the image-generation docs for the per-provider table.
15+
16+
Provider behavior in this release:
17+
18+
- **OpenAI image** — Prompts with image parts route `gpt-image-2` / `gpt-image-1` / `gpt-image-1-mini` to `images.edit()` (up to 16 source images plus optional mask); `dall-e-2` routes to `images.edit()` with one source image; `dall-e-3` rejects image parts at compile time and at runtime.
19+
- **OpenAI video** — Sora-2 / Sora-2-Pro accept a single image part as `input_reference`; passing more than one throws.
20+
- **Gemini image** — Native models (`gemini-*-flash-image`, "nano-banana") map prompt parts 1:1 onto multimodal `contents`, preserving interleaved order. Imagen is text-only (compile-time + runtime rejection).
21+
- **fal.ai** — Field names resolve per endpoint from a map generated from the fal SDK's endpoint types (362 endpoints with nonstandard fields, e.g. nano-banana edit → `image_urls`, Kling i2v start frame → `image_url`, Veo first-last-frame → `first_frame_url` / `last_frame_url`). Defaults for endpoints not in the map: single → `image_url`, multiple → `image_urls`; `role: 'mask'` → `mask_url`; `role: 'control'` → `control_image_url`; `role: 'reference'` / `'character'` → `reference_image_urls`; video `role: 'start_frame'` / `'end_frame'` → `start_image_url` / `end_image_url`. Per-model prompt modalities are derived at the type level from the SDK's endpoint input types. Regenerate the map after a fal SDK bump with `pnpm generate:fal-image-fields` (a unit test fails when it goes stale). In `FalImageProviderOptions` / `FalVideoProviderOptions`, media-conditioning fields the mappers can populate (`image_url`, `start_image_url`, `video_url`, `audio_url`, …) are demoted from required to optional — supply them as prompt parts, or keep passing them explicitly via `modelOptions`.
22+
- **Grok** — New `grok-imagine-image` / `grok-imagine-image-quality` models. Prompts with image parts route to xAI's JSON `/v1/images/edits` endpoint (up to 3 source images, addressed by xAI in request order; the prompt is sent verbatim). `role: 'mask'` / `'control'` throw. Their `size` uses an `aspectRatio_resolution` template (`'16:9_2k'`, suffix optional) mirroring Gemini's native image models. `grok-2-image-1212` remains text-to-image only.
23+
- **OpenRouter** — Prompt parts map 1:1 onto multimodal `text` / `image_url` chat content parts, preserving interleaved order, and are forwarded to the underlying image model. URL sources pass through verbatim (no fetching or re-encoding in your process); `data` sources become data URIs.
24+
- **Anthropic** — Unchanged (no image generation API).
25+
26+
A new `resolveMediaPrompt()` utility (exported from `@tanstack/ai`) is the single downrev point from the canonical interleaved prompt shape to flattened text + per-modality part buckets, for adapter authors.
27+
28+
On the client side, `ImageGenerateInput.prompt` and `VideoGenerateInput.prompt` (`@tanstack/ai-client`, and the `useGenerateImage` / `useGenerateVideo` hooks built on them) are widened from `string` to the same `MediaPrompt` shape, so prompt parts can be sent from the browser through your server route to `generateImage()` / `generateVideo()`.
29+
30+
Closes #618.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,4 @@ solo.yml
7878
# Agent scratch output (gap-analysis reports, triage notes — generated locally)
7979
.agent/gap-analysis/
8080
.agent/triage/
81+
.agent/research/

.prettierignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
**/coverage
66
**/dist
77
**/docs
8+
packages/ai-fal/src/image/generated/
89
pnpm-lock.yaml
910

1011
.angular

docs/adapters/grok.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,51 @@ const result = await generateImage({
160160
console.log(result.images);
161161
```
162162

163+
The grok-imagine models (`grok-imagine-image`, `grok-imagine-image-quality`)
164+
are aspect-ratio sized — `size` takes an `aspectRatio_resolution` template
165+
like `"16:9_2k"` (the `_2k` suffix is optional):
166+
167+
```typescript
168+
const result = await generateImage({
169+
adapter: grokImage("grok-imagine-image"),
170+
prompt: "A futuristic cityscape at sunset",
171+
size: "16:9_2k",
172+
});
173+
```
174+
175+
### Image Editing (image-to-image)
176+
177+
The grok-imagine models accept image prompt parts for image-conditioned
178+
generation via xAI's `/v1/images/edits` endpoint — up to 3 source images,
179+
addressed by xAI in the order they appear in the prompt. Per xAI's docs
180+
there is no in-prompt referencing syntax; write the prompt naturally and
181+
your text is sent verbatim:
182+
183+
```typescript
184+
const result = await generateImage({
185+
adapter: grokImage("grok-imagine-image"),
186+
prompt: [
187+
{
188+
type: "text",
189+
content: "Render the product in the style of the second image",
190+
},
191+
{
192+
type: "image",
193+
source: { type: "url", value: "https://example.com/product.png" },
194+
},
195+
{
196+
type: "image",
197+
source: { type: "url", value: "https://example.com/style.png" },
198+
},
199+
],
200+
});
201+
```
202+
203+
URL sources are fetched by xAI's servers, so they must be publicly
204+
reachable; use a `data` source for private images. `grok-2-image-1212` is
205+
text-to-image only — image prompt parts are a compile-time type error and
206+
throw at runtime.
207+
163208
## Text-to-Speech
164209

165210
Generate speech with Grok TTS:

docs/media/image-generation.md

Lines changed: 165 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ TanStack AI provides support for image generation through dedicated image adapte
2222

2323
Image generation is handled by image adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI. The image adapters support:
2424

25-
- **OpenAI**: DALL-E 2, DALL-E 3, GPT-Image-1, and GPT-Image-1-Mini models
25+
- **OpenAI**: DALL-E 2, DALL-E 3, GPT-Image-1, GPT-Image-1-Mini, and GPT-Image-2 models
2626
- **Gemini**: Gemini native image models (NanoBanana) and Imagen 3/4 models
2727
- **fal.ai**: 600+ models including Nano Banana Pro, FLUX, and more
2828

@@ -76,7 +76,7 @@ All image adapters support these common options:
7676
| Option | Type | Description |
7777
|--------|------|-------------|
7878
| `adapter` | `ImageAdapter` | Image adapter instance with model (required) |
79-
| `prompt` | `string` | Text description of the image to generate (required) |
79+
| `prompt` | `string \| MediaPromptPart[]` | Description of the image to generate (required). A plain string, or — on models that support image-conditioned generation — an ordered array of content parts interleaving text with image inputs. See [Image-Conditioned Generation](#image-conditioned-generation) below. |
8080
| `numberOfImages` | `number` | Number of images to generate |
8181
| `size` | `string` | Size of the generated image in WIDTHxHEIGHT format |
8282
| `modelOptions?` | `object` | Model-specific options (renamed from `providerOptions`) |
@@ -130,6 +130,169 @@ const result = await generateImage({
130130
})
131131
```
132132

133+
## Image-Conditioned Generation
134+
135+
For image-to-image, reference-guided, multi-reference, and edit / inpaint
136+
flows, pass the `prompt` as an ordered array of content parts — the same
137+
`TextPart` / `ImagePart` shapes used elsewhere for multimodal content:
138+
139+
```typescript
140+
import { generateImage } from '@tanstack/ai'
141+
import { openaiImage } from '@tanstack/ai-openai'
142+
143+
await generateImage({
144+
adapter: openaiImage('gpt-image-2'),
145+
prompt: [
146+
{ type: 'text', content: 'Turn this into a cinematic product photo' },
147+
{
148+
type: 'image',
149+
source: { type: 'url', value: 'https://example.com/product.png' },
150+
},
151+
],
152+
})
153+
```
154+
155+
Part order is meaningful. Providers with natively multimodal prompts
156+
(Gemini image models, OpenRouter) receive the parts exactly as written, so
157+
text can refer to its neighbouring images:
158+
159+
```typescript
160+
await generateImage({
161+
adapter: geminiImage('gemini-3.1-flash-image-preview'),
162+
prompt: [
163+
{ type: 'text', content: 'Not like this' },
164+
{ type: 'image', source: { type: 'url', value: badExampleUrl } },
165+
{ type: 'text', content: 'more like this' },
166+
{ type: 'image', source: { type: 'url', value: goodExampleUrl } },
167+
],
168+
})
169+
```
170+
171+
Providers with named request fields (OpenAI, fal, xAI) extract the image
172+
parts and flatten the text (text parts are joined verbatim, paragraph
173+
separated).
174+
175+
The accepted part types are narrowed **per model at compile time**: passing
176+
an image part to a text-only model (e.g. `dall-e-3`, Imagen) is a type
177+
error, not just a runtime throw.
178+
179+
### Referencing images from your prompt
180+
181+
**Your prompt text is always sent verbatim — the SDK never injects or
182+
rewrites referencing markers.** When you want the text to refer to specific
183+
input images, write the provider's own convention yourself:
184+
185+
| Provider | Convention | Example |
186+
| -------- | ---------- | ------- |
187+
| **OpenAI** (gpt-image) | Indexed prose, per OpenAI's prompting guide | `"apply the style of image 2 to image 1"` |
188+
| **FLUX.2 on fal / BFL** | Indexed prose (BFL's docs parse `image N`) | `"subject from image 1, style from image 2"` |
189+
| **Gemini** (native image models) | Describe the reference by content/role | `"using the attached fabric sample as the texture"` |
190+
| **fal Kling / Seedance endpoints** | `@`-tags, 1-indexed by input order | `"Put @Image1 in the style of @Image2"` |
191+
| **xAI grok-imagine** | No in-prompt syntax — images addressed in request order | `"render the product in the style of the second image"` |
192+
193+
To keep track of which part you meant by "image 2" or `@Image2`, you can
194+
label parts with the informational `metadata.tag` field — the SDK ignores
195+
it, but it keeps your code self-documenting:
196+
197+
```typescript
198+
prompt: [
199+
{ type: 'text', content: 'Put @Image1 in the style of @Image2' },
200+
{ type: 'image', source: { type: 'url', value: productUrl },
201+
metadata: { tag: 'product' } },
202+
{ type: 'image', source: { type: 'url', value: styleUrl },
203+
metadata: { tag: 'style' } },
204+
]
205+
```
206+
207+
### Source format
208+
209+
`ImagePart.source` is a discriminated union supporting both URLs and inline
210+
base64 data — pass whichever you have:
211+
212+
```typescript
213+
// URL source
214+
{ type: 'image', source: { type: 'url', value: 'https://example.com/img.png' } }
215+
216+
// Inline base64 data (mimeType required)
217+
{ type: 'image', source: { type: 'data', value: base64String, mimeType: 'image/png' } }
218+
```
219+
220+
OpenAI's edit endpoint requires file uploads; the adapter fetches URL sources
221+
and converts base64 to a `File` automatically.
222+
223+
### Role hints via `metadata.role`
224+
225+
When a generation has multiple inputs with different roles (mask vs reference
226+
vs start/end frame), set `metadata.role` on each part. Adapters route by role
227+
to the provider-specific field; parts without a role fall back to positional
228+
mapping.
229+
230+
| Role | Maps to |
231+
| --------------- | -------------------------------------------------------------------------------------- |
232+
| `'reference'` | fal `reference_image_urls`; Gemini multimodal part; positional fallback |
233+
| `'character'` | Same as `'reference'`; Veo `referenceImages` slot (planned — no Veo adapter yet) |
234+
| `'mask'` | OpenAI `mask` (gpt-image-2, gpt-image-1, dall-e-2); fal `mask_url` |
235+
| `'control'` | fal `control_image_url` (ControlNet / depth / pose conditioning) |
236+
| `'start_frame'` | fal `start_image_url`; Veo `image` (planned) (used by `generateVideo`) |
237+
| `'end_frame'` | fal `end_image_url`; Veo `lastFrame` (planned) (used by `generateVideo`) |
238+
239+
#### Inpaint / edit with a mask
240+
241+
```typescript
242+
await generateImage({
243+
adapter: openaiImage('gpt-image-2'),
244+
prompt: [
245+
{ type: 'text', content: 'Replace the masked region with a tree' },
246+
{
247+
type: 'image',
248+
source: { type: 'url', value: photoUrl },
249+
},
250+
{
251+
type: 'image',
252+
source: { type: 'url', value: maskUrl },
253+
metadata: { role: 'mask' },
254+
},
255+
],
256+
})
257+
```
258+
259+
#### Multi-reference composition
260+
261+
```typescript
262+
await generateImage({
263+
adapter: geminiImage('gemini-3.1-flash-image-preview'),
264+
prompt: [
265+
{
266+
type: 'text',
267+
content:
268+
'Generate a new image of the product using the style of the second reference',
269+
},
270+
{
271+
type: 'image',
272+
source: { type: 'url', value: 'https://example.com/product.png' },
273+
},
274+
{
275+
type: 'image',
276+
source: { type: 'url', value: 'https://example.com/style.png' },
277+
},
278+
],
279+
})
280+
```
281+
282+
### Provider support
283+
284+
| Provider | Behavior |
285+
| ------------ | --------------------------------------------------------------------------------------------------------- |
286+
| **OpenAI** | `gpt-image-2` / `gpt-image-1` / `gpt-image-1-mini` → routes to `images.edit()`, up to 16 source images plus optional mask.<br>`dall-e-2``images.edit()` with 1 source image only.<br>`dall-e-3` → throws (no edit support). |
287+
| **Gemini** | Native models (`gemini-*-flash-image`, "nano-banana", etc.) → prompt parts map 1:1 onto multimodal `contents`, preserving interleaved order. Up to ~14 input images (provider limit, not enforced by the SDK).<br>Imagen models → throws (text-to-image only). |
288+
| **fal.ai** | Field names resolve per endpoint from a map generated from the fal SDK's endpoint types (e.g. nano-banana edit gets `image_urls`, Fooocus masks get `mask_image_url`). Defaults for unknown endpoints: 1 input → `image_url`; multiple → `image_urls`; `role: 'mask'``mask_url`; `role: 'control'``control_image_url`; `role: 'reference'` / `'character'``reference_image_urls`. Override with `modelOptions` for endpoint-specific fields. |
289+
| **Grok** | grok-imagine models → xAI's `/v1/images/edits` (up to 3 source images, addressed by xAI in request order; prompt sent verbatim). `role: 'mask'` / `'control'` throw (no Imagine API equivalent). `grok-2-image-1212` throws (text-to-image only). |
290+
| **OpenRouter** | Prompt parts map 1:1 onto multimodal `image_url` / `text` content parts, preserving interleaved order, and are forwarded to the underlying image model. |
291+
| **Anthropic** | n/a — no image generation API. |
292+
293+
Adapters that don't support image-conditioned generation throw a clear
294+
runtime error so calls fail fast rather than silently dropping the inputs.
295+
133296
## Model Options
134297

135298
### OpenAI Model Options

0 commit comments

Comments
 (0)