Skip to content

Commit 1cf70d0

Browse files
tombeckenhamclaude
andcommitted
fix(ai-grok): grok-imagine-video-1.5 is image-to-video only
Confirmed against the live xAI API: grok-imagine-video-1.5 rejects text-to-video ("Text-to-video is not supported for this model") and only generates from a starting frame. createVideoJob now requires exactly one image prompt part and throws a clear error otherwise; model-meta, provider options, docs, changeset, and the media skill describe it as image-to-video only. The ts-react-media example drops the (non-working) text-to-video entry and keeps the image-to-video one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent cd06312 commit 1cf70d0

10 files changed

Lines changed: 97 additions & 92 deletions

File tree

.changeset/grok-imagine-video-adapter.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
'@tanstack/ai-grok': minor
33
---
44

5-
Add a `grokVideo` adapter for the `grok-imagine-video-1.5` model via xAI's Imagine API. Follows the experimental `generateVideo()` jobs/polling architecture: `createVideoJob` posts to `/v1/videos/generations`, status polling reads `/v1/videos/{request_id}`, and the completed result carries the hosted video URL plus usage (`unitsBilled` seconds and exact `cost` in USD). Sizing uses the aspect-ratio template consistent with the grok-imagine image models (`size: '16:9_720p'``aspect_ratio` / `resolution`), durations are 1–15 integer seconds, and image-to-video starting frames are supplied as an `image` prompt part (public URL or base64 data source), consistent with the multimodal prompt convention used by the other image/video adapters.
5+
Add a `grokVideo` adapter for the `grok-imagine-video-1.5` model via xAI's Imagine API. The model is image-to-video only: every request needs exactly one `image` prompt part as the starting frame (public URL or base64 data source), with the text part describing the motion. Follows the experimental `generateVideo()` jobs/polling architecture: `createVideoJob` posts to `/v1/videos/generations`, status polling reads `/v1/videos/{request_id}`, and the completed result carries the hosted video URL plus usage (`unitsBilled` seconds and exact `cost` in USD). Sizing uses the aspect-ratio template consistent with the grok-imagine image models (`size: '16:9_720p'``aspect_ratio` / `resolution`), and durations are 1–15 integer seconds.

docs/adapters/grok.md

Lines changed: 19 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -209,20 +209,34 @@ throw at runtime.
209209

210210
## Video Generation (Experimental)
211211

212-
Generate short video clips (1–15 seconds, with audio) with the Grok Imagine video model via xAI's asynchronous jobs/polling API:
212+
Generate short video clips (1–15 seconds, with audio) with the Grok Imagine video model via xAI's asynchronous jobs/polling API.
213+
214+
`grok-imagine-video-1.5` is **image-to-video only**: every request must include an
215+
`image` prompt part as the starting frame, with the text part describing the
216+
desired motion. URL sources are fetched by xAI's servers (so they must be
217+
publicly reachable); use a `data` source for a base64 starting frame:
213218

214219
```typescript
215220
import { generateVideo, getVideoJobStatus } from "@tanstack/ai";
216221
import { grokVideo } from "@tanstack/ai-grok";
217222

218223
const adapter = grokVideo("grok-imagine-video-1.5");
219224

220-
// 1. Create the job
225+
// 1. Create the job — the prompt carries the starting frame plus motion text
221226
const { jobId } = await generateVideo({
222227
adapter,
223-
prompt: "A red panda balancing on a bamboo stalk in the rain",
228+
prompt: [
229+
{
230+
type: "text",
231+
content: "Make the waterfall crash down and slowly pan out the camera",
232+
},
233+
{
234+
type: "image",
235+
source: { type: "url", value: "https://example.com/waterfall-still.png" },
236+
},
237+
],
224238
size: "16:9_720p", // "aspectRatio" or "aspectRatio_resolution"
225-
duration: 5, // integer seconds, 1–15
239+
duration: 10, // integer seconds, 1–15
226240
});
227241

228242
// 2. Poll until complete, then read the video URL
@@ -237,32 +251,10 @@ console.log(status.url); // hosted .mp4 URL
237251

238252
Available model:
239253

240-
- `grok-imagine-video-1.5`text-to-video and image-to-video, $0.08 per second of video. Per xAI's docs a starting image is optional for text-to-video and required for image-to-video.
254+
- `grok-imagine-video-1.5` — image-to-video, $0.08 per second of video.
241255

242256
Like the Grok Imagine image models, sizing is aspect-ratio based: the `size` option takes an `aspectRatio_resolution` template. Supported aspect ratios are `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `3:2`, and `2:3`; supported resolutions are `480p`, `720p`, and `1080p` (e.g. `"9:16_1080p"`). The resolution suffix is optional.
243257

244-
For image-to-video, include an `image` prompt part as the starting frame and
245-
describe the desired motion in the text part. URL sources are fetched by xAI's
246-
servers (so they must be publicly reachable); use a `data` source for a base64
247-
starting frame:
248-
249-
```typescript
250-
const { jobId } = await generateVideo({
251-
adapter: grokVideo("grok-imagine-video-1.5"),
252-
prompt: [
253-
{
254-
type: "text",
255-
content: "Make the waterfall crash down and slowly pan out the camera",
256-
},
257-
{
258-
type: "image",
259-
source: { type: "url", value: "https://example.com/waterfall-still.png" },
260-
},
261-
],
262-
duration: 10,
263-
});
264-
```
265-
266258
When the job completes, the adapter reports usage on the result: `usage.unitsBilled` carries the billed seconds of video and `usage.cost` the exact cost in USD, both as returned by the xAI API.
267259

268260
See [Video Generation](../media/video-generation) for the full jobs/polling flow, streaming mode, and the `useGenerateVideo` hook.

docs/media/video-generation.md

Lines changed: 8 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -558,15 +558,21 @@ Adapters that haven't declared a per-model duration map keep the plain
558558
559559
### Grok (xAI Imagine) Model Options
560560

561-
Based on the [xAI video generation API](https://docs.x.ai/docs/guides/video-generations). The Grok Imagine models are aspect-ratio sized — the generic `size` option takes an `aspectRatio_resolution` template (like the Grok Imagine image models), and clips can be 1–15 seconds long:
561+
Based on the [xAI video generation API](https://docs.x.ai/docs/guides/video-generations). `grok-imagine-video-1.5` is **image-to-video only**: every request must include an `image` prompt part as the starting frame, with the text part describing the desired motion. URL sources are fetched by xAI's servers (so they must be publicly reachable); use a `data` source for a base64 starting frame. The model is aspect-ratio sized — the generic `size` option takes an `aspectRatio_resolution` template (like the Grok Imagine image models), and clips can be 1–15 seconds long:
562562

563563
```typescript
564564
import { generateVideo } from '@tanstack/ai'
565565
import { grokVideo } from '@tanstack/ai-grok'
566566

567567
const { jobId } = await generateVideo({
568568
adapter: grokVideo('grok-imagine-video-1.5'),
569-
prompt: 'A beautiful sunset over the ocean',
569+
prompt: [
570+
{ type: 'text', content: 'Slowly pan out as the waves roll in' },
571+
{
572+
type: 'image',
573+
source: { type: 'url', value: 'https://example.com/still.png' },
574+
},
575+
],
570576
size: '16:9_720p', // aspect ratio: '1:1' | '16:9' | '9:16' | '4:3' | '3:4' | '3:2' | '2:3'
571577
// resolution (optional suffix): '480p' | '720p' | '1080p'
572578
duration: 5, // integer seconds, 1-15
@@ -578,22 +584,6 @@ const { jobId } = await generateVideo({
578584
})
579585
```
580586

581-
For image-to-video, include an `image` prompt part as the starting frame and describe the desired motion in the text part. URL sources are fetched by xAI's servers (so they must be publicly reachable); use a `data` source for a base64 starting frame:
582-
583-
```typescript
584-
const { jobId } = await generateVideo({
585-
adapter: grokVideo('grok-imagine-video-1.5'),
586-
prompt: [
587-
{ type: 'text', content: 'Slowly pan out as the waves roll in' },
588-
{
589-
type: 'image',
590-
source: { type: 'url', value: 'https://example.com/still.png' },
591-
},
592-
],
593-
duration: 5,
594-
})
595-
```
596-
597587
Generated clips include an audio track. When the job completes, the adapter reports `usage.unitsBilled` (billed seconds of video) and `usage.cost` (exact USD cost as returned by the API) on the result.
598588

599589
## Response Types

examples/ts-react-media/src/lib/models.ts

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -132,17 +132,11 @@ export const VIDEO_MODELS = [
132132
mode: 'image-to-video' as const,
133133
provider: 'fal' as const,
134134
},
135-
{
136-
id: 'grok-imagine-video-1.5',
137-
name: 'Grok Imagine Video 1.5 (Text-to-Video)',
138-
description: 'xAI Imagine API via the native grokVideo adapter',
139-
mode: 'text-to-video' as const,
140-
provider: 'xai' as const,
141-
},
142135
{
143136
id: 'grok-imagine-video-1.5/image-to-video',
144137
name: 'Grok Imagine Video 1.5 (Image-to-Video)',
145-
description: 'Animate a starting frame via the native grokVideo adapter',
138+
description:
139+
'Animate a starting frame via the native grokVideo adapter (image-to-video only)',
146140
mode: 'image-to-video' as const,
147141
provider: 'xai' as const,
148142
},

examples/ts-react-media/src/lib/server-functions.ts

Lines changed: 1 addition & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -73,10 +73,7 @@ function asImageToVideoPrompt(
7373
* (XAI_API_KEY); everything else is a fal-hosted model.
7474
*/
7575
function videoAdapterForModel(model: string) {
76-
if (
77-
model === 'grok-imagine-video-1.5' ||
78-
model === 'grok-imagine-video-1.5/image-to-video'
79-
) {
76+
if (model === 'grok-imagine-video-1.5/image-to-video') {
8077
return grokVideo('grok-imagine-video-1.5')
8178
}
8279
return falVideo(model)
@@ -249,18 +246,6 @@ export const createVideoJobFn = createServerFn({ method: 'POST' })
249246
},
250247
})
251248
}
252-
case 'grok-imagine-video-1.5': {
253-
// Direct xAI Imagine API (XAI_API_KEY) — no fal in between. Sizing is
254-
// an "aspectRatio_resolution" template; durations are 1-15 integer
255-
// seconds. Completed jobs report usage.unitsBilled (billed seconds)
256-
// and usage.cost (exact USD).
257-
return generateVideo({
258-
adapter: grokVideo('grok-imagine-video-1.5'),
259-
prompt: asTextPrompt(data.prompt),
260-
size: '16:9_720p',
261-
duration: 5,
262-
})
263-
}
264249
case 'fal-ai/ltx-2.3/text-to-video/fast': {
265250
return generateVideo({
266251
adapter: falVideo('fal-ai/ltx-2.3/text-to-video/fast'),

packages/ai-grok/src/adapters/video.ts

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -86,10 +86,13 @@ function buildGrokVideoUsage(
8686
/**
8787
* Grok Video Generation Adapter (xAI Imagine API)
8888
*
89-
* Tree-shakeable adapter for the grok-imagine video models using the
89+
* Tree-shakeable adapter for the grok-imagine-video-1.5 model using the
9090
* async jobs/polling architecture: create a generation request, poll it,
9191
* then read the completed video URL.
9292
*
93+
* The model is image-to-video only: every request needs exactly one image
94+
* prompt part (the starting frame) plus text describing the desired motion.
95+
*
9396
* The Imagine video endpoints are not part of the OpenAI SDK surface (and
9497
* xAI rejects the SDK's multipart paths), so requests are plain JSON calls
9598
* issued with the configured `fetch` (or the global one).
@@ -174,8 +177,8 @@ export class GrokVideoAdapter<
174177
const duration = options.duration ?? modelOptions?.duration
175178

176179
// The interleaved prompt decomposes into verbatim text plus typed media
177-
// buckets. The Imagine video endpoint takes a text prompt and an optional
178-
// starting frame; reject the modalities it can't consume.
180+
// buckets. grok-imagine-video-1.5 is image-to-video only: it needs exactly
181+
// one starting-frame image plus the text prompt describing the motion.
179182
const resolved = resolveMediaPrompt(options.prompt)
180183
if (resolved.videos.length > 0) {
181184
throw new Error(
@@ -187,9 +190,14 @@ export class GrokVideoAdapter<
187190
`${this.name}.createVideoJob does not support audio prompt parts (model: ${model}).`,
188191
)
189192
}
193+
if (resolved.images.length === 0) {
194+
throw new Error(
195+
`${this.name}: ${model} is image-to-video only — include exactly one image prompt part as the starting frame.`,
196+
)
197+
}
190198
if (resolved.images.length > 1) {
191199
throw new Error(
192-
`${this.name}: grok-imagine video accepts at most one starting-frame image; received ${resolved.images.length}.`,
200+
`${this.name}: ${model} accepts at most one starting-frame image; received ${resolved.images.length}.`,
193201
)
194202
}
195203

@@ -352,11 +360,15 @@ export class GrokVideoAdapter<
352360
*
353361
* @example
354362
* ```typescript
363+
* // Image-to-video only: include the starting frame as an image prompt part.
355364
* const adapter = createGrokVideo('grok-imagine-video-1.5', 'xai-...');
356365
*
357366
* const { jobId } = await generateVideo({
358367
* adapter,
359-
* prompt: 'A beautiful sunset over the ocean',
368+
* prompt: [
369+
* { type: 'text', content: 'Slowly pan out as the waves roll in' },
370+
* { type: 'image', source: { type: 'url', value: 'https://example.com/still.png' } },
371+
* ],
360372
* size: '16:9_720p',
361373
* duration: 5
362374
* });
@@ -390,10 +402,13 @@ export function createGrokVideo<TModel extends GrokVideoModel>(
390402
* // Automatically uses XAI_API_KEY from environment
391403
* const adapter = grokVideo('grok-imagine-video-1.5');
392404
*
393-
* // Create a video generation job
405+
* // Image-to-video only: the prompt must carry a starting-frame image part.
394406
* const { jobId } = await generateVideo({
395407
* adapter,
396-
* prompt: 'A cat playing piano'
408+
* prompt: [
409+
* { type: 'text', content: 'Make the cat start playing the piano' },
410+
* { type: 'image', source: { type: 'url', value: 'https://example.com/cat.png' } },
411+
* ],
397412
* });
398413
*
399414
* // Poll for status

packages/ai-grok/src/model-meta.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -253,9 +253,9 @@ const GROK_IMAGINE_IMAGE_QUALITY = {
253253
} as const satisfies ModelMeta
254254

255255
// Imagine API video model. Pricing is per second of generated video
256-
// (output only); generated videos carry an audio track. Per xAI's docs the
257-
// model does text-to-video (a starting image is optional) and image-to-video
258-
// (a starting image is required).
256+
// (output only); generated videos carry an audio track. The model is
257+
// image-to-video only: a starting-frame image is required (the text prompt
258+
// describes the desired motion).
259259
const GROK_IMAGINE_VIDEO_1_5 = {
260260
name: 'grok-imagine-video-1.5',
261261
supports: {

packages/ai-grok/src/video/video-provider-options.ts

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,9 +170,8 @@ export type GrokVideoModelSizeByName = {
170170

171171
/**
172172
* Type-only map from model name to the non-text prompt modalities it accepts.
173-
* grok-imagine-video-1.5 supports image-to-video: an `image` prompt part
174-
* supplies the starting frame (optional for text-to-video, required for
175-
* image-to-video).
173+
* grok-imagine-video-1.5 is image-to-video only: an `image` prompt part
174+
* supplies the required starting frame.
176175
*
177176
* @experimental Video generation is an experimental feature and may change.
178177
*/

packages/ai-grok/tests/video-adapter.test.ts

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,21 @@ function adapterWithFetch(
4545
})
4646
}
4747

48+
/**
49+
* grok-imagine-video-1.5 is image-to-video only, so every request needs a
50+
* starting-frame image part. This builds a text + image prompt for the
51+
* request-shape / status / error tests.
52+
*/
53+
function i2vPrompt(text = 'p') {
54+
return [
55+
{ type: 'text' as const, content: text },
56+
{
57+
type: 'image' as const,
58+
source: { type: 'url' as const, value: 'https://example.com/start.png' },
59+
},
60+
]
61+
}
62+
4863
describe('Grok Video Adapter', () => {
4964
describe('factories', () => {
5065
it('creates an adapter with the provided API key', () => {
@@ -73,7 +88,7 @@ describe('Grok Video Adapter', () => {
7388

7489
const result = await adapter.createVideoJob({
7590
model: 'grok-imagine-video-1.5',
76-
prompt: 'A red ball bouncing once',
91+
prompt: i2vPrompt('A red ball bouncing once'),
7792
size: '16:9_720p',
7893
duration: 5,
7994
logger: testLogger,
@@ -94,6 +109,7 @@ describe('Grok Video Adapter', () => {
94109
expect(JSON.parse(String(init?.body))).toEqual({
95110
model: 'grok-imagine-video-1.5',
96111
prompt: 'A red ball bouncing once',
112+
image: { url: 'https://example.com/start.png' },
97113
aspect_ratio: '16:9',
98114
resolution: '720p',
99115
duration: 5,
@@ -106,7 +122,7 @@ describe('Grok Video Adapter', () => {
106122

107123
await adapter.createVideoJob({
108124
model: 'grok-imagine-video-1.5',
109-
prompt: 'p',
125+
prompt: i2vPrompt(),
110126
size: '9:16',
111127
logger: testLogger,
112128
})
@@ -123,7 +139,7 @@ describe('Grok Video Adapter', () => {
123139

124140
await adapter.createVideoJob({
125141
model: 'grok-imagine-video-1.5',
126-
prompt: 'make the waterfall crash down',
142+
prompt: i2vPrompt('make the waterfall crash down'),
127143
modelOptions: {
128144
resolution: '1080p',
129145
duration: 10,
@@ -225,13 +241,27 @@ describe('Grok Video Adapter', () => {
225241
expect(fetchMock).not.toHaveBeenCalled()
226242
})
227243

244+
it('rejects a text-only prompt — the model is image-to-video only', async () => {
245+
const fetchMock = mockFetch(() => jsonResponse({ request_id: 'r' }))
246+
const adapter = adapterWithFetch(fetchMock)
247+
248+
await expect(
249+
adapter.createVideoJob({
250+
model: 'grok-imagine-video-1.5',
251+
prompt: 'a red ball bouncing once',
252+
logger: testLogger,
253+
}),
254+
).rejects.toThrow(/image-to-video only/)
255+
expect(fetchMock).not.toHaveBeenCalled()
256+
})
257+
228258
it('lets modelOptions win over the generic size template', async () => {
229259
const fetchMock = mockFetch(() => jsonResponse({ request_id: 'r' }))
230260
const adapter = adapterWithFetch(fetchMock)
231261

232262
await adapter.createVideoJob({
233263
model: 'grok-imagine-video-1.5',
234-
prompt: 'p',
264+
prompt: i2vPrompt(),
235265
size: '16:9_480p',
236266
modelOptions: { resolution: '1080p' },
237267
logger: testLogger,
@@ -305,7 +335,7 @@ describe('Grok Video Adapter', () => {
305335
await expect(
306336
adapter.createVideoJob({
307337
model: 'grok-imagine-video-1.5',
308-
prompt: 'p',
338+
prompt: i2vPrompt(),
309339
logger: testLogger,
310340
}),
311341
).rejects.toThrow(
@@ -320,7 +350,7 @@ describe('Grok Video Adapter', () => {
320350
await expect(
321351
adapter.createVideoJob({
322352
model: 'grok-imagine-video-1.5',
323-
prompt: 'p',
353+
prompt: i2vPrompt(),
324354
logger: testLogger,
325355
}),
326356
).rejects.toThrow(/no request_id/)
@@ -335,7 +365,7 @@ describe('Grok Video Adapter', () => {
335365

336366
await adapter.createVideoJob({
337367
model: 'grok-imagine-video-1.5',
338-
prompt: 'p',
368+
prompt: i2vPrompt(),
339369
logger: testLogger,
340370
})
341371

0 commit comments

Comments
 (0)