fix(ai-grok): grok-imagine-video-1.5 is image-to-video only

tombeckenham · claude · tombeckenham · commit 1cf70d0098fb · 2026-06-23T21:17:05.000+10:00
Confirmed against the live xAI API: grok-imagine-video-1.5 rejects
text-to-video ("Text-to-video is not supported for this model") and only
generates from a starting frame. createVideoJob now requires exactly one
image prompt part and throws a clear error otherwise; model-meta, provider
options, docs, changeset, and the media skill describe it as image-to-video
only. The ts-react-media example drops the (non-working) text-to-video entry
and keeps the image-to-video one.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.changeset/grok-imagine-video-adapter.md b/.changeset/grok-imagine-video-adapter.md
@@ -2,4 +2,4 @@
 '@tanstack/ai-grok': minor
 ---
 
-Add a `grokVideo` adapter for the `grok-imagine-video-1.5` model via xAI's Imagine API. Follows the experimental `generateVideo()` jobs/polling architecture: `createVideoJob` posts to `/v1/videos/generations`, status polling reads `/v1/videos/{request_id}`, and the completed result carries the hosted video URL plus usage (`unitsBilled` seconds and exact `cost` in USD). Sizing uses the aspect-ratio template consistent with the grok-imagine image models (`size: '16:9_720p'` → `aspect_ratio` / `resolution`), durations are 1–15 integer seconds, and image-to-video starting frames are supplied as an `image` prompt part (public URL or base64 data source), consistent with the multimodal prompt convention used by the other image/video adapters.
+Add a `grokVideo` adapter for the `grok-imagine-video-1.5` model via xAI's Imagine API. The model is image-to-video only: every request needs exactly one `image` prompt part as the starting frame (public URL or base64 data source), with the text part describing the motion. Follows the experimental `generateVideo()` jobs/polling architecture: `createVideoJob` posts to `/v1/videos/generations`, status polling reads `/v1/videos/{request_id}`, and the completed result carries the hosted video URL plus usage (`unitsBilled` seconds and exact `cost` in USD). Sizing uses the aspect-ratio template consistent with the grok-imagine image models (`size: '16:9_720p'` → `aspect_ratio` / `resolution`), and durations are 1–15 integer seconds.
diff --git a/docs/adapters/grok.md b/docs/adapters/grok.md
@@ -209,20 +209,34 @@ throw at runtime.
 
 ## Video Generation (Experimental)
 
-Generate short video clips (1–15 seconds, with audio) with the Grok Imagine video model via xAI's asynchronous jobs/polling API:
+Generate short video clips (1–15 seconds, with audio) with the Grok Imagine video model via xAI's asynchronous jobs/polling API.
+
+`grok-imagine-video-1.5` is **image-to-video only**: every request must include an
+`image` prompt part as the starting frame, with the text part describing the
+desired motion. URL sources are fetched by xAI's servers (so they must be
+publicly reachable); use a `data` source for a base64 starting frame:
 
 ```typescript
 import { generateVideo, getVideoJobStatus } from "@tanstack/ai";
 import { grokVideo } from "@tanstack/ai-grok";
 
 const adapter = grokVideo("grok-imagine-video-1.5");
 
-// 1. Create the job
+// 1. Create the job — the prompt carries the starting frame plus motion text
 const { jobId } = await generateVideo({
   adapter,
-  prompt: "A red panda balancing on a bamboo stalk in the rain",
+  prompt: [
+    {
+      type: "text",
+      content: "Make the waterfall crash down and slowly pan out the camera",
+    },
+    {
+      type: "image",
+      source: { type: "url", value: "https://example.com/waterfall-still.png" },
+    },
+  ],
   size: "16:9_720p", // "aspectRatio" or "aspectRatio_resolution"
-  duration: 5, // integer seconds, 1–15
+  duration: 10, // integer seconds, 1–15
 });
 
 // 2. Poll until complete, then read the video URL
@@ -237,32 +251,10 @@ console.log(status.url); // hosted .mp4 URL
 
 Available model:
 
-- `grok-imagine-video-1.5` — text-to-video and image-to-video, $0.08 per second of video. Per xAI's docs a starting image is optional for text-to-video and required for image-to-video.
+- `grok-imagine-video-1.5` — image-to-video, $0.08 per second of video.
 
 Like the Grok Imagine image models, sizing is aspect-ratio based: the `size` option takes an `aspectRatio_resolution` template. Supported aspect ratios are `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `3:2`, and `2:3`; supported resolutions are `480p`, `720p`, and `1080p` (e.g. `"9:16_1080p"`). The resolution suffix is optional.
 
-For image-to-video, include an `image` prompt part as the starting frame and
-describe the desired motion in the text part. URL sources are fetched by xAI's
-servers (so they must be publicly reachable); use a `data` source for a base64
-starting frame:
-
-```typescript
-const { jobId } = await generateVideo({
-  adapter: grokVideo("grok-imagine-video-1.5"),
-  prompt: [
-    {
-      type: "text",
-      content: "Make the waterfall crash down and slowly pan out the camera",
-    },
-    {
-      type: "image",
-      source: { type: "url", value: "https://example.com/waterfall-still.png" },
-    },
-  ],
-  duration: 10,
-});
-```
-
 When the job completes, the adapter reports usage on the result: `usage.unitsBilled` carries the billed seconds of video and `usage.cost` the exact cost in USD, both as returned by the xAI API.
 
 See [Video Generation](../media/video-generation) for the full jobs/polling flow, streaming mode, and the `useGenerateVideo` hook.
diff --git a/docs/media/video-generation.md b/docs/media/video-generation.md
@@ -558,15 +558,21 @@ Adapters that haven't declared a per-model duration map keep the plain
 
 ### Grok (xAI Imagine) Model Options
 
-Based on the [xAI video generation API](https://docs.x.ai/docs/guides/video-generations). The Grok Imagine models are aspect-ratio sized — the generic `size` option takes an `aspectRatio_resolution` template (like the Grok Imagine image models), and clips can be 1–15 seconds long:
+Based on the [xAI video generation API](https://docs.x.ai/docs/guides/video-generations). `grok-imagine-video-1.5` is **image-to-video only**: every request must include an `image` prompt part as the starting frame, with the text part describing the desired motion. URL sources are fetched by xAI's servers (so they must be publicly reachable); use a `data` source for a base64 starting frame. The model is aspect-ratio sized — the generic `size` option takes an `aspectRatio_resolution` template (like the Grok Imagine image models), and clips can be 1–15 seconds long:
 
 ```typescript
 import { generateVideo } from '@tanstack/ai'
 import { grokVideo } from '@tanstack/ai-grok'
 
 const { jobId } = await generateVideo({
   adapter: grokVideo('grok-imagine-video-1.5'),
-  prompt: 'A beautiful sunset over the ocean',
+  prompt: [
+    { type: 'text', content: 'Slowly pan out as the waves roll in' },
+    {
+      type: 'image',
+      source: { type: 'url', value: 'https://example.com/still.png' },
+    },
+  ],
   size: '16:9_720p',  // aspect ratio: '1:1' | '16:9' | '9:16' | '4:3' | '3:4' | '3:2' | '2:3'
                       // resolution (optional suffix): '480p' | '720p' | '1080p'
   duration: 5,        // integer seconds, 1-15
@@ -578,22 +584,6 @@ const { jobId } = await generateVideo({
 })
 ```
 
-For image-to-video, include an `image` prompt part as the starting frame and describe the desired motion in the text part. URL sources are fetched by xAI's servers (so they must be publicly reachable); use a `data` source for a base64 starting frame:
-
-```typescript
-const { jobId } = await generateVideo({
-  adapter: grokVideo('grok-imagine-video-1.5'),
-  prompt: [
-    { type: 'text', content: 'Slowly pan out as the waves roll in' },
-    {
-      type: 'image',
-      source: { type: 'url', value: 'https://example.com/still.png' },
-    },
-  ],
-  duration: 5,
-})
-```
-
 Generated clips include an audio track. When the job completes, the adapter reports `usage.unitsBilled` (billed seconds of video) and `usage.cost` (exact USD cost as returned by the API) on the result.
 
 ## Response Types
diff --git a/examples/ts-react-media/src/lib/models.ts b/examples/ts-react-media/src/lib/models.ts
@@ -132,17 +132,11 @@ export const VIDEO_MODELS = [
     mode: 'image-to-video' as const,
     provider: 'fal' as const,
   },
-  {
-    id: 'grok-imagine-video-1.5',
-    name: 'Grok Imagine Video 1.5 (Text-to-Video)',
-    description: 'xAI Imagine API via the native grokVideo adapter',
-    mode: 'text-to-video' as const,
-    provider: 'xai' as const,
-  },
   {
     id: 'grok-imagine-video-1.5/image-to-video',
     name: 'Grok Imagine Video 1.5 (Image-to-Video)',
-    description: 'Animate a starting frame via the native grokVideo adapter',
+    description:
+      'Animate a starting frame via the native grokVideo adapter (image-to-video only)',
     mode: 'image-to-video' as const,
     provider: 'xai' as const,
   },
diff --git a/examples/ts-react-media/src/lib/server-functions.ts b/examples/ts-react-media/src/lib/server-functions.ts
@@ -73,10 +73,7 @@ function asImageToVideoPrompt(
  * (XAI_API_KEY); everything else is a fal-hosted model.
  */
 function videoAdapterForModel(model: string) {
-  if (
-    model === 'grok-imagine-video-1.5' ||
-    model === 'grok-imagine-video-1.5/image-to-video'
-  ) {
+  if (model === 'grok-imagine-video-1.5/image-to-video') {
     return grokVideo('grok-imagine-video-1.5')
   }
   return falVideo(model)
@@ -249,18 +246,6 @@ export const createVideoJobFn = createServerFn({ method: 'POST' })
           },
         })
       }
-      case 'grok-imagine-video-1.5': {
-        // Direct xAI Imagine API (XAI_API_KEY) — no fal in between. Sizing is
-        // an "aspectRatio_resolution" template; durations are 1-15 integer
-        // seconds. Completed jobs report usage.unitsBilled (billed seconds)
-        // and usage.cost (exact USD).
-        return generateVideo({
-          adapter: grokVideo('grok-imagine-video-1.5'),
-          prompt: asTextPrompt(data.prompt),
-          size: '16:9_720p',
-          duration: 5,
-        })
-      }
       case 'fal-ai/ltx-2.3/text-to-video/fast': {
         return generateVideo({
           adapter: falVideo('fal-ai/ltx-2.3/text-to-video/fast'),
diff --git a/packages/ai-grok/src/adapters/video.ts b/packages/ai-grok/src/adapters/video.ts
@@ -86,10 +86,13 @@ function buildGrokVideoUsage(
 /**
  * Grok Video Generation Adapter (xAI Imagine API)
  *
- * Tree-shakeable adapter for the grok-imagine video models using the
+ * Tree-shakeable adapter for the grok-imagine-video-1.5 model using the
  * async jobs/polling architecture: create a generation request, poll it,
  * then read the completed video URL.
  *
+ * The model is image-to-video only: every request needs exactly one image
+ * prompt part (the starting frame) plus text describing the desired motion.
+ *
  * The Imagine video endpoints are not part of the OpenAI SDK surface (and
  * xAI rejects the SDK's multipart paths), so requests are plain JSON calls
  * issued with the configured `fetch` (or the global one).
@@ -174,8 +177,8 @@ export class GrokVideoAdapter<
     const duration = options.duration ?? modelOptions?.duration
 
     // The interleaved prompt decomposes into verbatim text plus typed media
-    // buckets. The Imagine video endpoint takes a text prompt and an optional
-    // starting frame; reject the modalities it can't consume.
+    // buckets. grok-imagine-video-1.5 is image-to-video only: it needs exactly
+    // one starting-frame image plus the text prompt describing the motion.
     const resolved = resolveMediaPrompt(options.prompt)
     if (resolved.videos.length > 0) {
       throw new Error(
@@ -187,9 +190,14 @@ export class GrokVideoAdapter<
         `${this.name}.createVideoJob does not support audio prompt parts (model: ${model}).`,
       )
     }
+    if (resolved.images.length === 0) {
+      throw new Error(
+        `${this.name}: ${model} is image-to-video only — include exactly one image prompt part as the starting frame.`,
+      )
+    }
     if (resolved.images.length > 1) {
       throw new Error(
-        `${this.name}: grok-imagine video accepts at most one starting-frame image; received ${resolved.images.length}.`,
+        `${this.name}: ${model} accepts at most one starting-frame image; received ${resolved.images.length}.`,
       )
     }
 
@@ -352,11 +360,15 @@ export class GrokVideoAdapter<
  *
  * @example
  * ```typescript
+ * // Image-to-video only: include the starting frame as an image prompt part.
  * const adapter = createGrokVideo('grok-imagine-video-1.5', 'xai-...');
  *
  * const { jobId } = await generateVideo({
  *   adapter,
- *   prompt: 'A beautiful sunset over the ocean',
+ *   prompt: [
+ *     { type: 'text', content: 'Slowly pan out as the waves roll in' },
+ *     { type: 'image', source: { type: 'url', value: 'https://example.com/still.png' } },
+ *   ],
  *   size: '16:9_720p',
  *   duration: 5
  * });
@@ -390,10 +402,13 @@ export function createGrokVideo<TModel extends GrokVideoModel>(
  * // Automatically uses XAI_API_KEY from environment
  * const adapter = grokVideo('grok-imagine-video-1.5');
  *
- * // Create a video generation job
+ * // Image-to-video only: the prompt must carry a starting-frame image part.
  * const { jobId } = await generateVideo({
  *   adapter,
- *   prompt: 'A cat playing piano'
+ *   prompt: [
+ *     { type: 'text', content: 'Make the cat start playing the piano' },
+ *     { type: 'image', source: { type: 'url', value: 'https://example.com/cat.png' } },
+ *   ],
  * });
  *
  * // Poll for status
diff --git a/packages/ai-grok/src/model-meta.ts b/packages/ai-grok/src/model-meta.ts
@@ -253,9 +253,9 @@ const GROK_IMAGINE_IMAGE_QUALITY = {
 } as const satisfies ModelMeta
 
 // Imagine API video model. Pricing is per second of generated video
-// (output only); generated videos carry an audio track. Per xAI's docs the
-// model does text-to-video (a starting image is optional) and image-to-video
-// (a starting image is required).
+// (output only); generated videos carry an audio track. The model is
+// image-to-video only: a starting-frame image is required (the text prompt
+// describes the desired motion).
 const GROK_IMAGINE_VIDEO_1_5 = {
   name: 'grok-imagine-video-1.5',
   supports: {
diff --git a/packages/ai-grok/src/video/video-provider-options.ts b/packages/ai-grok/src/video/video-provider-options.ts
@@ -170,9 +170,8 @@ export type GrokVideoModelSizeByName = {
 
 /**
  * Type-only map from model name to the non-text prompt modalities it accepts.
- * grok-imagine-video-1.5 supports image-to-video: an `image` prompt part
- * supplies the starting frame (optional for text-to-video, required for
- * image-to-video).
+ * grok-imagine-video-1.5 is image-to-video only: an `image` prompt part
+ * supplies the required starting frame.
  *
  * @experimental Video generation is an experimental feature and may change.
  */
diff --git a/packages/ai-grok/tests/video-adapter.test.ts b/packages/ai-grok/tests/video-adapter.test.ts
@@ -45,6 +45,21 @@ function adapterWithFetch(
   })
 }
 
+/**
+ * grok-imagine-video-1.5 is image-to-video only, so every request needs a
+ * starting-frame image part. This builds a text + image prompt for the
+ * request-shape / status / error tests.
+ */
+function i2vPrompt(text = 'p') {
+  return [
+    { type: 'text' as const, content: text },
+    {
+      type: 'image' as const,
+      source: { type: 'url' as const, value: 'https://example.com/start.png' },
+    },
+  ]
+}
+
 describe('Grok Video Adapter', () => {
   describe('factories', () => {
     it('creates an adapter with the provided API key', () => {
@@ -73,7 +88,7 @@ describe('Grok Video Adapter', () => {
 
       const result = await adapter.createVideoJob({
         model: 'grok-imagine-video-1.5',
-        prompt: 'A red ball bouncing once',
+        prompt: i2vPrompt('A red ball bouncing once'),
         size: '16:9_720p',
         duration: 5,
         logger: testLogger,
@@ -94,6 +109,7 @@ describe('Grok Video Adapter', () => {
       expect(JSON.parse(String(init?.body))).toEqual({
         model: 'grok-imagine-video-1.5',
         prompt: 'A red ball bouncing once',
+        image: { url: 'https://example.com/start.png' },
         aspect_ratio: '16:9',
         resolution: '720p',
         duration: 5,
@@ -106,7 +122,7 @@ describe('Grok Video Adapter', () => {
 
       await adapter.createVideoJob({
         model: 'grok-imagine-video-1.5',
-        prompt: 'p',
+        prompt: i2vPrompt(),
         size: '9:16',
         logger: testLogger,
       })
@@ -123,7 +139,7 @@ describe('Grok Video Adapter', () => {
 
       await adapter.createVideoJob({
         model: 'grok-imagine-video-1.5',
-        prompt: 'make the waterfall crash down',
+        prompt: i2vPrompt('make the waterfall crash down'),
         modelOptions: {
           resolution: '1080p',
           duration: 10,
@@ -225,13 +241,27 @@ describe('Grok Video Adapter', () => {
       expect(fetchMock).not.toHaveBeenCalled()
     })
 
+    it('rejects a text-only prompt — the model is image-to-video only', async () => {
+      const fetchMock = mockFetch(() => jsonResponse({ request_id: 'r' }))
+      const adapter = adapterWithFetch(fetchMock)
+
+      await expect(
+        adapter.createVideoJob({
+          model: 'grok-imagine-video-1.5',
+          prompt: 'a red ball bouncing once',
+          logger: testLogger,
+        }),
+      ).rejects.toThrow(/image-to-video only/)
+      expect(fetchMock).not.toHaveBeenCalled()
+    })
+
     it('lets modelOptions win over the generic size template', async () => {
       const fetchMock = mockFetch(() => jsonResponse({ request_id: 'r' }))
       const adapter = adapterWithFetch(fetchMock)
 
       await adapter.createVideoJob({
         model: 'grok-imagine-video-1.5',
-        prompt: 'p',
+        prompt: i2vPrompt(),
         size: '16:9_480p',
         modelOptions: { resolution: '1080p' },
         logger: testLogger,
@@ -305,7 +335,7 @@ describe('Grok Video Adapter', () => {
       await expect(
         adapter.createVideoJob({
           model: 'grok-imagine-video-1.5',
-          prompt: 'p',
+          prompt: i2vPrompt(),
           logger: testLogger,
         }),
       ).rejects.toThrow(
@@ -320,7 +350,7 @@ describe('Grok Video Adapter', () => {
       await expect(
         adapter.createVideoJob({
           model: 'grok-imagine-video-1.5',
-          prompt: 'p',
+          prompt: i2vPrompt(),
           logger: testLogger,
         }),
       ).rejects.toThrow(/no request_id/)
@@ -335,7 +365,7 @@ describe('Grok Video Adapter', () => {
 
       await adapter.createVideoJob({
         model: 'grok-imagine-video-1.5',
-        prompt: 'p',
+        prompt: i2vPrompt(),
         logger: testLogger,
       })
 
diff --git a/packages/ai/skills/ai-core/media-generation/SKILL.md b/packages/ai/skills/ai-core/media-generation/SKILL.md