NVIDIA-NeMo · nabinchha · May 22, 2026 · May 18, 2026 · May 18, 2026 · May 18, 2026
diff --git a/docs/code_reference/config/models.md b/docs/code_reference/config/models.md
@@ -1,6 +1,6 @@
 # Models
 
-[ModelProvider](#data_designer.config.models.ModelProvider) stores connection and authentication details for model providers. [ModelConfig](#data_designer.config.models.ModelConfig) stores a model alias, model identifier, provider settings, and inference parameters. [Inference Parameters](../../concepts/models/inference-parameters.md) control model behavior. Chat-completion parameters include `temperature`, `top_p`, and `max_tokens`; `temperature` and `top_p` can be fixed values or configured distributions. [ImageContext](#data_designer.config.models.ImageContext) provides image inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) configures image generation models.
+[ModelProvider](#data_designer.config.models.ModelProvider) stores connection and authentication details for model providers. [ModelConfig](#data_designer.config.models.ModelConfig) stores a model alias, model identifier, provider settings, and inference parameters. [Inference Parameters](../../concepts/models/inference-parameters.md) control model behavior. Chat-completion parameters include `temperature`, `top_p`, and `max_tokens`; `temperature` and `top_p` can be fixed values or configured distributions. [ImageContext](#data_designer.config.models.ImageContext), [AudioContext](#data_designer.config.models.AudioContext), and [VideoContext](#data_designer.config.models.VideoContext) provide image, audio, and video inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) configures image generation models.
 
 Related guides:
 

@@ -24,9 +24,11 @@
     "#### 📚 What you'll learn\n",
     "\n",
     "This notebook demonstrates how to provide images as context to generate text descriptions using vision-language models.\n",
+    "The same `multi_modal_context` field can also carry audio or video context when the selected model supports those modalities.\n",
     "\n",
     "- ✨ **Visual Document Processing**: Converting images to chat-ready format for model consumption\n",
     "- 🔍 **Vision-Language Generation**: Using vision models to generate detailed summaries from images\n",
+    "- 🧩 **Media Context Pattern**: Understanding how `ImageContext`, `AudioContext`, and `VideoContext` fit into the same configuration field\n",
     "\n",
     "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.\n"
    ]
@@ -268,6 +270,42 @@
     "config_builder.with_seed_dataset(dd.DataFrameSeedSource(df=df_seed))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "media-context-capabilities",
+   "metadata": {},
+   "source": [
+    "### 🧩 Media context and model capabilities\n",
+    "\n",
+    "`multi_modal_context` accepts media context descriptors such as `ImageContext`, `AudioContext`, and `VideoContext`. Data Designer reads the referenced seed columns and serializes them for the model request, but the selected model still determines which modalities are valid.\n",
+    "\n",
+    "This notebook uses image context only because image-capable VLMs are broadly available. Before combining image, audio, and video in one column, choose a model alias backed by an omni or otherwise modality-compatible model, and check that the provider accepts every context type you send.\n",
+    "\n",
+    "For base64 seed columns, store the raw base64 payload without a `data:<media-type>;base64,` prefix and specify the media format on the context object:\n",
+    "\n",
+    "```python\n",
+    "media_context = [\n",
+    "    dd.ImageContext(\n",
+    "        column_name=\"image_base64\",\n",
+    "        data_type=dd.ModalityDataType.BASE64,\n",
+    "        image_format=dd.ImageFormat.PNG,\n",
+    "    ),\n",
+    "    dd.AudioContext(\n",
+    "        column_name=\"audio_base64\",\n",
+    "        data_type=dd.ModalityDataType.BASE64,\n",
+    "        audio_format=dd.AudioFormat.MP3,\n",
+    "    ),\n",
+    "    dd.VideoContext(\n",
+    "        column_name=\"video_base64\",\n",
+    "        data_type=dd.ModalityDataType.BASE64,\n",
+    "        video_format=dd.VideoFormat.MP4,\n",
+    "    ),\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "URL-backed media can use `data_type=dd.ModalityDataType.URL`, subject to the provider's URL support and file-size limits."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -456,7 +494,7 @@
     "\n",
     "- Experiment with different vision models for specific image types\n",
     "- Try different prompt variations to generate specialized descriptions (e.g., technical details, key findings)\n",
-    "- Combine vision-based descriptions with other column types for multi-modal workflows\n",
+    "- Combine image, audio, or video context with other column types after confirming your selected model supports those modalities\n",
     "- Apply this pattern to other vision tasks like image captioning, OCR validation, or visual question answering\n",
     "\n",
     "- [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) with Data Designer\n"

@@ -19,9 +19,11 @@
 # #### 📚 What you'll learn
 #
 # This notebook demonstrates how to provide images as context to generate text descriptions using vision-language models.
+# The same `multi_modal_context` field can also carry audio or video context when the selected model supports those modalities.
 #
 # - ✨ **Visual Document Processing**: Converting images to chat-ready format for model consumption
 # - 🔍 **Vision-Language Generation**: Using vision models to generate detailed summaries from images
+# - 🧩 **Media Context Pattern**: Understanding how `ImageContext`, `AudioContext`, and `VideoContext` fit into the same configuration field
 #
 # If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.
 #
@@ -153,6 +155,37 @@ def convert_image_to_chat_format(record, height: int) -> dict:
 df_seed = pd.DataFrame(img_dataset)[["uuid", "label", "base64_image"]]
 config_builder.with_seed_dataset(dd.DataFrameSeedSource(df=df_seed))
 
+# %% [markdown]
+# ### 🧩 Media context and model capabilities
+#
+# `multi_modal_context` accepts media context descriptors such as `ImageContext`, `AudioContext`, and `VideoContext`. Data Designer reads the referenced seed columns and serializes them for the model request, but the selected model still determines which modalities are valid.
+#
+# This notebook uses image context only because image-capable VLMs are broadly available. Before combining image, audio, and video in one column, choose a model alias backed by an omni or otherwise modality-compatible model, and check that the provider accepts every context type you send.
+#
+# For base64 seed columns, store the raw base64 payload without a `data:<media-type>;base64,` prefix and specify the media format on the context object:
+#
+# ```python
+# media_context = [
+#     dd.ImageContext(
+#         column_name="image_base64",
+#         data_type=dd.ModalityDataType.BASE64,
+#         image_format=dd.ImageFormat.PNG,
+#     ),
+#     dd.AudioContext(
+#         column_name="audio_base64",
+#         data_type=dd.ModalityDataType.BASE64,
+#         audio_format=dd.AudioFormat.MP3,
+#     ),
+#     dd.VideoContext(
+#         column_name="video_base64",
+#         data_type=dd.ModalityDataType.BASE64,
+#         video_format=dd.VideoFormat.MP4,
+#     ),
+# ]
+# ```
+#
+# URL-backed media can use `data_type=dd.ModalityDataType.URL`, subject to the provider's URL support and file-size limits.
+
 # %%
 # Add a column to generate detailed image descriptions
 config_builder.add_column(
@@ -257,7 +290,7 @@ def convert_image_to_chat_format(record, height: int) -> dict:
 #
 # - Experiment with different vision models for specific image types
 # - Try different prompt variations to generate specialized descriptions (e.g., technical details, key findings)
-# - Combine vision-based descriptions with other column types for multi-modal workflows
+# - Combine image, audio, or video context with other column types after confirming your selected model supports those modalities
 # - Apply this pattern to other vision tasks like image captioning, OCR validation, or visual question answering
 #
 # - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) with Data Designer

@@ -95,6 +95,7 @@ Learn how to use vision-language models to generate text descriptions from image
 
 - Processing and converting images to base64 format for model consumption
 - Using vision-language models (VLMs) to analyze visual documents
+- Understanding how image, audio, and video context share the same `multi_modal_context` field, while still requiring model support for each modality
 - Generating detailed summaries from document images
 - Inspecting and validating vision-based generation results
 

diff --git a/fern/versions/latest/pages/code_reference/config/models.mdx b/fern/versions/latest/pages/code_reference/config/models.mdx
@@ -3,11 +3,13 @@ title: "Models"
 description: ""
 position: 1
 ---
-The `models` module defines configuration objects for model-based generation. [ModelProvider](#data_designer.config.models.ModelProvider) specifies connection and authentication details for custom providers. [ModelConfig](#data_designer.config.models.ModelConfig) encapsulates model details including the model alias, identifier, and inference parameters. [Inference Parameters](/concepts/models/inference-parameters) controls model behavior through settings like `temperature`, `top_p`, and `max_tokens`, with support for both fixed values and distribution-based sampling. The module includes [ImageContext](#data_designer.config.models.ImageContext) for providing image inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) for configuring image generation models.
+The `models` module defines configuration objects for model-based generation. [ModelProvider](#data_designer.config.models.ModelProvider) specifies connection and authentication details for custom providers. [ModelConfig](#data_designer.config.models.ModelConfig) encapsulates model details including the model alias, identifier, and inference parameters. [Inference Parameters](/concepts/models/inference-parameters) controls model behavior through settings like `temperature`, `top_p`, and `max_tokens`, with support for both fixed values and distribution-based sampling. The module includes [ImageContext](#data_designer.config.models.ImageContext), [AudioContext](#data_designer.config.models.AudioContext), and [VideoContext](#data_designer.config.models.VideoContext) for providing image, audio, and video inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) for configuring image generation models.
+
+`ImageContext`, `AudioContext`, and `VideoContext` describe the media blocks that Data Designer should send. They do not override provider limitations: the selected model must support every modality and media format included in a column's `multi_modal_context`.
 
 For more information on how they are used, see below:
 
 - **[Model Providers](/concepts/models/model-providers)**
 - **[Model Configurations](/concepts/models/model-configs)**
-- **[Image Context](/tutorials/providing-images-as-context)**
+- **[Image and Media Context](/tutorials/providing-images-as-context)**
 - **[Generating Images](/tutorials/generating-images)**
@@ -45,6 +45,8 @@ LLM-Text columns generate natural language text: product descriptions, customer
 
 Use **Jinja2 templating** in prompts to reference other columns. Data Designer automatically manages dependencies and injects the referenced column values into the prompt.
 
+LLM-Text and LLM-Structured columns can also include `multi_modal_context` with `ImageContext`, `AudioContext`, or `VideoContext`. Data Designer reads the referenced seed columns and serializes the media blocks, but it does not make an image-only model understand audio or video. Choose a `model_alias` whose underlying provider/model supports every modality in the column.
+
 <Note>
 Generation Traces
 LLM columns can optionally capture message traces in a separate `{column_name}__trace` column. Set `with_trace` on the column config to control what's captured: `TraceType.NONE` (default, no trace), `TraceType.LAST_MESSAGE` (final assistant message only), or `TraceType.ALL_MESSAGES` (full conversation history). The trace includes the ordered message history for the final generation attempt (system/user/assistant/tool calls/tool results), and may include model reasoning fields when the provider exposes them.
@@ -126,11 +128,11 @@ Image columns require a model configured with `ImageInferenceParams`. Model-spec
 - **Preview** (`data_designer.preview()`): Images are stored as base64-encoded strings directly in the DataFrame for quick iteration
 - **Create** (`data_designer.create()`): Images are saved to disk in an `images/<column_name>/` folder with UUID filenames; the DataFrame stores relative paths
 
-Image columns also support `multi_modal_context` for autoregressive models that accept image inputs, enabling image-to-image generation workflows.
+Image columns also support `multi_modal_context` for autoregressive multimodal models that accept media inputs, enabling image-to-image and other media-conditioned image generation workflows. Diffusion image-generation routes do not consume multimodal context, and not every autoregressive image model accepts every media type.
 
 <Tip>
 Tutorials
-The image tutorials cover three workflows: [Providing Images as Context](/tutorials/providing-images-as-context) (image → text), [Generating Images](/tutorials/generating-images) (text → image), and [Editing Images with Image Context](/tutorials/image-to-image-editing) (image → image).
+The image tutorials cover three workflows: [Providing Images as Context](/tutorials/providing-images-as-context) (image → text, with notes on audio/video-capable models), [Generating Images](/tutorials/generating-images) (text → image), and [Editing Images with Image Context](/tutorials/image-to-image-editing) (image → image).
 </Tip>
 
 ### 🧬 Embedding Columns

@@ -48,7 +48,7 @@ The following model configurations are automatically available when `NVIDIA_API_
 |-------|-------|----------|---------------------|
 | `nvidia-text` | `nvidia/nemotron-3-nano-30b-a3b` | General text generation | `temperature=1.0, top_p=1.0` |
 | `nvidia-reasoning` | `nvidia/nemotron-3-super-120b-a12b` | Reasoning and analysis tasks | `temperature=1.0, top_p=0.95, extra_body={"reasoning_effort": "medium"}` |
-| `nvidia-vision` | `nvidia/nemotron-nano-12b-v2-vl` | Vision and image understanding | `temperature=0.85, top_p=0.95` |
+| `nvidia-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` | Omni multimodal understanding for image, audio, and video inputs | `temperature=0.60, top_p=0.95` |
 | `nvidia-embedding` | `nvidia/llama-3.2-nv-embedqa-1b-v2` | Text embeddings | `encoding_format="float", extra_body={"input_type": "query"}` |
 
 
@@ -59,8 +59,8 @@ The following model configurations are automatically available when `OPENAI_API_
 | Alias | Model | Use Case | Inference Parameters |
 |-------|-------|----------|---------------------|
 | `openai-text` | `gpt-4.1` | General text generation | `temperature=0.85, top_p=0.95` |
-| `openai-reasoning` | `gpt-5` | Reasoning and analysis tasks | `temperature=0.35, top_p=0.95` |
-| `openai-vision` | `gpt-5` | Vision and image understanding | `temperature=0.85, top_p=0.95` |
+| `openai-reasoning` | `gpt-5` | Reasoning and analysis tasks | `extra_body={"reasoning_effort": "medium"}` |
+| `openai-vision` | `gpt-5` | Vision and image understanding | `extra_body={"reasoning_effort": "medium"}` |
 | `openai-embedding` | `text-embedding-3-large` | Text embeddings | `encoding_format="float"` |
 
 ### OpenRouter Models
@@ -71,9 +71,13 @@ The following model configurations are automatically available when `OPENROUTER_
 |-------|-------|----------|---------------------|
 | `openrouter-text` | `nvidia/nemotron-3-nano-30b-a3b` | General text generation | `temperature=1.0, top_p=1.0` |
 | `openrouter-reasoning` | `openai/gpt-oss-20b` | Reasoning and analysis tasks | `temperature=0.35, top_p=0.95` |
-| `openrouter-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free` | Vision and image understanding | `temperature=0.60, top_p=0.95` |
+| `openrouter-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free` | Omni multimodal understanding for image, audio, and video inputs, subject to OpenRouter model support | `temperature=0.60, top_p=0.95` |
 | `openrouter-embedding` | `openai/text-embedding-3-large` | Text embeddings | `encoding_format="float"` |
 
+<Note title="Modality support depends on the model">
+  The `multi_modal_context` field can include image, audio, and video contexts, but each model/provider combination has its own accepted input formats, media-size limits, and modality mix. Use an image-capable model for image-only workflows, and use an omni or otherwise multimodal model before sending audio or video context.
+</Note>
+
 
 ## Using Default Settings
 

@@ -9,6 +9,8 @@ Model configurations define the specific models you use for synthetic data gener
 
 A `ModelConfig` specifies which LLM model to use and how it should behave during generation. When you create column configurations (like `LLMText`, `LLMCode`, or `LLMStructured`), you reference a model by its alias. Data Designer uses the model configuration to determine which model to call and with what parameters.
 
+When a column includes `multi_modal_context`, the `ModelConfig` alias must point to a model that supports the media types you send. Data Designer can serialize image, audio, and video context blocks, but model capability is still provider-specific.
+
 ## ModelConfig Structure
 
 The `ModelConfig` class has the following fields:
@@ -81,13 +83,13 @@ model_configs = [
             max_tokens=4096,
         ),
     ),
-    # Vision tasks
+    # Omni multimodal tasks
     dd.ModelConfig(
         alias="vision-model",
-        model="nvidia/nemotron-nano-12b-v2-vl",
+        model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
         provider="nvidia",
         inference_parameters=dd.ChatCompletionInferenceParams(
-            temperature=0.7,
+            temperature=0.60,
             top_p=0.95,
             max_tokens=2048,
         ),

@@ -1,6 +1,6 @@
 ---
 title: "Providing Images as Context"
-description: "Multimodal prompts with image inputs."
+description: "Multimodal prompts with image inputs and notes for audio/video-capable models."
 position: 5
 ---
 

@@ -11,6 +11,6 @@ These tutorials walk through Data Designer end-to-end with executable Jupyter no
 | [The Basics](/tutorials/the-basics) | Declare columns, generate your first dataset |
 | [Structured Outputs, Jinja Expressions, and Conditional Generation](/tutorials/structured-outputs-jinja-expressions-and-conditional-generation) | Schema-constrained outputs and dynamic prompts |
 | [Seeding with an External Dataset](/tutorials/seeding-with-an-external-dataset) | Use existing data as input for generation |
-| [Providing Images as Context](/tutorials/providing-images-as-context) | Multimodal prompts with image inputs |
+| [Providing Images as Context](/tutorials/providing-images-as-context) | Multimodal prompts with image inputs, plus the media-context pattern for models that support audio or video |
 | [Generating Images](/tutorials/generating-images) | Create image columns from text prompts |
 | [Image-to-Image Editing](/tutorials/image-to-image-editing) | Edit images using image context |