Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/code_reference/config/models.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Models

[ModelProvider](#data_designer.config.models.ModelProvider) stores connection and authentication details for model providers. [ModelConfig](#data_designer.config.models.ModelConfig) stores a model alias, model identifier, provider settings, and inference parameters. [Inference Parameters](../../concepts/models/inference-parameters.md) control model behavior. Chat-completion parameters include `temperature`, `top_p`, and `max_tokens`; `temperature` and `top_p` can be fixed values or configured distributions. [ImageContext](#data_designer.config.models.ImageContext) provides image inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) configures image generation models.
[ModelProvider](#data_designer.config.models.ModelProvider) stores connection and authentication details for model providers. [ModelConfig](#data_designer.config.models.ModelConfig) stores a model alias, model identifier, provider settings, and inference parameters. [Inference Parameters](../../concepts/models/inference-parameters.md) control model behavior. Chat-completion parameters include `temperature`, `top_p`, and `max_tokens`; `temperature` and `top_p` can be fixed values or configured distributions. [ImageContext](#data_designer.config.models.ImageContext), [AudioContext](#data_designer.config.models.AudioContext), and [VideoContext](#data_designer.config.models.VideoContext) provide image, audio, and video inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) configures image generation models.

Related guides:

Expand Down
40 changes: 39 additions & 1 deletion docs/colab_notebooks/4-providing-images-as-context.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,11 @@
"#### πŸ“š What you'll learn\n",
"\n",
"This notebook demonstrates how to provide images as context to generate text descriptions using vision-language models.\n",
"The same `multi_modal_context` field can also carry audio or video context when the selected model supports those modalities.\n",
"\n",
"- ✨ **Visual Document Processing**: Converting images to chat-ready format for model consumption\n",
"- πŸ” **Vision-Language Generation**: Using vision models to generate detailed summaries from images\n",
"- 🧩 **Media Context Pattern**: Understanding how `ImageContext`, `AudioContext`, and `VideoContext` fit into the same configuration field\n",
"\n",
"If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.\n"
]
Expand Down Expand Up @@ -268,6 +270,42 @@
"config_builder.with_seed_dataset(dd.DataFrameSeedSource(df=df_seed))"
]
},
{
"cell_type": "markdown",
"id": "media-context-capabilities",
"metadata": {},
"source": [
"### 🧩 Media context and model capabilities\n",
"\n",
"`multi_modal_context` accepts media context descriptors such as `ImageContext`, `AudioContext`, and `VideoContext`. Data Designer reads the referenced seed columns and serializes them for the model request, but the selected model still determines which modalities are valid.\n",
"\n",
"This notebook uses image context only because image-capable VLMs are broadly available. Before combining image, audio, and video in one column, choose a model alias backed by an omni or otherwise modality-compatible model, and check that the provider accepts every context type you send.\n",
"\n",
"For base64 seed columns, store the raw base64 payload without a `data:<media-type>;base64,` prefix and specify the media format on the context object:\n",
"\n",
"```python\n",
"media_context = [\n",
" dd.ImageContext(\n",
" column_name=\"image_base64\",\n",
" data_type=dd.ModalityDataType.BASE64,\n",
" image_format=dd.ImageFormat.PNG,\n",
" ),\n",
" dd.AudioContext(\n",
" column_name=\"audio_base64\",\n",
" data_type=dd.ModalityDataType.BASE64,\n",
" audio_format=dd.AudioFormat.MP3,\n",
" ),\n",
" dd.VideoContext(\n",
" column_name=\"video_base64\",\n",
" data_type=dd.ModalityDataType.BASE64,\n",
" video_format=dd.VideoFormat.MP4,\n",
" ),\n",
"]\n",
"```\n",
"\n",
"URL-backed media can use `data_type=dd.ModalityDataType.URL`, subject to the provider's URL support and file-size limits."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -456,7 +494,7 @@
"\n",
"- Experiment with different vision models for specific image types\n",
"- Try different prompt variations to generate specialized descriptions (e.g., technical details, key findings)\n",
"- Combine vision-based descriptions with other column types for multi-modal workflows\n",
"- Combine image, audio, or video context with other column types after confirming your selected model supports those modalities\n",
"- Apply this pattern to other vision tasks like image captioning, OCR validation, or visual question answering\n",
"\n",
"- [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) with Data Designer\n"
Expand Down
35 changes: 34 additions & 1 deletion docs/notebook_source/4-providing-images-as-context.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,11 @@
# #### πŸ“š What you'll learn
#
# This notebook demonstrates how to provide images as context to generate text descriptions using vision-language models.
# The same `multi_modal_context` field can also carry audio or video context when the selected model supports those modalities.
#
# - ✨ **Visual Document Processing**: Converting images to chat-ready format for model consumption
# - πŸ” **Vision-Language Generation**: Using vision models to generate detailed summaries from images
# - 🧩 **Media Context Pattern**: Understanding how `ImageContext`, `AudioContext`, and `VideoContext` fit into the same configuration field
#
# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.
#
Expand Down Expand Up @@ -153,6 +155,37 @@ def convert_image_to_chat_format(record, height: int) -> dict:
df_seed = pd.DataFrame(img_dataset)[["uuid", "label", "base64_image"]]
config_builder.with_seed_dataset(dd.DataFrameSeedSource(df=df_seed))

# %% [markdown]
# ### 🧩 Media context and model capabilities
#
# `multi_modal_context` accepts media context descriptors such as `ImageContext`, `AudioContext`, and `VideoContext`. Data Designer reads the referenced seed columns and serializes them for the model request, but the selected model still determines which modalities are valid.
#
# This notebook uses image context only because image-capable VLMs are broadly available. Before combining image, audio, and video in one column, choose a model alias backed by an omni or otherwise modality-compatible model, and check that the provider accepts every context type you send.
#
# For base64 seed columns, store the raw base64 payload without a `data:<media-type>;base64,` prefix and specify the media format on the context object:
#
# ```python
# media_context = [
# dd.ImageContext(
# column_name="image_base64",
# data_type=dd.ModalityDataType.BASE64,
# image_format=dd.ImageFormat.PNG,
# ),
# dd.AudioContext(
# column_name="audio_base64",
# data_type=dd.ModalityDataType.BASE64,
# audio_format=dd.AudioFormat.MP3,
# ),
# dd.VideoContext(
# column_name="video_base64",
# data_type=dd.ModalityDataType.BASE64,
# video_format=dd.VideoFormat.MP4,
# ),
# ]
# ```
#
# URL-backed media can use `data_type=dd.ModalityDataType.URL`, subject to the provider's URL support and file-size limits.

# %%
# Add a column to generate detailed image descriptions
config_builder.add_column(
Expand Down Expand Up @@ -257,7 +290,7 @@ def convert_image_to_chat_format(record, height: int) -> dict:
#
# - Experiment with different vision models for specific image types
# - Try different prompt variations to generate specialized descriptions (e.g., technical details, key findings)
# - Combine vision-based descriptions with other column types for multi-modal workflows
# - Combine image, audio, or video context with other column types after confirming your selected model supports those modalities
# - Apply this pattern to other vision tasks like image captioning, OCR validation, or visual question answering
#
# - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) with Data Designer
Expand Down
1 change: 1 addition & 0 deletions docs/notebook_source/_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ Learn how to use vision-language models to generate text descriptions from image

- Processing and converting images to base64 format for model consumption
- Using vision-language models (VLMs) to analyze visual documents
- Understanding how image, audio, and video context share the same `multi_modal_context` field, while still requiring model support for each modality
- Generating detailed summaries from document images
- Inspecting and validating vision-based generation results

Expand Down
6 changes: 4 additions & 2 deletions fern/versions/latest/pages/code_reference/config/models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@ title: "Models"
description: ""
position: 1
---
The `models` module defines configuration objects for model-based generation. [ModelProvider](#data_designer.config.models.ModelProvider) specifies connection and authentication details for custom providers. [ModelConfig](#data_designer.config.models.ModelConfig) encapsulates model details including the model alias, identifier, and inference parameters. [Inference Parameters](/concepts/models/inference-parameters) controls model behavior through settings like `temperature`, `top_p`, and `max_tokens`, with support for both fixed values and distribution-based sampling. The module includes [ImageContext](#data_designer.config.models.ImageContext) for providing image inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) for configuring image generation models.
The `models` module defines configuration objects for model-based generation. [ModelProvider](#data_designer.config.models.ModelProvider) specifies connection and authentication details for custom providers. [ModelConfig](#data_designer.config.models.ModelConfig) encapsulates model details including the model alias, identifier, and inference parameters. [Inference Parameters](/concepts/models/inference-parameters) controls model behavior through settings like `temperature`, `top_p`, and `max_tokens`, with support for both fixed values and distribution-based sampling. The module includes [ImageContext](#data_designer.config.models.ImageContext), [AudioContext](#data_designer.config.models.AudioContext), and [VideoContext](#data_designer.config.models.VideoContext) for providing image, audio, and video inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) for configuring image generation models.

`ImageContext`, `AudioContext`, and `VideoContext` describe the media blocks that Data Designer should send. They do not override provider limitations: the selected model must support every modality and media format included in a column's `multi_modal_context`.

For more information on how they are used, see below:

- **[Model Providers](/concepts/models/model-providers)**
- **[Model Configurations](/concepts/models/model-configs)**
- **[Image Context](/tutorials/providing-images-as-context)**
- **[Image and Media Context](/tutorials/providing-images-as-context)**
- **[Generating Images](/tutorials/generating-images)**
6 changes: 4 additions & 2 deletions fern/versions/latest/pages/concepts/columns.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ LLM-Text columns generate natural language text: product descriptions, customer

Use **Jinja2 templating** in prompts to reference other columns. Data Designer automatically manages dependencies and injects the referenced column values into the prompt.

LLM-Text and LLM-Structured columns can also include `multi_modal_context` with `ImageContext`, `AudioContext`, or `VideoContext`. Data Designer reads the referenced seed columns and serializes the media blocks, but it does not make an image-only model understand audio or video. Choose a `model_alias` whose underlying provider/model supports every modality in the column.

<Note>
Generation Traces
LLM columns can optionally capture message traces in a separate `{column_name}__trace` column. Set `with_trace` on the column config to control what's captured: `TraceType.NONE` (default, no trace), `TraceType.LAST_MESSAGE` (final assistant message only), or `TraceType.ALL_MESSAGES` (full conversation history). The trace includes the ordered message history for the final generation attempt (system/user/assistant/tool calls/tool results), and may include model reasoning fields when the provider exposes them.
Expand Down Expand Up @@ -126,11 +128,11 @@ Image columns require a model configured with `ImageInferenceParams`. Model-spec
- **Preview** (`data_designer.preview()`): Images are stored as base64-encoded strings directly in the DataFrame for quick iteration
- **Create** (`data_designer.create()`): Images are saved to disk in an `images/<column_name>/` folder with UUID filenames; the DataFrame stores relative paths

Image columns also support `multi_modal_context` for autoregressive models that accept image inputs, enabling image-to-image generation workflows.
Image columns also support `multi_modal_context` for autoregressive multimodal models that accept media inputs, enabling image-to-image and other media-conditioned image generation workflows. Diffusion image-generation routes do not consume multimodal context, and not every autoregressive image model accepts every media type.

<Tip>
Tutorials
The image tutorials cover three workflows: [Providing Images as Context](/tutorials/providing-images-as-context) (image β†’ text), [Generating Images](/tutorials/generating-images) (text β†’ image), and [Editing Images with Image Context](/tutorials/image-to-image-editing) (image β†’ image).
The image tutorials cover three workflows: [Providing Images as Context](/tutorials/providing-images-as-context) (image β†’ text, with notes on audio/video-capable models), [Generating Images](/tutorials/generating-images) (text β†’ image), and [Editing Images with Image Context](/tutorials/image-to-image-editing) (image β†’ image).
</Tip>

### 🧬 Embedding Columns
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The following model configurations are automatically available when `NVIDIA_API_
|-------|-------|----------|---------------------|
| `nvidia-text` | `nvidia/nemotron-3-nano-30b-a3b` | General text generation | `temperature=1.0, top_p=1.0` |
| `nvidia-reasoning` | `nvidia/nemotron-3-super-120b-a12b` | Reasoning and analysis tasks | `temperature=1.0, top_p=0.95, extra_body={"reasoning_effort": "medium"}` |
| `nvidia-vision` | `nvidia/nemotron-nano-12b-v2-vl` | Vision and image understanding | `temperature=0.85, top_p=0.95` |
| `nvidia-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` | Omni multimodal understanding for image, audio, and video inputs | `temperature=0.60, top_p=0.95` |
| `nvidia-embedding` | `nvidia/llama-3.2-nv-embedqa-1b-v2` | Text embeddings | `encoding_format="float", extra_body={"input_type": "query"}` |


Expand All @@ -59,8 +59,8 @@ The following model configurations are automatically available when `OPENAI_API_
| Alias | Model | Use Case | Inference Parameters |
|-------|-------|----------|---------------------|
| `openai-text` | `gpt-4.1` | General text generation | `temperature=0.85, top_p=0.95` |
| `openai-reasoning` | `gpt-5` | Reasoning and analysis tasks | `temperature=0.35, top_p=0.95` |
| `openai-vision` | `gpt-5` | Vision and image understanding | `temperature=0.85, top_p=0.95` |
| `openai-reasoning` | `gpt-5` | Reasoning and analysis tasks | `extra_body={"reasoning_effort": "medium"}` |
| `openai-vision` | `gpt-5` | Vision and image understanding | `extra_body={"reasoning_effort": "medium"}` |
| `openai-embedding` | `text-embedding-3-large` | Text embeddings | `encoding_format="float"` |

### OpenRouter Models
Expand All @@ -71,9 +71,13 @@ The following model configurations are automatically available when `OPENROUTER_
|-------|-------|----------|---------------------|
| `openrouter-text` | `nvidia/nemotron-3-nano-30b-a3b` | General text generation | `temperature=1.0, top_p=1.0` |
| `openrouter-reasoning` | `openai/gpt-oss-20b` | Reasoning and analysis tasks | `temperature=0.35, top_p=0.95` |
| `openrouter-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free` | Vision and image understanding | `temperature=0.60, top_p=0.95` |
| `openrouter-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free` | Omni multimodal understanding for image, audio, and video inputs, subject to OpenRouter model support | `temperature=0.60, top_p=0.95` |
| `openrouter-embedding` | `openai/text-embedding-3-large` | Text embeddings | `encoding_format="float"` |

<Note title="Modality support depends on the model">
The `multi_modal_context` field can include image, audio, and video contexts, but each model/provider combination has its own accepted input formats, media-size limits, and modality mix. Use an image-capable model for image-only workflows, and use an omni or otherwise multimodal model before sending audio or video context.
</Note>


## Using Default Settings

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Model configurations define the specific models you use for synthetic data gener

A `ModelConfig` specifies which LLM model to use and how it should behave during generation. When you create column configurations (like `LLMText`, `LLMCode`, or `LLMStructured`), you reference a model by its alias. Data Designer uses the model configuration to determine which model to call and with what parameters.

When a column includes `multi_modal_context`, the `ModelConfig` alias must point to a model that supports the media types you send. Data Designer can serialize image, audio, and video context blocks, but model capability is still provider-specific.

## ModelConfig Structure

The `ModelConfig` class has the following fields:
Expand Down Expand Up @@ -81,13 +83,13 @@ model_configs = [
max_tokens=4096,
),
),
# Vision tasks
# Omni multimodal tasks
dd.ModelConfig(
alias="vision-model",
model="nvidia/nemotron-nano-12b-v2-vl",
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
provider="nvidia",
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=0.7,
temperature=0.60,
top_p=0.95,
max_tokens=2048,
),
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Providing Images as Context"
description: "Multimodal prompts with image inputs."
description: "Multimodal prompts with image inputs and notes for audio/video-capable models."
position: 5
---

Expand Down
2 changes: 1 addition & 1 deletion fern/versions/latest/pages/notebooks/README.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,6 @@ These tutorials walk through Data Designer end-to-end with executable Jupyter no
| [The Basics](/tutorials/the-basics) | Declare columns, generate your first dataset |
| [Structured Outputs, Jinja Expressions, and Conditional Generation](/tutorials/structured-outputs-jinja-expressions-and-conditional-generation) | Schema-constrained outputs and dynamic prompts |
| [Seeding with an External Dataset](/tutorials/seeding-with-an-external-dataset) | Use existing data as input for generation |
| [Providing Images as Context](/tutorials/providing-images-as-context) | Multimodal prompts with image inputs |
| [Providing Images as Context](/tutorials/providing-images-as-context) | Multimodal prompts with image inputs, plus the media-context pattern for models that support audio or video |
| [Generating Images](/tutorials/generating-images) | Create image columns from text prompts |
| [Image-to-Image Editing](/tutorials/image-to-image-editing) | Edit images using image context |
Loading
Loading