---
size_categories: {{ size_categories }}
tags:
{% for tag in tags %}
- {{ tag }}
{% endfor %}
---
This dataset was generated using NeMo Data Designer, a comprehensive framework for creating high-quality synthetic datasets from scratch or using seed data.
NeMo Data Designer is a general framework for generating high-quality synthetic data that goes beyond simple LLM prompting. It provides:
- Diverse data generation using statistical samplers, LLMs, or existing seed datasets
- Relationship control between fields with dependency-aware generation
- Quality validation with built-in Python, SQL, and custom local and remote validators
- LLM-as-a-judge scoring for quality assessment
- Fast iteration with preview mode before full-scale generation
For more information, visit: https://github.com/NVIDIA-NeMo/DataDesigner
Load this dataset for fine-tuning:
from datasets import load_dataset
dataset = load_dataset("{{ repo_id }}")
# Access the data
df = dataset["train"].to_pandas()Or with NeMo Data Designer:
from data_designer.interface.results import DatasetCreationResults
# Load dataset with all artifacts (analysis, configs, etc.)
results = DatasetCreationResults.pull_from_hub("{{ repo_id }}")
# Access the dataset
df = results.load_dataset()
# Access the analysis
analysis = results.load_analysis()
# Access the config builder
config_builder = results._config_builder- Number of records: {% if num_records is defined and num_records is not none %}{{ "{:,}".format(num_records) }}{% else %}N/A{% endif %}
- Number of columns: {{ num_columns }}
- Size category: {{ size_categories }} {% if target_num_records is defined and target_num_records is not none and target_num_records != num_records %}
- Target records: {{ "{:,}".format(target_num_records) }} ({{ "%.1f" | format(percent_complete) if percent_complete is defined and percent_complete is not none else "N/A" }}% complete) {% endif %}
{% if num_samples > 0 %} Here are sample records from the dataset:
{% for idx in range(num_samples) %}
{{ sample_records[idx] | tojson(indent=2) }}{% endfor %} {% else %} No sample records available. {% endif %}
{% if all_columns is defined and all_columns %}
| Column | Type | Description |
|---|---|---|
| {% for col_name, dtype in all_columns | dictsort -%} | |
{{ col_name }} |
{{ dtype }} | {% if column_configs %}{% for col_config in column_configs %}{% if col_config.get('name') == col_name %}{% set col_type = col_config.get('column_type') %}{% if col_type is mapping %}{{ col_type.get('value', '') }}{% elif col_type %}{{ col_type }}{% endif %}{% endif %}{% endfor %}{% endif %} |
| {% endfor -%} | ||
| {% else %} | ||
| No column information available. | ||
| {% endif %} |
{% if column_stats_by_type %}
{% for col_type in sorted_column_types %} {% set stats_list = column_stats_by_type[col_type] %} {% if stats_list %} {% set col_type_label = col_type.replace("_", " ").title().replace("Llm", "LLM") %}
{% if col_type == "sampler" %}
| Column | Data Type | Unique Values | Sampler Type |
|---|---|---|---|
| {% for stat in stats_list -%} | |||
| {{ stat.get('column_name', 'unknown') }} | {{ stat.get('simple_dtype', 'unknown') }} | {% if 'num_unique' in stat and stat['num_unique'] is not none %}{{ stat['num_unique'] }}{% else %}N/A{% endif %} ({% if 'num_unique' in stat and stat['num_unique'] is not none and num_records > 0 %}{{ "%.1f" | format((stat['num_unique'] / num_records * 100)) }}{% else %}0.0{% endif %}%) |
| {% endfor -%} |
{% elif col_type in ["llm_text", "llm_structured", "llm_code", "llm_judge"] %}
| Column | Data Type | Unique Values | Prompt Tokens (avg) | Completion Tokens (avg) |
|---|---|---|---|---|
| {% for stat in stats_list -%} | ||||
| {{ stat.get('column_name', 'unknown') }} | {{ stat.get('simple_dtype', 'unknown') }} | {% if 'num_unique' in stat and stat['num_unique'] is not none %}{{ stat['num_unique'] }}{% else %}N/A{% endif %} ({% if 'num_unique' in stat and stat['num_unique'] is not none and num_records > 0 %}{{ "%.1f" | format((stat['num_unique'] / num_records * 100)) }}{% else %}0.0{% endif %}%) | {% if 'prompt_tokens_mean' in stat and stat['prompt_tokens_mean'] is not none %}{{ "%.1f" |
| {% endfor -%} |
{% else %}
| Column | Data Type | Unique Values | Null Values |
|---|---|---|---|
| {% for stat in stats_list -%} | |||
| {{ stat.get('column_name', 'unknown') }} | {{ stat.get('simple_dtype', 'unknown') }} | {% if 'num_unique' in stat and stat['num_unique'] is not none %}{{ stat['num_unique'] }}{% else %}N/A{% endif %} ({% if 'num_unique' in stat and stat['num_unique'] is not none and num_records > 0 %}{{ "%.1f" | format((stat['num_unique'] / num_records * 100)) }}{% else %}0.0{% endif %}%) |
| {% endfor -%} | |||
| {% endif %} | |||
| {% endif %} |
{% endfor %} {% elif column_statistics %} {% for stat in column_statistics[:10] %}
- {{ stat.get('column_name', 'unknown') }} ({{ stat.get('column_type', 'unknown') }}): {% if 'num_unique' in stat and stat['num_unique'] is not none %}{{ stat['num_unique'] }} unique values{% if num_records > 0 %} ({{ "%.1f" | format((stat['num_unique'] / num_records * 100)) }}% coverage){% endif %}{% else %}N/A{% endif %}{% if 'num_null' in stat and stat['num_null'] is not none and stat['num_null'] > 0 %}, {{ stat['num_null'] }} nulls{% endif %} {% endfor %} {% if column_statistics | length > 10 %} ... and {{ (column_statistics | length) - 10 }} more columns {% endif %} {% endif %}
{% if column_configs %} This dataset was generated with {{ column_configs | length }} column configuration(s).
{% for config_type, count in config_types | dictsort %}
- {{ config_type }}: {{ count }} column(s) {% endfor %}
{% for col_config in column_configs %}
- {{ col_config.get('name', 'unknown') }}: {% set col_type = col_config.get('column_type') %}{% if col_type is mapping %}{{ col_type.get('value', 'unknown') }}{% elif col_type %}{{ col_type }}{% else %}unknown{% endif %} {% endfor %} {% else %} No column configurations available. {% endif %}
{% if metadata %}
{{ metadata | tojson(indent=2) }}{% endif %}
If you use this dataset in your research, please cite:
@software{data_designer,
title={NeMo Data Designer: A Framework for Synthetic Dataset Generation},
author={NVIDIA},
year={2025},
url={https://github.com/NVIDIA-NeMo/DataDesigner}
}