Describe the bug
When I use a custom dataset in openai format with preserving tool calls, I see the following error.
▶ Setting up data...
Using PreservingDataset to preserve heterogeneous tool argument schemas without None-filling.
Loaded dataset using PreservingDataset: 8000 samples.
Traceback (most recent call last):
File "/opt/nemo-rl/examples/run_sft.py", line 213, in <module>
main()
File "/opt/nemo-rl/examples/run_sft.py", line 185, in main
dataset, val_dataset = setup_data(tokenizer, config["data"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/examples/run_sft.py", line 89, in setup_data
merged_data = concatenate_datasets([data.dataset for data in data_list])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/datasets/combine.py", line 209, in concatenate_datasets
raise ValueError(
ValueError: Expected a list of Dataset objects or a list of IterableDataset objects, but element at position 0 is a PreservingDataset.
2026-03-16 17:50:05,449 ERR cli.py:73 -- ----------------------------------------------
2026-03-16 17:50:05,449 ERR cli.py:74 -- Job 'test-sft-megatron-job-gkwmz-67pfr' failed
2026-03-16 17:50:05,449 ERR cli.py:75 -- ----------------------------------------------
2026-03-16 17:50:05,449 INFO cli.py:88 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
main()
File "/opt/nemo-rl/examples/run_sft.py", line 185, in main
dataset, val_dataset = setup_data(tokenizer, config["data"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/examples/run_sft.py", line 89, in setup_data
merged_data = concatenate_datasets([data.dataset for data in data_list])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/datasets/combine.py", line 209, in concatenate_datasets
raise ValueError(
ValueError: Expected a list of Dataset objects or a list of IterableDataset objects, but element at position 0 is a PreservingDataset.
Steps/Code to reproduce bug
- Pull v0.5.0.nemotron_3_super
- Kick off SFT job with custom dataset.
use_preserving_dataset is set to true in data config. Following is the complete override
uv run examples/run_sft.py \
--config examples/configs/sft.yaml \
sft.max_num_epochs=1 \
policy.model_name=openai/gpt-oss-20b \
policy.tokenizer.name=openai/gpt-oss-20b \
~policy.tokenizer.chat_template \
policy.train_global_batch_size=32 \
policy.train_micro_batch_size=1 \
policy.max_total_sequence_length=1024 \
policy.sequence_packing.enabled=false \
policy.dtensor_cfg.enabled=false \
policy.megatron_cfg.enabled=true \
policy.megatron_cfg.sequence_parallel=false \
policy.megatron_cfg.expert_model_parallel_size=8 \
policy.megatron_cfg.tensor_model_parallel_size=1 \
policy.megatron_cfg.pipeline_model_parallel_size=1 \
policy.make_sequence_length_divisible_by=1 \
policy.megatron_cfg.freeze_moe_router=false \
policy.megatron_cfg.moe_router_dtype=fp64 \
policy.megatron_cfg.moe_router_load_balancing_type=aux_loss \
policy.megatron_cfg.moe_router_bias_update_rate=1e-3 \
+policy.megatron_cfg.env_vars.NRL_MEGATRON_CHECKPOINT_DIR=/fsx/models/megatron/gpt-oss-20b \
policy.megatron_cfg.optimizer.use_distributed_optimizer=true \
policy.megatron_cfg.optimizer.use_precision_aware_optimizer=true \
policy.megatron_cfg.optimizer.params_dtype=float32 \
logger.wandb_enabled=false \
logger.tensorboard_enabled=false \
logger.mlflow_enabled=true \
logger.mlflow.experiment_name=nemo-rl-experiments \
logger.mlflow.run_name=gpt-oss-20b-tool-call-sft-run \
+logger.mlflow.tracking_uri=<tracking_uri> \
cluster.gpus_per_node=8 \
cluster.num_nodes=2 \
checkpointing.checkpoint_dir=/fsx/ramasud/ws/dpo-k8s/checkpoints-megatron/gpt-oss-sft-20b/ramasud-nemo-tool-call-sft-ray \
checkpointing.keep_top_k=5 \
data.train.dataset_name=openai_format \
++data.train.data_path=/fsx/ramasud/ws/dpo-k8s/custom_datasets/xlam_toolcalls/train_data.jsonl \
~data.train.split \
~data.validation \
++data.default.system_key=system_key \
++data.default.chat_key=messages \
++data.default.tool_key=tools \
~data.default.processor \
++data.default.use_preserving_dataset=true
Expected behavior
SFT job to finish running successfully.
Additional context
- Custom dataset(xlam in openai chat format) has the following format.
[{'role': 'user',
'content': 'I need to know the GCD of 90 and 120, the monthly mortgage for $190,000 at 2.8% over 15 years, the standard deviation of [1, 1, 2, 2, 3, 3, 4, 4], and the cosine similarity of [0, 1, 0, 1] and [1, 0, 1, 0].'},
{'role': 'assistant',
'content': {},
'tool_calls': [{'id': 'call_0',
'type': 'function',
'function': {'name': 'greatest_common_divisor',
'arguments': '{"a": 90, "b": 120}'}},
{'id': 'call_1',
'type': 'function',
'function': {'name': 'monthly_mortgage_payment',
'arguments': '{"loan_amount": 190000, "annual_interest_rate": 0.028, "loan_term_years": 15}'}},
{'id': 'call_2',
'type': 'function',
'function': {'name': 'calculate_standard_deviation',
'arguments': '{"numbers": [1, 1, 2, 2, 3, 3, 4, 4]}'}},
{'id': 'call_3',
'type': 'function',
'function': {'name': 'cosine_similarity',
'arguments': '{"vector_a": [0, 1, 0, 1], "vector_b": [1, 0, 1, 0]}'}}]}]
This is blocking my day to day work, if there are any workarounds, please suggest.
Thanks in advance.
Describe the bug
When I use a custom dataset in openai format with preserving tool calls, I see the following error.
Steps/Code to reproduce bug
use_preserving_datasetis set to true in data config. Following is the complete overrideExpected behavior
SFT job to finish running successfully.
Additional context
This is blocking my day to day work, if there are any workarounds, please suggest.
Thanks in advance.