The official implementation of our NeurIPS 2025 Spotlight paper:
ReSim: Reliable World Simulation for Autonomous Driving
Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen
Primary contact at Jiazhi Yang: jzyang@link.cuhk.edu.hk
ReSim is a driving world model for reliable simulation of future ego-view driving videos under a wide range of ego behaviors.
- Reliable action control. ReSim supports action-conditioned future prediction for expert, free-driving, commanded, and hazardous non-expert behaviors.
- Heterogeneous training data. The model is trained from a mixture of web driving videos, real driving datasets with action or trajectory labels, and simulated data containing non-expert behavior.
- High-fidelity open-world prediction. ReSim targets realistic future driving video generation while improving controllability for both expert and non-expert actions.
- [2026/05/05] Initial public code release.
- Initial code release with training and inference scripts.
- Release pretrained ReSim world-model weights - pretrained on OpenDV + NavSim (expert actions only, without Carla data).
- Release pretrained ReSim world-model weights - pretrained on OpenDV + NavSim + CARLA data (expert + non-expert actions).
The main ReSim pipeline was developed with Python 3.10, PyTorch 2.4, CUDA 12.4, SAT, and DeepSpeed-style distributed training.
conda create -n resim python=3.10 -y
conda activate resim
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
cd SwissArmyTransformer
pip install -e . --no-build-isolation
cd ..The launch scripts under sat/ also add the vendored SwissArmyTransformer
directory to PYTHONPATH, so inference can run from a clean checkout after the
Python dependencies are installed.
ReSim uses SAT-format CogVideoX video diffusion checkpoints plus the text encoder and VAE components required by the configs. Download the public asset bundle from Hugging Face:
pip install -U huggingface_hub
huggingface-cli download OpenDriveLab-org/ReSim_Assets \
--repo-type model \
--local-dir checkpoints/CogVideoX-2b-satThe deprecated Tsinghua mirror download for vae.zip and transformer.zip is
no longer required. After downloading, the component directory should contain:
checkpoints/CogVideoX-2b-sat/
|-- transformer/
| |-- latest
| `-- <iteration>/mp_rank_00_model_states.pt
|-- vae/3d-vae.pt
`-- t5-v1_1-xxl/
|-- config.json
|-- model-00001-of-00002.safetensors
|-- model-00002-of-00002.safetensors
|-- model.safetensors.index.json
|-- spiece.model
`-- tokenizer_config.json
The transformer directory stores the SAT checkpoint, vae/3d-vae.pt is used
by the video autoencoder, and t5-v1_1-xxl provides the frozen text encoder and
tokenizer files. If you place the assets elsewhere, keep the same internal
directory structure and update the config paths accordingly.
Before running training or inference, copy an example config and update:
args.load: ReSim or base transformer checkpoint directory.model.conditioner_config...FrozenT5Embedder.params.model_dir: T5 directory.model.first_stage_config.params.ckpt_path: VAE checkpoint path.args.train_dataandargs.valid_data: dataset annotation files.
For example, the checkpoint-related fields should point to the downloaded asset directory:
args:
load: "checkpoints/CogVideoX-2b-sat/transformer"
model:
conditioner_config:
params:
emb_models:
- params:
model_dir: "checkpoints/CogVideoX-2b-sat/t5-v1_1-xxl"
first_stage_config:
params:
ckpt_path: "checkpoints/CogVideoX-2b-sat/vae/3d-vae.pt"The ReSim loaders are JSON-driven. Real driving and simulator datasets use the
shared schema consumed by sat/data_share.py:
{
"meta": {
"data_root": "/path/to/image/root"
},
"clips": [
{
"img_seq": ["scene/frame_000.jpg", "scene/frame_001.jpg"],
"cmd": "Moving_Forward",
"traj_fut": [[0.0, 0.0, 0.0], [1.0, 0.1, 0.0]],
"lidar_pc_token": "sample-token"
}
]
}Important fields:
img_seqis a list of frame paths relative tometa.data_root. The loader also supportsimg_seq_hisplusimg_seq_fut.cmdcan be a string such asMoving_Forward,Turning_Left, orTurning_Right, or an integer mapped bysat/data_utils.py.traj_futstores future trajectory points as[x, y, heading]. The default configs use 8 future points.lidar_pc_tokenortokenis used to name generated outputs.
For web-driving data, sat/data_youtube.py expects clips with folder_name,
first_frame, end_frame, and flow_direction.
Run ReSim training through sat/train_video.py and the provided launcher:
cd sat
# CFG, GPUS, NNODES, optional SEED
bash finetune_multi_gpus_custom.sh configs/train.yaml 8 1 42For single-GPU debugging:
cd sat
bash finetune_single_gpu_custom.sh configs/train.yamlBefore launching a real run, check the copied config:
args.mode: finetunedata.target, for exampledata_multi.MultiSourceDatasetordata_waymo.WaymoDatasetdata.params.video_size,fps,max_num_frames, and crop mode- DeepSpeed batch size, gradient accumulation, precision, and save interval
train_data_weightswhen mixing heterogeneous data sources
Training writes checkpoints under args.save and stores the merged training
config with the run.
Run ReSim sampling through sat/sample_video.py and the provided launcher:
cd sat
bash inference_custom.sh configs/infer_nus.yamlThe example inference config uses input_type: dataset; it loads validation
clips, conditions on the first frames, optionally applies fut_traj, and writes
MP4 samples.
Common inference options are config-driven:
args.sampling_video_size: output frame size, for example[512, 896].args.sampling_num_frames: latent-frame count, commonly13,11, or9.args.n_prediction_round: autoregressive rollout rounds.args.apply_traj: whether to condition onfut_traj.args.save_gtandargs.concat_gt_for_demo: whether to save ground-truth clips and side-by-side demo videos.
-
ModuleNotFoundError: No module named 'sat': install the vendored SAT package withpip install -e SwissArmyTransformer. -
Paths in example configs must be replaced with paths on your machine before running.
This implementation builds on the SAT training stack from CogVideoX, SwissArmyTransformer, and other open-source video diffusion components. We thank all maintainers for their open-source contributions.
If this project is useful for your research, please cite:
@inproceedings{yang2025resim,
title={ReSim: Reliable World Simulation for Autonomous Driving},
author={Jiazhi Yang and Kashyap Chitta and Shenyuan Gao and Long Chen and Yuqian Shao and Xiaosong Jia and Hongyang Li and Andreas Geiger and Xiangyu Yue and Li Chen},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}The repository includes an Apache-2.0 LICENSE file. Model weights may be
governed by separate terms in MODEL_LICENSE. Check the licenses of SAT, CARLA,
nuScenes, Waymo, nuPlan, OpenDV, and any redistributed annotations before public
release or commercial use.

