SenseTime-FVG
diff --git a/‎README.md‎
Lines changed: 51 additions & 11 deletions b/‎README.md‎
Lines changed: 51 additions & 11 deletions
diff --git a/‎README_intro_zh.md‎
Lines changed: 5 additions & 3 deletions b/‎README_intro_zh.md‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎configs/README.md‎
Lines changed: 73 additions & 2 deletions b/‎configs/README.md‎
Lines changed: 73 additions & 2 deletions
@@ -10,11 +10,13 @@ The driving world models generate multi-view images or videos of autonomous driv
 
 The highlights are as follows:
 
-1. **Significant improvement in the environmental diversity.** Through the use of multiple datasets, the model's generalization ability has been enhanced like never before. Take the example of a generation task controlled by layout conditions, such as a snowy city street or a lakeside highway with distant snow mountains, these scenarios are impossible tasks for generative models trained with a single dataset.
+1. **Transparent and reproducable training.** We provide complete training codes and configurations, allowing everyone to reproduce experiments, fine-tune on their own data, and customize development features as needed.
 
-2. **Greatly improved generation quality.** Support for popular model architectures (SD 2.1, 3.5) enables more convenient utilization of the advanced pre-training generation capabilities within the community. Various training techniques, including multitasking and self-supervision, allow the model to utilize the information in autonomous driving video data more effectively.
+2. **Significant improvement in the environmental diversity.** Through the use of multiple datasets, the model's generalization ability has been enhanced like never before. Take the example of a generation task controlled by layout conditions, such as a snowy city street or a lakeside highway with distant snow mountains, these scenarios are impossible tasks for generative models trained with a single dataset.
 
-3. **Convenient evaluation.** Evaluation follows the popular framework `torchmetrics`, which is easy to configure, develop, and integrate into the pipeline. Public configurations (such as FID, FVD on the nuScenes validation set) are provided to align other research works.
+3. **Greatly improved generation quality.** Support for popular model architectures (SD 2.1, 3.5) enables more convenient utilization of the advanced pre-training generation capabilities within the community. Various training techniques, including multitasking and self-supervision, allow the model to utilize the information in autonomous driving video data more effectively.
+
+4. **Convenient evaluation.** Evaluation follows the popular framework `torchmetrics`, which is easy to configure, develop, and integrate into the pipeline. Public configurations (such as FID, FVD on the nuScenes validation set) are provided to align other research works.
 
 Furthermore, our code modules are designed with high reusability in mind, for easy application in other projects.
 
@@ -30,6 +32,7 @@ Currently, the project has implemented the following papers:
 
 ## News
 
+* [2025/4/23] Update the [LiDAR VQVAE (including KITTI-360), LiDAR generation models](#lidar-models), and release the [DFoT on CTSD 3.5 model](#video-models).
 * [2025/3/17] Experimental release the [Interactive Generation with Carla](docs/InteractiveGeneration.md)
 * [2025/3/7] Release the [LiDAR Generation](#lidar-models)
 * [2025/3/4] Release the [CTSD 3.5 with layout condition](#video-models)
@@ -72,16 +75,22 @@ Our cross-view temporal SD (CTSD) pipeline support loading the pretrained SD 2.1
 | [SD 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) | [Config](configs/ctsd/multi_datasets/ctsd_21_tirda_nwao.json), [Download](http://103.237.29.236:10030/ctsd_21_tirda_nwao_30k.pth) | [Config](configs/ctsd/multi_datasets/ctsd_21_tirda_bm_nwa.json), [Download](http://103.237.29.236:10030/ctsd_21_tirda_bm_nwa_30k.pth) |
 | [SD 3.0](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) | | [UniMLVG Config](configs/ctsd/unimlvg/ctsd_unimlvg_stage3_tirda_bm_nwa.json), [Download](http://103.237.29.236:10030/ctsd_unimlvg_tirda_bm_nwa_60k.pth) |
 | [SD 3.5](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium) | [Config](configs/ctsd/multi_datasets/ctsd_35_tirda_nwao.json), [Download](http://103.237.29.236:10030/ctsd_35_tirda_nwao_20k.pth) | [Config](configs/ctsd/multi_datasets/ctsd_35_tirda_bm_nwao.json), [Download](http://103.237.29.236:10030/ctsd_35_tirda_bm_nwao_40k.pth) |
+| [DFoT](https://arxiv.org/abs/2502.06764) on [SD 3.5](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium) | | [Config](configs/ctsd/multi_datasets/ctsd_35_df16_tirda_bm_nwao.json), [Download](http://103.237.29.236:10030/ctsd_35_df16_tirda_bm_nwao_40k.pth) |
+
+The FVD evaluation results for all downloadable models can be found at the bottom of the corresponding configuration files.
 
 ### LiDAR Models
 
 You can download our pre-trained tokenzier and generation model in the following link.
 
-| Model Architecture | Configs | Checkpoint Download |
-| :-: | :-: | :-: |
-| VQVAE | [Config](configs/lidar/lidar_vqvae_nwa.json) | [checkpoint](http://103.237.29.236:10030/lidar_vqvae_nwa_60k.pth), [blank code ](http://103.237.29.236:10030/lidar_vqvae_nwa_60k_blank_code.pkl) |
-| MaskGIT | [Config](configs/lidar/lidar_maskgit_layout_ns.json)| [checkpoint](http://103.237.29.236:10030/lidar_maskgit_nusc_150k.pth) |
-| Temporal MaskGIT |  |  |
+| Model Architecture | Dataset | Configs | Checkpoint Download |
+| :-: | :-: | :-: | :-: |
+| VQVAE | nuscene, waymo, argoverse | [Config](configs/lidar/lidar_vqvae_nwa.json) | [checkpoint](http://103.237.29.236:10030/lidar_vqvae_nwa_60k.pth), [blank code ](http://103.237.29.236:10030/lidar_vqvae_nwa_60k_blank_code.pkl) |
+| | nuscene, waymo, argoverse, kitti360 | [Config](configs/lidar/lidar_vqvae_nwak.json) | [checkpoint](http://103.237.29.236:10030/lidar_vqvae_nwak_80k.pth), [blank code](http://103.237.29.236:10030/lidar_vqvae_nwak_80k_blank_code.pkl) |
+| MaskGIT | nuscene | [Config](configs/lidar/lidar_maskgit_layout_ns.json) | [ckpt_with_vqvae_nwa](http://103.237.29.236:10030/lidar_maskgit_nusc_150k.pth) <br> [ckpt_with_vqvae_nwak](http://103.237.29.236:10030/lidar_maskgit_vq80k_layout_ns_120k.pth) |
+| | kitti360 | [Config](configs/lidar/lidar_maskgit_vq80k_layout_kt.json) | [checkpoint](http://103.237.29.236:10030/lidar_maskgit_vq80k_layout_kt_120k.pth)|
+| Temporal MaskGIT | nuscene | [Config](configs/lidar/lidar_maskgit_temporal_vq80k_layout_ns.json) | checkpoint(TODO) |
+| | kitti360 | [Config](configs/lidar/lidar_maskgit_temporal_vq80k_layout_kt.json) | checkpoint(TODO)|
 ## Examples
 
 ### T2I, T2V generation with CTSD pipeline
@@ -106,13 +115,20 @@ PYTHONPATH=src python src/dwm/preview.py -c examples/ctsd_35_6views_video_genera
 
 1. Download LiDAR VQVAE and LiDAR MaskGIT generation model checkpoint.
 2. Prepare the dataset ( [nuscenes_scene-0627_lidar_package.zip](http://103.237.29.236:10030/nuscenes_scene-0627_lidar_package.zip) ).
-3. Modify the values of `json_file`, `vq_point_cloud_ckpt_path`, `vq_blank_code_path` and `model_ckpt_path` to the paths of your dataset and checkpoints in the json file `examples/lidar_maskgit_preview.json` .
-4. Run the following command to visualize the LiDAR of the validation set and save the generated point cloud as `.bin` file.
+3. Modify the values of `json_file`, `vq_point_cloud_ckpt_path`, `vq_blank_code_path` and `model_ckpt_path` to the paths of your dataset and checkpoints in the json file `examples/lidar_maskgit_preview.json` or `examples/lidar_maskgit_temporal_preview.json` .
+4. For single-frame lidar generation, run the following command to visualize the LiDAR of the validation set and save the generated point cloud as `.bin` file.
 
 ```bash
-PYTHONPATH=src python src/dwm/preview.py -c examples/lidar_maskgit_preview.json -o output/test
+PYTHONPATH=src python src/dwm/preview.py -c examples/lidar_maskgit_preview.json -o output/single_frame_maskgit
 ```
 
+5. For lidar sequence generation, `enable_autoregressive_inference` flag is enabled in the config file to support autoregressive generation. If you would like to use ground truth data as reference frames, set `use_ground_truth_as_reference` as `true`. Alternatively, you can set it as `false` for generation from layout condition only. After setting up the config file, run the following command
+
+```bash
+PYTHONPATH=src python3 -m torch.distributed.run --nnodes 1 --nproc-per-node 2 --node-rank 0 --master-addr 127.0.0.1 --master-port 29000 src/dwm/preview.py -c examples/lidar_maskgit_temporal_preview.json -o output/temporal_maskgit
+```
+
+
 ## Train
 
 Preparation:
@@ -165,3 +181,27 @@ Or distributed evaluation by `torch.distributed.run`, similar to the distributed
   * `tools` provides dataset and file processing scripts for faster initialization and reading.
 
 Introduction about the [file system](src/dwm/fs/README.md), and [dataset](src/dwm/datasets/README.md).
+
+## Citation
+If you find our OpenDWM useful in your research or refer to the provided baseline results, please star :star: this repository and consider citing our repo or papers :pencil::
+```
+@misc{opendwm,
+  Year = {2025},
+  Note = {https://github.com/SenseTime-FVG/OpenDWM},
+  Title = {OpenDWM: Open Driving World Models}
+}
+
+@article{chen2024unimlvg,
+  title={UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving},
+  author={Chen, Rui and Wu, Zehuan and Liu, Yichen and Guo, Yuxin and Ni, Jingcheng and Xia, Haifeng and Xia, Siyu},
+  journal={arXiv preprint arXiv:2412.04842},
+  year={2024}
+}
+
+@article{ni2025maskgwm,
+  title={MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction},
+  author={Ni, Jingcheng and Guo, Yuxin and Liu, Yichen and Chen, Rui and Lu, Lewei and Wu, Zehuan},
+  journal={arXiv preprint arXiv:2502.11663},
+  year={2025}
+}
+```
@@ -12,11 +12,13 @@ https://github.com/user-attachments/assets/649d3b81-3b1f-44f9-9f51-4d1ed7756476
 
 亮点如下：
 
-1. **环境多样性的显著改进。** 通过对多个数据集的使用，模型的泛化能力得到前所未有的提升。以布局条件控制生成任务为例，下雪的城市街道，远处有雪山的湖边高速路，这些场景对于仅使用单一数据集训练的生成模型都是不可能的任务。
+1. **透明且可复现的训练。** 我们提供完整的训练代码和配置，让大家可以根据需要进行实验复现、在自有数据上微调、定制开发功能。
 
-2. **大幅提升生成质量。** 对于流行模型架构（SD 2.1, 3.5）的支持，可以更便捷地利用社区内先进的预训练生成能力。包括多任务、自监督在内的多种训练技巧，让模型更有效地利用视频数据里的信息。
+2. **环境多样性的显著改进。** 通过对多个数据集的使用，模型的泛化能力得到前所未有的提升。以布局条件控制生成任务为例，下雪的城市街道，远处有雪山的湖边高速路，这些场景对于仅使用单一数据集训练的生成模型都是不可能的任务。
 
-3. **方便测评。** 测评遵循流行框架 `torchmetrics`，易于配置、开发、并集成到已有管线。一些公开配置（例如在 nuScenes 验证集上的 FID, FVD）用于和其他研究工作对齐。
+3. **大幅提升生成质量。** 对于流行模型架构（SD 2.1, 3.5）的支持，可以更便捷地利用社区内先进的预训练生成能力。包括多任务、自监督在内的多种训练技巧，让模型更有效地利用视频数据里的信息。
+
+4. **方便测评。** 测评遵循流行框架 `torchmetrics`，易于配置、开发、并集成到已有管线。一些公开配置（例如在 nuScenes 验证集上的 FID, FVD）用于和其他研究工作对齐。
 
 此外，我们设计的代码模块考虑到了相当程度的可复用性，以便于在其他项目中应用。
 
 
@@ -2,9 +2,80 @@
 
 The configuration files are in the JSON format. They include settings for the models, datasets, pipelines, or any arguments for the program.
 
+## Introduction
+
+In our code, we mainly use JSON objects in three ways:
+
+1. As a dictionary
+2. As a function's parameter list
+3. As a constructor and parameter for objects
+
+### As a dictionary
+
+The most common way for the config, for example:
+
+```JSON
+{
+    "guidance_scale": 4,
+    "inference_steps": 40,
+    "preview_image_size": [
+        448,
+        252
+    ]
+}
+```
+
+The pipeline finds the corresponding value variable in the dictionary through the key, which determines the behavior at runtime.
+
+### As a function's parameter list
+
+The content of a JSON object is passed into a function, for example:
+
+```JSON
+{
+    "num_workers": 3,
+    "prefetch_factor": 3,
+    "persistent_workers": true
+}
+```
+
+The PyTorch data loader will accept all the arguments by
+
+```Python
+data_loader = torch.utils.data.DataLoader(
+    dataset, **deserialized_json_object)
+```
+
+In this case, you can fill in the required parameters according to the reference documentation of the function (such as the [data loader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) here).
+
+### As a constructor and parameter for objects
+
+The JSON object declares the name of the object to be created, as well as the parameters, for example:
+
+```JSON
+{
+    "_class_name": "torch.optim.AdamW",
+    "lr": 6e-5,
+    "betas": [
+        0.9,
+        0.975
+    ]
+}
+```
+
+The "_class_name" is in the format of `{name_space}.{class_or_function_name}`, and other key-value pairs are used as parameters for the class constructor (e.g. [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW) here) or the function.
+
+In the code, this type of object is parsed with `dwm.common.create_instance_from_config()` function.
+
+With this design, the configuration, framework, and components are **loosely coupled**. For example, user can easily switch to a third-party optimizer "bitsandbytes.optim.Adam8bit" without editing the code. Developers can provide any component class (e.g. dataset, data transforms) without having to register to a specific framework.
+
+## Development
+
+### Name convention
+
 The configs in this folder are mainly about the pipelines and consumed by the `src/dwm/train.py`. So they are named in the format of `{pipeline_name}_{model_config}_{condition_config}_{data_config}.json`.
 
 * Pipeline name: the python script name in the `src/dwm/pipelines`.
-* Model config: the most discriminative model arguments, such as `image`, `lidar`, `joint` for the holodrive models, or `spatial`, `crossview`, `temporal` for the SD models.
+* Model config: the most discriminative model arguments, such as `spatial`, `crossview`, `temporal` for the SD models.
 * Condition config: the additional input for the model, such as `ts` for the "text description per scene", `ti` for the "text description per image", `b` for the box condition, `m` for the map condition.
-* Data config: `mini` for the debug purpose. Or combination of `nuscenes`, `argoverse`, `waymo`, `opendv`, for the data components. For some dataset, use `k` for "key frames", `a` for "all frames".
+* Data config: `mini` for the debug purpose. Combination of `nuscenes`, `argoverse`, `waymo`, `opendv` (or their initial letters), for the data components.