Benchmarking Multi-View BEV Object Detection with Mixed Pinhole and Fisheye Cameras
Xiangzhong Liu, Hao Shen
fortiss GmbH, Munich, Germany
ICRA 2026
Modern autonomous driving systems increasingly rely on mixed camera configurations with pinhole and fisheye cameras for full-view perception. However, Bird's-Eye View (BEV) 3D object detection models are predominantly designed for pinhole cameras, leading to performance degradation under fisheye distortion. To bridge this gap, we introduce a multi-view BEV detection benchmark with mixed cameras by converting KITTI-360 into nuScenes format. Our study encompasses three adaptations: rectification for zero-shot evaluation and fine-tuning of nuScenes-trained models, distortion-aware view transformation modules (VTMs) via the MEI camera model, and polar coordinate representations to better align with radial distortion. We systematically evaluate three representative BEV architectures—BEVFormer, BEVDet and PETR—across these strategies. We demonstrate that projection-free architectures are inherently more robust and effective against fisheye distortion. This work establishes the first real-data 3D detection benchmark with fisheye and pinhole images and provides systematic adaptation and practical guidelines for designing robust and cost-effective 3D perception systems.
FishBEVOD/
├── fishBEVDet/ # BEVDet adapted with MEI distortion-aware VTM
├── fishbevformer/ # BEVFormer adapted with MEI distortion-aware VTM
├── fishPETR/ # PETR adapted with MEI distortion-aware 3D PE
├── kitti360_nusc_converter/ # KITTI-360 → nuScenes format conversion pipeline
└── nuscenes-devkit/ # Customized nuScenes devkit (local, for visualization and evaluation)
The following instructions set up the environment is only tested for PETR and BEVFormer. For BEVDet, please refer to original environment instructions in the fishBEVDet, since it uses a different version of mmdetection3d.
- Linux (tested on Ubuntu 20.04)
- Python 3.8
- CUDA 11.1
- NVIDIA GPU (experiments run on NVIDIA A5000)
conda create -n fishbevod python=3.8 -y
conda activate fishbevodpip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 \
-f https://download.pytorch.org/whl/torch_stable.htmlpip install mmcv-full==1.4.0 \
-f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.htmlpip install mmdet==2.14.0
pip install mmsegmentation==0.14.1git clone https://github.com/open-mmlab/mmdetection3d.git
cd mmdetection3d
git checkout v0.17.1
python setup.py install
cd ..pip install einops fvcore seaborn iopath==0.1.9 timm==0.6.13 \
typing-extensions==4.5.0 pylint ipython==8.12 numpy==1.19.5 \
matplotlib==3.5.2 numba==0.48.0 pandas==1.4.4 \
scikit-image==0.19.3 setuptools==59.5.0
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'This repository ships a customized nuScenes devkit with support for fisheye visualization and mixed-camera configurations. Install it locally instead of the PyPI version:
cd nuscenes-devkit/setup
pip install -r requirements.txt
cd ..
pip install -e .
cd ..Download the KITTI-360 dataset from the official website and place it under data/kitti360/. Since the dataset is coverted to nuScenes format, you can also just name it as 'nuscenes' and vkitti360-trainval as v1.0-trainval if you prefer.
Expected layout:
data/kitti360/
├── calibration/
├── data_2d_raw/
├── data_3d_bboxes/
└── data_poses/
cd kitti360_nusc_converter
# Full trainval conversion
python convert_kitti360_to_nuscenes.py \
--kitti360_root ../data/kitti360 \
--output_dir ../data/kitti360
# (Optional) Create a mini split for debugging
python create_mini.py \
--input_dir ../data/kitti360 \
--output_dir ../data/vkitti360-mini
cd ..The current converter produces a vkitti360-trainval split (~300 scenes, ~68K samples) compatible with the nuScenes devkit.
To generate the rectified 6-camera baseline:
cd kitti360_nusc_converter
python rectify_fisheye_images.py \
--kitti360_root ../data/kitti360 \
--output_dir ../data/kitti360
cd ..cd fishBEVDet
python tools/create_data_bevdet.py \
--root-path ../data/kitti360 \
--out-dir ../data/kitti360
cd ..All models are trained for 24/20 epochs with a batch size of 2 using AdamW with cosine annealing. Distributed training across multiple NVIDIA GPUs is recommended.
cd fishBEVDet
bash tools/dist_train.sh configs/bevdet/bevdet-r50-xxx.py 2 --validate
cd ..cd fishbevformer
bash tools/dist_train.sh projects/configs/bevformer/bevformer_small_s_xxx.py 2
cd ..cd fishPETR
bash tools/dist_train.sh projects/configs/petr/petr_vovnet_gridmask_p4_xxx.py 2
cd ..# BEVDet
cd fishBEVDet
bash tools/dist_test.sh <config> <checkpoint> 2 --eval bbox
cd ..
# BEVFormer
cd fishbevformer
bash tools/dist_test.sh <config> <checkpoint> 2 --eval bbox
cd ..
# PETR
cd fishPETR
bash tools/dist_test.sh <config> <checkpoint> 2 --eval bbox
cd ..This project is released under the MIT License.
If you find this work useful, please cite:
@article{liu2026fishbevod,
title = {Benchmarking Multi-View BEV Object Detection with Mixed Pinhole and Fisheye Cameras},
author = {Liu, Xiangzhong and Shen, Hao},
journal = {arXiv preprint arXiv:2603.27818},
year = {2026}
}This work builds upon and is grateful to the following projects:
- KITTI-360 — Liao et al., KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D, TPAMI 2022. The dataset and its rich sensor suite (stereo + fisheye cameras with LiDAR annotations) form the foundation of our benchmark.
- nuscenes-devkit — Caesar et al., nuScenes: A multimodal dataset for autonomous driving, CVPR 2020. We extend the devkit to support fisheye visualization and mixed-camera evaluation.
- PETR — Liu et al., PETR: Position Embedding Transformation for Multi-View 3D Object Detection, ECCV 2022. The projection-free BEV architecture that we adapt with MEI distortion-aware 3D positional encoding and polar coordinates.
- BEVFormer — Li et al., BEVFormer: Learning Bird's-Eye-View Representation from LiDAR-Camera via Spatiotemporal Transformers, TPAMI 2024. The backward-projection BEV architecture adapted with MEI-based spatial cross-attention.
- BEVDet — Huang et al., BEVDet: High-performance Multi-Camera 3D Object Detection in Bird-Eye View, arXiv 2021. The forward-projection BEV architecture adapted with MEI-based depth lifting.