| ObVLN:突破障碍,受阻环境中的视觉语言导航 |
https://dl.acm.org/doi/pdf/10.1145/3664647.3681640 |
https://github.com/honghd16/ObstructedVLN |
| Behavioral Analysis of Vision-and-Language Navigation Agents |
https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Behavioral_Analysis_of_Vision-and-Language_Navigation_Agents_CVPR_2023_paper.pdf |
https://github.com/Yoark/vln-behave |
| LLM as Copilot for Coarse-grained Vision-and-Language Navigation |
https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00833.pdf |
无 |
| NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation |
https://arxiv.org/pdf/2402.15852 |
https://pku-epic.github.io/NaVid/ |
| AerialVLN: Vision-and-Language Navigation for UAVs |
https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_AerialVLN_Vision-and-Language_Navigation_for_UAVs_ICCV_2023_paper.pdf |
https://github.com/AirVLN/AirVLN |
| Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method |
https://arxiv.org/pdf/2412.09082 |
https://hcplab-sysu.github.io/LH-VLN/ |
| doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation |
https://arxiv.org/pdf/2412.05893 |
https://www.github.com/rossgreer/doScenes |
| MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation |
https://aclanthology.org/2024.acl-long.529.pdf |
https://github.com/chen-judge/MapGPT/ |
| NL-SLAM for OC-VLN: Natural Language Grounded SLAM for Object-Centric VLN |
https://arxiv.org/pdf/2411.07848 |
https://sonia-raychaudhuri.github.io/nlslam/ |
| Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning |
https://arxiv.org/pdf/2410.08500 |
无 |
| VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation |
https://openreview.net/pdf?id=gdw1zUTABk |
https://github.com/bdaiinstitute/vlfm |
| CITYNAV: LANGUAGE-GOAL AERIAL NAVIGATION DATASET WITH GEOGRAPHIC INFORMATION |
https://arxiv.org/pdf/2406.14240 |
https://water-cookie.github.io/city-nav-proj/ |
| NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models |
https://link.springer.com/chapter/10.1007/978-3-031-72667-5_15 |
https://github.com/GengzeZhou/NavGPT-2 |
| NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models |
https://ojs.aaai.org/index.php/AAAI/article/download/28597/29161 |
https://github.com/GengzeZhou/NavGPT |
| Navigation Instruction Generation with BEV Perception and Large Language Models |
https://link.springer.com/chapter/10.1007/978-3-031-72670-5_21 |
https://github.com/FanScy/BEVInstructor |
| BOOTSTRAPPING LANGUAGE-GUIDED NAVIGATION LEARNING WITH SELF-REFINING DATA FLYWHEEL |
https://arxiv.org/pdf/2412.08467 |
https://github.com/wz0919/VLN-SRDF |
| Towards Learning a Generalist Model for Embodied Navigation |
https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Towards_Learning_a_Generalist_Model_for_Embodied_Navigation_CVPR_2024_paper.pdf |
https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Towards_Learning_a_Generalist_Model_for_Embodied_Navigation_CVPR_2024_paper.pdf |
| CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation |
https://aclanthology.org/2024.findings-acl.745.pdf |
https://github.com/liangcici/MO-VLN |
| CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation |
https://aclanthology.org/2024.findings-acl.745.pdf |
https://github.com/liangcici/MO-VLN |
| NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation |
https://arxiv.org/pdf/2411.08579 |
无 |
| Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching |
https://arxiv.org/pdf/2311.12751 |
https://github.com/MultimodalGeo/GeoText-1652 |
| SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments |
https://ojs.aaai.org/index.php/ICAPS/article/download/31506/33666 |
https://www.sri.com/ics/computer-vision/saynav |
| LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos |
https://arxiv.org/pdf/2410.03603 |
https://github.com/NHirose/learning-language-navigation |
| Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks |
https://arxiv.org/pdf/2412.06224 |
https://pku-epic.github.io/Uni-NaVid/ |
| WebVLN: Vision-and-Language Navigation on Websites |
https://ojs.aaai.org/index.php/AAAI/article/view/27878 |
https://github.com/WebVLN/WebVLN |
| SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts |
https://arxiv.org/pdf/2412.05552 |
https://github.com/GengzeZhou/SAME |
| InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment |
https://openreview.net/pdf?id=fCDOfpTCzZ |
https://github.com/LYX0501/InstructNav |
| VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation |
https://semrob.github.io/docs/rss_semrob2024_cr_paper12.pdf |
https://github.com/HaochenZ11/VLA-3D |
| Human-centered In-building Embodied Delivery Benchmark |
https://arxiv.org/pdf/2406.17898 |
https://github.com/PRS-Organization/prs-delivery |
| AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models |
https://arxiv.org/pdf/2408.15511 |
无 |
| NAVILA: LEGGED ROBOT VISION-LANGUAGE ACTION MODEL FOR NAVIGATION |
https://arxiv.org/pdf/2412.04453 |
https://github.com/yang-zj1026/legged-loco |
| Navigation World Models |
https://arxiv.org/abs/2412.03572 |
无 |
| TOWARDS REALISTIC UAV VISION-LANGUAGE NAVIGATION: PLATFORM, BENCHMARK, AND METHODOLOGY |
https://arxiv.org/pdf/2410.07087 |
无 |
| Recent Advances in Robot Navigation via Large Language Models: A Review |
https://www.researchgate.net/profile/Xian-Wei-3/publication/384537380_Recent_Advances_in_Robot_Navigation_via_Large_Language_Models_A_Review/links/66fd2d9b906bca2ac3de2f73/Recent-Advances-in-Robot-Navigation-via-Large-Language-Models-A-Review.pdf |
无 |
| VLM-GroNav: Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments |
https://arxiv.org/pdf/2409.20445 |
https://gamma.umd.edu/researchdirections/crowdmultiagent/vlm-gronav/ |
| PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation |
https://link.springer.com/chapter/10.1007/978-3-031-72848-8_5 |
https://github.com/iSEE-Laboratory/VLN-PRET |
| Find What You Want: Learning Demand-conditioned Object Attribute Space for Demand-driven Navigation |
https://arxiv.org/pdf/2309.08138 |
https://github.com/whcpumpkin/Demand-driven-navigation |
| Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning |
https://aclanthology.org/2024.naacl-long.57.pdf |
无 |
| VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model |
https://openreview.net/pdf?id=Va7mhTVy5s |
无 |
| InfiniteWorld: A Unified Scalable Simulation Framework for General Visual-Language Robot Interaction |
https://arxiv.org/pdf/2412.05789 |
https://github.com/pzhren/InfiniteWorld |
| RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation |
https://arxiv.org/pdf/2412.08591 |
https://github.com/roomtour3d/roomtour3d-NaviLLM |
| UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility |
https://arxiv.org/pdf/2501.02341 |
https://github.com/Hub-Tian/UAVs_Meet_LLMs |
| Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions |
https://arxiv.org/pdf/2406.19236v1 |
https://github.com/lpercc/HA3D_simulator |
| Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts |
https://arxiv.org/pdf/2406.02208 |
https://github.com/honghd16/VLN-MP |
| Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation |
https://openaccess.thecvf.com/content/CVPR2022/papers/Chen_Think_Global_Act_Local_Dual-Scale_Graph_Transformer_for_Vision-and-Language_Navigation_CVPR_2022_paper.pdf |
https://github.com/cshizhe/VLN-DUET |
| All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents |
https://arxiv.org/pdf/2408.10899 |
https://openi.pcl.ac.cn/ARIO/ARIO_Dataset |
| Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning |
https://arxiv.org/pdf/2501.02116 |
无 |
| Etpnav: Evolving topological planning for vision-language navigation in continuous environments |
https://arxiv.org/pdf/2304.03047 |
https://github.com/MarSaKi/ETPNav |
| UEVAVD: A Dataset for Developing UAV’s Eye View Active Object Detection |
https://arxiv.org/pdf/2411.04348 |
https://github.com/Leo000ooo/UEVAVD_dataset |
| FLAME: Learning to Navigate with Multimodal LLM in Urban Environments |
https://arxiv.org/pdf/2408.11051 |
https://github.com/xyz9911/FLAME |
| TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation |
https://arxiv.org/pdf/2411.16425 |
无 |
| EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models |
https://aclanthology.org/2024.acl-short.33.pdf |
https://github.com/mengfeidu/EmbSpatial-Bench |
| GRUtopia: Dream General Robots in a City at Scale |
https://arxiv.org/pdf/2407.10943v1 |
https://github.com/OpenRobotLab/GRUtopia |
| EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents |
https://arxiv.org/pdf/2501.11858 |
https://github.com/thunlp/EmbodiedEval |
| ROBOSPATIAL: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics |
https://arxiv.org/pdf/2411.16537 |
无 |
| ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation |
https://arxiv.org/abs/2409.13682 |
https://github.com/NVIDIA-AI-IOT/remembr |
| The Tong Test: Evaluating Artificial General Intelligence Through Dynamic Embodied Physical and Social Interactions |
https://www.sciencedirect.com/science/article/pii/S209580992300293X |
无 |
| Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation |
https://arxiv.org/pdf/2403.10700 |
https://github.com/intelligolabs/R2RIE-CE |
| SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning |
https://arxiv.org/pdf/2501.10074 |
https://spatialcot.github.io/ |
| ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models |
https://arxiv.org/pdf/2410.14682 |
无 |
| EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models |
https://openaccess.thecvf.com/content/CVPR2024/papers/Cheng_EgoThink_Evaluating_First-Person_Perspective_Thinking_Capability_of_Vision-Language_Models_CVPR_2024_paper.pdf |
https://github.com/AdaCheng/EgoThink/ |
| HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation |
https://arxiv.org/pdf/2403.10506 |
https://github.com/carlosferrazza/humanoid-bench |
| Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making |
https://arxiv.org/abs/2410.07166 |
https://github.com/embodied-agent-interface/embodied-agent-interface |
| The One RING: a Robotic Indoor Navigation Generalist |
https://arxiv.org/pdf/2412.14401v1 |
无 |
| SARO: Space-Aware Robot System for Terrain Crossing via Vision-Language Model |
https://saro-vlm.github.io/resources/saro_paper.pdf |
无 |
| Robust Robot Walker: Learning Agile Locomotion over Tiny Traps |
https://robust-robot-walker.github.io/resources/rrw_paper.pdf |
无 |
| RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation |
https://arxiv.org/abs/2311.01455 |
https://github.com/Genesis-Embodied-AI/RoboGen |
| See and Think: Embodied Agent in Virtual Environment(单位:浙江大学,华盛顿大学,香港科技大学) |
https://arxiv.org/pdf/2311.15209 |
无 |
| EvoAgent: Agent Autonomous Evolution with Continual World Model for Long-Horizon Tasks |
https://arxiv.org/pdf/2502.05907 |
无 |
| OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics |
https://arxiv.org/pdf/2502.09238 |
https://github.com/EI-Nav/light-map-navigation |
| CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space(单位:国防科技大学,清华大学) |
https://arxiv.org/pdf/2502.12532 |
无 |
| GENERAL SCENE ADAPTATION FOR VISION-AND-LANGUAGE NAVIGATION |
https://arxiv.org/pdf/2501.17403 |
https://github.com/honghd16/GSA-VLN |
| SpatialRGPT |
https://www.anjiecheng.me/assets/SpatialRGPT/Spatial_RGPT.pdf |
https://github.com/AnjieCheng/SpatialRGPT |
| GTA-UAV |
https://arxiv.org/pdf/2409.16925 |
https://github.com/Yux1angJi/GTA-UAV |
| Vision-and-Language Navigation via Causal Learning |
https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Vision-and-Language_Navigation_via_Causal_Learning_CVPR_2024_paper.pdf |
https://github.com/CrystalSixone/VLN-GOAT |
| UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation |
https://arxiv.org/pdf/2501.05014 |
https://github.com/sautenich/uav-vla |
| Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation |
https://arxiv.org/pdf/2409.18313 |
无 |
| Embodied Intelligence: A Synergy of Morphology, Action, Perception and Learning(出版信息:ACM COMPUTING SURVEYS - 2025) |
https://dl.acm.org/doi/pdf/10.1145/3717059 |
无 |
| UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios |
https://arxiv.org/pdf/2408.17267 |
https://github.com/opendatalab/UrBench |
| OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation |
https://arxiv.org/pdf/2502.18041 |
无 |
| Do Visual Imaginations Improve Vision-and-Language Navigation Agents? |
https://arxiv.org/abs/2503.16394 |
无 |
| Magma: A Foundation Model for Multimodal AI Agents |
https://www.arxiv.org/pdf/2502.13130 |
https://github.com/microsoft/Magma |
| NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM |
https://arxiv.org/pdf/2502.11142 |
https://github.com/MrZihan/NavRAG |
| Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation |
https://arxiv.org/pdf/2502.14254 |
无 |
| LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding |
https://arxiv.org/pdf/2501.08282 |
https://github.com/appletea233/LLaVA-ST |
| VIRTUAL COMMUNITY: A GENERATIVE SOCIAL WORLD FOR EMBODIED AI |
https://openreview.net/pdf?id=aRxLDcxFcL |
无 |
| MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation |
https://arxiv.org/pdf/2502.13451 |
无 |
| Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models(出版信息:Transactions on Machine Learning Research-2024) |
https://openreview.net/pdf?id=yiqeh2ZYUh |
无 |
| NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation |
https://arxiv.org/pdf/2412.13026 |
无 |
| EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning |
https://arxiv.org/pdf/2410.23968 |
无 |
| MageBench: Bridging Large Multimodal Models to Agents |
https://arxiv.org/pdf/2412.04531 |
https://github.com/microsoft/MageBench |
| Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions |
http://arxiv.org/abs/2502.15336 |
无 |
| UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces |
https://arxiv.org/pdf/2503.06157 |
https://github.com/EmbodiedCity/UrbanVideo-Bench.code |
| Advancements in Humanoid Robots: A Comprehensive Review and Future Prospects(出版信息:IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 11, NO. 2, FEBRUARY 2024) |
https://ieeexplore.ieee.org/abstract/document/10415857 |
无 |
| 长程视觉语言导航平台与数据集:迈向复杂环境中的智能机器人(作者:Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin;单位:中山大学,Independent Researcher,鹏城实验室) |
https://arxiv.org/pdf/2412.09082 |
无 |
| Dual-BEV Nav: Dual-layer BEV-based Heuristic Path Planning for Robotic Navigation in Unstructured Outdoor Environments |
https://arxiv.org/pdf/2501.18351 |
无 |
| Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks |
https://arxiv.org/pdf/2502.13175 |
无 |
| Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation |
无 |
https://github.com/HCI-LMC/VLN-SUSA (coming soon) |
| NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning |
https://arxiv.org/abs/2403.07376 |
https://github.com/expectorlin/NavCoT |
| Lifelong Learning of Large Language Model based Agents: A Roadmap |
https://arxiv.org/pdf/2501.07278 |
无 |
| An LLM-based vision and language cobot navigation approach for Human-centric Smart Manufacturing(出版信息:Journal of Manufacturing Systems 75 (2024) 299–305) |
https://www.sciencedirect.com/science/article/pii/S0278612524000864 |
无 |
| Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space |
https://www.arxiv.org/pdf/2503.11094 |
https://github.com/WeichenZh/Open3DVQA |
| Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions |
https://arxiv.org/pdf/2408.04168 |
https://anonymous.4open.science/r/PReP-13B5 |
| EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents |
https://arxiv.org/pdf/2408.04449 |
https://github.com/zihao-ai/EARBench |
| FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks |
https://arxiv.org/pdf/2503.13966 |
无 |
| Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments |
https://arxiv.org/pdf/2502.19024 |
无 |
| WMNav: Integrating Visual-Language Models into World Models for Object Goal Navigation |
https://arxiv.org/abs/2503.02247 |
无 |
| UniGoal: Towards Universal Zero-shot Goal-oriented Navigation |
https://arxiv.org/abs/2503.10630 |
https://github.com/bagh2178/UniGoal |
| PanoGen++: Domain-adapted text-guided panoramic environment generation for vision-and-language navigation(出版信息:Neural Networks 187 (2025) 107320) |
无 |
无 |
| UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation |
https://arxiv.org/pdf/2408.04423 |
无 |
| Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation |
https://arxiv.org/pdf/2503.18065 |
https://github.com/SaDil13/VLN-RAM |
| VERTIFORMER: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility |
https://arxiv.org/pdf/2502.00543 |
https://github.com/mhnazeri/VertiFormer |
| Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning |
https://arxiv.org/pdf/2504.12680 |
https://github.com/EmbodiedCity/Embodied-R.code |
| CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos |
https://arxiv.org/abs/2411.17820 |
https://github.com/ai4ce/CityWalker |
| GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation |
https://arxiv.org/pdf/2504.09587 |
无 |
| World-Consistent Data Generation for Vision-and-Language Navigation |
https://arxiv.org/pdf/2412.06413 |
无 |
| TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation |
https://arxiv.org/pdf/2502.07306 |
无 |
| Zero-Shot Object Navigation with Vision-Language Models Reasoning |
https://link.springer.com/chapter/10.1007/978-3-031-78456-9_25 |
无 |
| Multimodal Pretrained Knowledge for Real-world Object Navigation(出版信息:Machine Intelligence Research (2025)) |
https://www.mi-research.net/en/article/pdf/preview/10.1007/s11633-024-1537-x.pdf |
无 |
| Instruction-guided path planning with 3D semantic maps for vision-language navigation(出版信息:Neurocomputing 625 (2025) 129457) |
https://www.sciencedirect.com/science/article/abs/pii/S0925231225001298?via%3Dihub |
无 |
| COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation |
https://arxiv.org/pdf/2503.24065 |
无 |
| ESceme: Vision-and-Language Navigation with Episodic Scene Memory |
https://link.springer.com/article/10.1007/s11263-024-02159-8 |
https://github.com/qizhust/esceme |
| Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments |
https://doi.org/10.48550/arXiv.2503.09820 |
无 |
| Advances in Embodied Navigation Using Large Language Models: A Survey |
https://arxiv.org/pdf/2311.00530 |
无 |
| NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants |
https://arxiv.org/pdf/2502.13894 |
无 |
| ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots |
https://arxiv.org/pdf/2502.08791 |
无 |
| Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction |
https://arxiv.org/pdf/2503.11091 |
无 |
| UnrealZoo Gym(无明确论文标题,以项目名补充) |
https://arxiv.org/abs/2412.20977 |
https://github.com/UnrealZoo/unrealzoo-gym |
| WMNavigation(无明确论文标题,以项目名补充) |
https://arxiv.org/pdf/2503.02247 |
https://github.com/B0B8K1ng/WMNavigation |
| KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments |
https://arxiv.org/pdf/2503.08330 |
无 |
| SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation |
https://arxiv.org/pdf/2503.10069 |
无 |
| Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey(出版信息:Frontiers of Engineering Management 2025, 12 (1): 177–200) |
https://link.springer.com/content/pdf/10.1007/s42524-025-4136-9.pdf |
无 |
| Look Around Before Locating: Considering Content and Structure Information for Visual Grounding |
https://ojs.aaai.org/index.php/AAAI/article/view/32158/34313 |
https://github.com/VILAN-Lab/SSRVG |
| Self-Prompting Analogical Reasoning for UAV Object Detection |
https://ojs.aaai.org/index.php/AAAI/article/view/34026/36181 |
无 |
| Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation |
https://ojs.aaai.org/index.php/AAAI/article/view/32758/34913 |
https://github.com/yifeisu/FELA |
| LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs |
https://arxiv.org/pdf/2505.03460 |
无 |
| Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering |
https://arxiv.org/pdf/2502.00342 |
无 |
| RefDrone: A Challenging Benchmark for Drone Scene Referring Expression Comprehension |
https://arxiv.org/pdf/2502.00392 |
https://github.com/sunzc-sunny/refdrone |
| Framework for Autonomous UAV Navigation and Target Detection in Global-Navigation-Satellite-System-Denied and Visually Degraded Environments |
https://www.mdpi.com/2072-4292/16/3/471/pdf?version=1706191544 |
无 |
| Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology |
https://arxiv.org/abs/2505.08765 |
https://anonymous.4open.science/r/CityAVOS-3DF8 |
| AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding |
https://arxiv.org/pdf/2504.09583 |
无 |
| UAV Visual Navigation in the Large-Scale Outdoor Environment: A Semantic Map-Based Cognitive Escape Reinforcement Learning Method |
https://ieeexplore.ieee.org/abstract/document/10847926 |
无 |
| A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science |
https://arxiv.org/pdf/2504.09848 |
无 |
| General-Purpose Aerial Intelligent Agents Empowered by Large Language Models |
https://arxiv.org/pdf/2503.08302 |
无 |
| Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering |
https://arxiv.org/pdf/2503.11117 |
https://github.com/HCPLab-SYSU/EXPRESS-Bench |
| CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory |
https://www.arxiv.org/pdf/2505.05622 |
https://github.com/VinceOuti/CityNavAgent (coming soon) |
| Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation |
https://www.arxiv.org/pdf/2503.11006 |
无 |
| UAV-VLN: End-to-End Vision Language guided Navigation for UAVs |
https://arxiv.org/pdf/2504.21432 |
无 |
| FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models |
https://arxiv.org/pdf/2505.12835 |
https://github.com/Pendulumclock/FlightGPT |
| UAV Visual Navigation in the Large-Scale Outdoor Environment: A Semantic Map-Based Cognitive Escape Reinforcement Learning Method |
https://ieeexplore.ieee.org/abstract/document/10847926 |
无 |
| Visuomotor Navigation for Embodied Robots With Spatial Memory and Semantic Reasoning Cognition |
https://ieeexplore.ieee.org/abstract/document/10682097 |
无 |
| UAV-FlowColosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning |
https://arxiv.org/pdf/2505.15725 |
无 |
| UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning |
https://arxiv.org/pdf/2505.07236 |
无 |
| Task-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy |
https://arxiv.org/abs/2504.18317v3 |
https://github.com/fangzr/TOC-Edge-Aerial |
| AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval |
https://ieeexplore.ieee.org/document/11006099 |
https://github.com/VisionXLab/AirSpatialBot |
| RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour |
https://arxiv.org/pdf/2503.02572 |
https://github.com/SerValera/RaceVLA |
| Personalised Explainable Robots Using LLMs |
https://ieeexplore.ieee.org/abstract/document/10974125 |
无 |
| CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios |
https://arxiv.org/pdf/2505.00091 |
无 |
| DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes |
https://arxiv.org/pdf/2506.01950 |
https://github.com/Eku127/DualMap |
| CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs |
https://arxiv.org/pdf/2503.01378 |
https://cognitivedrone.github.io/ |
| Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence |
https://arxiv.org/pdf/2505.23747 |
https://github.com/diankun-wu/Spatial-MLLM |
| BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVs |
https://arxiv.org/pdf/2505.18229 |
https://github.com/lostwolves/BEDI |
| Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation |
https://dl.acm.org/doi/abs/10.1145/3664647.3681518 |
无 |
| P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction |
https://arxiv.org/pdf/2503.18525 |
无 |
| DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation |
https://arxiv.org/pdf/2505.00743 |
无 |
| AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning |
https://arxiv.org/pdf/2503.07557 |
https://github.com/Yanko96/AutoSpatial (coming soon) |
| OctoNav: Towards Generalist Embodied Navigation |
https://arxiv.org/pdf/2506.09839 |
https://github.com/buaa-colalab/OctoNav-R1 |
| A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI |
https://arxiv.org/pdf/2505.01458 |
无 |
| Toward Embodied AGI: A Review of Embodied AI and the Road Ahead |
https://arxiv.org/pdf/2505.14235 |
无 |
| USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents |
https://arxiv.org/pdf/2505.17572 |
https://github.com/usail-hkust/USTBench |
| CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation |
https://arxiv.org/pdf/2505.16663 |
https://github.com/oceanhao/CoNav |
| GaussNav: Gaussian Splatting for Visual Navigation |
https://ieeexplore.ieee.org/abstract/document/10870413/ |
https://github.com/XiaohanLei/GaussNav |
| HOZ++: Versatile Hierarchical Object-to-Zone Graph for Object Navigation |
https://ieeexplore.ieee.org/abstract/document/10933537/ |
https://github.com/sx-zhang/HOZplus |
| EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation |
https://arxiv.org/pdf/2506.01551 |
https://github.com/expectorlin/EvolveNav |
| Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks |
https://arxiv.org/pdf/2503.21696 |
https://github.com/zwq2018/embodied_reasoner |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents |
https://arxiv.org/pdf/2502.09560 |
https://github.com/EmbodiedBench/EmbodiedBench |
| 面向具身人工智能的物体目标导航综述 |
https://www.jos.org.cn/jos/article/abstract/7250 |
无 |
| Multimodal Perception for Goal-oriented Navigation: A Survey |
https://arxiv.org/pdf/2504.15643 |
无 |
| RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments |
https://arxiv.org/pdf/2408.15503 |
https://github.com/suhaisheng/RoboSense (coming soon) |
| Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation |
https://openaccess.thecvf.com/content/CVPR2025/papers/Xie_Vid2Sim_Realistic_and_Interactive_Simulation_from_Video_for_Urban_Navigation_CVPR_2025_paper.pdf |
https://github.com/Vid2Sim/Vid2Sim (coming soon) |
| Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation |
https://arxiv.org/pdf/2505.11383 |
https://github.com/MrZihan/Dynam3D (coming soon) |
| DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory |
https://arxiv.org/pdf/2506.15096 |
无 |
| UAVs Meet Agentic AI: A Multidomain Survey of Autonomous Aerial Intelligence and Agentic UAVs |
https://arxiv.org/pdf/2506.08045 |
无 |
| A Comprehensive Survey on Underwater Acoustic Target Positioning and Tracking: Progress, Challenges, and Perspectives |
https://arxiv.org/pdf/2506.14165 |
无 |
| Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation |
https://doi.org/10.48550/arXiv.2503.11006 |
无 |
| Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models |
https://openaccess.thecvf.com/content/CVPR2025/papers/Ren_Prior_Does_Matter_Visual_Navigation_via_Denoising_Diffusion_Bridge_Models_CVPR_2025_paper.pdf |
https://github.com/hren20/NaiviBridger |
| VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning |
https://arxiv.org/abs/2506.17221 |
https://github.com/Qi-Zhangyang/GPT4Scene-and-VLN-R1 |
| VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion |
https://arxiv.org/pdf/2502.01536 |
https://github.com/zst1406217/VR-Robo |
| Embodying Compliant Touch on Drones for Aerial Tactile Navigation |
https://ieeexplore.ieee.org/abstract/document/10806756 |
https://github.com/BioMorphic-Intelligence-Lab/tactile-drone |
| Controllable 3D Outdoor Scene Generation via Scene Graphs |
https://arxiv.org/abs/2503.07152 |
https://github.com/yuhengliu02/control-3d-scene (coming soon) |
| A Survey: Learning Embodied Intelligence from Physical Simulators and World Models |
https://arxiv.org/pdf/2507.00917 |
无(项目主页:https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey) |
| Aim My Robot: Precision Local Navigation to Any Object |
https://ieeexplore.ieee.org/abstract/document/10887329/ |
无 |
| ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion |
https://arxiv.org/pdf/2504.14478 |
https://github.com/Robotics-STAR-Lab/ApexNav (coming soon) |
| FindingDory: A Benchmark to Evaluate Memory in Embodied Agents |
https://arxiv.org/pdf/2506.15635 |
无 |
| SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments |
https://arxiv.org/pdf/2507.06564 |
无 |
| REGNav: Room Expert Guided Image-Goal Navigation |
https://ojs.aaai.org/index.php/AAAI/article/download/32514/34669 |
https://github.com/leeBooMla/REGNav |
| Embodied navigation |
https://link.springer.com/article/10.1007/s11432-024-4303-8 |
无 |
| FlySearch: Exploring how vision-language models explore |
https://arxiv.org/pdf/2506.02896 |
https://github.com/gmum/FlySearch |
| Towards Audio-visual Navigation in Noisy Environments: A Large-scale Benchmark Dataset and An Architecture Considering Multiple Sound-Sources |
https://ojs.aaai.org/index.php/AAAI/article/view/33608 |
https://github.com/ZhanboShiAI/ENMuS |
| MMCNav: MLLM-empowered Multi-agent Collaboration for Outdoor Visual Language Navigation |
https://dl.acm.org/doi/abs/10.1145/3731715.3733393 |
https://github.com/zzhaesc/MMCNav |
| MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation |
https://arxiv.org/pdf/2503.11081 |
https://github.com/MoMaKitchen/MoMaKitchen (coming soon) |
| FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment |
https://arxiv.org/pdf/2504.08603 |
无 |
| Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding |
https://www.arxiv.org/pdf/2506.10756 |
https://github.com/zzzzzyh111/Vision-Language-Fly (coming soon) |
| 基于视觉的无人机定位与导航方法研究综述 |
https://www.ejournal.org.cn/CN/10.12263/DZXB.20240699 |
无 |
| VLM-Nav: Mapless UAV-Navigation Using Monocular Vision Driven by Vision-Language Model |
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5231854 |
https://github.com/gcsarker/vlm_nav |
| Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic Reasoning |
https://arxiv.org/pdf/2506.04998 |
https://github.com/Mehdiazarafza/UAV-RAG |
| Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System |
https://arxiv.org/pdf/2506.19433v1 |
https://github.com/tsinghua-fib-lab/Mem4Nav (coming soon) |
| StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling |
https://arxiv.org/pdf/2507.05240 |
https://github.com/OpenRobotLab/StreamVLN |
| MorphoNavi: Aerial-Ground Robot Navigation with Object Oriented Mapping in Digital Twin |
https://arxiv.org/pdf/2504.16914 |
无 |
| A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects |
https://arxiv.org/pdf/2506.19769v1 |
无 |
| MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation |
https://ieeexplore.ieee.org/abstract/document/11071855 |
https://github.com/OpenMICG/MossVLN |
| Semantic Mapping in Indoor Embodied AI – A Survey on Advances, Challenges, and Future Directions |
https://arxiv.org/pdf/2501.05750 |
无 |
| X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots |
https://www.arxiv.org/pdf/2507.14731 |
无 |
| Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation |
https://www.arxiv.org/pdf/2409.10350 |
https://github.com/zimingluo/Point2Graph |
| Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach |
https://ieeexplore.ieee.org/abstract/document/11027320/ |
无 |
| Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision |
https://arxiv.org/pdf/2504.02477 |
https://github.com/Xiaofeng-Han-Res/MF-RV |
| AirScape: An Aerial Generative World Model with Motion Controllability |
https://arxiv.org/pdf/2507.08885 |
无 |
| TrackVLA: Embodied Visual Tracking in the Wild |
http://arxiv.org/abs/2505.23189 |
https://github.com/wsakobe/TrackVLA |
| SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models |
https://arxiv.org/abs/2507.13152v1 |
无 |
| SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation |
https://arxiv.org/pdf/2508.00390 |
无 |
| Foundation Model Driven Robotics: A Comprehensive Review |
https://arxiv.org/pdf/2507.10087v1 |
无 |
| Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities |
https://arxiv.org/pdf/2507.13019 |
https://github.com/InternRobotics/InternNav |
| 大模型驱动的具身智能:发展与挑战 |
https://www.sciengine.com/SSI/doi/10.1360/SSI-2024-0076 |
无 |
| CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking |
https://arxiv.org/pdf/2507.11334 |
https://github.com/yuehaohuang/CogDDN |
| Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation |
https://arxiv.org/pdf/2507.21450 |
无 |
| MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming |
https://arxiv.org/pdf/2508.02549v1 |
无 |
| FloNa: Floor Plan Guided Embodied Visual Navigation |
https://ojs.aaai.org/index.php/AAAI/article/download/33601/35756 |
无 |
| Hi AirStar, Guide Me to the Badminton Court |
https://www.arxiv.org/pdf/2507.04430 |
无 |
| Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation |
https://arxiv.org/pdf/2505.11886 |
https://github.com/HorizonRobotics/robo_orchard_lab/tree/master/projects/aux_think |
| Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs |
https://arxiv.org/abs/2409.18794 |
https://github.com/YanyuanQiao/Open-Nav |
| osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning |
https://arxiv.org/pdf/2507.12753v1 |
https://anonymous.4open.science/r/osmAG-LLM |
| 基于大模型的具身智能系统综述 |
http://www.aas.net.cn/cn/article/doi/10.16383/j.aas.c240542 |
无 |
| Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps |
https://arxiv.org/pdf/2505.18675 |
https://github.com/fscdc/ReasonMap |
| NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks |
https://arxiv.org/pdf/2508.02046v1 |
无 |
| UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents |
https://arxiv.org/pdf/2508.00288 |
https://github.com/Kyaren/UAV_ON |
| AgriVLN: Vision-and-Language Navigation for Agricultural Robots |
https://arxiv.org/pdf/2508.07406v1 |
https://github.com/AlexTraveling/AgriVLN |
| i2Nav-Robot: A Large-Scale Indoor-Outdoor Robot Dataset for Multi-Sensor Fusion Navigation and Mapping |
https://arxiv.org/pdf/2508.11485 |
https://github.com/i2Nav-WHU/i2Nav-Robot |
| NavBench: Probing Multimodal Large Language Models for Embodied Navigation |
https://arxiv.org/abs/2506.01031 |
https://github.com/NavBench/Evaluation_Code (coming soon) |
| ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks |
https://arxiv.org/pdf/2508.08240 |
无 |
| AerialGo: Walking-through City View Generation from Aerial Perspectives |
https://arxiv.org/pdf/2412.00157 |
无 |
| AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation |
https://arxiv.org/abs/2508.15232v1 |
无 |
| Autonomous Exploration with Terrestrial-Aerial Bimodal Vehicles |
https://arxiv.org/pdf/2507.21338v1 |
无 |
| Vision-based Navigation of Unmanned Aerial Vehicles in Orchards: An Imitation Learning Approach |
https://arxiv.org/abs/2508.02617v1 |
无 |
| HA-VLN: A Benchmark for Human-Aware Navigation in Discrete–Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard |
https://arxiv.org/pdf/2503.14229 |
https://github.com/F1y1113/HA-VLN |
| LOVON: Legged Open-Vocabulary Object Navigator |
https://arxiv.org/pdf/2507.06747 |
https://github.com/DaojiePENG/LOVON |
| Sensing, Social, and Motion Intelligence in Embodied Navigation: A Comprehensive Survey |
https://arxiv.org/pdf/2508.15354v1 |
无 |
| Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting |
https://arxiv.org/pdf/2508.12720 |
https://github.com/chenkangjie1123/Co-Adaptation-of-3DGS/ |
| RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model |
https://arxiv.org/pdf/2409.19590 |
无 |
| Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning |
https://arxiv.org/pdf/2408.01147 |
无 |
| QUAR-VLA: Vision-Language-Action Model for Quadruped Robots |
https://arxiv.org/pdf/2312.14457 |
无 |
| DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution |
https://arxiv.org/pdf/2411.02359 |
无 |
| General Flow as Foundation Affordance for Scalable Robot Learning |
https://arxiv.org/pdf/2401.11439 |
无 |
| Learning Universal Policies via Text-Guided Video Generation |
https://arxiv.org/pdf/2302.00111 |
无 |
| Learning to Act from Actionless Videos through Dense Correspondences |
https://arxiv.org/pdf/2310.08576 |
无 |
| Compositional Foundation Models for Hierarchical Planning |
https://arxiv.org/pdf/2309.08587 |
无 |
| VIMA: General Robot Manipulation with Multimodal Prompts |
https://vimalabs.github.io./assets/vima_paper.pdf |
无 |
| Prompt a Robot to Walk with Large Language Models |
https://arxiv.org/pdf/2309.09969 |
无 |
| Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning |
https://arxiv.org/pdf/2311.17842 |
无 |
| π0 |
https://arxiv.org/pdf/2410.24164 |
https://github.com/Physical-Intelligence/openpi |
| CogACT |
https://arxiv.org/pdf/2411.19650 |
https://github.com/microsoft/CogACT.git |
| Diffusion-VLA |
https://arxiv.org/abs/2412.03293 |
无 |
| 3D-VLA |
https://arxiv.org/pdf/2403.09631 |
https://github.com/UMass-Foundation-Model/3D-VLA/tree/main |
| TinyVLA |
https://arxiv.org/abs/2409.12514 |
无 |
| TraceVLA |
https://arxiv.org/pdf/2412.10345 |
https://github.com/umd-huang-lab/tracevla |
| Octo |
https://arxiv.org/pdf/2405.12213 |
无 |
| SpatialVLA |
https://arxiv.org/pdf/2501.15830 |
https://github.com/SpatialVLA/SpatialVLA |
| Gemini Robotics |
https://arxiv.org/pdf/2503.20020 |
无 |
| SafeVLA |
https://arxiv.org/abs/2503.03480 |
https://github.com/PKU-Alignment/SafeVLA |
| HybridVLA |
https://arxiv.org/pdf/2503.10631 |
https://github.com/PKU-HMI-Lab/Hybrid-VLA |
| DexVLA |
https://arxiv.org/pdf/2502.05855 |
https://github.com/juruobenruo/DexVLA |
| DexGraspVLA |
https://arxiv.org/abs/2502.20900 |
https://github.com/Psi-Robot/DexGraspVLA |
| UP-VLA |
https://arxiv.org/pdf/2501.18867 |
无 |
| CoT-VLA |
https://arxiv.org/pdf/2503.22020 |
无 |
| UniAct |
https://arxiv.org/abs/2501.10105 |
https://github.com/2toinf/UniAct |
| Dynamic Path Navigation for Motion Agents with LLM Reasoning |
https://arxiv.org/pdf/2503.07323 |
无 |
| SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation |
https://arxiv.org/pdf/2503.10069 |
无 |
| Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments |
https://arxiv.org/pdf/2503.09820 |
无 |
| UniGoal: Towards Universal Zero-shot Goal-oriented Navigation |
https://arxiv.org/pdf/2503.10630 |
无 |
| PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation |
https://arxiv.org/pdf/2503.09938 |
无 |
| Do Visual Imaginations Improve Vision-and-Language Navigation Agents? |
https://arxiv.org/pdf/2503.16394 |
无 |
| Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation |
https://arxiv.org/pdf/2504.19322 |
https://github.com/leggedrobotics/fdm |
| Semantic Mapping in Indoor Embodied AI - A Comprehensive Survey and Future Directions |
https://arxiv.org/pdf/2501.05750 |
无 |
| VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning |
https://arxiv.org/pdf/2502.00931 |
无 |
| TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation |
https://arxiv.org/pdf/2502.07306 |
无 |
| VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion |
https://arxiv.org/pdf/2502.01536 |
无 |
| NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants |
https://arxiv.org/pdf/2502.13894 |
无 |
| Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation |
https://arxiv.org/pdf/2503.18065 |
https://github.com/SaDil13/VLN-RAM |
| COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation |
https://arxiv.org/pdf/2503.24065 |
无 |
| ForesightNav: Learning Scene Imagination for Efficient Exploration |
https://arxiv.org/pdf/2504.16062 |
https://github.com/uzh-rpg/foresight-nav |
| CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory |
https://arxiv.org/pdf/2505.05622 |
https://github.com/VinceOuti/CityNavAgent |
| NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance |
https://arxiv.org/pdf/2505.08712 |
无 |
| VISTA: Generative Visual Imagination for Vision-and-Language Navigation |
https://arxiv.org/pdf/2505.07868 |
无 |
| Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation |
https://arxiv.org/pdf/2505.11383 |
https://github.com/MrZihan/Dynam3D |
| Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation |
https://arxiv.org/pdf/2505.11886 |
无 |
| Active Test-time Vision-Language Navigation |
https://arxiv.org/pdf/2506.06630 |
无 |
| BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation |
https://arxiv.org/pdf/2506.06487 |
https://github.com/ZiboKNOW/BeliefMapNav |
| TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation |
https://arxiv.org/pdf/2411.16425 |
无 |
| CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model |
https://arxiv.org/pdf/2508.10416 |
无 |
| GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation |
https://arxiv.org/pdf/2509.10454 |
https://github.com/bagh2178/GC-VLN |
| Embodied Navigation Foundation Model |
https://arxiv.org/pdf/2509.12129 |
无 |
| Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild |
https://arxiv.org/pdf/2505.11350 |
https://github.com/marmotlab/Search-TTA-VLN |
| JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation |
https://arxiv.org/pdf/2509.22548 |
https://github.com/MIV-XJTU/JanusVLN |
| TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking |
https://arxiv.org/pdf/2510.07134 |
无 |
| CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction |
https://arxiv.org/abs/2410.01273 |
https://github.com/worv-ai/canvas |
| Navid: Video-based vlm plans the next step for vision-and-language navigation |
https://arxiv.org/pdf/2402.15852 |
无 |
| The One RING: a Robotic Indoor Navigation Generalist |
https://arxiv.org/pdf/2412.14401 |
无 |
| Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs |
https://arxiv.org/pdf/2407.07775 |
无 |
| GO-1 |
无 |
无 |
| pi-0.5 |
https://arxiv.org/abs/2504.16054 |
https://github.com/Physical-Intelligence/openpi |
| Hi Robot |
https://arxiv.org/abs/2502.19417 |
https://github.com/Physical-Intelligence/openpi |
| TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models |
https://arxiv.org/pdf/2503.00761 |
无 |
| AffordGrasp: In-Context Affordance Reasoning for Open-Vocabulary Task-Oriented Grasping in Clutter |
https://arxiv.org/pdf/2503.00778 |
无 |
| Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation |
https://arxiv.org/pdf/2503.01700 |
无 |
| Large Language Models as Natural Selector for Embodied Soft Robot Design |
https://arxiv.org/pdf/2503.02249 |
https://github.com/AisenGinn/evogym_data_generation |
| OVAMOS: A Framework for Open-Vocabulary Multi-Object Search in Unknown Environments |
https://arxiv.org/pdf/2503.02106 |
无 |
| RoboDesign1M: A Large-scale Dataset for Robot Design Understanding |
https://arxiv.org/pdf/2503.06796 |
无 |
| STAR: A Foundation Model-driven Framework for Robust Task Planning and Failure Recovery in Robotic Systems |
https://arxiv.org/pdf/2503.06060 |
无 |
| MatchMaker: Automated Asset Generation for Robotic Assembly |
https://arxiv.org/pdf/2503.05887 |
无 |
| Object-Centric World Model for Language-Guided Manipulation |
https://arxiv.org/pdf/2503.06170 |
无 |
| KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation |
https://arxiv.org/pdf/2503.10546 |
无 |
| IMPACT : Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models |
https://arxiv.org/pdf/2503.10110 |
无 |
| Building Cooperative Embodied Agents Modularly with Large Language Models |
https://arxiv.org/pdf/2307.02485 |
无 |
| AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation |
https://arxiv.org/pdf/2406.11548 |
无 |
| OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints |
https://arxiv.org/pdf/2501.03841 |
无 |
| Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice |
https://arxiv.org/pdf/2412.10694 |
无 |
| Towards Open-World Grasping with Large Vision-Language Models |
https://arxiv.org/pdf/2406.18722v4 |
无 |
| ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter |
https://arxiv.org/pdf/2407.11298v1 |
无 |
| MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World |
https://arxiv.org/pdf/2401.08577 |
无 |
| NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model |
https://arxiv.org/pdf/2503.09335 |
https://github.com/laiyuzhi/NVP-HRI.git |
| MindEye-OmniAssist: A Gaze-Driven LLM-Enhanced Assistive Robot System for Implicit Intention Recognition and Task Execution |
https://arxiv.org/pdf/2503.13250 |
无 |
| HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning |
https://arxiv.org/pdf/2503.13171 |
无 |
| Free-form language-based robotic reasoning and grasping |
https://arxiv.org/pdf/2503.13082 |
无 |
| Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Food Preparation Task Planning |
https://arxiv.org/pdf/2503.13055 |
无 |
| Multi-Agent LLM Actor-Critic Framework for Social Robot Navigation |
https://arxiv.org/pdf/2503.09758 |
无 |
| PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability |
https://arxiv.org/pdf/2503.08481 |
无 |
| MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model |
https://arxiv.org/pdf/2503.08372 |
无 |
| UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation |
https://unsup-affordance.github.io/uad.pdf |
无 |
| Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning |
https://arxiv.org/pdf/2503.15558 |
https://github.com/nvidia-cosmos/cosmos-reason1 |
| EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks |
https://arxiv.org/pdf/2503.11089 |
无 |
| RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation |
https://arxiv.org/pdf/2503.01616 |
无 |
| Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation |
https://arxiv.org/pdf/2502.16707v1 |
无 |
| Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations |
https://arxiv.org/pdf/2503.04557 |
无 |
| Look Before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation |
https://arxiv.org/pdf/2503.05114 |
无 |
| Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation |
https://arxiv.org/pdf/2503.05064 |
无 |
| AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning |
https://arxiv.org/pdf/2503.07557 |
https://github.com/Yanko96/AutoSpatial |
| AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance |
https://arxiv.org/pdf/2503.07360 |
无 |
| LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language |
https://arxiv.org/pdf/2503.17309 |
https://github.com/Kchu/LLM-MAP |
| Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning |
https://arxiv.org/pdf/2503.17125 |
无 |
| RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation |
https://arxiv.org/pdf/2503.18738 |
无 |
| IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes |
https://arxiv.org/pdf/2503.17406 |
无 |
| Cooking Task Planning using LLM and Verified by Graph Network |
https://arxiv.org/pdf/2503.21564 |
无 |