Embodied-Intelligent-Imitation-Learning-Framework/pappers.md at main · Necther/Embodied-Intelligent-Imitation-Learning-Framework

论文名	论文链接	代码链接
ObVLN：突破障碍，受阻环境中的视觉语言导航	https://dl.acm.org/doi/pdf/10.1145/3664647.3681640	https://github.com/honghd16/ObstructedVLN
Behavioral Analysis of Vision-and-Language Navigation Agents	https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Behavioral_Analysis_of_Vision-and-Language_Navigation_Agents_CVPR_2023_paper.pdf	https://github.com/Yoark/vln-behave
LLM as Copilot for Coarse-grained Vision-and-Language Navigation	https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00833.pdf	无
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation	https://arxiv.org/pdf/2402.15852	https://pku-epic.github.io/NaVid/
AerialVLN: Vision-and-Language Navigation for UAVs	https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_AerialVLN_Vision-and-Language_Navigation_for_UAVs_ICCV_2023_paper.pdf	https://github.com/AirVLN/AirVLN
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method	https://arxiv.org/pdf/2412.09082	https://hcplab-sysu.github.io/LH-VLN/
doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation	https://arxiv.org/pdf/2412.05893	https://www.github.com/rossgreer/doScenes
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation	https://aclanthology.org/2024.acl-long.529.pdf	https://github.com/chen-judge/MapGPT/
NL-SLAM for OC-VLN: Natural Language Grounded SLAM for Object-Centric VLN	https://arxiv.org/pdf/2411.07848	https://sonia-raychaudhuri.github.io/nlslam/
Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning	https://arxiv.org/pdf/2410.08500	无
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation	https://openreview.net/pdf?id=gdw1zUTABk	https://github.com/bdaiinstitute/vlfm
CITYNAV: LANGUAGE-GOAL AERIAL NAVIGATION DATASET WITH GEOGRAPHIC INFORMATION	https://arxiv.org/pdf/2406.14240	https://water-cookie.github.io/city-nav-proj/
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models	https://link.springer.com/chapter/10.1007/978-3-031-72667-5_15	https://github.com/GengzeZhou/NavGPT-2
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models	https://ojs.aaai.org/index.php/AAAI/article/download/28597/29161	https://github.com/GengzeZhou/NavGPT
Navigation Instruction Generation with BEV Perception and Large Language Models	https://link.springer.com/chapter/10.1007/978-3-031-72670-5_21	https://github.com/FanScy/BEVInstructor
BOOTSTRAPPING LANGUAGE-GUIDED NAVIGATION LEARNING WITH SELF-REFINING DATA FLYWHEEL	https://arxiv.org/pdf/2412.08467	https://github.com/wz0919/VLN-SRDF
Towards Learning a Generalist Model for Embodied Navigation	https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Towards_Learning_a_Generalist_Model_for_Embodied_Navigation_CVPR_2024_paper.pdf	https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Towards_Learning_a_Generalist_Model_for_Embodied_Navigation_CVPR_2024_paper.pdf
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation	https://aclanthology.org/2024.findings-acl.745.pdf	https://github.com/liangcici/MO-VLN
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation	https://aclanthology.org/2024.findings-acl.745.pdf	https://github.com/liangcici/MO-VLN
NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation	https://arxiv.org/pdf/2411.08579	无
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching	https://arxiv.org/pdf/2311.12751	https://github.com/MultimodalGeo/GeoText-1652
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments	https://ojs.aaai.org/index.php/ICAPS/article/download/31506/33666	https://www.sri.com/ics/computer-vision/saynav
LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos	https://arxiv.org/pdf/2410.03603	https://github.com/NHirose/learning-language-navigation
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks	https://arxiv.org/pdf/2412.06224	https://pku-epic.github.io/Uni-NaVid/
WebVLN: Vision-and-Language Navigation on Websites	https://ojs.aaai.org/index.php/AAAI/article/view/27878	https://github.com/WebVLN/WebVLN
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts	https://arxiv.org/pdf/2412.05552	https://github.com/GengzeZhou/SAME
InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment	https://openreview.net/pdf?id=fCDOfpTCzZ	https://github.com/LYX0501/InstructNav
VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation	https://semrob.github.io/docs/rss_semrob2024_cr_paper12.pdf	https://github.com/HaochenZ11/VLA-3D
Human-centered In-building Embodied Delivery Benchmark	https://arxiv.org/pdf/2406.17898	https://github.com/PRS-Organization/prs-delivery
AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models	https://arxiv.org/pdf/2408.15511	无
NAVILA: LEGGED ROBOT VISION-LANGUAGE ACTION MODEL FOR NAVIGATION	https://arxiv.org/pdf/2412.04453	https://github.com/yang-zj1026/legged-loco
Navigation World Models	https://arxiv.org/abs/2412.03572	无
TOWARDS REALISTIC UAV VISION-LANGUAGE NAVIGATION: PLATFORM, BENCHMARK, AND METHODOLOGY	https://arxiv.org/pdf/2410.07087	无
Recent Advances in Robot Navigation via Large Language Models: A Review	https://www.researchgate.net/profile/Xian-Wei-3/publication/384537380_Recent_Advances_in_Robot_Navigation_via_Large_Language_Models_A_Review/links/66fd2d9b906bca2ac3de2f73/Recent-Advances-in-Robot-Navigation-via-Large-Language-Models-A-Review.pdf	无
VLM-GroNav: Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments	https://arxiv.org/pdf/2409.20445	https://gamma.umd.edu/researchdirections/crowdmultiagent/vlm-gronav/
PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation	https://link.springer.com/chapter/10.1007/978-3-031-72848-8_5	https://github.com/iSEE-Laboratory/VLN-PRET
Find What You Want: Learning Demand-conditioned Object Attribute Space for Demand-driven Navigation	https://arxiv.org/pdf/2309.08138	https://github.com/whcpumpkin/Demand-driven-navigation
Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning	https://aclanthology.org/2024.naacl-long.57.pdf	无
VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model	https://openreview.net/pdf?id=Va7mhTVy5s	无
InfiniteWorld: A Unified Scalable Simulation Framework for General Visual-Language Robot Interaction	https://arxiv.org/pdf/2412.05789	https://github.com/pzhren/InfiniteWorld
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation	https://arxiv.org/pdf/2412.08591	https://github.com/roomtour3d/roomtour3d-NaviLLM
UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility	https://arxiv.org/pdf/2501.02341	https://github.com/Hub-Tian/UAVs_Meet_LLMs
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions	https://arxiv.org/pdf/2406.19236v1	https://github.com/lpercc/HA3D_simulator
Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts	https://arxiv.org/pdf/2406.02208	https://github.com/honghd16/VLN-MP
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation	https://openaccess.thecvf.com/content/CVPR2022/papers/Chen_Think_Global_Act_Local_Dual-Scale_Graph_Transformer_for_Vision-and-Language_Navigation_CVPR_2022_paper.pdf	https://github.com/cshizhe/VLN-DUET
All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents	https://arxiv.org/pdf/2408.10899	https://openi.pcl.ac.cn/ARIO/ARIO_Dataset
Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning	https://arxiv.org/pdf/2501.02116	无
Etpnav: Evolving topological planning for vision-language navigation in continuous environments	https://arxiv.org/pdf/2304.03047	https://github.com/MarSaKi/ETPNav
UEVAVD: A Dataset for Developing UAV’s Eye View Active Object Detection	https://arxiv.org/pdf/2411.04348	https://github.com/Leo000ooo/UEVAVD_dataset
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments	https://arxiv.org/pdf/2408.11051	https://github.com/xyz9911/FLAME
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation	https://arxiv.org/pdf/2411.16425	无
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models	https://aclanthology.org/2024.acl-short.33.pdf	https://github.com/mengfeidu/EmbSpatial-Bench
GRUtopia: Dream General Robots in a City at Scale	https://arxiv.org/pdf/2407.10943v1	https://github.com/OpenRobotLab/GRUtopia
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents	https://arxiv.org/pdf/2501.11858	https://github.com/thunlp/EmbodiedEval
ROBOSPATIAL: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics	https://arxiv.org/pdf/2411.16537	无
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation	https://arxiv.org/abs/2409.13682	https://github.com/NVIDIA-AI-IOT/remembr
The Tong Test: Evaluating Artificial General Intelligence Through Dynamic Embodied Physical and Social Interactions	https://www.sciencedirect.com/science/article/pii/S209580992300293X	无
Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation	https://arxiv.org/pdf/2403.10700	https://github.com/intelligolabs/R2RIE-CE
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning	https://arxiv.org/pdf/2501.10074	https://spatialcot.github.io/
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models	https://arxiv.org/pdf/2410.14682	无
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models	https://openaccess.thecvf.com/content/CVPR2024/papers/Cheng_EgoThink_Evaluating_First-Person_Perspective_Thinking_Capability_of_Vision-Language_Models_CVPR_2024_paper.pdf	https://github.com/AdaCheng/EgoThink/
HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation	https://arxiv.org/pdf/2403.10506	https://github.com/carlosferrazza/humanoid-bench
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making	https://arxiv.org/abs/2410.07166	https://github.com/embodied-agent-interface/embodied-agent-interface
The One RING: a Robotic Indoor Navigation Generalist	https://arxiv.org/pdf/2412.14401v1	无
SARO: Space-Aware Robot System for Terrain Crossing via Vision-Language Model	https://saro-vlm.github.io/resources/saro_paper.pdf	无
Robust Robot Walker: Learning Agile Locomotion over Tiny Traps	https://robust-robot-walker.github.io/resources/rrw_paper.pdf	无
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation	https://arxiv.org/abs/2311.01455	https://github.com/Genesis-Embodied-AI/RoboGen
See and Think: Embodied Agent in Virtual Environment（单位：浙江大学，华盛顿大学，香港科技大学）	https://arxiv.org/pdf/2311.15209	无
EvoAgent: Agent Autonomous Evolution with Continual World Model for Long-Horizon Tasks	https://arxiv.org/pdf/2502.05907	无
OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics	https://arxiv.org/pdf/2502.09238	https://github.com/EI-Nav/light-map-navigation
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space（单位：国防科技大学，清华大学）	https://arxiv.org/pdf/2502.12532	无
GENERAL SCENE ADAPTATION FOR VISION-AND-LANGUAGE NAVIGATION	https://arxiv.org/pdf/2501.17403	https://github.com/honghd16/GSA-VLN
SpatialRGPT	https://www.anjiecheng.me/assets/SpatialRGPT/Spatial_RGPT.pdf	https://github.com/AnjieCheng/SpatialRGPT
GTA-UAV	https://arxiv.org/pdf/2409.16925	https://github.com/Yux1angJi/GTA-UAV
Vision-and-Language Navigation via Causal Learning	https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Vision-and-Language_Navigation_via_Causal_Learning_CVPR_2024_paper.pdf	https://github.com/CrystalSixone/VLN-GOAT
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation	https://arxiv.org/pdf/2501.05014	https://github.com/sautenich/uav-vla
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation	https://arxiv.org/pdf/2409.18313	无
Embodied Intelligence: A Synergy of Morphology, Action, Perception and Learning（出版信息：ACM COMPUTING SURVEYS - 2025）	https://dl.acm.org/doi/pdf/10.1145/3717059	无
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios	https://arxiv.org/pdf/2408.17267	https://github.com/opendatalab/UrBench
OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation	https://arxiv.org/pdf/2502.18041	无
Do Visual Imaginations Improve Vision-and-Language Navigation Agents?	https://arxiv.org/abs/2503.16394	无
Magma: A Foundation Model for Multimodal AI Agents	https://www.arxiv.org/pdf/2502.13130	https://github.com/microsoft/Magma
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM	https://arxiv.org/pdf/2502.11142	https://github.com/MrZihan/NavRAG
Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation	https://arxiv.org/pdf/2502.14254	无
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding	https://arxiv.org/pdf/2501.08282	https://github.com/appletea233/LLaVA-ST
VIRTUAL COMMUNITY: A GENERATIVE SOCIAL WORLD FOR EMBODIED AI	https://openreview.net/pdf?id=aRxLDcxFcL	无
MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation	https://arxiv.org/pdf/2502.13451	无
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models（出版信息：Transactions on Machine Learning Research-2024）	https://openreview.net/pdf?id=yiqeh2ZYUh	无
NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation	https://arxiv.org/pdf/2412.13026	无
EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning	https://arxiv.org/pdf/2410.23968	无
MageBench: Bridging Large Multimodal Models to Agents	https://arxiv.org/pdf/2412.04531	https://github.com/microsoft/MageBench
Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions	http://arxiv.org/abs/2502.15336	无
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces	https://arxiv.org/pdf/2503.06157	https://github.com/EmbodiedCity/UrbanVideo-Bench.code
Advancements in Humanoid Robots: A Comprehensive Review and Future Prospects（出版信息：IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 11, NO. 2, FEBRUARY 2024）	https://ieeexplore.ieee.org/abstract/document/10415857	无
长程视觉语言导航平台与数据集：迈向复杂环境中的智能机器人（作者：Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin；单位：中山大学，Independent Researcher，鹏城实验室）	https://arxiv.org/pdf/2412.09082	无
Dual-BEV Nav: Dual-layer BEV-based Heuristic Path Planning for Robotic Navigation in Unstructured Outdoor Environments	https://arxiv.org/pdf/2501.18351	无
Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks	https://arxiv.org/pdf/2502.13175	无
Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation	无	https://github.com/HCI-LMC/VLN-SUSA (coming soon)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning	https://arxiv.org/abs/2403.07376	https://github.com/expectorlin/NavCoT
Lifelong Learning of Large Language Model based Agents: A Roadmap	https://arxiv.org/pdf/2501.07278	无
An LLM-based vision and language cobot navigation approach for Human-centric Smart Manufacturing（出版信息：Journal of Manufacturing Systems 75 (2024) 299–305）	https://www.sciencedirect.com/science/article/pii/S0278612524000864	无
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space	https://www.arxiv.org/pdf/2503.11094	https://github.com/WeichenZh/Open3DVQA
Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions	https://arxiv.org/pdf/2408.04168	https://anonymous.4open.science/r/PReP-13B5
EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents	https://arxiv.org/pdf/2408.04449	https://github.com/zihao-ai/EARBench
FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks	https://arxiv.org/pdf/2503.13966	无
Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments	https://arxiv.org/pdf/2502.19024	无
WMNav: Integrating Visual-Language Models into World Models for Object Goal Navigation	https://arxiv.org/abs/2503.02247	无
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation	https://arxiv.org/abs/2503.10630	https://github.com/bagh2178/UniGoal
PanoGen++: Domain-adapted text-guided panoramic environment generation for vision-and-language navigation（出版信息：Neural Networks 187 (2025) 107320）	无	无
UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation	https://arxiv.org/pdf/2408.04423	无
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation	https://arxiv.org/pdf/2503.18065	https://github.com/SaDil13/VLN-RAM
VERTIFORMER: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility	https://arxiv.org/pdf/2502.00543	https://github.com/mhnazeri/VertiFormer
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning	https://arxiv.org/pdf/2504.12680	https://github.com/EmbodiedCity/Embodied-R.code
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos	https://arxiv.org/abs/2411.17820	https://github.com/ai4ce/CityWalker
GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation	https://arxiv.org/pdf/2504.09587	无
World-Consistent Data Generation for Vision-and-Language Navigation	https://arxiv.org/pdf/2412.06413	无
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation	https://arxiv.org/pdf/2502.07306	无
Zero-Shot Object Navigation with Vision-Language Models Reasoning	https://link.springer.com/chapter/10.1007/978-3-031-78456-9_25	无
Multimodal Pretrained Knowledge for Real-world Object Navigation（出版信息：Machine Intelligence Research (2025)）	https://www.mi-research.net/en/article/pdf/preview/10.1007/s11633-024-1537-x.pdf	无
Instruction-guided path planning with 3D semantic maps for vision-language navigation（出版信息：Neurocomputing 625 (2025) 129457）	https://www.sciencedirect.com/science/article/abs/pii/S0925231225001298?via%3Dihub	无
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation	https://arxiv.org/pdf/2503.24065	无
ESceme: Vision-and-Language Navigation with Episodic Scene Memory	https://link.springer.com/article/10.1007/s11263-024-02159-8	https://github.com/qizhust/esceme
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments	https://doi.org/10.48550/arXiv.2503.09820	无
Advances in Embodied Navigation Using Large Language Models: A Survey	https://arxiv.org/pdf/2311.00530	无
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants	https://arxiv.org/pdf/2502.13894	无
ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots	https://arxiv.org/pdf/2502.08791	无
Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction	https://arxiv.org/pdf/2503.11091	无
UnrealZoo Gym（无明确论文标题，以项目名补充）	https://arxiv.org/abs/2412.20977	https://github.com/UnrealZoo/unrealzoo-gym
WMNavigation（无明确论文标题，以项目名补充）	https://arxiv.org/pdf/2503.02247	https://github.com/B0B8K1ng/WMNavigation
KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments	https://arxiv.org/pdf/2503.08330	无
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation	https://arxiv.org/pdf/2503.10069	无
Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey（出版信息：Frontiers of Engineering Management 2025, 12 (1): 177–200）	https://link.springer.com/content/pdf/10.1007/s42524-025-4136-9.pdf	无
Look Around Before Locating: Considering Content and Structure Information for Visual Grounding	https://ojs.aaai.org/index.php/AAAI/article/view/32158/34313	https://github.com/VILAN-Lab/SSRVG
Self-Prompting Analogical Reasoning for UAV Object Detection	https://ojs.aaai.org/index.php/AAAI/article/view/34026/36181	无
Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation	https://ojs.aaai.org/index.php/AAAI/article/view/32758/34913	https://github.com/yifeisu/FELA
LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs	https://arxiv.org/pdf/2505.03460	无
Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering	https://arxiv.org/pdf/2502.00342	无
RefDrone: A Challenging Benchmark for Drone Scene Referring Expression Comprehension	https://arxiv.org/pdf/2502.00392	https://github.com/sunzc-sunny/refdrone
Framework for Autonomous UAV Navigation and Target Detection in Global-Navigation-Satellite-System-Denied and Visually Degraded Environments	https://www.mdpi.com/2072-4292/16/3/471/pdf?version=1706191544	无
Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology	https://arxiv.org/abs/2505.08765	https://anonymous.4open.science/r/CityAVOS-3DF8
AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding	https://arxiv.org/pdf/2504.09583	无
UAV Visual Navigation in the Large-Scale Outdoor Environment: A Semantic Map-Based Cognitive Escape Reinforcement Learning Method	https://ieeexplore.ieee.org/abstract/document/10847926	无
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science	https://arxiv.org/pdf/2504.09848	无
General-Purpose Aerial Intelligent Agents Empowered by Large Language Models	https://arxiv.org/pdf/2503.08302	无
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering	https://arxiv.org/pdf/2503.11117	https://github.com/HCPLab-SYSU/EXPRESS-Bench
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory	https://www.arxiv.org/pdf/2505.05622	https://github.com/VinceOuti/CityNavAgent (coming soon)
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation	https://www.arxiv.org/pdf/2503.11006	无
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs	https://arxiv.org/pdf/2504.21432	无
FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models	https://arxiv.org/pdf/2505.12835	https://github.com/Pendulumclock/FlightGPT
UAV Visual Navigation in the Large-Scale Outdoor Environment: A Semantic Map-Based Cognitive Escape Reinforcement Learning Method	https://ieeexplore.ieee.org/abstract/document/10847926	无
Visuomotor Navigation for Embodied Robots With Spatial Memory and Semantic Reasoning Cognition	https://ieeexplore.ieee.org/abstract/document/10682097	无
UAV-FlowColosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning	https://arxiv.org/pdf/2505.15725	无
UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning	https://arxiv.org/pdf/2505.07236	无
Task-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy	https://arxiv.org/abs/2504.18317v3	https://github.com/fangzr/TOC-Edge-Aerial
AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval	https://ieeexplore.ieee.org/document/11006099	https://github.com/VisionXLab/AirSpatialBot
RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour	https://arxiv.org/pdf/2503.02572	https://github.com/SerValera/RaceVLA
Personalised Explainable Robots Using LLMs	https://ieeexplore.ieee.org/abstract/document/10974125	无
CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios	https://arxiv.org/pdf/2505.00091	无
DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes	https://arxiv.org/pdf/2506.01950	https://github.com/Eku127/DualMap
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs	https://arxiv.org/pdf/2503.01378	https://cognitivedrone.github.io/
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence	https://arxiv.org/pdf/2505.23747	https://github.com/diankun-wu/Spatial-MLLM
BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVs	https://arxiv.org/pdf/2505.18229	https://github.com/lostwolves/BEDI
Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation	https://dl.acm.org/doi/abs/10.1145/3664647.3681518	无
P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction	https://arxiv.org/pdf/2503.18525	无
DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation	https://arxiv.org/pdf/2505.00743	无
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning	https://arxiv.org/pdf/2503.07557	https://github.com/Yanko96/AutoSpatial (coming soon)
OctoNav: Towards Generalist Embodied Navigation	https://arxiv.org/pdf/2506.09839	https://github.com/buaa-colalab/OctoNav-R1
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI	https://arxiv.org/pdf/2505.01458	无
Toward Embodied AGI: A Review of Embodied AI and the Road Ahead	https://arxiv.org/pdf/2505.14235	无
USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents	https://arxiv.org/pdf/2505.17572	https://github.com/usail-hkust/USTBench
CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation	https://arxiv.org/pdf/2505.16663	https://github.com/oceanhao/CoNav
GaussNav: Gaussian Splatting for Visual Navigation	https://ieeexplore.ieee.org/abstract/document/10870413/	https://github.com/XiaohanLei/GaussNav
HOZ++: Versatile Hierarchical Object-to-Zone Graph for Object Navigation	https://ieeexplore.ieee.org/abstract/document/10933537/	https://github.com/sx-zhang/HOZplus
EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation	https://arxiv.org/pdf/2506.01551	https://github.com/expectorlin/EvolveNav
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks	https://arxiv.org/pdf/2503.21696	https://github.com/zwq2018/embodied_reasoner
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	https://arxiv.org/pdf/2502.09560	https://github.com/EmbodiedBench/EmbodiedBench
面向具身人工智能的物体目标导航综述	https://www.jos.org.cn/jos/article/abstract/7250	无
Multimodal Perception for Goal-oriented Navigation: A Survey	https://arxiv.org/pdf/2504.15643	无
RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments	https://arxiv.org/pdf/2408.15503	https://github.com/suhaisheng/RoboSense (coming soon)
Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation	https://openaccess.thecvf.com/content/CVPR2025/papers/Xie_Vid2Sim_Realistic_and_Interactive_Simulation_from_Video_for_Urban_Navigation_CVPR_2025_paper.pdf	https://github.com/Vid2Sim/Vid2Sim (coming soon)
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation	https://arxiv.org/pdf/2505.11383	https://github.com/MrZihan/Dynam3D (coming soon)
DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory	https://arxiv.org/pdf/2506.15096	无
UAVs Meet Agentic AI: A Multidomain Survey of Autonomous Aerial Intelligence and Agentic UAVs	https://arxiv.org/pdf/2506.08045	无
A Comprehensive Survey on Underwater Acoustic Target Positioning and Tracking: Progress, Challenges, and Perspectives	https://arxiv.org/pdf/2506.14165	无
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation	https://doi.org/10.48550/arXiv.2503.11006	无
Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models	https://openaccess.thecvf.com/content/CVPR2025/papers/Ren_Prior_Does_Matter_Visual_Navigation_via_Denoising_Diffusion_Bridge_Models_CVPR_2025_paper.pdf	https://github.com/hren20/NaiviBridger
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning	https://arxiv.org/abs/2506.17221	https://github.com/Qi-Zhangyang/GPT4Scene-and-VLN-R1
VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion	https://arxiv.org/pdf/2502.01536	https://github.com/zst1406217/VR-Robo
Embodying Compliant Touch on Drones for Aerial Tactile Navigation	https://ieeexplore.ieee.org/abstract/document/10806756	https://github.com/BioMorphic-Intelligence-Lab/tactile-drone
Controllable 3D Outdoor Scene Generation via Scene Graphs	https://arxiv.org/abs/2503.07152	https://github.com/yuhengliu02/control-3d-scene (coming soon)
A Survey: Learning Embodied Intelligence from Physical Simulators and World Models	https://arxiv.org/pdf/2507.00917	无（项目主页：https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey）
Aim My Robot: Precision Local Navigation to Any Object	https://ieeexplore.ieee.org/abstract/document/10887329/	无
ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion	https://arxiv.org/pdf/2504.14478	https://github.com/Robotics-STAR-Lab/ApexNav (coming soon)
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents	https://arxiv.org/pdf/2506.15635	无
SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments	https://arxiv.org/pdf/2507.06564	无
REGNav: Room Expert Guided Image-Goal Navigation	https://ojs.aaai.org/index.php/AAAI/article/download/32514/34669	https://github.com/leeBooMla/REGNav
Embodied navigation	https://link.springer.com/article/10.1007/s11432-024-4303-8	无
FlySearch: Exploring how vision-language models explore	https://arxiv.org/pdf/2506.02896	https://github.com/gmum/FlySearch
Towards Audio-visual Navigation in Noisy Environments: A Large-scale Benchmark Dataset and An Architecture Considering Multiple Sound-Sources	https://ojs.aaai.org/index.php/AAAI/article/view/33608	https://github.com/ZhanboShiAI/ENMuS
MMCNav: MLLM-empowered Multi-agent Collaboration for Outdoor Visual Language Navigation	https://dl.acm.org/doi/abs/10.1145/3731715.3733393	https://github.com/zzhaesc/MMCNav
MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation	https://arxiv.org/pdf/2503.11081	https://github.com/MoMaKitchen/MoMaKitchen (coming soon)
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment	https://arxiv.org/pdf/2504.08603	无
Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding	https://www.arxiv.org/pdf/2506.10756	https://github.com/zzzzzyh111/Vision-Language-Fly (coming soon)
基于视觉的无人机定位与导航方法研究综述	https://www.ejournal.org.cn/CN/10.12263/DZXB.20240699	无
VLM-Nav: Mapless UAV-Navigation Using Monocular Vision Driven by Vision-Language Model	https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5231854	https://github.com/gcsarker/vlm_nav
Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic Reasoning	https://arxiv.org/pdf/2506.04998	https://github.com/Mehdiazarafza/UAV-RAG
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System	https://arxiv.org/pdf/2506.19433v1	https://github.com/tsinghua-fib-lab/Mem4Nav (coming soon)
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	https://arxiv.org/pdf/2507.05240	https://github.com/OpenRobotLab/StreamVLN
MorphoNavi: Aerial-Ground Robot Navigation with Object Oriented Mapping in Digital Twin	https://arxiv.org/pdf/2504.16914	无
A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects	https://arxiv.org/pdf/2506.19769v1	无
MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation	https://ieeexplore.ieee.org/abstract/document/11071855	https://github.com/OpenMICG/MossVLN
Semantic Mapping in Indoor Embodied AI – A Survey on Advances, Challenges, and Future Directions	https://arxiv.org/pdf/2501.05750	无
X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots	https://www.arxiv.org/pdf/2507.14731	无
Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation	https://www.arxiv.org/pdf/2409.10350	https://github.com/zimingluo/Point2Graph
Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach	https://ieeexplore.ieee.org/abstract/document/11027320/	无
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision	https://arxiv.org/pdf/2504.02477	https://github.com/Xiaofeng-Han-Res/MF-RV
AirScape: An Aerial Generative World Model with Motion Controllability	https://arxiv.org/pdf/2507.08885	无
TrackVLA: Embodied Visual Tracking in the Wild	http://arxiv.org/abs/2505.23189	https://github.com/wsakobe/TrackVLA
SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models	https://arxiv.org/abs/2507.13152v1	无
SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation	https://arxiv.org/pdf/2508.00390	无
Foundation Model Driven Robotics: A Comprehensive Review	https://arxiv.org/pdf/2507.10087v1	无
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities	https://arxiv.org/pdf/2507.13019	https://github.com/InternRobotics/InternNav
大模型驱动的具身智能：发展与挑战	https://www.sciengine.com/SSI/doi/10.1360/SSI-2024-0076	无
CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking	https://arxiv.org/pdf/2507.11334	https://github.com/yuehaohuang/CogDDN
Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation	https://arxiv.org/pdf/2507.21450	无
MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming	https://arxiv.org/pdf/2508.02549v1	无
FloNa: Floor Plan Guided Embodied Visual Navigation	https://ojs.aaai.org/index.php/AAAI/article/download/33601/35756	无
Hi AirStar, Guide Me to the Badminton Court	https://www.arxiv.org/pdf/2507.04430	无
Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation	https://arxiv.org/pdf/2505.11886	https://github.com/HorizonRobotics/robo_orchard_lab/tree/master/projects/aux_think
Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs	https://arxiv.org/abs/2409.18794	https://github.com/YanyuanQiao/Open-Nav
osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning	https://arxiv.org/pdf/2507.12753v1	https://anonymous.4open.science/r/osmAG-LLM
基于大模型的具身智能系统综述	http://www.aas.net.cn/cn/article/doi/10.16383/j.aas.c240542	无
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps	https://arxiv.org/pdf/2505.18675	https://github.com/fscdc/ReasonMap
NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks	https://arxiv.org/pdf/2508.02046v1	无
UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents	https://arxiv.org/pdf/2508.00288	https://github.com/Kyaren/UAV_ON
AgriVLN: Vision-and-Language Navigation for Agricultural Robots	https://arxiv.org/pdf/2508.07406v1	https://github.com/AlexTraveling/AgriVLN
i2Nav-Robot: A Large-Scale Indoor-Outdoor Robot Dataset for Multi-Sensor Fusion Navigation and Mapping	https://arxiv.org/pdf/2508.11485	https://github.com/i2Nav-WHU/i2Nav-Robot
NavBench: Probing Multimodal Large Language Models for Embodied Navigation	https://arxiv.org/abs/2506.01031	https://github.com/NavBench/Evaluation_Code (coming soon)
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks	https://arxiv.org/pdf/2508.08240	无
AerialGo: Walking-through City View Generation from Aerial Perspectives	https://arxiv.org/pdf/2412.00157	无
AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation	https://arxiv.org/abs/2508.15232v1	无
Autonomous Exploration with Terrestrial-Aerial Bimodal Vehicles	https://arxiv.org/pdf/2507.21338v1	无
Vision-based Navigation of Unmanned Aerial Vehicles in Orchards: An Imitation Learning Approach	https://arxiv.org/abs/2508.02617v1	无
HA-VLN: A Benchmark for Human-Aware Navigation in Discrete–Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard	https://arxiv.org/pdf/2503.14229	https://github.com/F1y1113/HA-VLN
LOVON: Legged Open-Vocabulary Object Navigator	https://arxiv.org/pdf/2507.06747	https://github.com/DaojiePENG/LOVON
Sensing, Social, and Motion Intelligence in Embodied Navigation: A Comprehensive Survey	https://arxiv.org/pdf/2508.15354v1	无
Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting	https://arxiv.org/pdf/2508.12720	https://github.com/chenkangjie1123/Co-Adaptation-of-3DGS/
RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model	https://arxiv.org/pdf/2409.19590	无
Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning	https://arxiv.org/pdf/2408.01147	无
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots	https://arxiv.org/pdf/2312.14457	无
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution	https://arxiv.org/pdf/2411.02359	无
General Flow as Foundation Affordance for Scalable Robot Learning	https://arxiv.org/pdf/2401.11439	无
Learning Universal Policies via Text-Guided Video Generation	https://arxiv.org/pdf/2302.00111	无
Learning to Act from Actionless Videos through Dense Correspondences	https://arxiv.org/pdf/2310.08576	无
Compositional Foundation Models for Hierarchical Planning	https://arxiv.org/pdf/2309.08587	无
VIMA: General Robot Manipulation with Multimodal Prompts	https://vimalabs.github.io./assets/vima_paper.pdf	无
Prompt a Robot to Walk with Large Language Models	https://arxiv.org/pdf/2309.09969	无
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning	https://arxiv.org/pdf/2311.17842	无
π0	https://arxiv.org/pdf/2410.24164	https://github.com/Physical-Intelligence/openpi
CogACT	https://arxiv.org/pdf/2411.19650	https://github.com/microsoft/CogACT.git
Diffusion-VLA	https://arxiv.org/abs/2412.03293	无
3D-VLA	https://arxiv.org/pdf/2403.09631	https://github.com/UMass-Foundation-Model/3D-VLA/tree/main
TinyVLA	https://arxiv.org/abs/2409.12514	无
TraceVLA	https://arxiv.org/pdf/2412.10345	https://github.com/umd-huang-lab/tracevla
Octo	https://arxiv.org/pdf/2405.12213	无
SpatialVLA	https://arxiv.org/pdf/2501.15830	https://github.com/SpatialVLA/SpatialVLA
Gemini Robotics	https://arxiv.org/pdf/2503.20020	无
SafeVLA	https://arxiv.org/abs/2503.03480	https://github.com/PKU-Alignment/SafeVLA
HybridVLA	https://arxiv.org/pdf/2503.10631	https://github.com/PKU-HMI-Lab/Hybrid-VLA
DexVLA	https://arxiv.org/pdf/2502.05855	https://github.com/juruobenruo/DexVLA
DexGraspVLA	https://arxiv.org/abs/2502.20900	https://github.com/Psi-Robot/DexGraspVLA
UP-VLA	https://arxiv.org/pdf/2501.18867	无
CoT-VLA	https://arxiv.org/pdf/2503.22020	无
UniAct	https://arxiv.org/abs/2501.10105	https://github.com/2toinf/UniAct
Dynamic Path Navigation for Motion Agents with LLM Reasoning	https://arxiv.org/pdf/2503.07323	无
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation	https://arxiv.org/pdf/2503.10069	无
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments	https://arxiv.org/pdf/2503.09820	无
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation	https://arxiv.org/pdf/2503.10630	无
PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation	https://arxiv.org/pdf/2503.09938	无
Do Visual Imaginations Improve Vision-and-Language Navigation Agents?	https://arxiv.org/pdf/2503.16394	无
Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation	https://arxiv.org/pdf/2504.19322	https://github.com/leggedrobotics/fdm
Semantic Mapping in Indoor Embodied AI - A Comprehensive Survey and Future Directions	https://arxiv.org/pdf/2501.05750	无
VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning	https://arxiv.org/pdf/2502.00931	无
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation	https://arxiv.org/pdf/2502.07306	无
VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion	https://arxiv.org/pdf/2502.01536	无
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants	https://arxiv.org/pdf/2502.13894	无
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation	https://arxiv.org/pdf/2503.18065	https://github.com/SaDil13/VLN-RAM
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation	https://arxiv.org/pdf/2503.24065	无
ForesightNav: Learning Scene Imagination for Efficient Exploration	https://arxiv.org/pdf/2504.16062	https://github.com/uzh-rpg/foresight-nav
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory	https://arxiv.org/pdf/2505.05622	https://github.com/VinceOuti/CityNavAgent
NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance	https://arxiv.org/pdf/2505.08712	无
VISTA: Generative Visual Imagination for Vision-and-Language Navigation	https://arxiv.org/pdf/2505.07868	无
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation	https://arxiv.org/pdf/2505.11383	https://github.com/MrZihan/Dynam3D
Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation	https://arxiv.org/pdf/2505.11886	无
Active Test-time Vision-Language Navigation	https://arxiv.org/pdf/2506.06630	无
BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation	https://arxiv.org/pdf/2506.06487	https://github.com/ZiboKNOW/BeliefMapNav
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation	https://arxiv.org/pdf/2411.16425	无
CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model	https://arxiv.org/pdf/2508.10416	无
GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation	https://arxiv.org/pdf/2509.10454	https://github.com/bagh2178/GC-VLN
Embodied Navigation Foundation Model	https://arxiv.org/pdf/2509.12129	无
Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild	https://arxiv.org/pdf/2505.11350	https://github.com/marmotlab/Search-TTA-VLN
JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation	https://arxiv.org/pdf/2509.22548	https://github.com/MIV-XJTU/JanusVLN
TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking	https://arxiv.org/pdf/2510.07134	无
CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction	https://arxiv.org/abs/2410.01273	https://github.com/worv-ai/canvas
Navid: Video-based vlm plans the next step for vision-and-language navigation	https://arxiv.org/pdf/2402.15852	无
The One RING: a Robotic Indoor Navigation Generalist	https://arxiv.org/pdf/2412.14401	无
Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs	https://arxiv.org/pdf/2407.07775	无
GO-1	无	无
pi-0.5	https://arxiv.org/abs/2504.16054	https://github.com/Physical-Intelligence/openpi
Hi Robot	https://arxiv.org/abs/2502.19417	https://github.com/Physical-Intelligence/openpi
TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models	https://arxiv.org/pdf/2503.00761	无
AffordGrasp: In-Context Affordance Reasoning for Open-Vocabulary Task-Oriented Grasping in Clutter	https://arxiv.org/pdf/2503.00778	无
Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation	https://arxiv.org/pdf/2503.01700	无
Large Language Models as Natural Selector for Embodied Soft Robot Design	https://arxiv.org/pdf/2503.02249	https://github.com/AisenGinn/evogym_data_generation
OVAMOS: A Framework for Open-Vocabulary Multi-Object Search in Unknown Environments	https://arxiv.org/pdf/2503.02106	无
RoboDesign1M: A Large-scale Dataset for Robot Design Understanding	https://arxiv.org/pdf/2503.06796	无
STAR: A Foundation Model-driven Framework for Robust Task Planning and Failure Recovery in Robotic Systems	https://arxiv.org/pdf/2503.06060	无
MatchMaker: Automated Asset Generation for Robotic Assembly	https://arxiv.org/pdf/2503.05887	无
Object-Centric World Model for Language-Guided Manipulation	https://arxiv.org/pdf/2503.06170	无
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation	https://arxiv.org/pdf/2503.10546	无
IMPACT : Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models	https://arxiv.org/pdf/2503.10110	无
Building Cooperative Embodied Agents Modularly with Large Language Models	https://arxiv.org/pdf/2307.02485	无
AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation	https://arxiv.org/pdf/2406.11548	无
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints	https://arxiv.org/pdf/2501.03841	无
Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice	https://arxiv.org/pdf/2412.10694	无
Towards Open-World Grasping with Large Vision-Language Models	https://arxiv.org/pdf/2406.18722v4	无
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter	https://arxiv.org/pdf/2407.11298v1	无
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World	https://arxiv.org/pdf/2401.08577	无
NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model	https://arxiv.org/pdf/2503.09335	https://github.com/laiyuzhi/NVP-HRI.git
MindEye-OmniAssist: A Gaze-Driven LLM-Enhanced Assistive Robot System for Implicit Intention Recognition and Task Execution	https://arxiv.org/pdf/2503.13250	无
HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning	https://arxiv.org/pdf/2503.13171	无
Free-form language-based robotic reasoning and grasping	https://arxiv.org/pdf/2503.13082	无
Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Food Preparation Task Planning	https://arxiv.org/pdf/2503.13055	无
Multi-Agent LLM Actor-Critic Framework for Social Robot Navigation	https://arxiv.org/pdf/2503.09758	无
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability	https://arxiv.org/pdf/2503.08481	无
MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model	https://arxiv.org/pdf/2503.08372	无
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation	https://unsup-affordance.github.io/uad.pdf	无
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning	https://arxiv.org/pdf/2503.15558	https://github.com/nvidia-cosmos/cosmos-reason1
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks	https://arxiv.org/pdf/2503.11089	无
RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation	https://arxiv.org/pdf/2503.01616	无
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation	https://arxiv.org/pdf/2502.16707v1	无
Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations	https://arxiv.org/pdf/2503.04557	无
Look Before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation	https://arxiv.org/pdf/2503.05114	无
Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation	https://arxiv.org/pdf/2503.05064	无
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning	https://arxiv.org/pdf/2503.07557	https://github.com/Yanko96/AutoSpatial
AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance	https://arxiv.org/pdf/2503.07360	无
LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language	https://arxiv.org/pdf/2503.17309	https://github.com/Kchu/LLM-MAP
Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning	https://arxiv.org/pdf/2503.17125	无
RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation	https://arxiv.org/pdf/2503.18738	无
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes	https://arxiv.org/pdf/2503.17406	无
Cooking Task Planning using LLM and Verified by Graph Network	https://arxiv.org/pdf/2503.21564	无

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

pappers.md

Latest commit

History

pappers.md

File metadata and controls