Skip to content

Latest commit

 

History

History
354 lines (353 loc) · 62.8 KB

File metadata and controls

354 lines (353 loc) · 62.8 KB
论文名 论文链接 代码链接
ObVLN:突破障碍,受阻环境中的视觉语言导航 https://dl.acm.org/doi/pdf/10.1145/3664647.3681640 https://github.com/honghd16/ObstructedVLN
Behavioral Analysis of Vision-and-Language Navigation Agents https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Behavioral_Analysis_of_Vision-and-Language_Navigation_Agents_CVPR_2023_paper.pdf https://github.com/Yoark/vln-behave
LLM as Copilot for Coarse-grained Vision-and-Language Navigation https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00833.pdf
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation https://arxiv.org/pdf/2402.15852 https://pku-epic.github.io/NaVid/
AerialVLN: Vision-and-Language Navigation for UAVs https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_AerialVLN_Vision-and-Language_Navigation_for_UAVs_ICCV_2023_paper.pdf https://github.com/AirVLN/AirVLN
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method https://arxiv.org/pdf/2412.09082 https://hcplab-sysu.github.io/LH-VLN/
doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation https://arxiv.org/pdf/2412.05893 https://www.github.com/rossgreer/doScenes
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation https://aclanthology.org/2024.acl-long.529.pdf https://github.com/chen-judge/MapGPT/
NL-SLAM for OC-VLN: Natural Language Grounded SLAM for Object-Centric VLN https://arxiv.org/pdf/2411.07848 https://sonia-raychaudhuri.github.io/nlslam/
Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning https://arxiv.org/pdf/2410.08500
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation https://openreview.net/pdf?id=gdw1zUTABk https://github.com/bdaiinstitute/vlfm
CITYNAV: LANGUAGE-GOAL AERIAL NAVIGATION DATASET WITH GEOGRAPHIC INFORMATION https://arxiv.org/pdf/2406.14240 https://water-cookie.github.io/city-nav-proj/
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models https://link.springer.com/chapter/10.1007/978-3-031-72667-5_15 https://github.com/GengzeZhou/NavGPT-2
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models https://ojs.aaai.org/index.php/AAAI/article/download/28597/29161 https://github.com/GengzeZhou/NavGPT
Navigation Instruction Generation with BEV Perception and Large Language Models https://link.springer.com/chapter/10.1007/978-3-031-72670-5_21 https://github.com/FanScy/BEVInstructor
BOOTSTRAPPING LANGUAGE-GUIDED NAVIGATION LEARNING WITH SELF-REFINING DATA FLYWHEEL https://arxiv.org/pdf/2412.08467 https://github.com/wz0919/VLN-SRDF
Towards Learning a Generalist Model for Embodied Navigation https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Towards_Learning_a_Generalist_Model_for_Embodied_Navigation_CVPR_2024_paper.pdf https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Towards_Learning_a_Generalist_Model_for_Embodied_Navigation_CVPR_2024_paper.pdf
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation https://aclanthology.org/2024.findings-acl.745.pdf https://github.com/liangcici/MO-VLN
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation https://aclanthology.org/2024.findings-acl.745.pdf https://github.com/liangcici/MO-VLN
NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation https://arxiv.org/pdf/2411.08579
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching https://arxiv.org/pdf/2311.12751 https://github.com/MultimodalGeo/GeoText-1652
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments https://ojs.aaai.org/index.php/ICAPS/article/download/31506/33666 https://www.sri.com/ics/computer-vision/saynav
LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos https://arxiv.org/pdf/2410.03603 https://github.com/NHirose/learning-language-navigation
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks https://arxiv.org/pdf/2412.06224 https://pku-epic.github.io/Uni-NaVid/
WebVLN: Vision-and-Language Navigation on Websites https://ojs.aaai.org/index.php/AAAI/article/view/27878 https://github.com/WebVLN/WebVLN
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts https://arxiv.org/pdf/2412.05552 https://github.com/GengzeZhou/SAME
InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment https://openreview.net/pdf?id=fCDOfpTCzZ https://github.com/LYX0501/InstructNav
VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation https://semrob.github.io/docs/rss_semrob2024_cr_paper12.pdf https://github.com/HaochenZ11/VLA-3D
Human-centered In-building Embodied Delivery Benchmark https://arxiv.org/pdf/2406.17898 https://github.com/PRS-Organization/prs-delivery
AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models https://arxiv.org/pdf/2408.15511
NAVILA: LEGGED ROBOT VISION-LANGUAGE ACTION MODEL FOR NAVIGATION https://arxiv.org/pdf/2412.04453 https://github.com/yang-zj1026/legged-loco
Navigation World Models https://arxiv.org/abs/2412.03572
TOWARDS REALISTIC UAV VISION-LANGUAGE NAVIGATION: PLATFORM, BENCHMARK, AND METHODOLOGY https://arxiv.org/pdf/2410.07087
Recent Advances in Robot Navigation via Large Language Models: A Review https://www.researchgate.net/profile/Xian-Wei-3/publication/384537380_Recent_Advances_in_Robot_Navigation_via_Large_Language_Models_A_Review/links/66fd2d9b906bca2ac3de2f73/Recent-Advances-in-Robot-Navigation-via-Large-Language-Models-A-Review.pdf
VLM-GroNav: Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments https://arxiv.org/pdf/2409.20445 https://gamma.umd.edu/researchdirections/crowdmultiagent/vlm-gronav/
PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation https://link.springer.com/chapter/10.1007/978-3-031-72848-8_5 https://github.com/iSEE-Laboratory/VLN-PRET
Find What You Want: Learning Demand-conditioned Object Attribute Space for Demand-driven Navigation https://arxiv.org/pdf/2309.08138 https://github.com/whcpumpkin/Demand-driven-navigation
Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning https://aclanthology.org/2024.naacl-long.57.pdf
VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model https://openreview.net/pdf?id=Va7mhTVy5s
InfiniteWorld: A Unified Scalable Simulation Framework for General Visual-Language Robot Interaction https://arxiv.org/pdf/2412.05789 https://github.com/pzhren/InfiniteWorld
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation https://arxiv.org/pdf/2412.08591 https://github.com/roomtour3d/roomtour3d-NaviLLM
UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility https://arxiv.org/pdf/2501.02341 https://github.com/Hub-Tian/UAVs_Meet_LLMs
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions https://arxiv.org/pdf/2406.19236v1 https://github.com/lpercc/HA3D_simulator
Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts https://arxiv.org/pdf/2406.02208 https://github.com/honghd16/VLN-MP
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation https://openaccess.thecvf.com/content/CVPR2022/papers/Chen_Think_Global_Act_Local_Dual-Scale_Graph_Transformer_for_Vision-and-Language_Navigation_CVPR_2022_paper.pdf https://github.com/cshizhe/VLN-DUET
All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents https://arxiv.org/pdf/2408.10899 https://openi.pcl.ac.cn/ARIO/ARIO_Dataset
Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning https://arxiv.org/pdf/2501.02116
Etpnav: Evolving topological planning for vision-language navigation in continuous environments https://arxiv.org/pdf/2304.03047 https://github.com/MarSaKi/ETPNav
UEVAVD: A Dataset for Developing UAV’s Eye View Active Object Detection https://arxiv.org/pdf/2411.04348 https://github.com/Leo000ooo/UEVAVD_dataset
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments https://arxiv.org/pdf/2408.11051 https://github.com/xyz9911/FLAME
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation https://arxiv.org/pdf/2411.16425
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models https://aclanthology.org/2024.acl-short.33.pdf https://github.com/mengfeidu/EmbSpatial-Bench
GRUtopia: Dream General Robots in a City at Scale https://arxiv.org/pdf/2407.10943v1 https://github.com/OpenRobotLab/GRUtopia
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents https://arxiv.org/pdf/2501.11858 https://github.com/thunlp/EmbodiedEval
ROBOSPATIAL: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics https://arxiv.org/pdf/2411.16537
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation https://arxiv.org/abs/2409.13682 https://github.com/NVIDIA-AI-IOT/remembr
The Tong Test: Evaluating Artificial General Intelligence Through Dynamic Embodied Physical and Social Interactions https://www.sciencedirect.com/science/article/pii/S209580992300293X
Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation https://arxiv.org/pdf/2403.10700 https://github.com/intelligolabs/R2RIE-CE
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning https://arxiv.org/pdf/2501.10074 https://spatialcot.github.io/
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models https://arxiv.org/pdf/2410.14682
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models https://openaccess.thecvf.com/content/CVPR2024/papers/Cheng_EgoThink_Evaluating_First-Person_Perspective_Thinking_Capability_of_Vision-Language_Models_CVPR_2024_paper.pdf https://github.com/AdaCheng/EgoThink/
HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation https://arxiv.org/pdf/2403.10506 https://github.com/carlosferrazza/humanoid-bench
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making https://arxiv.org/abs/2410.07166 https://github.com/embodied-agent-interface/embodied-agent-interface
The One RING: a Robotic Indoor Navigation Generalist https://arxiv.org/pdf/2412.14401v1
SARO: Space-Aware Robot System for Terrain Crossing via Vision-Language Model https://saro-vlm.github.io/resources/saro_paper.pdf
Robust Robot Walker: Learning Agile Locomotion over Tiny Traps https://robust-robot-walker.github.io/resources/rrw_paper.pdf
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation https://arxiv.org/abs/2311.01455 https://github.com/Genesis-Embodied-AI/RoboGen
See and Think: Embodied Agent in Virtual Environment(单位:浙江大学,华盛顿大学,香港科技大学) https://arxiv.org/pdf/2311.15209
EvoAgent: Agent Autonomous Evolution with Continual World Model for Long-Horizon Tasks https://arxiv.org/pdf/2502.05907
OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics https://arxiv.org/pdf/2502.09238 https://github.com/EI-Nav/light-map-navigation
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space(单位:国防科技大学,清华大学) https://arxiv.org/pdf/2502.12532
GENERAL SCENE ADAPTATION FOR VISION-AND-LANGUAGE NAVIGATION https://arxiv.org/pdf/2501.17403 https://github.com/honghd16/GSA-VLN
SpatialRGPT https://www.anjiecheng.me/assets/SpatialRGPT/Spatial_RGPT.pdf https://github.com/AnjieCheng/SpatialRGPT
GTA-UAV https://arxiv.org/pdf/2409.16925 https://github.com/Yux1angJi/GTA-UAV
Vision-and-Language Navigation via Causal Learning https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Vision-and-Language_Navigation_via_Causal_Learning_CVPR_2024_paper.pdf https://github.com/CrystalSixone/VLN-GOAT
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation https://arxiv.org/pdf/2501.05014 https://github.com/sautenich/uav-vla
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation https://arxiv.org/pdf/2409.18313
Embodied Intelligence: A Synergy of Morphology, Action, Perception and Learning(出版信息:ACM COMPUTING SURVEYS - 2025) https://dl.acm.org/doi/pdf/10.1145/3717059
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios https://arxiv.org/pdf/2408.17267 https://github.com/opendatalab/UrBench
OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation https://arxiv.org/pdf/2502.18041
Do Visual Imaginations Improve Vision-and-Language Navigation Agents? https://arxiv.org/abs/2503.16394
Magma: A Foundation Model for Multimodal AI Agents https://www.arxiv.org/pdf/2502.13130 https://github.com/microsoft/Magma
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM https://arxiv.org/pdf/2502.11142 https://github.com/MrZihan/NavRAG
Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation https://arxiv.org/pdf/2502.14254
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding https://arxiv.org/pdf/2501.08282 https://github.com/appletea233/LLaVA-ST
VIRTUAL COMMUNITY: A GENERATIVE SOCIAL WORLD FOR EMBODIED AI https://openreview.net/pdf?id=aRxLDcxFcL
MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation https://arxiv.org/pdf/2502.13451
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models(出版信息:Transactions on Machine Learning Research-2024) https://openreview.net/pdf?id=yiqeh2ZYUh
NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation https://arxiv.org/pdf/2412.13026
EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning https://arxiv.org/pdf/2410.23968
MageBench: Bridging Large Multimodal Models to Agents https://arxiv.org/pdf/2412.04531 https://github.com/microsoft/MageBench
Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions http://arxiv.org/abs/2502.15336
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces https://arxiv.org/pdf/2503.06157 https://github.com/EmbodiedCity/UrbanVideo-Bench.code
Advancements in Humanoid Robots: A Comprehensive Review and Future Prospects(出版信息:IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 11, NO. 2, FEBRUARY 2024) https://ieeexplore.ieee.org/abstract/document/10415857
长程视觉语言导航平台与数据集:迈向复杂环境中的智能机器人(作者:Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin;单位:中山大学,Independent Researcher,鹏城实验室) https://arxiv.org/pdf/2412.09082
Dual-BEV Nav: Dual-layer BEV-based Heuristic Path Planning for Robotic Navigation in Unstructured Outdoor Environments https://arxiv.org/pdf/2501.18351
Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks https://arxiv.org/pdf/2502.13175
Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation https://github.com/HCI-LMC/VLN-SUSA (coming soon)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning https://arxiv.org/abs/2403.07376 https://github.com/expectorlin/NavCoT
Lifelong Learning of Large Language Model based Agents: A Roadmap https://arxiv.org/pdf/2501.07278
An LLM-based vision and language cobot navigation approach for Human-centric Smart Manufacturing(出版信息:Journal of Manufacturing Systems 75 (2024) 299–305) https://www.sciencedirect.com/science/article/pii/S0278612524000864
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space https://www.arxiv.org/pdf/2503.11094 https://github.com/WeichenZh/Open3DVQA
Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions https://arxiv.org/pdf/2408.04168 https://anonymous.4open.science/r/PReP-13B5
EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents https://arxiv.org/pdf/2408.04449 https://github.com/zihao-ai/EARBench
FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks https://arxiv.org/pdf/2503.13966
Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments https://arxiv.org/pdf/2502.19024
WMNav: Integrating Visual-Language Models into World Models for Object Goal Navigation https://arxiv.org/abs/2503.02247
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation https://arxiv.org/abs/2503.10630 https://github.com/bagh2178/UniGoal
PanoGen++: Domain-adapted text-guided panoramic environment generation for vision-and-language navigation(出版信息:Neural Networks 187 (2025) 107320)
UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation https://arxiv.org/pdf/2408.04423
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation https://arxiv.org/pdf/2503.18065 https://github.com/SaDil13/VLN-RAM
VERTIFORMER: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility https://arxiv.org/pdf/2502.00543 https://github.com/mhnazeri/VertiFormer
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning https://arxiv.org/pdf/2504.12680 https://github.com/EmbodiedCity/Embodied-R.code
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos https://arxiv.org/abs/2411.17820 https://github.com/ai4ce/CityWalker
GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation https://arxiv.org/pdf/2504.09587
World-Consistent Data Generation for Vision-and-Language Navigation https://arxiv.org/pdf/2412.06413
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation https://arxiv.org/pdf/2502.07306
Zero-Shot Object Navigation with Vision-Language Models Reasoning https://link.springer.com/chapter/10.1007/978-3-031-78456-9_25
Multimodal Pretrained Knowledge for Real-world Object Navigation(出版信息:Machine Intelligence Research (2025)) https://www.mi-research.net/en/article/pdf/preview/10.1007/s11633-024-1537-x.pdf
Instruction-guided path planning with 3D semantic maps for vision-language navigation(出版信息:Neurocomputing 625 (2025) 129457) https://www.sciencedirect.com/science/article/abs/pii/S0925231225001298?via%3Dihub
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation https://arxiv.org/pdf/2503.24065
ESceme: Vision-and-Language Navigation with Episodic Scene Memory https://link.springer.com/article/10.1007/s11263-024-02159-8 https://github.com/qizhust/esceme
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments https://doi.org/10.48550/arXiv.2503.09820
Advances in Embodied Navigation Using Large Language Models: A Survey https://arxiv.org/pdf/2311.00530
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants https://arxiv.org/pdf/2502.13894
ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots https://arxiv.org/pdf/2502.08791
Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction https://arxiv.org/pdf/2503.11091
UnrealZoo Gym(无明确论文标题,以项目名补充) https://arxiv.org/abs/2412.20977 https://github.com/UnrealZoo/unrealzoo-gym
WMNavigation(无明确论文标题,以项目名补充) https://arxiv.org/pdf/2503.02247 https://github.com/B0B8K1ng/WMNavigation
KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments https://arxiv.org/pdf/2503.08330
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation https://arxiv.org/pdf/2503.10069
Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey(出版信息:Frontiers of Engineering Management 2025, 12 (1): 177–200) https://link.springer.com/content/pdf/10.1007/s42524-025-4136-9.pdf
Look Around Before Locating: Considering Content and Structure Information for Visual Grounding https://ojs.aaai.org/index.php/AAAI/article/view/32158/34313 https://github.com/VILAN-Lab/SSRVG
Self-Prompting Analogical Reasoning for UAV Object Detection https://ojs.aaai.org/index.php/AAAI/article/view/34026/36181
Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation https://ojs.aaai.org/index.php/AAAI/article/view/32758/34913 https://github.com/yifeisu/FELA
LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs https://arxiv.org/pdf/2505.03460
Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering https://arxiv.org/pdf/2502.00342
RefDrone: A Challenging Benchmark for Drone Scene Referring Expression Comprehension https://arxiv.org/pdf/2502.00392 https://github.com/sunzc-sunny/refdrone
Framework for Autonomous UAV Navigation and Target Detection in Global-Navigation-Satellite-System-Denied and Visually Degraded Environments https://www.mdpi.com/2072-4292/16/3/471/pdf?version=1706191544
Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology https://arxiv.org/abs/2505.08765 https://anonymous.4open.science/r/CityAVOS-3DF8
AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding https://arxiv.org/pdf/2504.09583
UAV Visual Navigation in the Large-Scale Outdoor Environment: A Semantic Map-Based Cognitive Escape Reinforcement Learning Method https://ieeexplore.ieee.org/abstract/document/10847926
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science https://arxiv.org/pdf/2504.09848
General-Purpose Aerial Intelligent Agents Empowered by Large Language Models https://arxiv.org/pdf/2503.08302
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering https://arxiv.org/pdf/2503.11117 https://github.com/HCPLab-SYSU/EXPRESS-Bench
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory https://www.arxiv.org/pdf/2505.05622 https://github.com/VinceOuti/CityNavAgent (coming soon)
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation https://www.arxiv.org/pdf/2503.11006
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs https://arxiv.org/pdf/2504.21432
FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models https://arxiv.org/pdf/2505.12835 https://github.com/Pendulumclock/FlightGPT
UAV Visual Navigation in the Large-Scale Outdoor Environment: A Semantic Map-Based Cognitive Escape Reinforcement Learning Method https://ieeexplore.ieee.org/abstract/document/10847926
Visuomotor Navigation for Embodied Robots With Spatial Memory and Semantic Reasoning Cognition https://ieeexplore.ieee.org/abstract/document/10682097
UAV-FlowColosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning https://arxiv.org/pdf/2505.15725
UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning https://arxiv.org/pdf/2505.07236
Task-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy https://arxiv.org/abs/2504.18317v3 https://github.com/fangzr/TOC-Edge-Aerial
AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval https://ieeexplore.ieee.org/document/11006099 https://github.com/VisionXLab/AirSpatialBot
RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour https://arxiv.org/pdf/2503.02572 https://github.com/SerValera/RaceVLA
Personalised Explainable Robots Using LLMs https://ieeexplore.ieee.org/abstract/document/10974125
CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios https://arxiv.org/pdf/2505.00091
DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes https://arxiv.org/pdf/2506.01950 https://github.com/Eku127/DualMap
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs https://arxiv.org/pdf/2503.01378 https://cognitivedrone.github.io/
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence https://arxiv.org/pdf/2505.23747 https://github.com/diankun-wu/Spatial-MLLM
BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVs https://arxiv.org/pdf/2505.18229 https://github.com/lostwolves/BEDI
Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation https://dl.acm.org/doi/abs/10.1145/3664647.3681518
P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction https://arxiv.org/pdf/2503.18525
DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation https://arxiv.org/pdf/2505.00743
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning https://arxiv.org/pdf/2503.07557 https://github.com/Yanko96/AutoSpatial (coming soon)
OctoNav: Towards Generalist Embodied Navigation https://arxiv.org/pdf/2506.09839 https://github.com/buaa-colalab/OctoNav-R1
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI https://arxiv.org/pdf/2505.01458
Toward Embodied AGI: A Review of Embodied AI and the Road Ahead https://arxiv.org/pdf/2505.14235
USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents https://arxiv.org/pdf/2505.17572 https://github.com/usail-hkust/USTBench
CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation https://arxiv.org/pdf/2505.16663 https://github.com/oceanhao/CoNav
GaussNav: Gaussian Splatting for Visual Navigation https://ieeexplore.ieee.org/abstract/document/10870413/ https://github.com/XiaohanLei/GaussNav
HOZ++: Versatile Hierarchical Object-to-Zone Graph for Object Navigation https://ieeexplore.ieee.org/abstract/document/10933537/ https://github.com/sx-zhang/HOZplus
EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation https://arxiv.org/pdf/2506.01551 https://github.com/expectorlin/EvolveNav
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks https://arxiv.org/pdf/2503.21696 https://github.com/zwq2018/embodied_reasoner
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents https://arxiv.org/pdf/2502.09560 https://github.com/EmbodiedBench/EmbodiedBench
面向具身人工智能的物体目标导航综述 https://www.jos.org.cn/jos/article/abstract/7250
Multimodal Perception for Goal-oriented Navigation: A Survey https://arxiv.org/pdf/2504.15643
RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments https://arxiv.org/pdf/2408.15503 https://github.com/suhaisheng/RoboSense (coming soon)
Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation https://openaccess.thecvf.com/content/CVPR2025/papers/Xie_Vid2Sim_Realistic_and_Interactive_Simulation_from_Video_for_Urban_Navigation_CVPR_2025_paper.pdf https://github.com/Vid2Sim/Vid2Sim (coming soon)
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation https://arxiv.org/pdf/2505.11383 https://github.com/MrZihan/Dynam3D (coming soon)
DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory https://arxiv.org/pdf/2506.15096
UAVs Meet Agentic AI: A Multidomain Survey of Autonomous Aerial Intelligence and Agentic UAVs https://arxiv.org/pdf/2506.08045
A Comprehensive Survey on Underwater Acoustic Target Positioning and Tracking: Progress, Challenges, and Perspectives https://arxiv.org/pdf/2506.14165
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation https://doi.org/10.48550/arXiv.2503.11006
Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models https://openaccess.thecvf.com/content/CVPR2025/papers/Ren_Prior_Does_Matter_Visual_Navigation_via_Denoising_Diffusion_Bridge_Models_CVPR_2025_paper.pdf https://github.com/hren20/NaiviBridger
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning https://arxiv.org/abs/2506.17221 https://github.com/Qi-Zhangyang/GPT4Scene-and-VLN-R1
VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion https://arxiv.org/pdf/2502.01536 https://github.com/zst1406217/VR-Robo
Embodying Compliant Touch on Drones for Aerial Tactile Navigation https://ieeexplore.ieee.org/abstract/document/10806756 https://github.com/BioMorphic-Intelligence-Lab/tactile-drone
Controllable 3D Outdoor Scene Generation via Scene Graphs https://arxiv.org/abs/2503.07152 https://github.com/yuhengliu02/control-3d-scene (coming soon)
A Survey: Learning Embodied Intelligence from Physical Simulators and World Models https://arxiv.org/pdf/2507.00917 无(项目主页:https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey)
Aim My Robot: Precision Local Navigation to Any Object https://ieeexplore.ieee.org/abstract/document/10887329/
ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion https://arxiv.org/pdf/2504.14478 https://github.com/Robotics-STAR-Lab/ApexNav (coming soon)
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents https://arxiv.org/pdf/2506.15635
SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments https://arxiv.org/pdf/2507.06564
REGNav: Room Expert Guided Image-Goal Navigation https://ojs.aaai.org/index.php/AAAI/article/download/32514/34669 https://github.com/leeBooMla/REGNav
Embodied navigation https://link.springer.com/article/10.1007/s11432-024-4303-8
FlySearch: Exploring how vision-language models explore https://arxiv.org/pdf/2506.02896 https://github.com/gmum/FlySearch
Towards Audio-visual Navigation in Noisy Environments: A Large-scale Benchmark Dataset and An Architecture Considering Multiple Sound-Sources https://ojs.aaai.org/index.php/AAAI/article/view/33608 https://github.com/ZhanboShiAI/ENMuS
MMCNav: MLLM-empowered Multi-agent Collaboration for Outdoor Visual Language Navigation https://dl.acm.org/doi/abs/10.1145/3731715.3733393 https://github.com/zzhaesc/MMCNav
MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation https://arxiv.org/pdf/2503.11081 https://github.com/MoMaKitchen/MoMaKitchen (coming soon)
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment https://arxiv.org/pdf/2504.08603
Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding https://www.arxiv.org/pdf/2506.10756 https://github.com/zzzzzyh111/Vision-Language-Fly (coming soon)
基于视觉的无人机定位与导航方法研究综述 https://www.ejournal.org.cn/CN/10.12263/DZXB.20240699
VLM-Nav: Mapless UAV-Navigation Using Monocular Vision Driven by Vision-Language Model https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5231854 https://github.com/gcsarker/vlm_nav
Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic Reasoning https://arxiv.org/pdf/2506.04998 https://github.com/Mehdiazarafza/UAV-RAG
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System https://arxiv.org/pdf/2506.19433v1 https://github.com/tsinghua-fib-lab/Mem4Nav (coming soon)
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling https://arxiv.org/pdf/2507.05240 https://github.com/OpenRobotLab/StreamVLN
MorphoNavi: Aerial-Ground Robot Navigation with Object Oriented Mapping in Digital Twin https://arxiv.org/pdf/2504.16914
A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects https://arxiv.org/pdf/2506.19769v1
MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation https://ieeexplore.ieee.org/abstract/document/11071855 https://github.com/OpenMICG/MossVLN
Semantic Mapping in Indoor Embodied AI – A Survey on Advances, Challenges, and Future Directions https://arxiv.org/pdf/2501.05750
X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots https://www.arxiv.org/pdf/2507.14731
Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation https://www.arxiv.org/pdf/2409.10350 https://github.com/zimingluo/Point2Graph
Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach https://ieeexplore.ieee.org/abstract/document/11027320/
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision https://arxiv.org/pdf/2504.02477 https://github.com/Xiaofeng-Han-Res/MF-RV
AirScape: An Aerial Generative World Model with Motion Controllability https://arxiv.org/pdf/2507.08885
TrackVLA: Embodied Visual Tracking in the Wild http://arxiv.org/abs/2505.23189 https://github.com/wsakobe/TrackVLA
SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models https://arxiv.org/abs/2507.13152v1
SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation https://arxiv.org/pdf/2508.00390
Foundation Model Driven Robotics: A Comprehensive Review https://arxiv.org/pdf/2507.10087v1
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities https://arxiv.org/pdf/2507.13019 https://github.com/InternRobotics/InternNav
大模型驱动的具身智能:发展与挑战 https://www.sciengine.com/SSI/doi/10.1360/SSI-2024-0076
CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking https://arxiv.org/pdf/2507.11334 https://github.com/yuehaohuang/CogDDN
Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation https://arxiv.org/pdf/2507.21450
MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming https://arxiv.org/pdf/2508.02549v1
FloNa: Floor Plan Guided Embodied Visual Navigation https://ojs.aaai.org/index.php/AAAI/article/download/33601/35756
Hi AirStar, Guide Me to the Badminton Court https://www.arxiv.org/pdf/2507.04430
Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation https://arxiv.org/pdf/2505.11886 https://github.com/HorizonRobotics/robo_orchard_lab/tree/master/projects/aux_think
Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs https://arxiv.org/abs/2409.18794 https://github.com/YanyuanQiao/Open-Nav
osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning https://arxiv.org/pdf/2507.12753v1 https://anonymous.4open.science/r/osmAG-LLM
基于大模型的具身智能系统综述 http://www.aas.net.cn/cn/article/doi/10.16383/j.aas.c240542
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps https://arxiv.org/pdf/2505.18675 https://github.com/fscdc/ReasonMap
NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks https://arxiv.org/pdf/2508.02046v1
UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents https://arxiv.org/pdf/2508.00288 https://github.com/Kyaren/UAV_ON
AgriVLN: Vision-and-Language Navigation for Agricultural Robots https://arxiv.org/pdf/2508.07406v1 https://github.com/AlexTraveling/AgriVLN
i2Nav-Robot: A Large-Scale Indoor-Outdoor Robot Dataset for Multi-Sensor Fusion Navigation and Mapping https://arxiv.org/pdf/2508.11485 https://github.com/i2Nav-WHU/i2Nav-Robot
NavBench: Probing Multimodal Large Language Models for Embodied Navigation https://arxiv.org/abs/2506.01031 https://github.com/NavBench/Evaluation_Code (coming soon)
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks https://arxiv.org/pdf/2508.08240
AerialGo: Walking-through City View Generation from Aerial Perspectives https://arxiv.org/pdf/2412.00157
AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation https://arxiv.org/abs/2508.15232v1
Autonomous Exploration with Terrestrial-Aerial Bimodal Vehicles https://arxiv.org/pdf/2507.21338v1
Vision-based Navigation of Unmanned Aerial Vehicles in Orchards: An Imitation Learning Approach https://arxiv.org/abs/2508.02617v1
HA-VLN: A Benchmark for Human-Aware Navigation in Discrete–Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard https://arxiv.org/pdf/2503.14229 https://github.com/F1y1113/HA-VLN
LOVON: Legged Open-Vocabulary Object Navigator https://arxiv.org/pdf/2507.06747 https://github.com/DaojiePENG/LOVON
Sensing, Social, and Motion Intelligence in Embodied Navigation: A Comprehensive Survey https://arxiv.org/pdf/2508.15354v1
Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting https://arxiv.org/pdf/2508.12720 https://github.com/chenkangjie1123/Co-Adaptation-of-3DGS/
RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model https://arxiv.org/pdf/2409.19590
Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning https://arxiv.org/pdf/2408.01147
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots https://arxiv.org/pdf/2312.14457
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution https://arxiv.org/pdf/2411.02359
General Flow as Foundation Affordance for Scalable Robot Learning https://arxiv.org/pdf/2401.11439
Learning Universal Policies via Text-Guided Video Generation https://arxiv.org/pdf/2302.00111
Learning to Act from Actionless Videos through Dense Correspondences https://arxiv.org/pdf/2310.08576
Compositional Foundation Models for Hierarchical Planning https://arxiv.org/pdf/2309.08587
VIMA: General Robot Manipulation with Multimodal Prompts https://vimalabs.github.io./assets/vima_paper.pdf
Prompt a Robot to Walk with Large Language Models https://arxiv.org/pdf/2309.09969
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning https://arxiv.org/pdf/2311.17842
π0 https://arxiv.org/pdf/2410.24164 https://github.com/Physical-Intelligence/openpi
CogACT https://arxiv.org/pdf/2411.19650 https://github.com/microsoft/CogACT.git
Diffusion-VLA https://arxiv.org/abs/2412.03293
3D-VLA https://arxiv.org/pdf/2403.09631 https://github.com/UMass-Foundation-Model/3D-VLA/tree/main
TinyVLA https://arxiv.org/abs/2409.12514
TraceVLA https://arxiv.org/pdf/2412.10345 https://github.com/umd-huang-lab/tracevla
Octo https://arxiv.org/pdf/2405.12213
SpatialVLA https://arxiv.org/pdf/2501.15830 https://github.com/SpatialVLA/SpatialVLA
Gemini Robotics https://arxiv.org/pdf/2503.20020
SafeVLA https://arxiv.org/abs/2503.03480 https://github.com/PKU-Alignment/SafeVLA
HybridVLA https://arxiv.org/pdf/2503.10631 https://github.com/PKU-HMI-Lab/Hybrid-VLA
DexVLA https://arxiv.org/pdf/2502.05855 https://github.com/juruobenruo/DexVLA
DexGraspVLA https://arxiv.org/abs/2502.20900 https://github.com/Psi-Robot/DexGraspVLA
UP-VLA https://arxiv.org/pdf/2501.18867
CoT-VLA https://arxiv.org/pdf/2503.22020
UniAct https://arxiv.org/abs/2501.10105 https://github.com/2toinf/UniAct
Dynamic Path Navigation for Motion Agents with LLM Reasoning https://arxiv.org/pdf/2503.07323
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation https://arxiv.org/pdf/2503.10069
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments https://arxiv.org/pdf/2503.09820
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation https://arxiv.org/pdf/2503.10630
PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation https://arxiv.org/pdf/2503.09938
Do Visual Imaginations Improve Vision-and-Language Navigation Agents? https://arxiv.org/pdf/2503.16394
Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation https://arxiv.org/pdf/2504.19322 https://github.com/leggedrobotics/fdm
Semantic Mapping in Indoor Embodied AI - A Comprehensive Survey and Future Directions https://arxiv.org/pdf/2501.05750
VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning https://arxiv.org/pdf/2502.00931
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation https://arxiv.org/pdf/2502.07306
VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion https://arxiv.org/pdf/2502.01536
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants https://arxiv.org/pdf/2502.13894
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation https://arxiv.org/pdf/2503.18065 https://github.com/SaDil13/VLN-RAM
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation https://arxiv.org/pdf/2503.24065
ForesightNav: Learning Scene Imagination for Efficient Exploration https://arxiv.org/pdf/2504.16062 https://github.com/uzh-rpg/foresight-nav
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory https://arxiv.org/pdf/2505.05622 https://github.com/VinceOuti/CityNavAgent
NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance https://arxiv.org/pdf/2505.08712
VISTA: Generative Visual Imagination for Vision-and-Language Navigation https://arxiv.org/pdf/2505.07868
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation https://arxiv.org/pdf/2505.11383 https://github.com/MrZihan/Dynam3D
Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation https://arxiv.org/pdf/2505.11886
Active Test-time Vision-Language Navigation https://arxiv.org/pdf/2506.06630
BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation https://arxiv.org/pdf/2506.06487 https://github.com/ZiboKNOW/BeliefMapNav
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation https://arxiv.org/pdf/2411.16425
CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model https://arxiv.org/pdf/2508.10416
GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation https://arxiv.org/pdf/2509.10454 https://github.com/bagh2178/GC-VLN
Embodied Navigation Foundation Model https://arxiv.org/pdf/2509.12129
Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild https://arxiv.org/pdf/2505.11350 https://github.com/marmotlab/Search-TTA-VLN
JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation https://arxiv.org/pdf/2509.22548 https://github.com/MIV-XJTU/JanusVLN
TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking https://arxiv.org/pdf/2510.07134
CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction https://arxiv.org/abs/2410.01273 https://github.com/worv-ai/canvas
Navid: Video-based vlm plans the next step for vision-and-language navigation https://arxiv.org/pdf/2402.15852
The One RING: a Robotic Indoor Navigation Generalist https://arxiv.org/pdf/2412.14401
Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs https://arxiv.org/pdf/2407.07775
GO-1
pi-0.5 https://arxiv.org/abs/2504.16054 https://github.com/Physical-Intelligence/openpi
Hi Robot https://arxiv.org/abs/2502.19417 https://github.com/Physical-Intelligence/openpi
TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models https://arxiv.org/pdf/2503.00761
AffordGrasp: In-Context Affordance Reasoning for Open-Vocabulary Task-Oriented Grasping in Clutter https://arxiv.org/pdf/2503.00778
Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation https://arxiv.org/pdf/2503.01700
Large Language Models as Natural Selector for Embodied Soft Robot Design https://arxiv.org/pdf/2503.02249 https://github.com/AisenGinn/evogym_data_generation
OVAMOS: A Framework for Open-Vocabulary Multi-Object Search in Unknown Environments https://arxiv.org/pdf/2503.02106
RoboDesign1M: A Large-scale Dataset for Robot Design Understanding https://arxiv.org/pdf/2503.06796
STAR: A Foundation Model-driven Framework for Robust Task Planning and Failure Recovery in Robotic Systems https://arxiv.org/pdf/2503.06060
MatchMaker: Automated Asset Generation for Robotic Assembly https://arxiv.org/pdf/2503.05887
Object-Centric World Model for Language-Guided Manipulation https://arxiv.org/pdf/2503.06170
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation https://arxiv.org/pdf/2503.10546
IMPACT : Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models https://arxiv.org/pdf/2503.10110
Building Cooperative Embodied Agents Modularly with Large Language Models https://arxiv.org/pdf/2307.02485
AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation https://arxiv.org/pdf/2406.11548
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints https://arxiv.org/pdf/2501.03841
Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice https://arxiv.org/pdf/2412.10694
Towards Open-World Grasping with Large Vision-Language Models https://arxiv.org/pdf/2406.18722v4
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter https://arxiv.org/pdf/2407.11298v1
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World https://arxiv.org/pdf/2401.08577
NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model https://arxiv.org/pdf/2503.09335 https://github.com/laiyuzhi/NVP-HRI.git
MindEye-OmniAssist: A Gaze-Driven LLM-Enhanced Assistive Robot System for Implicit Intention Recognition and Task Execution https://arxiv.org/pdf/2503.13250
HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning https://arxiv.org/pdf/2503.13171
Free-form language-based robotic reasoning and grasping https://arxiv.org/pdf/2503.13082
Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided, Self-Consistent MLLMs for Food Preparation Task Planning https://arxiv.org/pdf/2503.13055
Multi-Agent LLM Actor-Critic Framework for Social Robot Navigation https://arxiv.org/pdf/2503.09758
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability https://arxiv.org/pdf/2503.08481
MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model https://arxiv.org/pdf/2503.08372
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation https://unsup-affordance.github.io/uad.pdf
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning https://arxiv.org/pdf/2503.15558 https://github.com/nvidia-cosmos/cosmos-reason1
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks https://arxiv.org/pdf/2503.11089
RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation https://arxiv.org/pdf/2503.01616
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation https://arxiv.org/pdf/2502.16707v1
Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations https://arxiv.org/pdf/2503.04557
Look Before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation https://arxiv.org/pdf/2503.05114
Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation https://arxiv.org/pdf/2503.05064
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning https://arxiv.org/pdf/2503.07557 https://github.com/Yanko96/AutoSpatial
AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance https://arxiv.org/pdf/2503.07360
LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language https://arxiv.org/pdf/2503.17309 https://github.com/Kchu/LLM-MAP
Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning https://arxiv.org/pdf/2503.17125
RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation https://arxiv.org/pdf/2503.18738
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes https://arxiv.org/pdf/2503.17406
Cooking Task Planning using LLM and Verified by Graph Network https://arxiv.org/pdf/2503.21564