AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation

Abstract

We introduce AnywhereVLA, a modular pipeline designed to perform mobile manipulation in large-scale novel dynamic indoor environments with just one language command. Our system combines Vision-Language-Action (VLA) manipulation capabilities with active environment exploration, enabling robots to navigate and manipulate objects in previously unseen environments. By leveraging a purpose-built pick-and-place dataset and 3D point cloud semantic mapping, AnywhereVLA exhibits robust generalization capacities across diverse indoor scenarios. The modular architecture allows for seamless integration of exploration and manipulation tasks, making it suitable for real-world applications in dynamic environments.

Method

AnywhereVLA takes a natural language command as input and simultaneously performs environment exploration while executing mobile manipulation tasks. Our system integrates several key components: (1) Command Interpreter that parses complex tasks into simpler actions, (2) Object Perception module using YOLO v12m for object detection and 2D-to-3D projection for semantic mapping, (3) Active Exploration system that guides navigation in unknown environments, and (4) VLA Manipulation module for precise object interaction. The system leverages 3D point cloud semantic mapping to maintain spatial understanding and enables robust generalization across diverse indoor scenarios.

System Architecture

AnywhereVLA is the modular architecture comprising VLA manipulation and environment exploration. Given the task, AnywhereVLA parses it into simpler actions which further condition Active Environment Exploration. Exploration and navigation in larger-scale indoor environments are performed within a 3D point cloud semantic map. By leveraging a purpose-built pick-and-place dataset, AnywhereVLA exhibits robust generalization capacities.

Active SLAM & Autonomous Exploration

Real-time active SLAM visualization showing autonomous exploration in RViz. The system dynamically builds maps while planning optimal exploration paths.

System Demonstrations

VLA manipulation module successfully grasping and manipulating objects based on language commands.

Real robot performing autonomous exploration and mapping in dynamic indoor environments.

Camera-to-map projection demonstration showing 2D object detection mapped to 3D semantic map.

Applications

Use cases demonstrating AnywhereVLA's capabilities in various indoor manipulation scenarios including object retrieval, cleaning tasks, and delivery operations.

Citation

BibTeX

@article{gubernatorov2025anywherevla, title={AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation}, author={Gubernatorov, Konstantin and Voronov, Artem and Voronov, Roman and Pasynkov, Sergei and Perminov, Stepan and Guo, Ziang and Tsetserukou, Dzmitry}, journal={arXiv preprint arXiv:2509.21006}, year={2025} }