AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation

A modular pipeline for mobile manipulation in large-scale novel dynamic indoor environments

Konstantin Gubernatorov*, Artem Voronov*, Roman Voronov*, Sergei Pasynkov*, Stepan Perminov, Ziang Guo, and Dzmitry Tsetserukou
*Equal contribution
Skolkovo Institute of Science and Technology

Abstract

We introduce AnywhereVLA, a modular pipeline designed to perform mobile manipulation in large-scale novel dynamic indoor environments with just one language command. Our system combines Vision-Language-Action (VLA) manipulation capabilities with active environment exploration, enabling robots to navigate and manipulate objects in previously unseen environments. By leveraging a purpose-built pick-and-place dataset and 3D point cloud semantic mapping, AnywhereVLA exhibits robust generalization capacities across diverse indoor scenarios. The modular architecture allows for seamless integration of exploration and manipulation tasks, making it suitable for real-world applications in dynamic environments.

Method

AnywhereVLA takes a natural language command as input and simultaneously performs environment exploration while executing mobile manipulation tasks. Our system integrates several key components: (1) Command Interpreter that parses complex tasks into simpler actions, (2) Object Perception module using YOLO v12m for object detection and 2D-to-3D projection for semantic mapping, (3) Active Exploration system that guides navigation in unknown environments, and (4) VLA Manipulation module for precise object interaction. The system leverages 3D point cloud semantic mapping to maintain spatial understanding and enables robust generalization across diverse indoor scenarios.

System Architecture

AnywhereVLA Architecture
AnywhereVLA is the modular architecture comprising VLA manipulation and environment exploration. Given the task, AnywhereVLA parses it into simpler actions which further condition Active Environment Exploration. Exploration and navigation in larger-scale indoor environments are performed within a 3D point cloud semantic map. By leveraging a purpose-built pick-and-place dataset, AnywhereVLA exhibits robust generalization capacities.

Active SLAM & Autonomous Exploration

System Demonstrations

VLA Grab the Bottle
VLA manipulation module successfully grasping and manipulating objects based on language commands.
Active SLAM Real Robot
Real robot performing autonomous exploration and mapping in dynamic indoor environments.
Camera to Map Projection
Camera-to-map projection demonstration showing 2D object detection mapped to 3D semantic map.

Applications

Applications
Use cases demonstrating AnywhereVLA's capabilities in various indoor manipulation scenarios including object retrieval, cleaning tasks, and delivery operations.

Citation

BibTeX
@article{gubernatorov2025anywherevla, title={AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation}, author={Gubernatorov, Konstantin and Voronov, Artem and Voronov, Roman and Pasynkov, Sergei and Perminov, Stepan and Guo, Ziang and Tsetserukou, Dzmitry}, journal={arXiv preprint arXiv:2509.21006}, year={2025} }