Agentic Robotics: Multimodal Embodied Reasoning and Planning with Small Language Models
Recent advances in robot learning have increasingly relied on LLM-based task planning and reasoning, taking advantage of these models' inherent capabilities to bridge natural language instruction with effective and executable actions. To this end, generalist robot models have begun adopting a novel architecture known as Vision-Language-Action (VLA) models. These models consist of a Vision-Language Model (VLM) backbone, which processes image observations along with natural language instructions, coupled with an action de-tokenizer that predicts executable robot control actions. Notable examples of VLAs are Gemini Robotics (Google Team), Pi0 (Black et al., Physical Intelligence) and OpenVLA (Kim et al.). Although these models show tremendous potential, their massive computational requirements hinder the direct deployment on a real robot. Moreover, the training of these models demands thousands of hours of curated videos. TinyVLA (Wen et al.), partially addresses the limitation of previous methods with fast inference and lower computational requirements, yet its reduced size comes at the cost of lower generalization capabilities. There arises the necessity for an innovative open-source system that retains the advantages of VLAs, while being effectively deployable on robots and without heavily relying on user textual prompts. Addressing these challenges, during my PhD, I aim to develop an agentic AI system leveraging lightweight Multimodal Large Language Models (MLLMs) for robotics applications. This system will combine existing state-of-the-art models, dividing the main task into sub-tasks and assigning each to a specialized AI agent. The proposed system comprises three specialized agents: a language agent in charge of converting the natural language description provided by the user into an executable format such as code or behavior trees (BTs); a vision agent responsible for spatial understanding; an action agent designed to carry out tasks in an end-to-end manner. Efforts will be dedicated to the preparation of every agent for the corresponding task, with particular attention to the use of Small Language Models (SLMs) and compact VLMs to facilitate effective deployment. Building on the foundation of "Mixture of LoRA experts" (Wu et al.), my goal is to develop the system around a single model, and then add a specialized LoRA adapter for each agent. In this way, by developing an appropriate gating function, the system dynamically selects the appropriate agent at runtime, based on task requirements. This approach potentially offers several advantages. First, using a single SLM complemented by lightweight LoRA adapters and considering that the overhead of these adapters can be neglected, will drastically reduce the computational requirements in terms of training time and resources. Secondly, selecting at runtime the required agent could notably enhance inference speed. Furthermore, unlike VLAs that require explicit instructions, this agentic system potentially could accept more open-ended goals, opening the possibilities of autonomous goal selection where the agents proactively propose goals to the user. Finally, additional efforts will be placed on enhancing the planning and reasoning capabilities of the agents, as well as exploring alternative methods to support the overall system, such as memory-based techniques, approaches for improving out-of-distribution generalization, agent orchestration strategies and modular architectures.
Back to Current Students