TraVI: Trajectory from Vision and Instruction - VLM based Robot Trajectory Planner
Mobile Robotics at Northeastern University
Abstract
Designed and implemented a navigation pipeline to allow a robot to enact a natural language created plan using a vision language model (VLM), an RGB camera, and a time-of-flight depth camera. The system takes a text command and a first-person image into 2D pixel coordinates, which are then ray-cast projected into the 3D world. These 3D waypoints are then turned into actionable twist commands the robot can execute. Three VLMs were compared finding Gemini 3 Flash Preview generated the most reliable and reasonable trajectories.
Resources
Full Paper
Open PDF
Opens in your PDF viewer