TraVI: Trajectory from Vision and Instruction - VLM based Robot Trajectory Planner

Mobile Robotics at Northeastern University

Abstract

Designed and implemented a navigation pipeline to allow a robot to enact a natural language created plan using a vision language model (VLM), an RGB camera, and a time-of-flight depth camera. The system takes a text command and a first-person image into 2D pixel coordinates, which are then ray-cast projected into the 3D world. These 3D waypoints are then turned into actionable twist commands the robot can execute. Three VLMs were compared finding Gemini 3 Flash Preview generated the most reliable and reasonable trajectories.

Full Paper
Open PDF
Opens in your PDF viewer