Interactive Object Pushing Using VLMs for Scene Understanding

Thesis

Thesis Degree:

Master

Thesis Advisors:

Nils Dengler

Thesis Description:

In complex scenarios where typical pick-and-place techniques are insufficient, non-prehensile manipulation can often ensure that a robot is able to complete its task. More broadly, non-prehensile manipulation refers to moving or controlling objects without grasping, using techniques such as pushing, rolling, or sliding.

Current research focuses only on precise pushing of objects in cluttered environments [1], but assumes complete task knowledge, i.e., start position, target position, obstacle positions. In real-world scenarios, however, this information is not always sufficiently available, and the information must be approximated, e.g., from visual input. The figure above shows a scenario where the robot should push a cake towards the human to reach it. Here the human should simply ask "can you reach me the cake" and the robot has to understand the task modalities and find all relevant parts of the task in the environment. To realize this, Vision Language Models (VLM) can be used to give the robot scene understanding capabilities.

In this offered thesis the task is to generate such a scene understanding using VLMs and use the output to perform several tasks like "reach me this object" or "push object A close to object B" while avoiding other obstacles in the scene.

Thesis Start Date:

March 15, 2025

Thesis Requirements:

Required:

  • Enrolled in computer science or similar MSc program in and around Bonn/Cologne
  • Programming experience with C++, Python
  • Enthusiasm for real-world robot deployment and scientific publishing of results
  • Familiarity with LLMs and VLMs for scene understanding
  • Basic knowledge in scene and action graphs
  • Programming experience with ROS (Robot Operating System)

Thesis Related Work:

[1] Dengler et al. Arxiv-Preprint, 2025: https://arxiv.org/pdf/2403.17667