A minimal example of controlling a tabletop robotic manipulation task using a Vision-Language Model (VLM) through Visual Question Answering (VQA).
The robot observes the scene, answers structured questions about the environment, and converts those answers into executable actions.
- Capture an image of the scene.
- Ask the VLM task-relevant questions.
- Convert the answer into a structured state.
- Plan an action.
- Execute with the robot controller.
- Repeat until the task is completed.
img = camera.capture()
question = "Is the red block inside the bowl?"
answer = vlm.ask(img, question)
state = parse(answer)
action = policy(state)
robot.execute(action)- Tabletop manipulation
- Pick-and-place tasks
- Spatial reasoning experiments
- VLM-based robot planning
Implements differential inverse kinematics with null space control on a UR5e arm in MuJoCo. The end-effector tracks a waypoint trajectory while avoiding a spherical obstacle, with joint velocities solved via a damped pseudoinverse Jacobian.

