- This event has passed.
GRASP on Robotics: Vincent Sitzmann, Massachusetts Institute of Technology, “Self-supervised Scene Representation Learning for Robotics”
April 15, 2022 at 10:30 AM - 11:45 AM
Given only a single picture, people are capable of inferring a mental representation that encodes rich information about the underlying 3D scene. We acquire this skill not through massive labeled datasets of 3D scenes, but through self-supervised observation and interaction. Building machines that can infer similarly rich neural scene representations is critical if they are to one day parallel people’s ability to understand, navigate, and interact with their surroundings. In this talk, I will demonstrate how we can equip neural networks with inductive biases that enable them to learn 3D geometry, appearance, and even semantic information, self-supervised only from posed images. I will show how this approach unlocks the learning of priors, enabling 3D reconstruction from only a single posed 2D image. I will then talk about a recent application of self-supervised scene representation learning in robotic manipulation, where it enables us to learn to manipulate classes of objects in unseen poses from only a handful of human demonstrations, as well as the application of neural rendering to learn latent spaces amenable to control. I will then discuss recent work on learning the neural rendering operator to make rendering and training fast, and how this speed-up enables us to learn object-centric neural scene representations, learning to decompose 3D scenes into objects, given only images. Finally, I will discuss how neural scene representations may offer a new angle to tackle challenges in robotics.
Massachusetts Institute of Technology
Vincent is an incoming Assistant Professor at MIT EECS, where he will be leading the Scene Representation Group. Currently, he is a Postdoc at MIT's CSAIL with Josh Tenenbaum, Bill Freeman, and Fredo Durand. Previously, he finished a Ph.D. at Stanford University. His research interest lies in neural scene representations – the way neural networks learn to represent information on our world. His goal is to allow independent agents to reason about our world given visual observations, such as inferring a complete model of a scene with information on geometry, material, lighting etc. from only few observations, a task that is simple for humans, but currently impossible for AI.