R3M: A Universal Visual Representation for Robot Manipulation


Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav Gupta

Meta AI | Stanford University


Paper / Code

We study if visual representations pre-trained on diverse human videos can enable efficient robotic manipulation. We pre-train a single representation, R3M, utilizing an objective that combines time contrastive learning, video-language alignment, and a sparsity penalty.

Overview

Results

Given just 20 demonstrations (<10 minutes of human supervision) we use R3M to learn task in real world

We also demonstrate that pre-trained R3M representation enables data efficient imitation learning in a comprehensive simulation evaluations across three different benchmarks

Try it yourself

Try out the pre-trained models at https://github.com/facebookresearch/r3m. Using R3M is as simple as -