Robotics

Robots That Learn from YouTube: Video Pre-Training for Manipulation

Feb 5, 2026 7 min read
Share

New research shows robots trained on internet video data can learn manipulation skills 10x faster than traditional approaches.

A breakthrough from UC Berkeley and Google DeepMind is changing how robots acquire manipulation skills. By pre-training vision-language-action models on millions of hours of internet video—including YouTube tutorials, cooking shows, and repair guides—robots can learn new physical tasks with dramatically less real-world training data.

The approach, called Video Pre-training for Robotics (VPR), works by having the model learn a general understanding of how objects behave in the physical world from video, then fine-tuning on a small number of real-world robot demonstrations for specific tasks. In experiments, VPR-trained robots learned to fold shirts, sort recycling, and assemble IKEA furniture with just 50 demonstrations per task—compared to the 500+ demonstrations required by previous methods.

The key insight is that internet video contains an enormous amount of implicit physics knowledge. When a YouTube creator demonstrates how to chop vegetables, fold origami, or repair a circuit board, the video encodes information about object properties, force requirements, and sequential task structure that transfers to robotic manipulation.

The limitations are real but surmountable. VPR models struggle with tasks that require precise force control (like inserting a USB cable) or that involve materials not well-represented in internet video (like flexible surgical tissue). Researchers are addressing these gaps with sim-to-real transfer and tactile sensing integration.

For robotics teams, Vincony's Model Playground supports testing vision-language models on image and video inputs—helpful for evaluating which VLM backbone to use for your robotics application.

Explore More with Vincony

Liked this article? Model Playground and 800+ AI models are waiting for you on Vincony.com.