NVIDIA’s New AI Can Predict The Future

Two Minute Papers| 00:09:08|Apr 11, 2026

Chapters13

Discusses the problem that simulated success often does not translate to the real world and why we need better training environments.

NVIDIA’s new AI model learns from 2D video to predict future frames, runs faster with distillation, and brings practical, steerable robot intelligence closer to home.

Summary

Two Minute Papers’ Dr. Károly Zsolnai-Fehér explains a breakthrough from NVIDIA: a video-based, self-supervised model that learns to predict how the world changes by watching 2D video frames. The method addresses a long-standing gap between lab-made simulations and the messy real world by focusing on relative actions rather than absolute coordinates. The paper introduces four guiding ideas: let the AI generate its own action labels, leverage an enormous 4-billion-frame dataset, transform inputs into relative coordinates, and force the model to learn cause-and-effect by predicting short-frame futures in blocks of four actions. The results surpass prior methods by producing realistic hand-object interactions (like a lid moving) and crumpling papers, rather than clipping through objects. While the baseline is slow—about 35 heavy denoising steps to generate a single prediction—the team combats this with distillation, training a faster student model that reaches roughly 10 frames per second and remains comparably accurate. The combination of 2D world modeling and distillation yields a practical, interactive platform that could underpin future home robots, teleoperation for surgery, or personal assistants. Dr. Zsolnai-Fehér highlights that the approach builds on NeRD (Neural Robot Dynamics) but operates in 2D, enabling scalable learning across thousands of everyday objects. In short, NVIDIA’s technique moves us from experimental demos toward usable, open AI tools and pre-trained models you can run on consumer hardware. The tone is optimistic: a future with smarter robots that you can actually own, not just subscribe to, may be closer than we think.

Key Takeaways

A 2D, frame-predictive model trains on a 4-billion-frame video dataset to learn action understanding without explicit labels.
Transforming inputs to relative coordinates prevents brittle reliance on exact global positions, improving generalization to small object shifts.
Training with action blocks of four prevents cheating by peeking at future frames, enforcing genuine cause-and-effect learning.
The new method outperforms previous approaches in realistic hand-object interactions, such as crumpling paper and moving a lid.
Distillation reduces the slow, high-quality teacher model to a fast student model—achieving ~10 frames per second with comparable outcomes.

Who Is This For?

Essential viewing for researchers and developers in robotics and computer vision who want to understand how to train real-world-capable agents from large-scale video data without hand labeling. It also speaks to hobbyists curious about open, reusable AI models you can run on consumer hardware.

Notable Quotes

"Look at that! The paper finally crumples beautifully! And with previous methods, the clipping gets even worse."

—Demonstrates the improved realism of predictions vs. earlier approaches.

"Distillation is a training phase where a fast student model is used to learn the predictions of the slower, high-quality teacher model."

—Explains how the speed-accuracy gap is mitigated.

"The kicker is that they also predict very similar outcomes."

—Highlights the effectiveness of the fast student in matching teacher results.

"This finally gives us smarter AI robots, and robots that we can all own ourselves."

—Dr. Zsolnai-Fehér emphasizes practical, accessible robotics.

"A free brain that you can upload to your own devices and use it however you want."

—Underscores the open, subscription-free nature of the work.

Questions This Video Answers

How does NVIDIA’s upcoming AI predict future frames from video without action labels?
What is NeRD and how does it relate to 2D video-based robot learning?
Why is distillation critical for achieving interactive frame rates in high-quality video prediction?
What makes transforming inputs to relative coordinates more robust for robotic manipulation?
Can a 2D video model generalize to thousands of real-world objects and tasks?

NVIDIAAI video predictionNeural Robot Dynamics (NeRD)DreamDojodistillationframe prediction2D video roboticsrobotic manipulationself-supervised learninggeneralization in robotics

Full Transcript

Dear Fellow Scholars, this is Two Minute  Papers with Dr. Károly Zsolnai-Fehér. We’ve been trying out doing the videos with a  camera. I really enjoyed it, and your feedback   was also absolutely incredible. I’ve never  seen anything like this. So many comments,   thank you so much for the kind words everyone.  So, we will try to do more of this. But note   that this one is a classic voice paper that  we’ve always done here. It was done before   we did the camera thing, I thought I would  record this little intro now so you don’t   get surprised. And then next video I’ll be back.  And for now, please enjoy this super fun paper. How do robots learn how to be a good robot?  Well, surely not like this. Haha. Not by just   running around in the real world. Of course!  I mean, imagine a real robot doing this for   years and years. It would be dangerous to others  and itself. So here is a better question: how   do we teach a robot to be a helpful, good robot  safely? Well, we put it inside of a video game. Start learning there first! In the game,   we simulate physics, and let it fail.  A lot. Then, over time, get better. Now, I’ve been to a bunch of AI  and robotics labs around the world,   and let me briefly summarize what I saw:   things work fantastically well in a simulation,  and then, when you put them into the real world,   huge disappointment. Something that worked really  well suddenly does not work well or at all. Why? Well, the main reason is that  simulations are often just not good   enough. They often mimic reality, but  they are not a substitute for reality. So what do we do? Well, of course, try  to use reality. In this work, DreamDojo,   scientists said okay, let’s feed the AI 44  thousand hours of videos of humans doing stuff. That sounds great, except the fact  that it is completely useless.  Why? Well, humans and robots have  entirely different physical bodies,   hands, and joints. Also, the video does not  contain action information. It’s just a soup   of data that doesn’t say what joints  are exerting forces and how. Nothing. So why do this? Does this even make sense? Well, they propose 4 genius ideas,  and I hope that will make this work,   because it would be a miracle. One, if the video does not have labels  on what actions are taking place, well,   then let the AI try to understand it  and make up its own story of what is   happening. If you see someone waving  at a bus that is pulling away. You   don’t need a text label to know that  someone has just missed their ride. Two, this dataset is stupendously large. It  has more than 4 billion frames, and probably   more than 1 quadrillion pixels. Okay, that is  too much information. It is almost impossible   to handle. So the AI has to learn what is  important and what isn’t. How? Well, it is   forced to compress information. A musician does  not need to know every song in the universe. They   have to know that there are 12 notes in a scale,  and every song is just built as a combination   of these fundamental notes. This forces the AI  to look at only the most critical information. But guess what, it is still not enough to just  dump videos into the robot and make it work. Why? Well, three, if you train a robot to pick  up a cup at a global position, it learns to   reach for that exact spot in the world. That’s  no good. Why? Well, if you move the cup a few   inches to the left, the global coordinates change  entirely, and the robot has no idea what to do. So, what scientists said, instead  of using absolute robot joint poses,   let’s transform the inputs into relative actions.  If you are cooking, sometimes you don’t need   absolute coordinates. Here, the knife only needs  to know where it is relative to the carrot’s spot. And believe it or not, this is still not  working. We need something more. What do we need? Well, four, the goal is that the AI learns  cause and effect. Jelly bunny hits the wall,   and something happens. Try to learn  that by predicting the next frame.   The problem is that the AI is cheating. Like  a student, it just looks at the solution at   the end, and says, oh yeah, I was gonna say  exactly that. So how did they prevent that? Well, they fed it actions in  small blocks of 4 at a time,   so it cannot cheat by peeking at the  future to guess what happens right now. Okay, this was a lot of genius stuff,  so it better give us something amazing. Let’s see what we got. Previous method.  Can’t predict the future…oh my, look,   that hand clips through the piece of paper.  Now hold on to your papers Fellow Scholars for   the new method and….oh my! Look at that!  The paper finally crumples beautifully! And with previous methods, the clipping gets  even worse. Look. That’s not predicting reality,   that’s just guessing. New technique  - now we’re talking! Looking good! Also, previous technique, hand moves the lid,   and the lid refuses to move. No good. New  technique, the lid moves! Woo-hoo! Yes,   this is the corner of the internet where we  get unreasonably happy about a moving lid. And these are not some cherry-picked results,   the new technique is so much better than  previous methods. This is a huge leap forward! Now, this gets even better. So it finally  understands the world better than previous   techniques. So what do we pay for this? How  much slower is this than previous methods? Well, it is pretty slow because it requires 35   heavy denoising steps just to generate  one prediction. But wait, don’t despair! We can use distillation here. Distillation is  a training phase where a fast student model   is used to learn the predictions of the  slower, high-quality teacher model. The   goal is that the student would be nearly as  good as the teacher model, but much faster. Well, let’s test that! Oh my, now the  student is a heck of a lot faster.   It seems that it is 4 times faster than the  teacher that was used to train it. It runs at   about 10 frames per second. Understanding the  world and predicting how it will change at a   speed that is interactive. That is absolutely  insane. Well done! And the kicker is that   they also predict very similar outcomes.  This is an absolute slam dunk paper. Wow. Now for you wise Fellow Scholars out there, I’ll  note that we talked about a technique called NeRD,   Neural Robot Dynamics. That was a robot  AI that trained in its own imagination.   So how does this relate to that? Now NeRD  was building a perfect 3D environment. This   one thinks in 2D. It just sees the world as a  bunch of 2D video pixels on a flat TV screen.   Thus, this one is able to learn about  thousands of everyday objects. So cool! This finally gives us smarter AI robots,   and robots that we can all own ourselves.  In a world full of subscriptions,   it is so refreshing that we get all of this  for free. A ton of code and pre-trained   models are available for free for all of us.  No silly subscriptions and proprietary code. A free brain that you can upload to your own  devices and use it however you want. Love it. So this finally puts us one step closer  to having a robot fold our laundry,   or cook a healthy meal. Or help a  specialist doctor perform surgery   from the other side of the planet via  teleoperation. What a time to be alive!