NVIDIA’s New AI Changed Robotics Forever

Two Minute Papers| 00:10:04|Apr 25, 2026
Chapters7
Describes how a human performs movements that the robot interprets to generate joint positions, enabling real-time control.

NVIDIA's Sonic teleoperated robot controller proves you can learn multimodal, open research that runs on a phone, not just powerful GPUs.

Summary

Two Minute Papers dives into NVIDIA's Sonic project, a teleoperation-focused robot controller that maps human motions into 3D joint commands. Dr. Károly Zsolnai-Fehér emphasizes that the system understands whole-body movements and can follow a wide range of inputs, from videos and voice to music and text. The breakthrough isn’t just the robot’s stability or the motion capture; it’s the compact model—about 42 million parameters—that can run on a phone or even a toaster. The team uses a universal-token approach, turning diverse inputs into transferable representations and then decoding them into motor commands. A key engineering challenge is safely translating user intent to robot motion, addressed by a root trajectory spring model that dampens abrupt commands. Training demands were enormous—128 GPUs over three days—but the end product is designed for lightweight, on-device use. Open, free access to these models aims to accelerate research and real-world robotics, with NVIDIA’s Jim Fan and professor Zhu leading the effort. The video closes with a hopeful note that this could eventually automate tasks like laundry and cooking, marking a real shift toward open, practical AI robotics.

Key Takeaways

  • Sonic compresses human motion into a 42 million-parameter neural network that can run on consumer devices such as smartphones or toasters.
  • The system uses universal tokens to convert multimodal inputs (video, audio, text) into consistent motor commands.
  • A root trajectory spring model dampens aggressive user commands to prevent robot injury and ensure smooth settling to a target pose.
  • Training required 128 GPUs for 3 days, yet the final model is lightweight enough for on-device inference without heavy hardware.
  • The project is led by professor Zhu and Jim Fan at NVIDIA, and the work is released as open research and models for free use.
  • The approach avoids hand-labeling by learning from 100 million frames of human motion without explicit action labels.
  • The architecture cascades through a motion generator, a motion encoder, a quantizer for universal tokens, and a decoder for motor commands.

Who Is This For?

Robotics researchers, AI developers, and engineers curious about open, on-device autonomous control; ideal for those exploring multimodal human-robot interaction and safe, lightweight deployment.

Notable Quotes

"This runs with about 42 million parameters. That is a neural network so simple, it can run so easily on your phone it barely notices it."
Highlights the model size and on-device feasibility.
"This is a multimodal system. Meaning that the input can be almost anything."
Emphasizes the flexibility of inputs the system can handle.
"Open research for the benefit of humanity. Love it, thank you so much."
Closed with a public-spirited note about open models.

Questions This Video Answers

  • How can a 42 million-parameter model run on a phone for robotics?
  • What is the root trajectory spring model and how does it keep robots from injuring themselves?
  • What are universal tokens in NVIDIA Sonic and why are they important for multimodal robotics?
  • Can multimodal robot controllers learn without human action labels?
  • Who are professor Zhu and Jim Fan, and what are their roles in NVIDIA's robotics work?
NVIDIA Sonicteleoperated robot controllermultimodal roboticsuniversal tokensroot trajectory spring modelon-device AIopen researchJim Fanprofessor Zhurobot safety
Full Transcript
Let’s see what is going on here. This is me  around 9am. A bit wobbly, steps are unsure,   yup, that checks out. Now then, give me my  fake badge. Thank you sir. Hehehe, no one   noticed. Now let’s proceed to the next step of my  mastermind plans. Let’s eat all their food. Wait,   they noticed. Proceed to the next  step. What was that? Oh yes, run! Now, jokes aside, look at that. Sign up  for this one baby. Oh yes, please mow   my lawn. That is excellent. Rake the leaves!  Perfect. Hey, don’t slack off, that’s my job! Okay, so what is going on here. Let’s start with  the good news, this is a new teleoperated robot   controller and more. They call it Sonic. Now  the work here is not the robot, but the software   controlling it. At least in this footage, watch  until the end and you might get surprised. This   means there is a human performing these movements,  and the robot is able to understand these motions,   and then translate them to a bunch of joint  positions in 3D space. It’s kind of insane   that this is possible. But it will just get  better and better as we continue the video. So, before you ask, yes it can do kung  fu. Provided that you can do kung fu. It   understands whole body movement, so you can get it  to crawl into some space you don’t want to go to.   And that is super useful, people are already  using robots for that. Why? Well, chiefly,   for exploring under explored and dangerous  areas. This means tons of useful applications,   for instance, a variant of this could  help save humans stuck under rubble,   or perhaps later, even explore other  planets without putting humans at risk. But that’s still nothing. Because this is a  multimodal system. Meaning that the input can be   almost anything. So, you say that I don’t have to  pretend to mow the lawn to actually mow the lawn,   because where is the fun in that? Well, just  tell it to do that. Can you? Well, currently,   for simpler tasks, like moving around or behaving  like a monkey, yes you can! Absolutely incredible. And I love how expressive it is.  You can ask it to walk happily,   stealthily, or like an injured person. And you know, just the fact that it is stable  and does not fall is remarkable. Previously,   even in simple characters in simulated worlds, you  needed thousands and thousands of tries to teach   them to just be able to walk without falling.  And now, this, is a huge leap forward. Wow. But it gets better, we said multimodal.  Yup, that means that the input can also   be music. I’ll show you the dancing, but  not the music because of Youtube reasons,   but I put a link in the description  where you can check it out. And we haven’t even talked about the most insane  part of the whole thing. Now hold on to your   papers Fellow Scholars, because this runs with  about 42 million parameters. That is a neural   network so simple, it can run so easily on your  phone it barely notices it. It may even run on   your toaster these days. That size is absolutely  nothing. This is an incredible achivement. Okay, but how? How is that even possible? Dear  Fellow Scholars, this is Two Minute Papers with   Dr. Károly Zsolnai-Fehér. Well, first, it  looked at 100 million frames of human motion   to understand what we do and how we do it. The  incredible thing is that this system does not   require human-made action labels, so we don’t have  to explain our movements. It just watches the raw   motions and figures out how to transition  between tasks without any unnatural pauses! So then, your multi-modal input goes in, a video  of you, your voice, music, or just text. A motion   generator turns these into human motion, and the  human encoder processes it into a latent space,   and then a quantizer converts it to universal  tokens. Once again, universal tokens, that is key,   you’ll see a bit later. Then, the decoder  translates these tokens into motor commands. But there is a big problem. Learning to convert  one to the other is super hard. First of all,   robots do not work like humans, that  is one of the fundamental challenges. So if the user commands you to turn around, it  should be turning around. Okay, sure. But how   fast exactly? You don’t want to try to turn 180  degrees too quickly, because you would fall apart. To solve this, in their research paper, they  propose what they call a root trajectory   spring model. This dampens sudden, quick user  commands so the robot does not get injured.   Yes, robots can get injured  too, which is kind of hilarious. Now there is an exponential term as a function of  time. What is that? That is a physical brake. As   time increases, this term rapidly shrinks to 0,  which forces the whole mathematical expression to   decay smoothly. This serves two goals: one,  the robot does not injure itself and two,   it will settle at a target position without  oscillating back and forth forever. Nice. Now, do the dampening too much, and  of course, you’ll get a little slug   that can’t get anything done, so it’s  really tough to do well. Well done folks. Now, all this took 128 GPUs and 3 days to  train. That is expensive. But here’s the key,   after the training is done, the final product  is so lightweight, we don’t need this kind of   hardware to run it at all. In fact, all  of the models showcased in these videos   will be given to all of us for free, forever.  They run on your phone, easy-peasy. That is   incredible. Open research for the benefit  of humanity. Love it, thank you so much. This project is led by professor Zhu and Jim Fan,   who I love dearly. Jim started the humanoid  robots lab at NVIDIA just 2 years ago,   and they are raining research papers on us,  breakthrough after breakthrough. Insanity. And to compress all this human movement  knowledge down into a tiny little AI   controller that can be used by any of  us is simply a stunning achievement. It turns out, training a good AI requires coding  good thinking into a machine. But, surprisingly,   we ourselves can also learn a lot of good  life advice from this kind of thinking too. For instance, the model compresses a messy,  diverse soup of inputs into a kind of pure,   abstract token. You know, in life,  when asking other people for advice,   you will inevitably hear everything,  and its opposite too. That is also a   big soup of inputs. But try to look at all of  them, side by side, and you’ll find that they   often share an underlying truth. This works,  as is showcased by this incredible project too. And note that this work is not the end of  anything, this is just a start. An early   work at a nascent area. Two more papers down  the line, and I really hope this is going to   start folding my laundry and cooking my lunch.  That would be amazing. What a time to be alive!   And this is not some proprietary nonsense,  this is open knowledge and open models for   free for all of us. Thanks Jim! And that’s  just one of the many amazing papers they   just dropped. If you are interested in hearing  more hopefully soon, subscribe and hit the bell.

Get daily recaps from
Two Minute Papers

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.