NVIDIA’s New AI Changed Robotics Forever
Chapters7
Describes how a human performs movements that the robot interprets to generate joint positions, enabling real-time control.
NVIDIA's Sonic teleoperated robot controller proves you can learn multimodal, open research that runs on a phone, not just powerful GPUs.
Summary
Two Minute Papers dives into NVIDIA's Sonic project, a teleoperation-focused robot controller that maps human motions into 3D joint commands. Dr. Károly Zsolnai-Fehér emphasizes that the system understands whole-body movements and can follow a wide range of inputs, from videos and voice to music and text. The breakthrough isn’t just the robot’s stability or the motion capture; it’s the compact model—about 42 million parameters—that can run on a phone or even a toaster. The team uses a universal-token approach, turning diverse inputs into transferable representations and then decoding them into motor commands. A key engineering challenge is safely translating user intent to robot motion, addressed by a root trajectory spring model that dampens abrupt commands. Training demands were enormous—128 GPUs over three days—but the end product is designed for lightweight, on-device use. Open, free access to these models aims to accelerate research and real-world robotics, with NVIDIA’s Jim Fan and professor Zhu leading the effort. The video closes with a hopeful note that this could eventually automate tasks like laundry and cooking, marking a real shift toward open, practical AI robotics.
Key Takeaways
- Sonic compresses human motion into a 42 million-parameter neural network that can run on consumer devices such as smartphones or toasters.
- The system uses universal tokens to convert multimodal inputs (video, audio, text) into consistent motor commands.
- A root trajectory spring model dampens aggressive user commands to prevent robot injury and ensure smooth settling to a target pose.
- Training required 128 GPUs for 3 days, yet the final model is lightweight enough for on-device inference without heavy hardware.
- The project is led by professor Zhu and Jim Fan at NVIDIA, and the work is released as open research and models for free use.
- The approach avoids hand-labeling by learning from 100 million frames of human motion without explicit action labels.
- The architecture cascades through a motion generator, a motion encoder, a quantizer for universal tokens, and a decoder for motor commands.
Who Is This For?
Robotics researchers, AI developers, and engineers curious about open, on-device autonomous control; ideal for those exploring multimodal human-robot interaction and safe, lightweight deployment.
Notable Quotes
"This runs with about 42 million parameters. That is a neural network so simple, it can run so easily on your phone it barely notices it."
—Highlights the model size and on-device feasibility.
"This is a multimodal system. Meaning that the input can be almost anything."
—Emphasizes the flexibility of inputs the system can handle.
"Open research for the benefit of humanity. Love it, thank you so much."
—Closed with a public-spirited note about open models.
Questions This Video Answers
- How can a 42 million-parameter model run on a phone for robotics?
- What is the root trajectory spring model and how does it keep robots from injuring themselves?
- What are universal tokens in NVIDIA Sonic and why are they important for multimodal robotics?
- Can multimodal robot controllers learn without human action labels?
- Who are professor Zhu and Jim Fan, and what are their roles in NVIDIA's robotics work?
NVIDIA Sonicteleoperated robot controllermultimodal roboticsuniversal tokensroot trajectory spring modelon-device AIopen researchJim Fanprofessor Zhurobot safety
Full Transcript
Let’s see what is going on here. This is me around 9am. A bit wobbly, steps are unsure, yup, that checks out. Now then, give me my fake badge. Thank you sir. Hehehe, no one noticed. Now let’s proceed to the next step of my mastermind plans. Let’s eat all their food. Wait, they noticed. Proceed to the next step. What was that? Oh yes, run! Now, jokes aside, look at that. Sign up for this one baby. Oh yes, please mow my lawn. That is excellent. Rake the leaves! Perfect. Hey, don’t slack off, that’s my job! Okay, so what is going on here.
Let’s start with the good news, this is a new teleoperated robot controller and more. They call it Sonic. Now the work here is not the robot, but the software controlling it. At least in this footage, watch until the end and you might get surprised. This means there is a human performing these movements, and the robot is able to understand these motions, and then translate them to a bunch of joint positions in 3D space. It’s kind of insane that this is possible. But it will just get better and better as we continue the video. So, before you ask, yes it can do kung fu.
Provided that you can do kung fu. It understands whole body movement, so you can get it to crawl into some space you don’t want to go to. And that is super useful, people are already using robots for that. Why? Well, chiefly, for exploring under explored and dangerous areas. This means tons of useful applications, for instance, a variant of this could help save humans stuck under rubble, or perhaps later, even explore other planets without putting humans at risk. But that’s still nothing. Because this is a multimodal system. Meaning that the input can be almost anything. So, you say that I don’t have to pretend to mow the lawn to actually mow the lawn, because where is the fun in that?
Well, just tell it to do that. Can you? Well, currently, for simpler tasks, like moving around or behaving like a monkey, yes you can! Absolutely incredible. And I love how expressive it is. You can ask it to walk happily, stealthily, or like an injured person. And you know, just the fact that it is stable and does not fall is remarkable. Previously, even in simple characters in simulated worlds, you needed thousands and thousands of tries to teach them to just be able to walk without falling. And now, this, is a huge leap forward. Wow. But it gets better, we said multimodal. Yup, that means that the input can also be music.
I’ll show you the dancing, but not the music because of Youtube reasons, but I put a link in the description where you can check it out. And we haven’t even talked about the most insane part of the whole thing. Now hold on to your papers Fellow Scholars, because this runs with about 42 million parameters. That is a neural network so simple, it can run so easily on your phone it barely notices it. It may even run on your toaster these days. That size is absolutely nothing. This is an incredible achivement. Okay, but how? How is that even possible?
Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Well, first, it looked at 100 million frames of human motion to understand what we do and how we do it. The incredible thing is that this system does not require human-made action labels, so we don’t have to explain our movements. It just watches the raw motions and figures out how to transition between tasks without any unnatural pauses! So then, your multi-modal input goes in, a video of you, your voice, music, or just text. A motion generator turns these into human motion, and the human encoder processes it into a latent space, and then a quantizer converts it to universal tokens.
Once again, universal tokens, that is key, you’ll see a bit later. Then, the decoder translates these tokens into motor commands. But there is a big problem. Learning to convert one to the other is super hard. First of all, robots do not work like humans, that is one of the fundamental challenges. So if the user commands you to turn around, it should be turning around. Okay, sure. But how fast exactly? You don’t want to try to turn 180 degrees too quickly, because you would fall apart. To solve this, in their research paper, they propose what they call a root trajectory spring model.
This dampens sudden, quick user commands so the robot does not get injured. Yes, robots can get injured too, which is kind of hilarious. Now there is an exponential term as a function of time. What is that? That is a physical brake. As time increases, this term rapidly shrinks to 0, which forces the whole mathematical expression to decay smoothly. This serves two goals: one, the robot does not injure itself and two, it will settle at a target position without oscillating back and forth forever. Nice. Now, do the dampening too much, and of course, you’ll get a little slug that can’t get anything done, so it’s really tough to do well.
Well done folks. Now, all this took 128 GPUs and 3 days to train. That is expensive. But here’s the key, after the training is done, the final product is so lightweight, we don’t need this kind of hardware to run it at all. In fact, all of the models showcased in these videos will be given to all of us for free, forever. They run on your phone, easy-peasy. That is incredible. Open research for the benefit of humanity. Love it, thank you so much. This project is led by professor Zhu and Jim Fan, who I love dearly.
Jim started the humanoid robots lab at NVIDIA just 2 years ago, and they are raining research papers on us, breakthrough after breakthrough. Insanity. And to compress all this human movement knowledge down into a tiny little AI controller that can be used by any of us is simply a stunning achievement. It turns out, training a good AI requires coding good thinking into a machine. But, surprisingly, we ourselves can also learn a lot of good life advice from this kind of thinking too. For instance, the model compresses a messy, diverse soup of inputs into a kind of pure, abstract token.
You know, in life, when asking other people for advice, you will inevitably hear everything, and its opposite too. That is also a big soup of inputs. But try to look at all of them, side by side, and you’ll find that they often share an underlying truth. This works, as is showcased by this incredible project too. And note that this work is not the end of anything, this is just a start. An early work at a nascent area. Two more papers down the line, and I really hope this is going to start folding my laundry and cooking my lunch. That would be amazing.
What a time to be alive! And this is not some proprietary nonsense, this is open knowledge and open models for free for all of us. Thanks Jim! And that’s just one of the many amazing papers they just dropped. If you are interested in hearing more hopefully soon, subscribe and hit the bell.
More from Two Minute Papers
Get daily recaps from
Two Minute Papers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.




