NVIDIA’s New AI Just Changed Everything
Chapters12
Introduces the issue with proprietary AI and highlights the unveiling of a free open system and its accompanying paper.
NVIDIA’s Nemotron 3 Super unlocks open, auditable AI with100% transparent training data and a 120B parameter model, delivering speed wins and new 4-layer innovations.
Summary
Two Minute Papers’ Dr. Károly Zsolnai-Fehér dives into NVIDIA’s Nemotron 3 Super, a jaw-dropping open AI project that comes with a full 51-page research paper and training data details. He emphasizes that this is a rare shift away from closed, subscription-based models toward a freely accessible AI assistant for everyone. The system boasts 25 trillion tokens of training data and a 120-billion-parameter architecture, placing it on par with some open models but still shy of the very best closed frontiers from a year and a half ago. Crucially, the NVFP4 variant runs up to 3.5x faster than NVIDIA’s previous model and up to 7x faster than other similarly capable open models, all with little loss in accuracy. Zsolnai-Fehér highlights four key “secrets” from the paper: multi-token prediction, compressed-mloat (NVFP4) for speed, mamba layers for efficient memory, and stochastic rounding to keep accuracy while reducing computation. He explains how multi-token prediction lets the model generate several future tokens at once and verify them together, a major efficiency leap. The discussion also covers the inevitable trade-offs—how rounding and noise can introduce errors, and how stochastic rounding mitigates this by averaging to zero over many steps. The video ends with a strong call for open, transparent AI and hints at NVIDIA’s broader push toward open, decentrally hosted systems, inviting viewers to request more deep dives on this groundbreaking work.
Key Takeaways
- Nemotron 3 Super makes a freely available AI assistant with a 120B parameter model trained on 25T tokens, paired with a 51-page paper detailing the process and data.
- NVFP4 delivers up to 3.5x speed improvements over NVIDIA’s prior model and up to 7x faster than many open models, with minimal accuracy loss.
- Four core techniques drive the speedups: multi-token prediction (several future tokens at once), mamba layers for compact memory, targeted rounding (NVFP4) to reduce math without hurting results, and stochastic rounding to control error propagation.
- The approach intentionally preserves accuracy by protecting sensitive calculations while aggressively compressing the rest, enabling faster inference without catastrophic quality loss.
- Despite speed gains, the model still generates steps sequentially in many cases; stochastic rounding helps stabilize long chains of computation and reduces cumulative error.
- The broader takeaway is a shift toward open AI ecosystems, with NVIDIA signaling heavy investment in open, auditable systems rather than purely secretive frontiers.
Who Is This For?
Researchers and developers curious about open AI models, speed-optimized inference techniques, and the implications of transparent training data. This video is essential for anyone tracking the shift from closed, expensive AI to openly accessible, auditable systems.
Notable Quotes
"They absolutely knocked it out of the park. They spilled all the secrets."
—Zsolnai-Fehér highlighting the openness and transparency of the Nemotron 3 Super release.
"Now, we just get this kind of stuff for free. That is mind blowing."
—Emphasizing the free access and transparency of the model and paper.
"The NVFP4 version is about 3.5 times faster than their other model, and it is up to 7 times faster than similarly smart open models."
—Key performance comparison illustrating speed benefits.
"Whenever you round, you can lose accuracy, but here they left sensitive calculations alone."
—Explaining the NVFP4 rounding strategy and its impact on accuracy.
"Memory is precious. Read the book only once, take highly compressed notes."
—Describing the mamba layers concept for efficient memory usage.
Questions This Video Answers
- How does NVFP4 achieve such speedups without sacrificing accuracy?
- What is stochastic rounding and why does it matter for large language models?
- Why is NVIDIA’s Nemotron 3 Super considered a turning point for open AI access?
NVIDIA Nemotron 3 SuperNVFP4multi-token predictionmamba layersstochastic roundingopen AI modelsai accelerator researchopen dataset transparency
Full Transcript
Remember that most AI systems are proprietary, we have to pay a subscription for them, and no one knows how they work or what data they were trained on? Well, now hold on to your papers Fellow Scholars and check out this incredible work, and when I first saw it, my jaw hit the floor. They absolutely knocked it out of the park. They spilled all the secrets. This is an AI assistant that is free for all of us forever, but not just the model itself. They also gave us a 51-page research paper which might be the holy bible of creating such a system for now.
Why is that? Well, they show us every step of the way of how it was done, and the dataset it was trained on as well. That is extraordinary. Usually something is always missing. Not here. They call it Nemotron 3 Super and we are going to find out whether it is indeed super or not. Okay, so in goes 25 trillion tokens as training data, and out comes a 120 billion parameter AI assistant that is how smart exactly? It roughly matches the best closed frontier models from about a year and a half ago. Note that those models cost billions of dollars to train and every detail about them was kept in secret.
And now, we just get this kind of stuff for free. That is mind blowing. This is amazing for us, consumers and Fellow Scholars. So as you see, it is really smart. Up with some of the best open models out there in most tests, but note that it’s still a bit behind some areas. Here’s something that surprised me: in this result, they showcase two versions of the new model, BF16 and NVFP4. They perform roughly the same in terms of accuracy, so why the big fuss about this? Well, look at this. Holy mother of papers. Wow. Well, the NVFP4 version is about 3.5 times faster than their other model, and it is up to 7 times faster than similarly smart open models.
So the story is not just the similarly smart part, the story is that it is 7 times faster while it is similarly smart. Goodness. Okay, so how on Earth did they do that? So here are 4 secrets they gave us from the paper, in very simple words. Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Okay, NVFP4. What is that? This is a way for speeding up the AI to run a great deal faster by essentially compressing the mathematics it uses. Seeing a long number and rounding off a few digits. You get a smaller format. Less work!
What’s wrong with that? Well, everything. Normally, if you do that, you lose too much accuracy and the system will output nonsense. However, here, scientists did it the smart way: they left the most sensitive calculations alone, and did this rounding for the rest, where it does not cause trouble. The result is that it runs up to 7 times faster than many other techniques. And we saw that it gives us no meaningful loss in accuracy. Magic. But there is more magic. When other AI techniques write their answer, they write it token by token. Let’s simplify by saying word by word.
Writing one word at a time. But not this one. This one calculates several future words at once. A whole sentence! Almost. Specifically, 7 tokens. And then the system verifies the 7 tokens in one go. Another massive speed up. They call it multi-token prediction. But why stop there? Let’s add even more magic! They showcased these weird things they call the mamba layers. What do these do? Well, traditional AI systems have a bit of a memory problem. They work like a student who constantly re-reads the textbook over and over again when they are given a question. Scientists at NVIDIA say, that’s not the way to go.
Memory is precious. So instead, read the book only once, and take highly compressed notes. So this kind of memory remembers important details about the conversation. However, it is smart enough to throw away the filler words. Thus, this system can process massive amounts of data efficiently. All this sounds glorious, but this still does not give us a working system. Why is that? Well, this is why. You see that there is a lot of addition here? That is the problem. The AI generates your answer step by step, and because we rounded off the numbers, there is a little error.
That’s not a problem. Here’s the problem. There are many steps, and the error is magnified through each step. Imagine trying to walk to your car, which is a 100 steps away, but you feel a bit sluggish today and every single one of your steps is a bit smaller than it was before. What’s the result? Well, of course, after a 100 steps, you are still really far away from your car! So what is the solution? Well, scientists solved this by adding back some random noise in the system. But wait, this noise is carefully crafted in a way that it averages to zero. So your new steps are sometimes smaller, and sometimes bigger than they used to be, but if you average them out, over a 100 steps, you will be exactly at your car. So good!
They call this stochastic rounding and it is a genius idea. Now, not even this technique is perfect. For instance, when I give it my favorite question about assembling robotic cows, with lots of math, I like this guy a lot, but it thinks for almost an hour to get me an answer for that one. That’s a lot. So if I have workloads like that, I like to run it on a much faster Lambda instance. But still I think the AI game has suddenly changed. Closed systems used to dominate. Now, not anymore. It seems to me that Jensen at NVIDIA is not playing games here.
It’s in the news that they are going to invest tens of billions of dollars into fully open systems like this. I am not a money person, I don’t know how that works exactly, but if we get to own more amazing free AI systems. Well, sign me up for this one! What a time to be alive! And there is just so much more in the paper, I would definitely love to come back for at least another video on it. Let me know in the comments if you would like that, and if you enjoyed this, subscribe, and hit the bell.
More from Two Minute Papers
Get daily recaps from
Two Minute Papers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



