NVIDIA’s New AI Just Changed Everything

Two Minute Papers| 00:08:10|Apr 9, 2026
Chapters12
Introduces the issue with proprietary AI and highlights the unveiling of a free open system and its accompanying paper.

NVIDIA’s Nemotron 3 Super unlocks open, auditable AI with100% transparent training data and a 120B parameter model, delivering speed wins and new 4-layer innovations.

Summary

Two Minute Papers’ Dr. Károly Zsolnai-Fehér dives into NVIDIA’s Nemotron 3 Super, a jaw-dropping open AI project that comes with a full 51-page research paper and training data details. He emphasizes that this is a rare shift away from closed, subscription-based models toward a freely accessible AI assistant for everyone. The system boasts 25 trillion tokens of training data and a 120-billion-parameter architecture, placing it on par with some open models but still shy of the very best closed frontiers from a year and a half ago. Crucially, the NVFP4 variant runs up to 3.5x faster than NVIDIA’s previous model and up to 7x faster than other similarly capable open models, all with little loss in accuracy. Zsolnai-Fehér highlights four key “secrets” from the paper: multi-token prediction, compressed-mloat (NVFP4) for speed, mamba layers for efficient memory, and stochastic rounding to keep accuracy while reducing computation. He explains how multi-token prediction lets the model generate several future tokens at once and verify them together, a major efficiency leap. The discussion also covers the inevitable trade-offs—how rounding and noise can introduce errors, and how stochastic rounding mitigates this by averaging to zero over many steps. The video ends with a strong call for open, transparent AI and hints at NVIDIA’s broader push toward open, decentrally hosted systems, inviting viewers to request more deep dives on this groundbreaking work.

Key Takeaways

  • Nemotron 3 Super makes a freely available AI assistant with a 120B parameter model trained on 25T tokens, paired with a 51-page paper detailing the process and data.
  • NVFP4 delivers up to 3.5x speed improvements over NVIDIA’s prior model and up to 7x faster than many open models, with minimal accuracy loss.
  • Four core techniques drive the speedups: multi-token prediction (several future tokens at once), mamba layers for compact memory, targeted rounding (NVFP4) to reduce math without hurting results, and stochastic rounding to control error propagation.
  • The approach intentionally preserves accuracy by protecting sensitive calculations while aggressively compressing the rest, enabling faster inference without catastrophic quality loss.
  • Despite speed gains, the model still generates steps sequentially in many cases; stochastic rounding helps stabilize long chains of computation and reduces cumulative error.
  • The broader takeaway is a shift toward open AI ecosystems, with NVIDIA signaling heavy investment in open, auditable systems rather than purely secretive frontiers.

Who Is This For?

Researchers and developers curious about open AI models, speed-optimized inference techniques, and the implications of transparent training data. This video is essential for anyone tracking the shift from closed, expensive AI to openly accessible, auditable systems.

Notable Quotes

"They absolutely knocked it out of the park. They spilled all the secrets."
Zsolnai-Fehér highlighting the openness and transparency of the Nemotron 3 Super release.
"Now, we just get this kind of stuff for free. That is mind blowing."
Emphasizing the free access and transparency of the model and paper.
"The NVFP4 version is about 3.5 times faster than their other model, and it is up to 7 times faster than similarly smart open models."
Key performance comparison illustrating speed benefits.
"Whenever you round, you can lose accuracy, but here they left sensitive calculations alone."
Explaining the NVFP4 rounding strategy and its impact on accuracy.
"Memory is precious. Read the book only once, take highly compressed notes."
Describing the mamba layers concept for efficient memory usage.

Questions This Video Answers

  • How does NVFP4 achieve such speedups without sacrificing accuracy?
  • What is stochastic rounding and why does it matter for large language models?
  • Why is NVIDIA’s Nemotron 3 Super considered a turning point for open AI access?
NVIDIA Nemotron 3 SuperNVFP4multi-token predictionmamba layersstochastic roundingopen AI modelsai accelerator researchopen dataset transparency
Full Transcript
Remember that most AI systems are proprietary,  we have to pay a subscription for them,   and no one knows how they work or  what data they were trained on? Well, now hold on to your papers Fellow  Scholars and check out this incredible work,   and when I first saw it, my jaw hit the floor.  They absolutely knocked it out of the park.   They spilled all the secrets. This is an AI  assistant that is free for all of us forever,   but not just the model itself. They also  gave us a 51-page research paper which   might be the holy bible of creating  such a system for now. Why is that? Well, they show us every step  of the way of how it was done,   and the dataset it was trained on as well. That is extraordinary. Usually  something is always missing. Not here. They call it Nemotron 3 Super and we are going  to find out whether it is indeed super or not. Okay, so in goes 25 trillion  tokens as training data,   and out comes a 120 billion parameter  AI assistant that is how smart exactly? It roughly matches the best closed frontier  models from about a year and a half ago. Note that those models cost billions  of dollars to train and every detail   about them was kept in secret. And now, we just get this kind of   stuff for free. That is mind blowing. This is  amazing for us, consumers and Fellow Scholars. So as you see, it is really smart. Up with some  of the best open models out there in most tests,   but note that it’s still a bit behind some  areas. Here’s something that surprised me:  in this result, they showcase  two versions of the new model,   BF16 and NVFP4. They perform roughly the same in  terms of accuracy, so why the big fuss about this? Well, look at this. Holy mother of papers.  Wow. Well, the NVFP4 version is about 3.5 times   faster than their other model, and it is up to  7 times faster than similarly smart open models. So the story is not just the similarly smart part,   the story is that it is 7 times faster  while it is similarly smart. Goodness. Okay, so how on Earth did they  do that? So here are 4 secrets   they gave us from the paper, in very simple words.  Dear Fellow Scholars, this is Two Minute  Papers with Dr. Károly Zsolnai-Fehér. Okay, NVFP4. What is that? This is a way for  speeding up the AI to run a great deal faster   by essentially compressing the mathematics it  uses. Seeing a long number and rounding off   a few digits. You get a smaller format.  Less work! What’s wrong with that? Well,   everything. Normally, if you do that,  you lose too much accuracy and the   system will output nonsense. However,  here, scientists did it the smart way:   they left the most sensitive calculations  alone, and did this rounding for the rest,   where it does not cause trouble. The result  is that it runs up to 7 times faster than many   other techniques. And we saw that it gives  us no meaningful loss in accuracy. Magic. But there is more magic. When other  AI techniques write their answer,   they write it token by token. Let’s simplify  by saying word by word. Writing one word at   a time. But not this one. This one  calculates several future words at   once. A whole sentence! Almost. Specifically,  7 tokens. And then the system verifies the 7   tokens in one go. Another massive speed  up. They call it multi-token prediction. But why stop there? Let’s add even more magic!  They showcased these weird things they call  the mamba layers. What do these do? Well,   traditional AI systems have a bit of a  memory problem. They work like a student   who constantly re-reads the textbook over and  over again when they are given a question. Scientists at NVIDIA say, that’s not the  way to go. Memory is precious. So instead,   read the book only once, and take highly  compressed notes. So this kind of memory   remembers important details about the  conversation. However, it is smart enough   to throw away the filler words. Thus, this system  can process massive amounts of data efficiently. All this sounds glorious, but this still does  not give us a working system. Why is that?  Well, this is why. You see that there is a lot  of addition here? That is the problem. The AI   generates your answer step by step, and because  we rounded off the numbers, there is a little   error. That’s not a problem. Here’s the problem.  There are many steps, and the error is magnified   through each step. Imagine trying to walk to your  car, which is a 100 steps away, but you feel a bit   sluggish today and every single one of your steps  is a bit smaller than it was before. What’s the   result? Well, of course, after a 100 steps,  you are still really far away from your car! So what is the solution? Well, scientists solved  this by adding back some random noise in the   system. But wait, this noise is carefully  crafted in a way that it averages to zero.   So your new steps are sometimes smaller,  and sometimes bigger than they used to be,   but if you average them out, over a 100  steps, you will be exactly at your car.   So good! They call this stochastic  rounding and it is a genius idea. Now, not even this technique is perfect.  For instance, when I give it my favorite   question about assembling robotic cows,  with lots of math, I like this guy a lot,   but it thinks for almost an hour to get  me an answer for that one. That’s a lot.   So if I have workloads like that, I like  to run it on a much faster Lambda instance. But still I think the AI game has suddenly  changed. Closed systems used to dominate. Now,   not anymore. It seems to me that Jensen  at NVIDIA is not playing games here. It’s   in the news that they are going to invest tens  of billions of dollars into fully open systems   like this. I am not a money person, I don’t  know how that works exactly, but if we get to   own more amazing free AI systems. Well, sign  me up for this one! What a time to be alive! And there is just so much more in the paper,  I would definitely love to come back for at   least another video on it. Let me know  in the comments if you would like that,   and if you enjoyed this,  subscribe, and hit the bell.

Get daily recaps from
Two Minute Papers

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.