NVIDIA’s New AI Is Fast For A Strange Reason

Two Minute Papers| 00:05:42|May 13, 2026

Chapters11

Introduces a 30B parameter open model and explains why throughput and cost efficiency matter for processing multimedia at scale.

NVIDIA’s new open model nails throughput and cost efficiency, processing multimodal data faster and cheaper than rivals, with a unique five-into-one design.

Summary

Two Minute Papers’ Dr. Károly Zsolnai-Fehér breaks down a striking open AI model boasting 30 billion parameters and multimodal inputs (images, video, and audio). The model excels in throughput and cost efficiency, claiming nearly 10 hours of video processed per real hour and up to seven times faster document processing than competitors like Gwen 3 Omni. It requires substantial hardware—about 25 GB of video memory for local runs and can run in the cloud on Lambda’s GPUs. Dr. Zsolnai-Fehér highlights five core strengths: linear scaling of memory with context length, a novel audio-token pathway that preserves emotion without a full Whisper-style model, 3D convolutions that process blocks of frames rather than frame-by-frame, a distilled trio of matching, detail, and segmentation models into a single encoder, and an efficient video sampling approach that drops duplicate frames. He also notes a distinctive licensing situation (not Apache 2.0) and cautions that pure text reasoning or coding may not be its strongest suit, but multimodal speed and economy make it compelling for real-world, large-scale processing. The takeaway is clear: open, deployable multimodal models are becoming practical, affordable tools you can own and run yourself, shifting the landscape for research and industry alike. Finally, Dr. Zsolnai-Fehér demonstrates a real-use example by running DeepSeek AI on Lambda GPUs, underscoring the practical potential of these tools for developers and researchers.

Key Takeaways

30B-parameter open model achieves ~10 hours of video processing per real hour, enabling nearly real-time multimodal workloads.
The approach preserves audio emotion and tone by tokenizing raw audio instead of relying on separate Whisper-style models, reducing cost and latency.
3D convolutions process blocks of frames, not individual frames, dramatically reducing computation while maintaining quality.
A single encoder combines three specialized capabilities—image-text matching, fine-detail recognition, and object segmentation—without needing multiple standalone CLIP models.
Efficient video sampling discards duplicate frames to cut data size, lowering both compute and memory usage.

Who Is This For?

Researchers and developers working with multimodal AI, real-time video processing, or scalable AI deployments who want high throughput at lower cost and the feasibility of running large models on personal or cloud hardware.

Notable Quotes

"Hmm, 30 billion parameters in a new open free AI model where images, video, and audio all work."

—Introductory claim about the model's scale and multimodal capability.

"Now, hold on to your papers, fellow scholars, because it processes almost 10 hours of video per hour."

—Highlights the astonishing throughput claim.

"Not one standalone CLIP model. Nope, this one distills down three models... into one small encoder neural network."

—Explains the CLIP distillation approach and efficiency.

"If you're doing pure text reasoning or pure coding, I would probably look elsewhere."

—Balanced view on strengths/limits, guiding use cases.

"Here you see me running the full DeepSeek AI model through Lambda GPU Cloud..."

—Demonstrates real-world deployment via Lambda GPUs.

Questions This Video Answers

How does a linear context-length scaling differ from quadratic scaling in large multimodal models?
What makes 3D convolutions more efficient for video processing than frame-by-frame methods?
Why would an open model use a distilled trio of CLIP-like capabilities instead of a single standalone model?

NVIDIAopen AI modelmultimodal AIthroughputcost efficiency3D convolutionsCLIP model distillationvideo samplingLambda GPU CloudGwen 3 Omni

Full Transcript

Hmm, 30 billion parameters in a new open free AI model where images, video, and audio all work. Hmm, [clears throat] why? There are a bunch of other free systems around in this area like the amazing Gemma 4. So, what does this do better than those? Two words, throughput and cost efficiency. Okay, what does that mean in practice? Now, hold on to your papers, fellow scholars, because it processes almost 10 hours of video per hour. Whoo, that is nearly 10 times real time. That is insanely quick. Wow, almost three times faster than Gwen 3 Omni. And when processing documents, it gets up to seven times faster. To run it locally, you'll want something like this or a beefy desktop GPU. We're talking about 25 gigs of video memory, not something you run on your phone. And to run it in the cloud, I use Lambda. Okay, so how did they do that? Where's the magic sauce? Well, it does five things really well and one thing not so well. Dear fellow scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Well, one, member layers scale linearly with context length instead of quadratically. What does that mean? Well, it means you throw everything you got at it. The more documents you have, the longer video or audio you have, the bigger the advantage this one has. So, if you're running something online that processes those on a mass scale, this is going to be incredible. Two, when audio comes in, this side converts raw audio waves into tokens, but differently than elsewhere. Normally, you have a speech recognition model here. Those are often huge and expensive and strip away all emotion and tone from the input. But this one keeps all these data and still does the job well. So much cheaper than running a whole separate model like Whisper on top. Three, when you give it an image or video, many previous generation techniques smash it into a different aspect ratio. This one keeps it. Then, oh, look at this. Convolutions in 3D. Now we're talking. Many other techniques look at the video frame by frame. It takes tons and tons of computation to finish these videos. Here, the 3D convolution looks at blocks of frames. It looks at a package of frames at the same time, and thus it can compress it a great deal. Faster, cheaper. Four, now that's really interesting, somewhat unexpected. You would expect a huge standalone CLIP model here. These essentially predict what text would match the image well. You need that here, too. But, here's the trick. Not one standalone CLIP model. Nope, this one distills down three models. One for matching images to text, one for fine details, and one for object segmentation. Now, all three of these are smashed down into one small encoder neural network. Once again, super efficient. Five, efficient video sampling. This is a good one. At this point, we have thrown, let's say, a video with 300 images into the neural network. That's still a lot of data, but it turns out not all frames are completely unique. Many of them share the same background, for instance. And this one finally throws away this duplicate information. And it makes it, you guessed it right, even cheaper and more efficient. Okay, scholarly question. So, what is the license attached to it? What I would love to see Apache 2.0, which is highly permissive, and I don't see it here. It has its own license. That's usually not great news, but in this case, it's better than I thought. Derivative works and commercial use is fine. On the other hand, it needs a bit of attribution and is a little stricter on patent grants. If Apache 2.0 were a 10 out of 10, this is a seven out of 10, in my opinion. And we don't shy away from talking about limitations here. So, anything else? Oh, yes. If you're doing pure text reasoning or pure coding, I would probably look elsewhere. It is not the number one smartest open model. No. But, if you need multimodal input, like audio or video, processed super fast and super cheap, this is the one. So, we now have free and open AI models that we can own and run them ourselves, which is only going to get more and more important in the future. And since we have so many models, they are starting to specialize. They are becoming good in different directions. So, better models and more value for us fellow scholars, for free. Sign me up for that. Hugely appreciated. What a time to be alive. Here you see me running the full DeepSeek AI model through Lambda GPU Cloud. 671 billion parameters, running super fast and super reliably. This is insane. I love it and I use it on a regular basis. Lambda provides you with powerful Nvidia GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers or click the link in the description.