DeepSeek V4 AI Beats Billion Dollar Systems…For Free
Chapters14
Introduction to DeepSeek 4 and its significance, including a 58 page research paper and the large context window.
DeepSeek V4 unlocks massive context and compression wins for free, rivaling billion-dollar models at a fraction of the cost.
Summary
Two Minute Papers’ Dr. Károly Zsolnai-Fehér is thrilled about DeepSeek V4, a remarkable open and free AI model described in a dense 58-page paper. He highlights a 1 million token context window in open weights and notes Pro and Flash variants that lag far behind, yet still deliver frontier-like performance. The three magic pillars are compression, structure, and index, enabling dramatic memory savings in the KV-cache (about 90% reduction) without sacrificing essential information. The talk-through includes 128-to-1 compression via Heavily Compressed Attention, table-of-contents style summaries, and a top-5-page index for fast fact-finding. Real-world results show Pro recalling hidden facts better than Google Gemini 3.1 Pro, though limits appear as the context window is pushed. Coding is a strong suit for DeepSeek, and the system is unimodal (no images or audio) but can run code within the UI. Zsolnai-Fehér also covers caveats: not fully understood techniques, potential drift under tight contexts, and the necessity of tempering expectations. He closes with a forest metaphor linking local detail to global context and teases an Engram approach that speeds up fact recall. Overall, the video balances hype with transparency and practical takeaways for researchers and builders alike.
Key Takeaways
- DeepSeek V4 introduces a 1 million token context window in open weights, enabling unprecedented long-context usage for free.
- Compression, structure, and index work together to cut KV-cache memory by about 90% (through Heavily Compressed Attention and related techniques).
- 128-to-1 compression pairs with table-of-contents summaries and an index to deliver global context with minimal local detail loss.
- Pro version reportedly recalls hidden facts better than Google Gemini 3.1 Pro, while maintaining competitive performance as context grows.
- The system remains unimodal (no images/audio) and is not fully understood by its creators, with two stabilization techniques still not fully explained.
- DeepSeek shines at coding tasks and can run code in-browser, though it struggles with more advanced algorithms.
- Pricing is dramatically cheaper than Claude, potentially 8–30x cheaper depending on discounts, underscoring a broader open AI economics shift.
Who Is This For?
Researchers and developers curious about scalable, open AI architectures; engineers evaluating long-context capabilities and memory-efficient inference. Also valuable for data scientists exploring new compression and indexing techniques in large language models.
Notable Quotes
"Finally, DeepSeek 4 is here, and it is described in a 58-page research paper."
—Opening line sets the scale and invites excitement about the new model and its official paper.
"A 1 million token context window? In open weights AI?"
—Spotlights the unprecedented context length available for free in open weights.
"Compression for the KV cache... They call it token-level compression."
—Defines the core technique that reduces memory for prompts and documents.
"They call it Heavily Compressed Attention."
—Describes the central compression approach for global context.
"The three layers of compression: summaries, structure, index."
—Outlines the three-part strategy to achieve substantial efficiency gains.
Questions This Video Answers
- How does DeepSeek's 1,000,000-token context window actually work in practice?
- What is Heavily Compressed Attention and how does it reduce memory usage?
- Can DeepSeek Pro outperform Google Gemini 3.1 Pro in real tasks?
- What are the limitations of DeepSeek V4 regarding multimodal input and reliability?
- How does Engram help DeepSeek recall facts faster without re-computing everything?
DeepSeekKV-cache compressionHeavily Compressed AttentionCompressed Sparse AttentionEngram1,000,000-token context windowGoogle Gemini 3.1 ProDeepSeek Proopen weightscoding capabilities
Full Transcript
Finally, DeepSeek 4 is here, and it is described in a 58-page research paper. And finally, nothing is held back here. I’ll be honest I am feeling a little shy today, so I will do classic Two Minute Papers, mikrophone, but no camera. This is one of the biggest open and free AI models that we can use and…excuse me? Do you see that? What? A 1 million token context window? In open weights AI? If you ask it to inhale about 1,500 pages of dense documentation it will do it. But that was the main feature in Google’s Gemini not so long ago.
I remember flipping out about it 2 years ago. And now, this for free? This sounds absurd! And when I look at the Pro model, you’ve got to be kidding me. Its results roughly match the many billion dollar frontier models from just a few months ago. Now it is gifted to us mortals. I am trying to emphasize the kind of gift that we are getting here and my words fail me. Is this heaven? What a time to be alive! And…wait. There is a Flash model that is much smaller, and is somewhat competitive with the Pro? I mean, what is happening?
And it doesn’t end there. This is just the start! As it keeps outputting more and more text, the new Pro model requires about 3 times less computing power than the previous one, and the lighter, Flash model requires about 10 times less computing power. What am I even reading? How is that even possible? Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér! Well, it does three things that are absolutely magical. One: Compression. Namely, compression for the KV cache - this is a scratch pad where you write your prompts and add your documents. Imagine reading a book.
You can find answers so much quicker if you compress each paragraph down into one sentence. You keep the book. But now you can search it faster. They call it token-level compression. But even these little summaries add up. What do we do? Well, two. We want to know the overall plot of the James Bond book? See if it’s one that we read already? Well, of course, we look at the table of contents. If each chapter has a short name, we can grasp the whole story from that tiny piece of information. The paper describes it as a 128-to-1 compression. They call it Heavily Compressed Attention. Now, the AI sees the whole story at a glance.
But scientists at DeepSeek say this is still not enough compression. We need more! Three. Imagine that we want to search for a fight in the book. Table of contents helps a bit, but may not tell us exactly where the fights are. So, we look at an index. A list of words and phrases and their locations. Okay, so looking for a fight, and bingo! The index gives us the top 5 pages that have fights in them. This is genius, and they call it Compressed Sparse Attention. So, three layers of compression: summaries, structure, index. And suddenly, the three pieces click together.
These three reduce memory needs for the KV-cache by about 90%. I had to look twice. Down about 90%. Squashing down a 100 words into a storage space of 10? And you are saying that we are not losing basically every piece of information? Yes. That is exactly what they are saying. But we are Fellow Scholars here, we look at proofs and experiments. Now just to make sure, this is KV-cache compression. You still need to load the whole model. So it does not mean that you can load the full DeepSeek Pro AI onto a toaster. Just want to make sure you know that because media headlines and hype…you know.
And now…hold on to your papers Fellow Scholars, because this one delivers. They tested it by hiding 8 facts inside increasingly long contexts. So how good is it? Well, they report that the Pro version recalls it better than Gemini 3.1 Pro. That is Google’s flagship product. Wow. That is unbelievable. But note that like many other systems, it starts to degrade as you start approaching the limits of the context window. Then, models forget. Drift. Hallucinate. More text means less truth. Also, let’s look at its accuracy versus the previous DeepSeek, especially since this new version is heavily compressing things.
Ha. Look at that. This is crazy. It is also fantastic at coding. If you are a coder, great. If you are not a coder, well, you are now. It is so easy to ask it to generate javascript code that you can paste into a website and run, and in some cases, you can even run programs in the DeepSeek window with one click. I am a light transport researcher by trade, that is ray tracing if you will, so I had to try a little coding task related to that and…this is fantastic. It still failed to properly implement more advanced algorithms so I am excited to see what the next version brings.
It is crushing benchmarks…and the competition. At the low-low price of… free. If you can self-host it yourself, hardware is pricey, they also provide online access to it and it is so cheap, I feel like numbers are losing their meaning. Soon, intelligence will get too cheap to meter. Depending on whether there is a discount or not, you can easily get pricing that is 30 times cheaper than Anthropic’s Claude. Even with no discount, things can get 8 to 20 times cheaper. Crazy. Now, let’s temper expectations a bit. Limitations. That’s what is missing from the media headlines. One, you can almost hear the 1,500 pages fluttering as it churns through it.
But wait. I did not say also 10 hours of audio, or full feature-length movie. There is a reason for that. This system is unimodal. Not multimodal. No images or audio. It is blind and deaf, if you will. Two, this system is not fully understood, not even by its creators. They report two techniques that magically stabilize training, and they say that they are not quite sure why. I’ll note that this is something that happens to every researcher, and I have nothing but respect for the transparency. And three, we noted that if you are pushing against the limits of the context window, things break down a bit.
Be careful. Just want to make sure that you don’t get oversold on what is going on here, this still has limitations. Not small ones. But, overall…this is not a small step in open and free AI systems. Congratulations to the team and thank you so much. Now here’s what I think. I think this is a great release and a great paper and great life advice too. Why? Well, you can adapt so many of these ideas to your thinking. Imagine walking in the forest. You want to look at the amazing views in front of you. But then, you trip. Or you look mainly in front of your feet so you don’t trip.
You watch your step… or you enjoy the view. Not both. So what is the solution? You do both. Scan near, glance far. Step and look. Local detail, global context. It is the same as what DeepSeek does. Try it out next time you are on a walk, it’s weird. You’ll see. Let me know in the context, I mean comments how it went. They also use a technique called Engram - normally, an AI recalculates nearly every fact from scratch every time. Engram lets it just recall those facts instead. It’s not as easy as it sounds, we have a separate video on it, link in description.
And we are still just scratching the surface here. Now this is a really advanced research paper, with all the good and the bad, not just the hype. Also, this video was not super fast, I rewrote this over and over again. Why is that? Because distilling complex ideas into simple explanations takes time. You get fewer views than others who publish something as quickly as possible. But that’s what I try to do here, and it is an honor to do this for such an incredibly smart and receptive audience like you Fellow Scholars. And thank you so much for appreciating it - this one really made my day.
Subscribe and hit the bell if you enjoyed this.
More from Two Minute Papers
Get daily recaps from
Two Minute Papers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









