DeepSeek V4 AI Beats Billion Dollar Systems…For Free

Two Minute Papers| 00:10:03|May 6, 2026

Chapters14

Introduction to DeepSeek 4 and its significance, including a 58 page research paper and the large context window.

DeepSeek V4 unlocks massive context and compression wins for free, rivaling billion-dollar models at a fraction of the cost.

Summary

Two Minute Papers’ Dr. Károly Zsolnai-Fehér is thrilled about DeepSeek V4, a remarkable open and free AI model described in a dense 58-page paper. He highlights a 1 million token context window in open weights and notes Pro and Flash variants that lag far behind, yet still deliver frontier-like performance. The three magic pillars are compression, structure, and index, enabling dramatic memory savings in the KV-cache (about 90% reduction) without sacrificing essential information. The talk-through includes 128-to-1 compression via Heavily Compressed Attention, table-of-contents style summaries, and a top-5-page index for fast fact-finding. Real-world results show Pro recalling hidden facts better than Google Gemini 3.1 Pro, though limits appear as the context window is pushed. Coding is a strong suit for DeepSeek, and the system is unimodal (no images or audio) but can run code within the UI. Zsolnai-Fehér also covers caveats: not fully understood techniques, potential drift under tight contexts, and the necessity of tempering expectations. He closes with a forest metaphor linking local detail to global context and teases an Engram approach that speeds up fact recall. Overall, the video balances hype with transparency and practical takeaways for researchers and builders alike.

Key Takeaways

DeepSeek V4 introduces a 1 million token context window in open weights, enabling unprecedented long-context usage for free.
Compression, structure, and index work together to cut KV-cache memory by about 90% (through Heavily Compressed Attention and related techniques).
128-to-1 compression pairs with table-of-contents summaries and an index to deliver global context with minimal local detail loss.
Pro version reportedly recalls hidden facts better than Google Gemini 3.1 Pro, while maintaining competitive performance as context grows.
The system remains unimodal (no images/audio) and is not fully understood by its creators, with two stabilization techniques still not fully explained.
DeepSeek shines at coding tasks and can run code in-browser, though it struggles with more advanced algorithms.
Pricing is dramatically cheaper than Claude, potentially 8–30x cheaper depending on discounts, underscoring a broader open AI economics shift.

Who Is This For?

Researchers and developers curious about scalable, open AI architectures; engineers evaluating long-context capabilities and memory-efficient inference. Also valuable for data scientists exploring new compression and indexing techniques in large language models.

Notable Quotes

"Finally, DeepSeek 4 is here, and it is described in a 58-page research paper."

—Opening line sets the scale and invites excitement about the new model and its official paper.

"A 1 million token context window? In open weights AI?"

—Spotlights the unprecedented context length available for free in open weights.

"Compression for the KV cache... They call it token-level compression."

—Defines the core technique that reduces memory for prompts and documents.

"They call it Heavily Compressed Attention."

—Describes the central compression approach for global context.

"The three layers of compression: summaries, structure, index."

—Outlines the three-part strategy to achieve substantial efficiency gains.

Questions This Video Answers

How does DeepSeek's 1,000,000-token context window actually work in practice?
What is Heavily Compressed Attention and how does it reduce memory usage?
Can DeepSeek Pro outperform Google Gemini 3.1 Pro in real tasks?
What are the limitations of DeepSeek V4 regarding multimodal input and reliability?
How does Engram help DeepSeek recall facts faster without re-computing everything?

DeepSeekKV-cache compressionHeavily Compressed AttentionCompressed Sparse AttentionEngram1,000,000-token context windowGoogle Gemini 3.1 ProDeepSeek Proopen weightscoding capabilities

Full Transcript

Finally, DeepSeek 4 is here, and it is  described in a 58-page research paper.   And finally, nothing is held back here. I’ll  be honest I am feeling a little shy today,   so I will do classic Two Minute  Papers, mikrophone, but no camera. This is one of the biggest open  and free AI models that we can   use and…excuse me? Do you see that? What? A 1 million token context window? In open   weights AI? If you ask it to inhale about 1,500  pages of dense documentation it will do it. But that was the main feature in Google’s  Gemini not so long ago. I remember flipping   out about it 2 years ago. And now,  this for free? This sounds absurd! And when I look at the Pro model, you’ve  got to be kidding me. Its results roughly   match the many billion dollar frontier  models from just a few months ago. Now   it is gifted to us mortals. I am trying  to emphasize the kind of gift that we are   getting here and my words fail me. Is  this heaven? What a time to be alive! And…wait. There is a Flash  model that is much smaller,   and is somewhat competitive with  the Pro? I mean, what is happening? And it doesn’t end there. This is just the  start! As it keeps outputting more and more text,   the new Pro model requires about 3 times  less computing power than the previous one,   and the lighter, Flash model requires  about 10 times less computing power. What am I even reading? How is that  even possible? Dear Fellow Scholars,   this is Two Minute Papers  with Dr. Károly Zsolnai-Fehér! Well, it does three things that  are absolutely magical. One:   Compression. Namely, compression for the KV  cache - this is a scratch pad where you write   your prompts and add your documents.  Imagine reading a book. You can find   answers so much quicker if you compress  each paragraph down into one sentence. You keep the book. But now you can search it  faster. They call it token-level compression. But even these little summaries add up. What do we do? Well, two. We want to  know the overall plot of the James   Bond book? See if it’s one that  we read already? Well, of course,   we look at the table of contents. If each  chapter has a short name, we can grasp the   whole story from that tiny piece of information.  The paper describes it as a 128-to-1 compression.   They call it Heavily Compressed Attention.  Now, the AI sees the whole story at a glance. But scientists at DeepSeek say this is still  not enough compression. We need more! Three.   Imagine that we want to search for a fight  in the book. Table of contents helps a bit,   but may not tell us exactly where the fights  are. So, we look at an index. A list of words   and phrases and their locations. Okay, so looking  for a fight, and bingo! The index gives us the top   5 pages that have fights in them. This is genius,  and they call it Compressed Sparse Attention. So, three layers of compression:  summaries, structure, index. And suddenly,   the three pieces click together. These three  reduce memory needs for the KV-cache by about   90%. I had to look twice. Down about 90%.  Squashing down a 100 words into a storage   space of 10? And you are saying that we are not  losing basically every piece of information? Yes. That is exactly what they are  saying. But we are Fellow Scholars here,   we look at proofs and experiments. Now just to make sure, this is KV-cache  compression. You still need to load the   whole model. So it does not mean that  you can load the full DeepSeek Pro AI   onto a toaster. Just want to make sure you know  that because media headlines and hype…you know. And now…hold on to your papers Fellow  Scholars, because this one delivers. They tested it by hiding 8 facts inside  increasingly long contexts. So how good   is it? Well, they report that the Pro  version recalls it better than Gemini   3.1 Pro. That is Google’s flagship  product. Wow. That is unbelievable.  But note that like many other systems, it starts  to degrade as you start approaching the limits   of the context window. Then, models forget.  Drift. Hallucinate. More text means less truth.  Also, let’s look at its accuracy versus  the previous DeepSeek, especially since   this new version is heavily compressing  things. Ha. Look at that. This is crazy. It is also fantastic at coding. If you are a  coder, great. If you are not a coder, well,   you are now. It is so easy to ask it to generate  javascript code that you can paste into a website   and run, and in some cases, you can even  run programs in the DeepSeek window with   one click. I am a light transport researcher  by trade, that is ray tracing if you will,   so I had to try a little coding task related to  that and…this is fantastic. It still failed to   properly implement more advanced algorithms so I  am excited to see what the next version brings. It is crushing benchmarks…and the competition. At  the low-low price of… free. If you can self-host   it yourself, hardware is pricey, they also  provide online access to it and it is so cheap,   I feel like numbers are losing their  meaning. Soon, intelligence will get   too cheap to meter. Depending on  whether there is a discount or not,   you can easily get pricing that is  30 times cheaper than Anthropic’s   Claude. Even with no discount, things  can get 8 to 20 times cheaper. Crazy. Now, let’s temper expectations a bit. Limitations.  That’s what is missing from the media headlines. One, you can almost hear the 1,500 pages  fluttering as it churns through it. But wait.   I did not say also 10 hours of audio, or full  feature-length movie. There is a reason for that.   This system is unimodal. Not multimodal. No images  or audio. It is blind and deaf, if you will. Two, this system is not fully understood, not even  by its creators. They report two techniques that   magically stabilize training, and they say that  they are not quite sure why. I’ll note that this   is something that happens to every researcher, and  I have nothing but respect for the transparency. And three, we noted that if you are pushing  against the limits of the context window,   things break down a bit. Be careful. Just want to make sure that you don’t  get oversold on what is going on here,   this still has limitations. Not small ones.  But, overall…this is not a small  step in open and free AI systems. Congratulations to the team and thank you so much. Now here’s what I think. I think this is a great  release and a great paper and great life advice   too. Why? Well, you can adapt so many of these  ideas to your thinking. Imagine walking in the   forest. You want to look at the amazing  views in front of you. But then, you trip.  Or you look mainly in front of your feet so you  don’t trip. You watch your step… or you enjoy   the view. Not both. So what is the solution? You do both. Scan near, glance far. Step and   look. Local detail, global context. It is the  same as what DeepSeek does. Try it out next time   you are on a walk, it’s weird. You’ll see. Let me  know in the context, I mean comments how it went. They also use a technique  called Engram - normally,   an AI recalculates nearly every fact from  scratch every time. Engram lets it just   recall those facts instead. It’s not as easy  as it sounds, we have a separate video on it,   link in description. And we are still  just scratching the surface here. Now this is a really advanced research  paper, with all the good and the bad,   not just the hype. Also, this video was not  super fast, I rewrote this over and over again.   Why is that? Because distilling complex ideas  into simple explanations takes time. You get   fewer views than others who publish something  as quickly as possible. But that’s what I try   to do here, and it is an honor to do this  for such an incredibly smart and receptive   audience like you Fellow Scholars. And thank  you so much for appreciating it - this one   really made my day. Subscribe and  hit the bell if you enjoyed this.