DeepSeek’s New AI Is A Game Changer

Two Minute Papers| 00:07:43|May 22, 2026

Chapters10

An introduction to the idea of thinking with visual primitives and why pointing at objects is valuable for AI reasoning.

DeepSeek cuts visual tokens by 90% yet matches or beats frontier models, making AI vision faster, cheaper, and more interpretable.

Summary

Two Minute Papers’ Dr. Károly Zsolnai-Fehér dives into DeepSeek’s new AI approach, which adds human-like visual grounding to deep-sea AI systems without inflating token budgets. The core idea is visual prompting—allowing the model to point at parts of an image to reason, rather than verbally counting or describing. This enables faster, cheaper reasoning and opens doors to topological thinking, such as tracing a maze or identifying where a crown connects to an octopus. Zsolnai-Fehér underscores that these methods achieve strong accuracy with about 90% fewer visual tokens and can rival billion-dollar frontier models, all while remaining open and potentially usable with free architectures. He also explains the policy distillation framework: a student learns from multiple expert teachers, absorbing their diverse capabilities into one versatile model. The video also acknowledges limitations, such as the need for word cues to trigger point-based thinking, challenges with very high-resolution tasks (e.g., blades of grass), and imperfect generalization of topological reasoning. Finally, he showcases how platforms like Lambda GPU Cloud enable running DeepSeek-scale models (noting 671 billion parameters) and emphasizes the value of free, open weights for ownership and innovation. A hopeful take-away is that more breakthrough, open AI research—documented and reproducible—can push toward systems we can actually understand and trust.

Key Takeaways

DeepSeek requires about 90% fewer visual tokens than typical frontier models, while maintaining or improving performance.
The approach uses visual primitives and pointing to passages in an image to enable faster, more accurate reasoning than describing scenes with words.
Policy distillation combines multiple expert teachers into a single student model, enabling versatile visual thinking across boxes, mazes, and relational queries.
Benchmark results are averaged over seven external benchmarks with in-house tests excluded, reducing the risk of overfitting to the researchers’ own tests.
The method enables topological reasoning, such as tracing a path in a maze and identifying connections between objects, with traceable intermediate thoughts.
The system is open and potentially applicable to free models, not tied to a single proprietary architecture.
Limitations include reliance on word cues for triggering “point-based” thinking, difficulty with extremely high-resolution tasks, and weaker generalization for novel, wholly new scenarios.

Who Is This For?

AI researchers and developers interested in open, interpretable vision models and distillation methods; enthusiasts following the latest in open-weight AI and benchmark integrity.

Notable Quotes

""This makes it more accurate and it also makes it faster. In a world where hardware and tokens cost a fortune, it is fantastic to have something that gives us results faster and cheaper.""

—Highlights the practical impact of reducing visual tokens and improving efficiency.

""What we want is one AI that can do all of these things. And that is where this comes into play.""

—Explains the policy distillation idea of combining expert skills into a single student.

""This is free and open research. So, this technique can potentially be added to many existing models, including free ones.""

—Emphasizes openness and broad applicability beyond a single project.

""Less is more. DeepSeek just cut down those visual tokens by 90% and still beat frontier models.""

—Capsulates the core efficiency claim of the work.

""You can trace back the whole thought process visually.""

—Illustrates the interpretable reasoning enabled by visual prompting.

Questions This Video Answers

How does DeepSeek reduce visual tokens and still match frontier model performance?
What is policy distillation and how does it differ from traditional knowledge distillation?
Can visual prompting provide interpretable AI reasoning, and what are its limitations?
Are there open-weight implementations of DeepSeek or similar architectures available now?
What benchmarks were used to evaluate DeepSeek, and why is external benchmarking important?

DeepSeekvisual tokenspolicy distillationteacher-student learningtopological reasoningopen weightsvisual promptingintermediate thoughtsbenchmark integrityLambda GPU Cloud

Full Transcript

Hmm, why does this deep sea quirk exist? I mean, it adds vision capabilities to the deep sea AI system, but that's not new. A lot of other AI systems have vision capabilities. You just drop an image here and it works. Even video and even for open models. So, why do we need this paper? Well, they did something incredible here and it is an absolute game changer. Why? You see, if you ask a previous technique to count the number of people in this photo, it will think something like this. Okay, there are people on the upper left and a bunch of stripy guys in two rows. That is kind of three rows. Some of them are standing, some of them are sitting. Ah, it's just so confusing to just count them up using only words. Two problems with this one. One, this is prone to error. Two, you have to think a lot. Just describing stuff. Why? What would we, humans, do? Of course, we would use our finger and would point at the image. One, two, three, and so on. Done. Don't describe images like a poet. Point like a human. Now, that is exactly what this new technique does. It allows an AI system to point at things while thinking and it is absolutely brilliant. This makes it more accurate and it also makes it faster. In a world where hardware and tokens cost a fortune, it is fantastic to have something that gives us results faster and cheaper. But, it turns out thinking with visual primitives has even more advantages. It can also do topological reasoning. For instance, if you give it a maze with a start and end point, you not only get a correct answer to your questions, but you can also trace back the whole thought process visually. I love that. Also, here you can ask where the crown connects and look. To the octopus. Yeah, it answers correctly, but you can also see how it came to that conclusion. Now, make no mistake. These are simple examples. I'll show you in a moment if it is as good as these billion-dollar frontier models. Also, if something goes wrong, this will make it easier to find mistakes and fix them to create an even better model. This puts us one step closer to AI systems we can actually understand that do not just give us a soup of numbers. So good. So, how good is it? Well, hold on to your papers, fellow scholars, and I dropped my papers here. Look, it needs about 90% fewer visual tokens than most frontier models. Now, wait, wait, wait. It doesn't matter how little you think if you just say three as an answer without thinking. Thinking time doesn't matter if it is incorrect. So, how accurate is it? Are you kidding me? This free system matches or beats almost everything. And once again, we are talking about this, which is free, going up against billion-dollar systems here. Wow. Now, we are fellow scholars here, so at this point we ask, are these results real? You know, benchmarks are being gamed left and right. Now, here is what many people missed. Average over seven benchmarks, but in-house benchmarks excluded. That is the key. They did not rig their own benchmarks. You know why? Well, everyone loves it because it's one of the oldest tricks in the book. If you are not performing well, just create a new benchmark that fits you. Let's make a YUNUS benchmark. You will always be world first in being you. And this is not the case here. Amazing. This is free and open research. So, this technique can potentially be added to many existing models, including free ones. This paper does not have a model attached that I know of. It describes the concept of how to do it in detail. It's a blueprint, if you will. More intelligence for all of us for free. The world needs more papers like this. Love it. But, this all sounds like magic. How did they do this? Well, look, this is their own policy distillation objective. We need exactly this. You see, normally, we have a bunch of expert AI models. Now, at the risk of simplifying things, imagine that one of these guys is great at boxes. Nobody does boxes better than this guy. The other one is great at tracing mazes with points. But, that's not what we want. What we want is one AI that can do all of these things. And that is where this comes into play. We train a student model that learns from all of these teachers. It says what it would try to do, then the teachers say, "Okay, here's what I would have done." Do this enough and the student will be pretty good at all of these different kinds of visual thinking. This is why they used the name distilling the knowledge of a bunch of expert teachers into a student. So, where does this put us? Okay, so here's what I think. Dear fellow scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. You know, we always thought that we would make AI systems smarter by giving it higher resolution images to train on. More pixels, more smarts. It turns out not true. Sometimes, that's not what we need at all. Deep Seek just cut down those visual tokens by 90% and still beat frontier models. Less is more. Now, is this perfect? All problems solved? No. Limitations. One, the AI does not automatically do this kind of pointy thinking. It needs a word as a cue for this kind of thinking. Two, bounding boxes are nice for people, but if you are counting blades of grass or strands of hair, now, in this case, not having those in very high resolution is a problem. [laughter] Yep, once again, the two-minute papers special, thin structures. Every time, man. It's so painful. And three, this kind of topological reasoning does not generalize as well as we'd like. It might not be as robust when you show it something completely new. So, careful with the misleading media headlines, careful with the hype everywhere. There is still plenty to improve here. But, I feel that this might be a breakthrough. And that makes it maybe the third one this month in AI research. What a time to be alive. Also, with large AI companies going to IPO, they are about to become ventures that look to maximize their profits. More money needed every quarter. So, it's going to become more and more crucial to own your own AI systems with free open weights models. And this one makes them better. Love it. Here you see me running the full DeepSeek AI model through Lambda GPU Cloud. 671 billion parameters running super fast and super reliably. This is insane. I love it and I use it on a regular basis. Lambda provides you with powerful Nvidia GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers or click the link in the description.