GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies

AI Explained| 00:25:19|Apr 25, 2026

Chapters7

The chapter surveys two major AI releases, GPT 5.5 from OpenAI and Deep Seek V4, and promises a data-driven briefing with highlights from lab leader interviews, focusing on how these models might shape AI use and work for a billion people.

GPT 5.5 pushes OpenAI to stay ahead of Anthropics, while DeepSeek V4 tightens the compute race and long-context play, all amid looming compute scarcity headlines.

Summary

AI Explained’s deep dive by the host covers a flurry of blockbuster updates: GPT 5.5 from OpenAI aiming to hold the AI crown against competitors like Mythos and Gemini, and DeepSeek V4’s open-weights, ultra-long context, and cost-efficient performance. The host tests benchmarks (where API access is limited), discusses health benchmarks and hallucinations, and notes OpenAI’s guarded stance on recursive self-improvement. He compares model performance per dollar and foregrounds DeepSeek V4 Pro’s 61.2% Simplebench result and its 1 million token context window. The discussion widens to compute scarcity echoed by OpenAI and Anthropic leaders, with Greg Brockman and Sam Altman cited discussing looming compute limits. Real-world implications for white-collar work, domain-specific training (English vs non-English), and the broader “compute war” frame are tied together with a look at tool ecosystems (Image 2, vibe code, and Spud GPT-5.5 to test multi-modal prompts). The host closes by contemplating whether current progress translates into universal generalization or domain-shardened gains, and what that means for autonomy, safety, and the future of AI-enabled productivity.

Key Takeaways

GPT 5.5 shows a significant upgrade yet remains API-limited for most users, with OpenAI-mandated benchmarks largely self-reported.
DeepSeek V4 Pro achieves about 61.2% on the host’s Simplebench and features a 1 million token context window, driven by long-document and science-focused training data.
In performance-per-dollar terms, GPT 5.5 often wins on cost efficiency against Opus 4.7, while DeepSeek V4 demonstrates strong long-context capability at a fraction of the price.
OpenAI and Samuel Altman discuss cyber security and governance issues, while Greg Brockman highlights looming compute scarcity as a strategic constraint for the industry.
Benchmark results are domain-sensitive and inconsistent across metrics (hallucinations versus factual accuracy), underscoring that there is no single universal best model across all tasks.
DeepSeek V4’s approach—specialized, long-context data—suggests language and data-domain specialization can outpace monolithic language models in non-English contexts.
The broader “compute war” narrative connects model capability to hardware scale, with implications for accessibility, pricing, and the pace of innovation.

Who Is This For?

Essential viewing for AI practitioners evaluating multi-model ecosystems, benchmark enthusiasts tracking cost-to-performance in next-gen LLMs, and developers deciding which model to deploy for long-context or non-English workflows.

Notable Quotes

""GPT 5.5 is OpenAI's allout attempt to keep the AI crown from slipping too anthropic.""

—Host frames GPT 5.5 as a strategic move to outpace Anthropics.

""Deepseek V4 Pro got 61.2% in my own private benchmark, Simplebench.""

—Highlighting DeepSeek V4 Pro’s competitive edge on a personal benchmark.

""We are headed to a world of compute scarcity.""

—Greg Brockman and the commentary on industry-wide compute bottlenecks.

""As long as there was enough training data, it should generalize across languages.""

—Host on DeepSeek’s multilingual and data-domain generalization implications.

""Performance per dollar across all domains may end up being the ultimate benchmark.""

—Framing cost-efficiency as a core competitive metric across models.

Questions This Video Answers

How does GPT 5.5 compare to Mythos and Opus 4.7 in benchmarks and cost per performance?
What makes DeepSeek V4 Pro's 1 million token context window notable and how does it compare to GPT-5.5?
What is the significance of compute scarcity for the AI industry and how might it affect model availability?
Why do domain-specific training data (like DeepSeek’s Chinese professional tasks) outperform general English benchmarks in some cases?
What is the Vibe Code Benchmark and how do GPT-5.5, Opus 4.7, and DeepSeek V4 perform on it?

GPT 5.5OpenAI MythosGemini 3.1 ProOpus 4.7DeepSeek V4 ProDeepSeek V4compute scarcityhealthbenchVibe Code Bench 1.1 (Spud GPT)

Full Transcript

In the last 20 hours in AI, we have gotten two new models that could influence how a billion people use AI. In my mind, GBT 5.5 is OpenAI's allout attempt to keep the AI crown from slipping too anthropic, while today's Deep Seek V4 is China's answer to both. And in the swirl of headlines you are seeing today, you might have missed up to 50 data points that could affect how you work and how you use AI. So, I'm going to try and give you all of them, plus select highlights from hours worth of interviews that I've watched with lab leaders. You probably know me well enough to know that I've read the papers, too. So, we'll hear about OpenAI's updated estimate on the chances of recursive self-improvement. It was quite surprising. GPT 5.5's slight preference for men, which I'll explain, mythos comparisons, and why the OpenAI president laughed at anthropics compute situation. For reference, I'll start with a focus on GPT 5.5, then do DeepSeek, and end by zooming out for the juiciest part of the overview. For the brand new GPT 5.5, I did get early access, but there's no API access at the moment for anyone. So almost all of the benchmark scores you're going to hear about are self-reported from OpenAI. I will say for me testing out GPC 5.5 for days in the run-up to this release, it will become my daily driver just about nudging out Opus 4.7. There's lots of caveats to that though. As you can see with GT 5.5 underperforming both Opus 4.7 and of course Mythos Preview on Agentic Coding Swebench Pro. Notice GP 5.5 underperforms Opus 4.7 by around 6% but Mythos preview by almost 20%. What you might not notice is that there's no entry for SWEBench verified. And so you might say well Philillip who cares about Swebench Pro then what does it even mean that one row? Well to OpenAI it seemingly means a lot because as Neil Chowry points out in February OpenAI told us to switch to Swebench Pro. That's the one it underperforms in because it's less contaminated than Swebench verified. According to the OpenAI blog post, we recommend Swebench Pro. You are probably going to go through a bit of a roller coaster in this video because if you look one row down at Aentic Terminal Coding, you'll see GPT 5.5 way ahead. It's 82.7% score, beating out Mythos Previews 82.0%. And so if you had just been feeling down about GPT 5.5's coding ability, there's another reminder I'll bring, which is that we've been talking about GPT 5.5, not even GPT 5.5 Pro, which is coming to the API very soon. So while it's tempting to say that Mythos is absolutely mogging GPT 5.5, and let me know if I use that word correctly. We don't actually have an applesto apples comparison. The mandate of heaven is very much up for grabs. Okay, so now you're a bit confused. Let's look further. Let's look at humanity's last exam, which is more of a arcane knowledge benchmark. Obscure academic domains combined with advanced reasoning. Well, there GPT 5.5 is beaten by both Opus 4.7 and Mythos as well as Gemini 3.1 Pro, by the way, without tools. But there's a caveat even to this because that involves a lot of general knowledge. It could well be that OpenAI are at least slightly deemphasizing such general knowledge to make the model more efficient and cheaper. One of the top researchers at OpenAI who I've been quoting for years, Nome Brown, said, "What matters is intelligence per token or per dollar. After all, if you spend more, you do go up in benchmark score." Or in fancier language, intelligence is a function of inference compute. That being the case, if GC 5.5 can work well in the domains you care about and use fewer tokens to get the answers you care about, then you may just frankly not care about humanity's last exam. In one famous test of pattern recognition ARGI 2, you'll see that GBC 5.5 on all settings beats out the Clawude Opus series 4.6 and 4.7. Not only achieving higher scores, but for much lower cost. Just one benchmark of course, but we have to increasingly focus on performance per dollar these days. And on that front, Deepseek will definitely want a word because holy moly, I'll get to them later, but Deepseek V4 Pro got 61.2% in my own private benchmark, Simplebench. It asks spatio temporal questions that you need common sense to see through the tricks of, but to get within 1 or 2% of Opus 4.7. I wasn't expecting that at an absolute fraction of the cost. By the way, again, no GPT 5.5 score because no API access. What about those frantic headlines about Mythos being able to hack into virtually any system? I think a lot of that was overblown and some of that could be achieved by much smaller models. But nevertheless, skipping to page 33 of the system card, you can see that one external institute, the UK AI security institute, judges that GPT 5.5 is the strongest performing model overall on their narrow cyber tasks, albeit within the margin of error. This section was notably vague with a headline score implying that 5.5 was better than Mythos, i.e. better than any other model they've tested. But then on their endto-end cyber range task, 5.5 was able to complete a task in full on one out of 10 attempts. A 32-step corporate network attack simulation, one that would take an expert 20 hours. Mythos, it seems though, could do it in three out of 10 attempts. As you can see, direct comparison is hard, but 5.5 does at least seem to be in the ballpark of Mythos's capabilities. In other words, small-cale enterprise networks with weak security posture and a lack of defensive tooling could be vulnerable to autonomous endto-end cyber attack capability via 5.5. Of course, there are additional safeguards put on top of 5.5 to prevent that happening. But given that the world's top bankers and CEOs have gotten together to discuss the risk of mythos, releasing a comparable model without nearly as much cyber security fanfare does indicate a rather profound difference of perspective. Here's Samortman on the mythos marketing. There are people in the world who for a long time have wanted to keep AI in the hands of a smaller group of people. Um, you can justify that in a lot of different ways and some of it's real. Like there are going to be legitimate safety concerns. Um, but I expect but if what you want is like we need control of AI just us cuz we're the trustworthy people, I think the the fear-based marketing is probably the most effective way to justify that. Um that doesn't mean it's not legitimate in some cases. Uh but it is, you know, clearly incredible marketing to say, "We have built a bomb. We're about to drop it on your head. We will sell you a bomb shelter for $100 million. You need it to like run across all your stuff, but only if we like pick you as a customer." And well, there's another way that we could compare GPT 5.5 with mythos, and that's to look at hallucinations. Ask the models a bunch of obscure knowledge questions and see how many they get right and just as importantly how many of the ones they get wrong they admit to not knowing. The headline score looks amazing. GBC 5.5 gets the most right. 57% versus Opus 4.6 and 4.7's 46%. And I know mythos isn't on there but I'll get to that. However, as we've learned on this channel, headlines can be misleading. Look at the hallucination rate. That's the questions it gets wrong and should have said I don't know instead of hallucinating fabricating an answer. Wo there. GBT 5.5 at 86%. Hallucinating 86% of the questions it got wrong rather than saying I don't know. Opus 4.7 on max just 36%. Okay then. Well, let's focus on the net rate, the overall rate. Factoring in both correct and incorrect, we have a slight win for Opus 4.7 over GPT 5.5. 26 versus 20. But here's where mythos comes in. Because buried fairly deep in the Opus 4.7 system card on page 126, we get a comparison between Opus 4.6, Opus 4.7, and Mythos. We can then compare Mythos with GPC 5.5 on extra high. Notice how Mythos gets way more correct. 71%. still hallucinating, of course, 21.7%, but on the face of it, not quite as bad as Opus 4.7, and thereby not as bad, definitely as GPT 5.5. Maybe you just care about spreadsheets. Well, one external benchmark has GPT 5.5 outperforming Opus 4.7 in both performance and latency. Forget that. We just care about making money. Well, let's check out vending bench. That's where the models have to run a simulated business, given only the instruction to make as much money as you can. Sam Alman in his drunk phase said, "Don't retweet this. Don't retweet this." but eventually did so with the tweet in question being GP 5.5 mogging OPUS 4.7. Another detail, Opus 4.7 showed similar behavior to Opus 4.6, lying to suppliers and stiffing customers on refunds. GT 5.5's tactics were clean and it still won. Now, this is one benchmark on one setting when not in a multiplayer setting. It was a slightly different result, but still didn't show any of that deception or power-seeking we saw from Opus and Mythos. not what you might initially guess the results to be from such a benchmark. 5.5 is just a colossal upgrade then you might be thinking first of all it's for paid users at the moment doesn't seem to be on the free tier. How about this comparison then a detail that few will mention on healthbench relevant obviously if you are a clinician or just want a clinical diagnosis for yourself. We have GPT 5.5 outperforming GPT 5.4 roughly 52% versus 48% correct. I pick on this row in particular because even there there's a caveat. Did you know that just the other day OpenAI released GBC 5.4 for clinicians you have to apply for access but if you get it on that subset of the benchmark healthbench professional we see that version of 5.4 getting 59% beating both physician written responses were at around 44%. Now let's say you just don't care at all about healthcare. This result would still be interesting because if there was a singular axis for model IQ, results like this shouldn't really happen. 5.4 for clinicians shouldn't outperform GPT 5.5. But the reality, as we're seeing in these conflicting benchmark scores, is much more jagged. The models aren't proving to be universal generalizers. They are fairly reliant on reinforcement learning environments for particular domains. I'll unpack that a bit more in the overview section. And I just thought it was worth pointing out. Now I will just leave you with a thought. What does AGI or ASI artificial super intelligence mean if there's such disparity between domains? And here's another interesting test you might have missed. They asked the models to follow a certain set of instructions for their thoughts for their chains of thought. Could the model, for example, force its thoughts, its chain of thoughts to only use lowercase letters? Can you control your thoughts? I can't. Not really. and nor can GPC 5.5. Across 100,000 tokens, it could do so far less than one in a thousand times. Worse, by the way, than GBC 5.4 thinking, but OpenAI took that as good news. It's less able to reshape its thoughts in ways that could reduce monitorability. Because it can't control its thoughts, we have increased confidence in the reliability of our monitoring systems. I can almost imagine an alien scanning your thoughts and being like, "They're so all over the place. There's no way that this person is faking it. Oh yeah, what about that preferring men thing? Well, when introduced as a stereotypical man name, Brian, hi, I'm Brian versus, "Hi, I'm Ashley." Mind you, I had a male friend called Ashley, but never mind. What was the overall rate of harmful outputs when given 600 prompts, basically baiting the model to be biased? Well, GPT 5.5 does worse than previous models. Many of you will be waiting to hear about recursive self-improvement, but on this OpenAI are pretty dismissive. GPT 5.5 does not have a plausible chance of reaching a high threshold for self-improvement. This is despite them repeatedly emphasizing that it had hit the high threshold for cyber security and that it was almost borderline critical. On bio threat, it was a notable step up even from GPT 5.4 thinking. Same thing with troubleshooting viology. So, what was the issue with recursive self-improvement? Well, part of the answer came from their internal research debugging evaluation. Could GBT 5.5 debug 41 real bugs from internal research experiments at OpenAI? The original solutions took hours or days to debug? Yes, it can do better, but it's within the margin of error between GBC 5.4 and 5.5, both around 50%. Even more interestingly, and I've seen no commentary on this, what if you convert this to a time horizon all meter? Well, even interpreted very generously where passing corresponds to providing any assistance that would unblock the user, including partial explanations of root causes or fixes, we get this result. Very similar performance between GPT 5.3, 5.4, and 5.5 with 5.5 in the middle actually and a roughly one quarter success rate even at an 8h hour interval. For one dayong tasks, more like around 6%. That's maybe why OpenAI ended the report by saying, "Don't worry, guys, about GPT 5.5 self-exfiltrating or escaping or even sabotaging internal research. It's just too limited in coherence and goal sustenance during internal usage. No point testing the propensity for a model to try. It wouldn't succeed anyway. Again, none of this is to say that 5.5 won't have an effect on cyber security. Sometimes when you look at external benchmarks, the delta, the gap between 5.5 and 5.4 is bigger than on the more famous benchmarks. Take the Frontier AI Security Lab Irregular, where they found that not only across their suite did GPT 5.5 way outperform 5.4, for example, having an average success rate of 26% versus 9% on certain vulnerability and cyber security benchmarks, but the API cost was also significantly lower for 5.5. That's that token efficiency point I mentioned earlier. Performance per dollar across all domains may end up being the ultimate benchmark. Which brings us to DeepSeek V4. It's open weights, so you could use it locally. Notably, that doesn't make it fully open source, though we don't know the training data that went into it. But the first big headline for me is that it supports a context length of 1 million tokens. Call it 3/4 of a million words. That's pretty remarkable for such a performant model. The Pro version has 1.6 trillion parameters comparable with the original GP4, but through the mixture of experts architecture just 49 billion of those are activated. Eight more quick highlights from a very dense paper that I am sure I will return to. Came out just around 6 hours before recording. So, forgive the brevity. The first is a summation of its benchmark performance and I agree with this. On max settings, Deepseek V4 Pro is better, shows superior performance relative to GPT 5.2, 2, a relatively recent model, as well as Gemini 3 Pro. Not on every benchmark, but take reasoning and coding. Deepseek themselves admit that it still falls marginally short of GP 5.4 and 3.1 Pro, though, with their estimate being that they're behind the frontier by 3 to 6 months. Massively depends on token usage, of course, but think ballpark onetenth of the cost. What were Deepseek gunning for with V4 being better at long context? For their training data, they placed a particular emphasis on long document data curation, finding good long documents, prioritizing scientific papers, technical reports, and other materials that reflect unique academic values. What about white collar work? Well, going back to GPT 5.5, you may have noticed that on OpenAI's own internal benchmark, GDP Val, crafted by them, GPT 5.5 outperforms Opus 4.7. Indeed, if you combine both wins and ties versus other models, it outperforms GBC 5.4 for pro. But it must be said these are English language white collar tasks. Deepseek were like what if we created our own comprehensive suite of 30 advanced Chinese professional tasks information analysis document generation editing inside finance education law tech. Then we could blind grade versus for example Opus 4.6 Max. Well, the win rates reported by Deepseek were significant of their V4 Pro Max versus Opus 4.6 Max. We're returning to that IQ access debate again. If there was just one singular access for intelligence that manifested across domains, then a result like this shouldn't really be possible. As long as there was enough training data, it should generalize across languages. Evidently, having specialized data trumps that theory. If you work in a non-English language, you might want to test Deep Seek V4 Pro. is live on my own app lmconsil.ai, but the API is clearly so busy that half the time you get a model busy message. If you do need to wait, then let me recommend the 80,000 hours podcast. In particular, an episode from 48 Hours ago with Will McKascal. This episode happens to be about the AI intelligence explosion. Yes, of course, their podcasts are available on Spotify as well as on YouTube. But if you are going to check out 80,000 Hours, do feel free to use the custom link in the description. helps the channel out and you get these multi-hour long free podcasts. Not a bad deal. Not quite done with Deep Seek though because they almost get philosophical after reeling off a list of the different tricks they're using to improve performance. After waiting through 40 plus pages of breakdown, they say in pursuit of extreme long context efficiency, we basically retained many of the tricks that seem to work, tricks we already knew would work. Yes. The downside though was that this made the architecture relatively complex. And to be honest, they say some of the tricks we used have underlying principles that remain insufficiently understood. They did hit that 1 million context window though, one of their long goals I mentioned at the end of my documentary on Deep Seek that debuted first on my Patreon, but is also on YouTube. Now, it's time now for a result that ties all the models we've been talking about together. It seems vibe coding or more specifically Vibe Code Bench V1.1 from Val's AI. Almost everyone will probably end up being a Vibe coder by 2030. So we have Deepseek V4 at around 50%, GPT 5.5 at 70%, Opus 4.7 at 71%. Incredible. But look at the cost curve. As we've discussed, we have 5.5 at what is that? 25% less cost than Opus 4.7. Deepseek V4 onetenth the cost of Opus 4.7. To better test this though, I thought, well, why not use the brand new Spud GPT 5.5 to vibe code and adventure game in less than 24 hours. Why did I pick 5.5? Well, I also wanted to test the brand new GPT Image 2. Yes, that's the model that even on medium settings absolutely destroys Nano Banana 2 and Nano Banana Pro. an almost 250 point ELO gap. In case you were wondering, yes, there is a high quality setting, four times the cost, but you would suspect would win even more. Because Codeex is becoming this super app, you can invoke the image 2 tool within a codec session, multiple times without even asking each time. That's why I wanted to give the endto-end task to GPT 5.5 to kind of show you guys the state-of-the-art for these models, what you can create in less than a day. The reason I'm lingering on this particular screenshot, which is not mine, is because OGs of this channel will remember maybe around two years ago when I speculated at what would come. I said, I wonder when there will be an image model that will generate an output, then take that output as an input, analyze whether it fulfills the prompt and edit as appropriate. Well, yes, the new image 2 model does do that. But if you're using it within chatbt, you have to be using it with a thinking model. Anyway, what follows is just a glimpse of what's possible with a little bit of patience both in wait times and prompting the model a few times when it makes mistakes. We get this adventure game which you can access. The link is in the description and I can turn the sound on. The images are generated by image 2. And the plot is set in the Red Wall universe, albeit with names changed for copyright reasons. Essentially, it's a pick your own adventure game and you read the plot and then you can pick different outcomes, I guess, different paths. Let's consult Abby Elders. And the videos, your quest begins now. Go with my blessing. come via C dance 2. And there we go. We're consulting the Abby elders and then they're talking and then we can continue and you get through the different levels. Now, I know it's flawed, right? Some of the text is coming outside the bubble and I did have to use C dance to get the videos. But the music, by the way, comes from 11 Labs. But the fact we can create this with just a few prompts and a bit of patience is insane. And it did involve quite a bit of debugging. OpenAI can probably only incorporate image generation unlike Deepseek or Anthropic because they have the compute to do so. According to an exclusive in Bloomberg, Deepseek say that the service capacity for V4 Pro is extremely limited due to a computing crunch. And Anthropic not having anticipated how successful they would be this year are going through their own computing crunch. So much so that Samman has been relentlessly comparing how much more compute OpenAI has than Anthropic. Greg Brockman even laughed at the compute conundrum that Anthropic find themselves in. You guys were teased for putting so much effort, money into data centers. How do you think that's playing out now? Well, I think it's going to give us an advantage and I think it's going to be something that's an advantage not just for the business, but for actually delivering on the mission of bringing this technology to everyone. because you guys like you saw that way in advance. You get teased for it by almost all of your competitors. Mhm. Who's laughing now? Yeah. I mean, I I think our competitors are not having a good time on comput, let me put it that way. But in a separate interview, even Greg Brockman of OpenAI admitted that they are entering a new era of compute scarcity. Yeah. And that I think would explain the massive investments that you've led making these big infrastructure bets. Still not enough. We're going to feel the scarcity. We're going to feel it. We're feeling it already. You can sense it right now on people who are trying to use these agents and just simply cannot, you know, hitting the rate limits. Um, so we're working on behalf of our customers, on behalf of of everyone who wants to use these agents to ensure that there is enough. And I don't think we're going to get there. We're going to do our best, but I think that we are headed to a world of compute scarcity. And uh, again, I think this is something where we can all contribute to trying to help there just be more availability of this in the world. So let's step back a moment because we just don't know what performance companies could produce if they were given unlimited compute. Maybe as Amade once said on Dwaresh Patel, specializing in enough niche domains would eventually at a certain scale allow models to generalize across all domains. But with the compute we have today, it seems we are in the eek out incremental gains in the most lucrative domains kind of world, not the birth of a country of geniuses world. There's so much evidence of the ability to automate repeatable tasks that are done on a computer, but there's much less of being able to just get whatever environment it's in, pinpoint the best sources of fresh data, acquire them autonomously, and make meaningful breakthroughs. Yeah, I know that's a high bar, but do you hear every lab leader trottting out the prospect of curing Alzheimer's, presumably to fight back against the declining support for AI in the public, while none of them have shown the ability to make, I would say, a positive novel breakthrough even 100th as significant as that. Now, yes, they soon might do, and I'm watching Demis' isomorphic labs, of course, for drug discovery. Nevertheless, what does automating repeatable tasks unlock now, though? For sure, a massive boost to the productivity of white collar workers, but will companies spend that productivity by laying off workers? And then we still have the incredible prospect of a single individual having the reach, if not the capital, of a medium-sized company. Even those two things seem to justify vast tracks of the globe being turned into token generating data centers. So if you do still think AI is going nowhere, ask yourself what fraction of the progress and productivity of the world rests on repetitive tasks. And it might be more than you first think. That's what I think anyway for now. Thank you so much for watching and have a wonderful