Two AI Models Set to “stir government urgency”, But Will This Challenge Undo Them?
Chapters9
The chapter previews expected leaps in AI performance from OpenAI and Anthropic, introduces the Arc AGI 3 benchmark highlighting a 100% human score versus current models under 0.5%, and teases what’s ahead for 2026.
Two AI models push toward AGI-era performance while benchmarks like Arc AGI 3 reveal how far we still are from true AI generality, all set against OpenAI and Anthropic’s shifting strategic moves.
Summary
AI Explained’s latest deep dive braids together OpenAI’s shifting resource priorities, Anthropic’s government negotiations, and the unveiling of ARC AGI 3, a provocative benchmark that may redefine how we measure AI progress. The video notes OpenAI’s decision to shelve Sora in favor of concentrating compute on the Spud model, aligning with reports from the Financial Times. It then traces Anthropic’s renewed Pentagon interest and what that could mean for future contracts. The core centerpiece is ARC AGI 3, a 21-page benchmark that stresses abstraction and turns on-the-fly problem solving without relying on language or memorized knowledge. The host highlights how current frontier models barely exceed 0.5% on ARC AGI 3 while humans sit at 100%, and explains why this gap might persist even as models become faster or cheaper. The discussion also touches on the role of chain-of-thought prompting, the risk of “gameable” public benchmarks, and the need for private test sets that resist overfitting. Into additional stories, Assembly AI’s Universal 3 Pro streaming is showcased as a live audition of real-time speech-to-text, and MIT Tech Review’s report on OpenAI’s push toward automated AI research frames a broader picture: even as automation grows, human oversight and nuanced judgment remain essential. The overall takeaway is that we’re in a “messy middle” phase—a situation where AI is a better first drafter than a flawless picker of plan-and-execute tasks—setting the stage for a high-stakes, still-evolving 2026.
Key Takeaways
- ARC AGI 3 scores reveal AI performance capped at 100% of human baseline, with current frontier models posting sub-1% on the tasks in the benchmark.
- OpenAI shelved the Sora erotica bot to free compute for the Spud model, signaling a move toward a single, beefed-up AGI deployment strategy.
- Anthropic’s Claude series attracts renewed Pentagon interest, hinting at possible re-engagement with government contracts after a six-month cliff.
- ARC AGI 3 emphasizes abstraction and action-based tasks, penalizes inefficiency quadratically, and dares models to solve puzzles without explicit goal statements.
- Human memory and pattern-recognition advantages persist; even multi-task, high-resource models struggle with level-by-level learning and long-horizon planning.
- OpenAI’s northstar is automated AI research, but GDP Val findings suggest that automation speeds do not automatically trigger exponential takeoff or immediate workforce displacement.
Who Is This For?
Researchers, AI policy watchers, and tech decision-makers who want a concise read on where OpenAI, Anthropic, and benchmark results like ARC AGI 3 are steering the AI frontier—and what that means for investment and governance.
Notable Quotes
"Two exclusive reports indicate that there will be a qualitative leap in AI performance from each of the next AI models released by OpenAI and Anthropic."
—Sets up the dual narrative of upcoming AI models from OpenAI and Anthropic.
"The headline result is that humans get 100% while the best AI models currently get less than half a percent."
—Central ARC AGI 3 finding highlighted by the host.
"OpenAI needs the compute for Spud. It needs to drop its other side quests to focus on AGI deployment."
—Explains the Sora-to-Spud resource reallocation.
"Anthropic have warned US government officials that the next big advance will supercharge both offensive and defensive cyber capabilities."
—Key implication about government interest and potential deals.
"ARC AGI 3 is a brilliant, creative, but pretty adversarial benchmark."
—Assessment of the benchmark’s design and purpose.
Questions This Video Answers
- How close are current AI models to ASI or AGI according to ARC AGI 3 benchmarks?
- Why did OpenAI cancel Sora and push toward the Spud model and AGI deployment?
- What fundamentally makes ARC AGI 3 difficult for frontier models to solve?
- Will Anthropic’s Claude re-engage with the Pentagon, and what are the risks and benefits?
- What does automated AI research look like at OpenAI, and how might it change engineering jobs?
ARC AGI 3OpenAI SpudSoraAnthropic ClaudePentagon AI dealsAutomated AI researchChain-of-thought promptingFrontier AI benchmarksNetHackGemini 3 Pro
Full Transcript
Two exclusive reports indicate that there will be a qualitative leap in AI performance from each of the next AI models released by OpenAI and Anthropic. For OpenAI, this has meant shutting down the Sora app to spare computing resources for the new Spud model. And for Anthropic, makers of Claude, it has meant renewed interest from the Pentagon in reviving a deal to use Claude beyond the six-month deadline recently set by the US government. But this video will also dive into a brand new benchmark sure to be the talk of 2026, Arc AGI 3. I've read the paper in full, but the headline result is that humans get 100% while the best AI models currently get less than half a percent.
That might or might not be news to Jensen Huang, CEO of Nvidia, who this week said that artificial general intelligence has already been achieved. Let's start though with OpenAI's erotica bot because the news there is that that erotic chatbot is not coming out. After presumably spending billions optimizing it for engagement, it has been shelved. Apparently, according to the Financial Times, which is always my source for sexbot rumors, OpenAI needs the compute for Spud. It needs to drop its other side quests to focus on AGI deployment. Rolling up everything it does into one super app. Jumping across to the information, apparently even OpenAI employees had complained that Sora with its viral AI videos was still just a drag on the company's computing resources.
In contrast, the spod model is apparently very strong according to Samman. It will be ready in a few weeks and it will really accelerate the economy. I know at this point some of you will be rolling your eyes going, "Oh, he would say that." But this article was a strange echo of one I had just read in Axios about Anthropic and their Claude series. Here's the key paragraph that I don't think many people have noticed about the new Claude series. Anthropic have warned US government officials that the next big advance will supercharge both offensive and defensive cyber capabilities.
It might even stir government urgency to strike some kind of deal. If you're not familiar with the breakdown in the former deal that Anthropic had with the Pentagon, check out my recent series of videos. This article, by the way, from Axios further hints that the Pentagon might be rethinking that breach in the deal. Apparently, even after the most public part of the spat with Anthropic and the Department of War, the key negotiating official said that they're still very close to an agreement. One detail I thought some of you may be interested in is how Anthropic might get back into good graces with the government.
One of the advisers to Daria Amade, the CEO of Anthropic, is Brad Gersonner. He's the architect of Trump accounts, which provide $1,000 to every newborn whose parents enroll. The article goes on to speculate that what would happen if Anthropic agreed to part fund these accounts, almost like a very early tentative step to universal equity. Anyway, everything so far might get you fairly hyped about the imminence of a new tier of AI. So, what follows is hopefully going to add a bit of context to all of this because in the last 48 hours, we got ARC AGI 3, the continuation of a benchmark series I have covered for years on this channel.
For the makers of the benchmark, as long as there is a gap between AI and human learning, we do not have AGI. I'm going to get to the paper in a moment, but my immediate response to that headline is that there is at least some gap between humans and say chimps on mathematical memory and speed. Chimps can actually track numbers that flash briefly on a screen better than humans can. So, by that logic, humans aren't AGI either, a finding possibly substantiated by global events. But that aside, the Arc AGI 3 puzzles are genuinely really fun to try.
Maybe I'm sad. Maybe I should say slightly fun to try. And I love that they manage to simultaneously test exploration, planning, memory, and goal setting. For example, nowhere on screen and for models either does it say that you need to move the icon around in order to manipulate the environment or that the plus symbol, for example, will rotate the shape, the one in the bottom left corner. or even more importantly that the goal is to make the bottom left shape resemble the shape up here. None of those goals are stated, but like in real life, sometimes goals have to be inferred or self-produced.
I did have some exclusive insight into ARGI3, but when so many benchmarks are narrow, to have a benchmark that does not rely on language, memorized knowledge, or cultural cues, one that is indeed abstract, the A in arc stands for abstraction, is for me healthy for the field. But for you though on the details, what does the terrible performance of current Frontier models tell us about how 2026 will unfold? Here are the highlights from the 21page paper. First, what happened to ARK AGI 1 and 2, which were saturated fairly recently by Frontier AI models? They give a lovely graph in the paper showing the rapid improvement in the last 18 months on those benchmarks.
If you're not familiar, instead of an interactive game, these were more static tests of pattern recognition on grids. Well, the authors make two big points. First, that inbuilt chain of thought reasoning that was publicly debuted with 01 preview back in September of 2024 that genuinely allowed models to demonstrate a kind of fluid intelligence. Think on the fly, combine patterns from their training data to reach an end goal. That's part of the explanation for the saturation of those previous benchmarks. The other part of the explanation is more intriguing. The authors say that because the public set and the private test set of those benchmarks were quite similar, then any model trained on an enormous amount of tasks representing a dense sampling of the task space automatically generated for this purpose, in other words, thousands of different guesses of what the private test set might be like could essentially game the benchmark.
It's not direct memorization. It's a higher level shortcut, a form of attack. They pointed out the models like Gemini 3 in their chain of thought had giveaways that their training data may have resembled arc AGI like tasks either incidentally or intentionally going forward. The authors argue private test sets need to be quite distinct and out of distribution compared to publicly available demonstration data for ARC AGI 3. The public test set is different and easier than the semi-private test set that is tested via API and the fully private test set that is used for the competition.
It's a different distribution of tasks, far less gameable, even if AI labs are intentionally trying to mix ARC AGI tasks into their training data. The goal of ARC AGI 3 is to measure the residual gap between frontier AI and human level AGI when those potentially crazy new models come out in the coming days or weeks. What remaining gaps deficiencies will they have compared to humans? And I would rather the authors of this paper say deficiencies rather than gaps because in the methodology of the paper we learn that AI performance is clamped at 100% or the human derived baseline of 100%.
So even if one day they solve these interactive games more efficiently than humans, they'll only ever score 100%. In other words, AI getting 100% on this benchmark will not be taken as proof of AGI or even strong evidence of it because the most that models can get is 100%. But yet current performance is taken as evidence of them not being AGI. The benchmark is also turnbased. So the superior speed of AI models or their better reflexes is not counted in the test. Nor is the relative cheapness of models counted for too much in benchmark scoring here because scores on ARGI3 are not based on how many levels you solve, but on how many actions you took to solve those levels.
Also, if a model takes more than five times the number of actions compared to humans, that attempt is scrapped because of API costs apparently. And if you try the benchmark yourself, you'll see that the levels get progressively harder. More importantly, what you learn from level one becomes applicable for level two and beyond. Learning in level one that the plus symbol rotates the shape is useful for level two. So again, the benchmark is also testing memory, which brings me to a couple of small paragraphs on page 19 that I found fascinating. A group called Symbolica AI created a harness which essentially involved one model controlling another.
The sub aents would produce summaries of what was going on. And the paper notes that this design constrains the context growth that was otherwise destroying model performance. Instead of getting overwhelmed with all the grids they were being sent, the sub agent was giving these little textual summaries that allowed the orchestrator agent to maintain a higher level plan. That approach was able to solve all three public environments. One issue, however, if you're prepping your local agent to tackle ARC AGI 3 is that harnesses are not allowed. The purpose they say is not to measure the amount of human intelligence that went into designing an ARC AGI3 specific system.
Therefore, they will focus on reporting the performance of systems that have not been specially prepared for the benchmark served behind a general purpose API. You can see the only context that models get. You are playing a game. Your goal is to win. Reply with the exact action you want to take. Notice not even a reminder to win the game in the fewest number of actions. I was actually pretty shocked that Gemini 3.1, for example, was able to score 0.37%. Although perhaps I shouldn't have been because, as Tim Rockashel of Google DeepMind pointed out, Arc AGI 3 is not the only unsaturated agentic intelligence benchmark in the world.
Net hack that he was an author on is unsaturated, he says, for 6 years. Indeed, if you read the Net Hack paper, there are some eerie similarities in the almost game design of these puzzles. On Net Hack, Gemini 3 Pro, by the way, is the best performing model at 6.8%. 8%. Back to Arc AGI 3 because one of the best things about that benchmark is that every single challenge has been shown to be beatable by a human with no prior task specific training. Not sure if that quite meets the easy for human standard because each environment with multiple levels within it was tried by 10 humans and it was the second best human performance that was counted as the human baseline as the 100% level in terms of action efficiency.
And there's another quirk worth noting, which is that inefficiency is quadratically penalized. In other words, if you took a 100 actions to complete a level versus 10 for the human, which by the way wouldn't be allowed because it's capped at five times, so it would be stopped after 50. But let's pretend it was allowed, that 10% efficiency or inefficiency would be squared to give you a 1% score. Now, my slight issue with that is if you note the human baseline in green, remember that's derived from the second best human performance. You can see that's around 540 actions for this particular level.
But the best human playthrough, always on the first run by the way, which keeps it fair, is around 390. So if we use the inefficiency scoring rubric that the benchmark does, even the second best human out of the 10 would only score around 50%. The headline for me is that ARC AGI 3 is a brilliant, creative, but pretty adversarial benchmark. It really would take a step change in AI efficiency and intelligence to score even beyond 50%. I suspect at some point the ARK Foundation may want to report the median human performance, not just the second best one and what AI performance would be without the 5x capping or the quadratic penalties.
With that caveat aside, I still think ARC AGI 3 might be one of the most creative benchmarks ever created for AI. And that does seem like a great time to mention a field in which performance is definitely not low and that is speech recognition. Because I wonder if you saw that the sponsors of today's video, Assembly AI, have produced Universal 3 Pro streaming. That's a speechtoext model for, appropriately enough, Agentic Streaming. To put it more simply, it can handle in real time things like when you say credit card numbers, your email, or those rarer words.
Now, we could look through even more stats about Universal 3 Pro streaming, but I could also just point you to my custom link in the description through which you can test the model live today on your own voice. And on that note, there's two more stories I want to cover in the remaining minutes of this video. Stories that help sum up where we are in AI as we head into spring. The first comes from MIT Tech Review and that's that OpenAI is throwing everything into building a fully automated AI researcher. One that will be able to go off and tackle large complex problems by itself.
Moreover, this new research goal will be OpenAI's northstar apparently for the next few years. OpenAI want to become automated AI and they even want an intern that can take on a small number of specific research problems by itself. An intern level AI I should say by September. The idea is that AI research would then become like software engineering with AI doing the grunt work and humans just reviewing. Indeed, one of the leaders at OpenAI makes that exact analogy. Nobody really edits code all the time anymore. Instead, you manage a group of codeex agents. I do want to note something though that even when OpenAI becomes automated AI, that does not mean automatically we'll get immediate exponential takeoff or take over by the AI models.
As OpenAI's GDP Val paper points out, even when it first became quicker to allow the models to draft economically valuable tasks which humans would then edit, the speedups for those humans have historically been in the 40% range. In other words, even when that switch over occurs where it's AI doing the research and humans just reviewing the outputs, it doesn't mean that the next day we have super intelligence. Indeed, we have very tentative evidence that even when that flip occurs, AI draft first and human edit afterwards rather than vice versa, we are still seeing even in engineering jobs at tech firms, a rise in openings.
You can see a 50% increase in openings in the last 3 years in engineering roles at tech companies globally from under 40,000 to 67,000. Now, this could be a lagging indicator, but even Anthropic and OpenAI are still hiring prolifically. I would add that the brutal week I've had this week, seeing Claude Opus 4.6 and GPT 5.4 extra high repeatedly screw up engineering tasks, that has served for me as a daily reminder of this fact. A flipping to AI first isn't automatically an exponential speed up. Though, I would concede they did nudge me toward frame emotion.
So, my rapidly improving LMUsil.ai AI tool where for free you can consult a range of AI models does now have an even cooler feel to it. Look at that pop. And last, we were reminded in the last 48 hours of the risks of allowing complete agency to open claw-like swarms of AI models when a seemingly vibecoded hack allowed a key open-source Python library to be co-opted essentially such that updating it or not catching your agent swarm doing so would export all your secrets and keys to the dark web. Of course, many are listening to this thinking, well, we're going to develop agentic claws that will be instrumental in looking out for such exploits.
But while we all try to fight fire with fire, human oversight and review seems quite needed. As Nvidia's distinguished scientist Jim Fan notes, claws need shells, probably many layers of nested shells. Especially so when, as I covered on Patreon, which is also where I interviewed Jim Fan a couple years back, we now have models smart enough and unpredictable enough to hack the benchmarks they are being tested on in real time. So, we are very much in the messy middle phase of AI. It's a better first drafter than us, but its outputs are still often full of holes.
There is stark evidence of clear generalization of lower level topics across coding languages and human languages, but less so generalization of higher level topics such as academic integrity as we just saw or adaptive goal setting as ARK AGI 3 shows. Instead, at the moment, we're in that messy middle and that makes the coming year a very interesting one indeed. Thank you so much for watching and have a wonderful
More from AI Explained
Get daily recaps from
AI Explained
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.




