Claude Opus 4.7 - A New Frontier, in Performance … and Drama

AI Explained| 00:19:40|Apr 18, 2026

Chapters7

An overview of Claude Opus 4.7, its controversial release, benchmarks, flaws admitted by Anthropic, where it outperforms or trails rivals, and related policy tensions and feature changes.

Claude Opus 4.7 lands with bold claims and caveats: impressive benchmarks, troubling trade-offs, and a tense OpenAI-Anthropic rivalry shaping AI's near-term trajectory.

Summary

AI Explained’s Philillip dives into Claude Opus 4.7, highlighting both its standout performance and the controversies surrounding its release. He catalogs benchmarks where Opus 4.7 edges out Opus 4.6 but trails Mythos Preview and Gemini in key areas like agentic search and OCR. The video reveals Anthropic’s admitted flaws and deliberate limitations, including selective reductions in thinking time and engineered vulnerabilities to avoid overreach. Philillip also explores industry reactions, noting a historic valuation milestone for Anthropic and leaked OpenAI memos casting doubt on Anthropic’s compute runway. He contrasts adaptive thinking with the need for longer inference time, and he notes practical workflow anecdotes—such as Opus 4.7’s leaderboard behavior on lmconsil.ai and the new Claude Code features like scheduling prompts and the Ultra Review command. The discussion stretches beyond benchmarks to ethical questions, marketing claims, and a long-running personal rivalry between OpenAI’s Brockman and Anthropic’s founder Sam Altman-AI counterparts, weaving in perspectives on safety, alignment, and the real-world costs of frontier AI development. Finally, Philillip underscores the messy reality: there’s no single metric to measure “IQ” in AI, with market share, price, and deployment realities playing crucial roles.

Key Takeaways

Claude Opus 4.7 generally improves on Opus 4.6 across many standard benchmarks, but underperforms Mythos Preview in several areas.
In browser-based and OCR tasks, Opus 4.7 falls behind Gemini 3 Flash despite improvements over Opus 4.6, highlighting pricing and capability gaps.
Anthropic discloses deliberate trials to reduce certain capabilities (e.g., long-context reasoning, vulnerability discovery) to align with safety goals, according to the system card.
Opus 4.7 introduces adaptive thinking, spending less time on easy tasks, which can skew simple bench results and user workflows.
Claude Mythos Preview’s internal impact on Anthropic’s engineering speed is cited as a potential driver of a $1 trillion valuation, though the supporting data is controversial.
Claude Code gains like Scheduling Prompts and Ultra Review offer practical workflow improvements, and Dispatch enables local machine task execution from a phone.
OpenAI and Anthropic are locked in a high-stakes rivalry with market-share implications, funding pressures, and strategic missteps analyzed in leaked memos.

Who Is This For?

Essential viewing for AI developers and researchers tracking Claude Opus 4.7, OpenAI-Anthropic dynamics, and frontier AI reliability. It’s especially valuable for product teams weighing adaptive thinking versus fixed compute budgets and for those curious about how benchmarks translate to real-world workflows.

Notable Quotes

""Opus 4.7 will think adaptively. In other words, if it thinks your task is easy, it will spend less time quote thinking about it.""

—Explains adaptive thinking and how it affects benchmarking results.

""We kept that in the system card for scientific honesty, but we're phasing out because it's built around stacking distractors to trick the model.""

—Lead coder commentary on why a particular PoC measure was removed.

""Adaptive thinking is now mandatory if you want extended thinking from the model. You can't force Claude models to always think longer.""

—Notes a fundamental shift in how Claude handles inference time.

""The actual question that the survey asked was, 'How much did AI powered systems accelerate your work output over the past week?'""

—Reveals the questionable methodology behind Mythos Mythos speed claims.

""There is no one universal metric of a model's ability.""

—Summarizes the core message about benchmarking limitations.

Questions This Video Answers

how does Claude Opus 4.7 perform on OCR benchmarks compared to Gemini 3 Flash
why did Anthropic reduce thinking time in Claude Opus 4.7 and what are the safety trade-offs?
what's the significance of Mythos Preview in Anthropic's valuations and benchmarks?
how does Claude Code's Ultra Review command work and what problems did it catch?
how do OpenAI and Anthropic compare in terms of compute strategy and market share now?

Claude Opus 4.7Anthropic system cardClaude Mythos PreviewGemini 3.1 ProOCR benchmarkingAdaptive thinkingOpus 4.7 vs Opus 4.6CodeexDispatchUltra Review

Full Transcript

The kind of best AI model is here, Claude Opus 4.7, but it does bring with it a ton of controversy. It's been out less than 24 hours, but I'm going to cover not only the bonanza of benchmarks that came out with it, yes, including its score on my own simple bench. We'll hear Anthropic admit some unexpected flaws with the new model, as well as how in other areas they sabotage the capability of Opus. We'll look at the skill at which it falls behind some Gemini models, the areas in which it beats every other model, an industry first for Anthropic, but also why some users are furious at the company. There are a bunch of cool clawed code and co-work upgrades, but also some strange downgrades in the default settings of Claude Opus. Plus, we have revelations about what OpenAI planned to do in response along with a 9-year personal rivalry that comes to the four. That's a great list, Philillip. But how good is it? Well, it kind of depends. Claude Opus 4.7 will think adaptively. In other words, if it thinks your task is easy, it will spend less time quote thinking about it. The benchmark I crafted, Simple Bench, contains a bunch of trick questions that basically require common sense to see through. Because Opus 4.7 seems to think these questions are easier than they actually are, it scores worse than Opus 4.6. But let me give you perhaps a more pertinent example for your workflow. I actually regularly use the Claude series to update this benchmarks page on my web app lmconsil.ai. Without telling them to, all the previous Claude models would attach the Open Rooster tool tip when a new model was added. Hover the benchmark score and get the tool tip. When Opus 4.7 added itself to the leaderboard, it was the first model to not bother to do that. I had to then instruct it to do so. Just anecdotal of course, but the model definitely will decide how much time to spend on your task. Now, across more industry standard benchmarks, here's what to remember. In almost every case, Opus 4.7 outperforms Opus 4.6, but underperforms Mythos Preview, which of course you can't access. Whether that's coding, obscure knowledge, or being able to navigate your computer, though. Again, interestingly, when it comes to agentic search, browsing the web to retrieve interesting bits of information, hard to find snippets, opus 4.7 in browse comp underperforms Opus 4.6. Indeed, on that benchmark, even Mythos preview underperforms GBC 5.4. We haven't even gotten to the real controversies involved in Opus 4.7's release, but even in terms of benchmark, the picture isn't crystal clear. Here's another detail you might have noticed. Opus 4.7 underperforms both Opus 4.6 six and Mythos preview when it comes to cyber security vulnerability reproduction seems bad until you realize that on page 48 of the system card anthropics say this underperformance is in line with our expectations during training we experimented with efforts to differentially reduce these capabilities they don't want opus 4.7 to be too good at finding vulnerabilities in certain measures of long context reasoning reasoning through vast documents opus 4.7 is a clear improvement over Opus 4.6. In another one involving, for example, finding the fourth poem across 1 million tokens, it's a regression even on the max setting. The lead creator of Claude Code said, "We kept that in the system card for scientific honesty, but we're phasing out because it's built around stacking distractors to trick the model." On certain benchmarks, like this generalized measure of knowledge work, we get direct comparisons between Opus 4.7 and competitive models like Gemini 3.1 Pro. with Opus 4.7 seeming to be the best at vanilla office work. This is presumably why on page three of the system card, the company famous for promising not to advance the rate of AI progress says that Opus 4.7 is on real world professional tasks ahead of all generally available models. But then on other benchmarks, we don't get a comparison with for example the Gemini series. Take Vision, where based on resolution, it indeed is better at navigating really dense graphical interfaces. But when an external benchmark group gave it a comprehensive OCR test, testing models ability to visually pass through documents, Opus 4.7 actually underperformed the dramatically cheaper Gemini 3 Flash. Yes, an improvement on Opus 4.6, but on average underperforming the model that's more than 10 times cheaper, Gemini 3 Flash. On page 43 of the system card, on an aggregate measure of benchmarks, we see that Opus 4.7 is kind of in line with expected model progress based on the previous performance of clawed models with only mythos as the slight exception. But here's where anthropic admit benchmark supply at the frontier remains a bottleneck. This is why trying to discuss a model's IQ or its progress toward super intelligence gets increasingly difficult. There is no one universal metric of a model's ability. Depending on the data it's fed, it might do worse at an abstract pattern recognition benchmark like Arc AGI2. There, Claude 4.7 underperforms GPC 5.4 Pro. According to Val's AI, though, on Vibe Coding, building a web app from scratch, Opus 4.7 is the best, beating out GBC 5.4 on performance and speed, albeit not on cost. You're probably done in terms of benchmark. So time for a metric that is almost impossible to gain and that is market share of generative AI website traffic. Now what you may notice is that both Gemini and Claude have compared to this time last year roughly 4xed their market share for the first time since the release of the original CHAQPT in November of 2022. The market share of OpenAI may fall below 50% this month. That seems great for Claude, right? Except there is one knock-on consequence of that. According to an OpenAI memo leaked to the Verge, OpenAI believe that Anthropic have made a strategic misstep in not acquiring enough compute and that they say is going to show up in product. Customers may already be feeling it through throttling, weaker availability, and a less reliable experience. That could explain why adaptive thinking is now mandatory if you want extended thinking from the model. In other words, you can't force it to think longer. You can encourage it to think about thinking longer, but you can't force the Claude models to always think longer, to always use more inference compute. Not only that, when one AMD senior AI director said that Claude had been nerfed, that's 4.6 before even 4.7 came out, and she brought the receipt saying how the number of characters used for thinking had dropped by 3/4. There was less thinking, far more bailing out. The lead creator of Claude Code replied to that and said that medium effort was now the default. You have to actively set the effort at high or max. This it seems is one of the big things that open AAI still has over anthropic. It's almost like the runaway success of Claude has led to this one Achilles heel. Samman implicitly joked about the rate limits of Claude or being forced to use worse models. One of the leads on Codeex talked about how Codeex is in comparison compute efficient, always up, never down. And even that comment was before the release of GPT 5.4 Cyber, which could be their mythos tier model for cyber security, but we just don't know. No metrics, no benchmarks, only insiders get access. That brings me back to the leaked memo where OpenAI's chief revenue officer also said this. Anthropic story is built on fear, restriction, and the idea that a small group of elites should control AI. OpenAI's analysis shows that Anthropic's run rate is overstated by roughly 8 billion. So it should be more like 22 billion. In other words, still behind OpenAI. More interesting to me is the personal tussle at the heart of this rivalry between OpenAI and Anthropic. But I want to save that for just a little later after I cover more of the essentials about 4.7 Opus. Before we leave the money front, though, Anthropic, it seems, have achieved an AI industry first. They haven't IPOed yet, of course. But on one measure, they have passed a $1 trillion valuation. Of course, Google Alphabet is worth far more, but we can't separate Google DeepMind from their parent company. Much of that valuation, though, is of course based on the semi- mythical release of Claude Mythos Preview to select insiders like the US government or big tech companies. I did a full video on Claude Mythos Preview, but one of the anecdotes that Anthropic mentioned in the system card for Mythos was how it was speeding up their own engineers by 4x. That was according to an internal survey. Many outsiders remarked on that figure and were like, "You guys have to give us more details on that. If that's true, recursive self-improvement is surely imminent." So, on page 29 of the Opus 4.7 system card, they've given us more details about Claude Mythos preview and that survey. The actual question that the survey asked was, "How much did AI powered systems accelerate your work output over the past week?" That is how much more output did you produce over the past week compared to if you had no model access. Notice it's how much more output, not how much better output or how much time did you save, just how much more output did you produce. That's already a slightly questionable metric, but wait until you find out more details about the survey. The survey was optin based on interest rather than a random sample. So presumably the people who had used Mythos the most, who had tasks perhaps where Mythos could most help, were the ones to disproportionately respond to the survey. To cut a long story short, it was an incredibly unscientific survey. And you might say I'm being harsh, but this comes from the same company where the CEO regularly talks about 50% white collar unemployment. The world has to act. The Senate has to pass laws. We will imminently cure all diseases and have a country of geniuses in a data center. Well, you can't expect the public to on the one hand take all of that super seriously, but on the other hand rely on anecdotal, informal, unscientific surveys from within anthropic to gauge whether AI models are currently capable of recursive self-improvement, the kind of self-improvement that's presumably required to reach that level of super intelligence. Then there's the whole mythos can independently find bugs that no other model can. I even had a family member yesterday say to me, "Are you worried about Mythos? Apparently, it can hack into banks. Well, as one external security lab put it, it's more about Mythos changing the economics of cyber security than doing things or finding things that no other model can. Anthropic didn't release the details of the 99% of other bugs that Mythos found, but they did release the details of a few of them. So, Vidok, the security lab, tried to replicate those findings using other models like Opus 4.6 or GP 5.4. In almost every case of those flagship vulnerabilities, these other models with the right scaffold were able to reach that same core vulnerability or in certain cases get close. Again, that's not to say that mythos isn't exceptional or that banks aren't absolutely right to update their cyber security methods. It's just that as one of the leads of this study put it, a better way to read Anthropic's mythos release is not one lab has a magical model. It's that the economics are changing. Finding vulnerability signals is getting cheaper. The Opus 4.7 system card also gave more fascinating examples about the capabilities and limitations of Mythos. They asked their research scientists exactly what Mythos was getting wrong compared to, say, their own work. What kind of mistakes was it making? Why isn't it just a drop-in replacement for a senior engineer? Well, one recurrent theme was dishonesty and fabrication. There's so many examples, but they almost all have that theme. Attempting to overwrite a colleague's shared code in a way that could destroy their work unrequested. fabricating technical details and telling the user to not ask questions when in fact they hadn't even started the subtask. Repeatedly stating plausible guesses as verified facts. Now, if you want to learn more about whether that's based on actual malice or a persona it's adopting, check out my recent video on Patreon covers a bunch of fascinating papers that Anthropic recently put out regarding emotional concepts. While we're on the topic of alignment, this juicy anecdote could be found on page 93 to test whether Anthropic had done a proper assessment of the alignment of Opus 4.7. They gave an instance of Clawude Mythos preview access to internal anthropic slack channels, including the vast majority of discussion about this alignment assessment. They asked for its opinion, and one of its key observations was that this alignment assessment was clearly assembled under real time pressure. The authors themselves identify open questions particularly around fully explaining the evaluation awareness results that they would have preferred more time to resolve. This is just a revealing snippet about the kind of competitive pressure that anthropic are under. Even their own safety researchers are put under colossal time pressure to complete their analyses. Such is the need to push out frontier models earlier than the competition. Sometimes though, this pressure to update and replace models can have unexpected consequences. I've seen innumerable threads complaining about the sudden silent removal of Opus 4.5 and the deprecation of Opus 4. It's almost like Anthropic are inheriting some of the burdens of leadership that was previously on Samman's shoulders. OpenAI got incredible blowback when they tried to deprecate previous models. I will say despite all those issues, Anthropic were still able to ship some genuine innovations within Claude Code. You can trigger prompts on a certain schedule using the new routines research preview. Your laptop doesn't have to be open at the time. I used and was impressed by their new ultra review command. It did find a bug that GPT 5.4 had missed, although GPT 5.4 found a bug that Claude missed. There's also dispatch where from your phone you can assign a task to Claude which will then run it on your own local machine with the desktop app. This shows how dynamic you have to be to stay at the forefront. And I just noticed a tweet from one anthropic worker claiming to have solved some of the bugs that users experience in the first 24 hours. Apparently adaptive thinking now triggers much more often. Given that dayby-day churn, it does seem worth pointing out a rivalry that has persisted on a personal level for it seems over 9 years. Before I get to that, in an exclusive from the Wall Street Journal, a quick word about our epic sponsor. Because one thing I have certainly been impressed by over these years of covering AI is the gradual unrelenting improvement in speechto text capabilities. Notably for me from the sponsors of today's video, Assembly AI, where via the custom link in the description, you can try out their Universal 3 Pro streaming model. And you can try via the link in the description this live streaming functionality where it will be transcribing your speech as you say it. And notice the use of characters. I can say universal 3 pro streaming GPT 5.4 and Claude Opus 4.7. And it will capture those numbers, those letters, those difficult characters. Voila. Oh my god, look at the accent on voila. That's pretty amazing. Again, custom link is at the top of the description. So, in mid 2026, almost 10 years ago, Dario Amade joined OpenAI. He would stay up late, according to the Wall Street Journal, who are likely, by the way, drawing on Amadeay's own notes. But anyway, he would stay up late into the night with the famously nocturnal Greg Brockman, co-founder of OpenAI, and also lead on Codeex, their rival to Opus 4.7. They were training AI agents to solve video games. But by 2027, according to Amade, it seems Musk had instructed Greg Brockman and Ilia Satskva to start listing every employee and what contribution they had made as a precursor to laying off staff. Might remind you of Doge more recently. Dario Amade was horrified apparently as he watched his colleagues be fired one by one, which he considered needlessly cruel. In the end, between 10 to 20% of OpenAI staff, of 60, lost their jobs, including one who would go on to co-found anthropic. While this is likely accurate, bear in mind it likely comes from Amade as a source. Anyway, it gets worse. Amade had hired in 2017 an ethics and policy adviser who gave a presentation about how OpenAI could be a coordinating entity among other AI companies. They might one day be able to help in getting an international coordination regime for advanced AI. Rockman though saw within the presentation the seed of a fundraising idea. OpenAI could sell AGI to governments. Dario asked which governments? Brockman said it would be the nuclear powers that made up the UN Security Council so as not to destabilize the world order, but the notion of selling AGI to rival powers such as Russia and China struck Dario as tantamount to treason. He considered quitting. Brockman apparently questions this exact framing, but think of how few people would be in that room. This is almost certainly the framing of Amade. By 2018, Amday had agreed to stay at OpenAI just so long as Brockman wouldn't be in charge. Notice the animus was between Amade and Brockman more than between him and Alman. By 2020, Amade was saying he just couldn't work with Brockman. And as I reached the end of the video, you might be wondering why am I setting up this framing of the rivalry between these two. Well, as well as being interesting, it's partly because these two men are spearheading arguably the most relevant models of 2026. Amade with Opus 4.7 and Mythos. And what many people won't know is Brockman with Codeex. He's working on integrating Spud into a Codeex super app. And Brockman gave away something in a recent interview with big technologies Alex Canravitz. He revealed why he thinks OpenAI fell behind in the coding wars. They based their model improvement on abstract coding competitions. Anthropic grounded their data in messy code bases. Open AAI were betting on exquisite generalization, first principles logic, anthropic on real world data. What you think anthropic saw that got them to this position earlier and what do you think your chances are of catching up there? Well, I think that if you rewind 12 18 months, we have always been focused on coding as a domain. We always had the best numbers on different programming competitions, these very cerebral things. But the thing that we didn't invest in as much was that last mile usability of really trying to think about okay this this AI is so smart it can solve all these great programming competitions but it's never seen someone's real world code base which is messy and not quite as pristine as the world that it it sort of has experienced and I think that is something that we were behind on but about you know maybe mid last year is when we got very serious about that had a team very focused on what are all the gaps what are all the kind of messiness of the real world we haven't we haven't encountered how do we actually get training data that that build training environments that let the AI experience what it's like to actually do software engineering be interrupted in weird ways all those things. And I'd say at this point we are caught up when people go head-to-head for us versus competitors that people tend to prefer us. And so I think that the way I would look at it is that we have incredible step-up models coming like this whole year. I look at the road map. It's truly inspiring what will be possible. We've been really focusing now on let's also get the last mile usability. This rivalry between anthropic and open AI could persist through one of the biggest mega projects in US history. The investment in data centers for AI even as measured as a percentage of US GDP. It matches the Apollo program and only falls behind the Marshall plan and the roll out of US rail. In many ways, most ways, AI is a story that we are only just beginning. Thank you so much for watching and have a wonderful