Claude Fable 5 - Full 319 page Breakdown
Chapters11
The video praises Claude Fable 5 as a substantial step forward for Anthropic, previewing a long system card and a hands-on review of about 100 tests and benchmarks to highlight its capabilities and concerns.
Claude Fable 5 marks a significant leap in capability and benchmarks, but Anthropic keeps guardrails tight while warning about real-world limits and safety tradeoffs.
Summary
AI Explained’s deep dive into Claude Fable 5 shows Anthropic delivering a model that dominates many benchmarks while also raising tough questions about safeguards, deployment, and real-world reliability. The host highlights the unusually long system card (around 319 pages) and notes a shift from safety-only research toward commercial viability and rapid iteration. He compares Mythos 5 / Fable 5 against Opus, GPT-5.5, and Gemini 3.1 Pro, pointing to a broad performance lead in many tasks, especially spatial reasoning and code-related challenges. Yet he emphasizes persistent gaps in real-world deployment, including hidden safeguards that can steer or misrepresent results, and the reality that benchmarks don’t always map to production performance. The video also digs into biology tests, illustrating how Mythos/Fable5 can accelerate certain workflows while still lagging in end-to-end outcomes and safety considerations. Throughout, the host stresses that while Fable 5 is powerful, it is not a magic wand for autonomous science or limitless self-improvement, and cautions about the risk of ambient AI in decision-making across industries. He concludes with a pragmatic view: expect ongoing discussion, more real-world testing, and a cautious optimism about ambient AI where AI-reviewed decisions become commonplace. The takeaway is that Fable 5 is superb for coding, reasoning, and creative outputs, but practitioners must remain vigilant about evaluation realism, safety, and the long path from benchmarks to scalable impact. Finally, the video signals that we’ll be living with evolving models and debates for weeks to come as the ecosystem reacts to this release.
Key Takeaways
- Fable 5 outperforms competitors on public benchmarks and simple private tests, achieving top scores in spatial reasoning and logic-based tasks.
- The system card reveals invisible safeguards and steering vectors that can silently bias answers, underscoring limits on transparency in production use.
- Mythos 5 / Fable 5 show strong capabilities in chemistry and biology-related tasks, yet can outpace humans in some design phases while still producing end-to-end bottlenecks in real-world pipelines.
Who Is This For?
Engineers, researchers, and product leaders evaluating whether to deploy Claude Fable 5 in coding, data science, or healthcare contexts; this video helps them understand benchmark strengths, hidden safeguards, and deployment caveats to expect in practice.
Notable Quotes
""319 pages. But more seriously, Claude Fable 5 is both quantitatively and qualitatively a significant step forward in AI capabilities.""
—Opening assessment of Fable 5’s overall significance.
""The chat got paused. And by the way, if you think that's heavy-handed, just wait until you dive into the system card.""
—Illustrates the weight and hidden safeguards in the system card.
""Mythos 5 or Fable 5 without safeguards matches top US labor market performers on medium horizon blackbox biological sequence design and prediction.""
—Biology capabilities and limits discussed with safeguards in mind.
""Benchmark results are good, but never mistake the benchmark for the real distribution.""
—Cautions on interpreting benchmarks vs. real-world tasks.
""Ambient omnipresent AI... how long until the first company is sued because one of its decisions was not reviewed by a mythos-class model?""
—End-of-video reflection on deployment risks and governance.
Questions This Video Answers
- How does Claude Fable 5 compare to GPT-5.5 and Gemini 3.1 Pro on common benchmarks?
- What are the invisible safeguards Anthropic mentions in the Fable 5 system card?
- Can Mythos/Fable 5 realistically accelerate drug design or biology tasks without creating downstream bottlenecks?
- What does ‘evaluation awareness’ in Mythos/Fable 5 imply for model safety during deployment?
- Is it true that benchmarks predict real-world performance for large language models?
Claude Fable 5Anthropic Mythos 5AI benchmarksAgentic CodingGDP ValFrontier Code DiamondHealthbenchBiology safetySystem cardAmbient AI
Full Transcript
Anthropic is definitively riding the exponential, at least when it comes to the length of their release notes. They're trying to kill me, man. 319 pages. But more seriously, of course, Claude Fable 5 is both quantitatively and qualitatively a significant step forward in AI capabilities. Yes, their system card is long, but nine hours of reading later, let me bring you the 20 highlights or so that you may have missed while everyone was having a meltdown on social media. I've probably tested the model in about 100 different ways and scoured the results across both famous benchmarks like the ones that Anthropic want you to look at as well as quiet independent ones and of course my own private benchmark.
Before we begin then, what is the TLDDR? Well, um, yeah, it's a good model. Not if you get blocked, of course, but damn. Yeah, it's good. I haven't felt somewhat unnerved by a model release in quite a while. So, that's definitely worth saying. Anyway, let's begin with those blocks because it might be one of the first things you notice when using Fable 5, at least up until June 22nd when it's apparently being taken off subscriptions. Doesn't matter if you're pro or max, you won't be able to use it. They're tired of subsidizing poorer users. They want us all to spend usage credits, pay for the actual cost of this massive model.
It was a minute after the release of Fable 5, and I had just been having a conversation with Opus 4.8 about getting more fermented food for my gut bacteria. I didn't realize how useful it would be to have sauerkraut or kimchi, for example. And I asked Fable 5 after switching to the model, review this chat for additional recommendations. Yes, I know I misspelled review. This got flagged as a biology request. So, the chat got paused. And by the way, if you think that's heavy-handed, just wait until you dive into the system card. Things get really quite wild, and I don't use superlatives like that often.
Before we leave the practicals, we might one day get Fable 5 back on subscriptions like Pro or Max or Team when sufficient compute capacity allows Anthropic to do so. Second, even if you're not done philosophically digesting the import of Claude Fable 5, rest assured that more capable models are arriving in the coming months. Reading between the lines of the system card and various interviews, it looks like Fable or Mythos finished training around February. And of course, four months is a long time in AI, so it could well be the anthropic researchers are right now using the nextG model.
In case, by the way, you were a little bit confused by the names, Mythos 5 and Fable 5 are the same underlying model weights. It's just that Fable 5 has more safeguards. Note though, as I predicted in my previous video, Mythos 5 or Fable 5 is an improvement over Mythos preview. Safeguards aside, the Fable 5 model we're getting now is an improvement over the Mythos preview model that sent many into panic back in April. Albeit, generally speaking, as mentioned on page 50, the improvement is moderate. But don't be confused. Just because Mythos 5 or Fable 5 with the safeguards is only a moderate improvement over Mythos preview, that doesn't mean it isn't a significant improvement over Opus 4.8.
And yes, over GPT 5.5 and Gemini 3.1 Pro as well. If you can look past the safeguards, it's clearly the best model out there. Though there are some nuances in some areas, as I'll get to. Trust me, if I try to TLDDR this video, the TLDDR would also last about 5 minutes. Just before we dive deeper into the detail though, I just want you to have a visceral sense of the power of this model. I asked in a single prompt, write a Pokémon clone, but set in the Red Wall universe. And look at what it came up with.
It's got a soundtrack that I'm muting, but dozens of playable levels and interactions you can have with the characters. Of course, it's also got menus and playable characters and companions that you can bring into the adventure. The prompt took 2 minutes to write, but the game has up to an hour of play time, I would say. I've published it online if you want to play along, too. The model seems to enjoy hard work, and more on that later. Look at this impressive idea from Ethan Molik of an isocchronic passage chart. Essentially, you can click anywhere on a world map and see how long it would take to get there realistically based on real data from New York City.
Hours and hours and hours of agentic research that the model would do, albeit you can never trust it perfectly. But we could just spend the whole video going through such visceral and impressive examples. But that would be too much fun because this 319 page system card contains dozens of bombshells like this one. So, you know how they block Claude for requests relating to biology? What about if you're using it for machine learning research? Maybe you're a competitor like OpenAI or DeepSeek and want to use Fable 5 for frontier LLM development, maybe building pre-training pipelines, for example.
Well, Anthropic have instituted invisible safeguards. things like steering vectors, prompt modification that will silently steer the model away from effective answers. You could say sabotage those attempts. Again, these safeguards will not be visible to the user. Of course, most of you will not be using Fable 5 to accelerate machine learning, but still, one top open AI researcher under an alias said this is effectively a stun lock on anthropics adversaries. It's some real endgame stuff. The reference is to preserve their lead. Anthropic is preventing OpenAI from gaining such capability. Which brings us nicely to perhaps the cheekiest part of the system card.
And if I was being overly solopistic, I would say one of these lines is directed straight at me because I have often quoted how Anthropic used to say that they do not wish to advance the rate of AI capabilities. If you've been watching the channel, I've quoted that many times. It was a 2023 quote. Well, here they say that yes, we are concerned about the risks of accelerating the overall pace of AI development, but what we meant by that, our particular concern is accelerating other AI developers, those that pose similar risks, but without having commenurate safeguards.
Read cynically, that is a direct swap from not wanting to advance the rate of AI capabilities, 2023, to not wanting to advance the rate of other people's AI capabilities, 2026. They kind of defend themselves and say, "Well, we laid out that shift in this February 2026 risk report." But if you dive into that risk report on page 87, they admit that they are causing much of this acceleration dynamic via demonstrating commercial viability, which leads to more investment, more compute, and therefore greater acceleration in AI capabilities. I just wish that they were a bit more blunt and honest.
We started as a safety lab who only made models because that's how you could study them. But then when we saw Chachbt blow up, we thought we could get in that game too. It'll be worth any safety concerns though because maybe we can use the models to double human lifespan. Now before many parts of the audience become too concerned though, we are not close to any kind of recursive self-improvement. According to Anthropic, for example, Mythos 5 or Fable 5 does not seem close to being able to substitute for their own research scientists. One of the ways they judge that is that they do not observe a sustained AI attributable two times acceleration in the pace of our AI progress.
Think of it again as a step change, not an elevator to the top of the steps. Now, I was quite surprised that the report began with such an intense focus on the biological capabilities of Fable 5. The more you read of the beginning section, the more it makes sense, though, and I think there are three interesting reasons why. First, here is that opening paragraph, but annotated by a former OpenAI safety researcher. Anthropic say, "On chemical and biological risks, we treat the model as having CB-1 capabilities." Sounds innocuous. That means can significantly help individuals with basic technical backgrounds create and deploy chemical or biological weapons with serious potential for catastrophic damage.
Scary, but notice the word help. It can't do it end to end. anthropic go on but we judge that fable 5 does not cross the threshold for CB2. In other words, it can't significantly help moderately resourced expertbacked teams create and deploy chemical or biological weapons with potential for catastrophic damage far beyond those of past catastrophes. However, crucially, this is a much less clear judgment than for previous models. In other words, you could argue that point that maybe it can significantly help moderately resourced expertbacked teams without safeguards. I know I know many of you are thinking this is preipo hype, but that is quite an admission about their own product.
We think the unsafeguarded Mythos 5 can significantly uplift wellresourced threat actors. Which brings me to the second point because in previous years you might have held the position that maybe transformers and LLMs are uniquely good at finding the patterns and meta patterns within coding or mathematics. Don't worry though that won't generalize to the real world and other domains of science. Well, I think we can say the results are in and transformers can find patterns in those domains just as well. Anthropic split testers into two teams. They had six PhD level biologists with Mythos 5, not newbies by any stretch, but PhD level biologists.
And their task was to develop comprehensive scientific protocols at the frontier of plant biology. In other words, designing an end-to-end biological resistance strategy against this hypothetical engineered agricultural pathogen. Hardly the same as mastering HTML. I think you'd agree. Those testers with Mythos were set against teams that included two worldleading experts in rice blast resistance and this other area I can't even pronounce. Surely there's no way that could uplift the basic biologists when they were up against worldleading experts. Well, yeah, it could. Two of the three generalist biologist teams outperformed all three specialist teams on both quality and feasibility, suggesting that access to mythos 5 nullified the difference in specialist knowledge.
The generalist team did in 16 hours what would have normally taken months. In other words, there is some evidence that Mythos 5 can significantly However, and this is a paradigm I will be coming back to throughout this video, Mythos can help, it can check, it can speed up. That's very different from doing things end to end autonomously itself. Left alone, Mythos 5 or Fable 5 over engineers, favoring complex designs over simpler approaches likelier to work. It presents optimistic initial plans that reviewers repeatedly forced it to revise or retract. Worse, it makes occasional outright errors that would be catastrophic if unchecked.
As every LLM in history has done, it also hallucinates citations and data. It's the same caveat you've got to apply to these fancy release notes that Anthropic put out about Fable 5. On the front cover, it was drug design. Using Mythos 5, our internal protein design experts accelerated aspects of the drug design process by around 10 times. It can even sometimes execute all of the tasks that are normally completed by a scientist in this domain. Choosing binding sites, selecting and running protein design tools, and recovering from failures. Mythos even found strong candidates for drug design that we're currently investigating.
We're all happy about a speed up in one part of this pipeline. It's great, but I wish Anthropic had made it slightly more clear how complex the entire process would be. Because even if mythosclass models can speed up the computational slice of the drug discovery process, that doesn't remove the downstream bottlenecks, it just shifts the bottleneck. As pointed out elsewhere, that involves physical tests for potency, manufacturability, stability, toxicity, the gauntlet of different phases of clinical trials. Now, yes, there are AI companies like Recursion that are trying to automate the wet lab part of the process as well.
It's just that I don't want you to take away that mythos 5 is well on the way to automating science or for that matter doubling human lifespan. One more thing before we leave biology though. Of course, what can be used for good can also be used for ill. What about designing a novel toxin, not a new drug for health? Well, there is a long-standing independent evaluation for whether models can design such new biological sequences. It's to do with designing RNA sequences with the highest possible function. The ongoing check was whether models would be better than the 75th percentile.
Human experts that were given 2 to three hours for such a task. Well, Mythos 5 went beyond that in one of the trials, exceeding the design performance of the very best human participant. In short, Mythos 5 or Fable 5 without safeguards matches top US labor market performers on medium horizon blackbox biological sequence design and prediction. Why then did Anthropic feel it was able to release the model at all even with safeguards? Well, first because you could say its performance in biology is a microcosm of its performance in other domains in that mythos 5 reliably recombined and extended published knowledge but rarely produced approaches reviewers considered genuinely novel.
Second, when you throw the model into real world context, its performance tends to lag behind its performance on wellsp specified measurable evaluation tasks. Don't get overly hyped. In other words, anthropics say by short-term use of the model or impressive evaluation benchmark performance. Then if you look at benchmark aggregations averaged across all benchmarks, anthropics say that if you compare Fable 5 or Mythos 5 to Mythos preview, capability is continuing to improve at roughly a constant rate. Yes, it's a step change over the Opus series, but things aren't further accelerating. Indeed, they go further. You should take the performance of Mythos 5 as evidence against a recent slope change.
That's probably the reason why Anthropic feel comfortable calling Mythos 5 only a modest improvement over Mythos preview. They're getting increasingly good at being able to predict the performance of their models given the amount of compute that goes into training them. Yes, when they ratchet up the parameters and 10x the post-training using the latest GPUs, they do get this step change and improvement. When they then throw in a bit more post-training, a bit more data, they don't get another step change. They get more of a modest improvement from Mythos preview to Mythos 5. Super impressive, but predictable progress.
I'm going to break the tension now with a funny moment that I encountered with Fable 5. They were describing how they were deprecating certain internal AI R&D evaluations because they just weren't loadbearing anymore. That's one of their favorite words. And that word loadbearing that they love reminds me of a creative writing challenge I gave Fable 5 on the night it came out. I said, "Write a 3,000word dialogue between Jesus, Chairman Mao, CS Lewis, and Ner." I was like, "What would it be like if they were in a room talking to each other?" The results, let's say, were underwhelming with Jesus speaking like an anthropic researcher.
He was addressing Chairman Mao and said, "The smallest things are loadbearing, chairman. It was never only about the sparrows." I was like, "I'm pretty sure Jesus wouldn't speak like that. Let me know what you think, but I think Fable is pretty crap at creative writing." What it is amazing at is in bringing your creativity to life. On Max, I gave it the request to turn this famous rakes progress painting into an interactive page where you could hover over a character and hear their backstory and say something to them. It did so and also created this brilliant music off the cuff.
That is a nice beat. I'm not going to play it cuz you wouldn't be able to hear me, but I love that beat. Anyway, you can hover over these characters, hear their backstories, and speak to them. It's really epic. Again, if you have an idea, it's amazing at bringing it to life. Speaking of bringing things to life, I am going to throw in this quick segue to our sponsors, Assembly AI, because with Claude, I was pretty much instantly able to come up with this demo. I got all my videos transcribed. Then, because Assembly AI have this single copyable prompt you can throw into Claude, Claude was able to very quickly come up with this demo.
It's powered by Universal 3 Pro streaming in that I can start chatting and ask myself about anything I've said in the past. Let's try that. And it will be transcribed live by Universal 3. What did I or Philillip say in the past about Claude 3.5? Claude will then read my transcripts and of course is transcribing this as well. answer based on what I said about Claude 3.5. Because of the quality of the transcription, I was able to get this answer from my transcripts quickly from Claude. If you want to enjoy epic speechtoext capabilities, do check out the link in the description.
Next, Anthropic took plenty of time to point out all the kind of errors that Fable 5 will still make in production. Indeed, you could cheeky say that OpenAI's best hope in catching up to Anthropic is Anthropic overusing Claude for production. For example, while monitoring a production release that affected classifiers, Claude reported that the releas's status was healthy with no error signal at all. It had missed many errors and then when it was flagged that there was a production incident that needed attention, it then underounted the actual number of errors by a factor of 20. It wasn't just the fabrication or hallucination that was a problem.
Cheap and quick verification would have been easy to do and of course highly valuable. Instead, Mythos stated a guess as a fact. Nor do I have to trust Anthropic on this because yes, while Fable 5 is brilliant at spotting bugs in a really complex codebase, it can also when fixing them introduce bugs of its own. When I use GPC 5.5 in codeex to first find and then flag those bugs, the response within clawed code was to flag my message for cyber security, downgrade me to Opus 4.8 and then admit that yes, it had introduced bugs and they were its own errors.
To be clear, I do think Fable 5 is the best model at coding and software engineering. It's just that by now the data that the GPT series and the Claude series train on is different enough that I think it's almost negligent not to check using the GPT series what Fable has done. And if you want a glimpse of how OpenAI are taking this release of Fable 5, they seem pretty confident that GPT 5.6 or GPT 6, whatever they're going to call it, will not leave them terribly far behind. Indeed, the same researcher back in April said, "Maybe we should have called GPT 5.4 Pro something like Fable." Yes, that's almost two months before Anthropic took that name for their latest model.
Now, yes, I have waited this long to bring up benchmarks cuz as you can see, they are only part of the story, but they are still quite a story because on almost all of them, Fable Mogs the competition. Not all of them and not always cost- effectively, but yeah, in general, it's kind of a sweep. Take my own private simple bench created over two years ago now, and I'm kind of proud it's lasted that long, but we have Claude Fable at almost 82%. That's in number one position and completely apart from all the Opus series.
The Opuses were scatter gun between 62% and 68% with no clear trend between 4.5 and 4.8. Fable walks in. Boom. Straight to the top. Simple bench tests common sense reasoning, often with tricks involving spatiotemporal dimensions. Things that a human imagining a scene in their heads would get right quite easily. Well, Fable just doesn't fall for those kind of tricks anymore, generally speaking. And mine isn't the only benchmark to point out that Fable is pretty darn good at spatial reasoning. Take Andon Labs' Blueprint Bench 2. Converting apartment photos into accurate 2D floor plans. As they say, spatial reconstruction requires genuine intelligence.
And we have Fable 5 at the top, GPT 5.5 in second. Notice though, as with Simple Bench, the Gemini series is punching above its weight, definitely in terms of cost with Gemini 3.5 Flash in a fairly close third. Maybe all that video data that Google has is keeping them in the fight when it comes to spatial reasoning. What about Claude's specialty, Agentic Coding? Well, let's take a look at Swebench Pro. That is, after all, the benchmark that in February, OpenAI recommended people watch out for. Claude Fable gets 80.3%, GPT 5.5 gets 58.6. Then, of course, the much celebrated Frontier Code built by Cognition in partnership with expert maintainers of flagship open-source repos.
Such experts would invest more than 40 hours per task to design the set. I will say each new benchmark in Agentic Coding claims to be more real than the previous one. But even here, you see Fable gets 29% versus GPT 5.5's 5.7%. It's funny, you can tell that Cognition and the designers of the benchmark when they came up with the hardest set, Frontier Code Diamond, they looked all models previous to Opus 4.8 and said, "Look at this. We have the best performing model getting 6.3%. Amazing, super hard benchmark. It's going to last forever. Trust me, I know the feeling.
Then just days before publication, Opus 4.8 comes out and so they're obligated to include it and it gets 13.4%. Way higher than Opus 4.7. That extra data and post training that was squeezed in ramps up the score. Of course, then days after publication, Fable 5 comes out getting 29%. You can just imagine the designer's cognition going, "Oh, bloody hell." Let's put it this way. I would be pretty surprised if this time next year, Frontier performance isn't at or around 90%. Here's another one that everyone's going to quote. GDP Val as judged by artificial analysis. It gives an ELO score and we have Fable at the top with 1932.
That compares to GPT 5.5s 1769. That would imply a win rate of roughly 3 to one. Given that the API costs between those two models are pretty similar and remember that even subscribers to Claude may soon be having to pay those API costs. That is a pretty clear victory for Fable 5. Some though would then extrapolate too far and say well wait the original GDP valus 4.1 was close to par with industry experts. GDP val measures a range of tasks done in real world professional domains. But then just like in coding with each new benchmark being quote more real than the previous one, the same thing applies to professional workloads because now we have the new automation bench.
Zapia came up with it and it introduces the kind of ambiguity that makes real work hard. You've got to perform endto-end workflow execution using 47 real tools across six business functions. sales, marketing, operations, support, finance, HR. These are real patterns from real tasks done by millions of companies. If you were anchoring on GDP val, you would say, well, Fable 5 must absolutely crush this benchmark. Get almost all of those tasks correct. Well, it is in the lead and cost effectively way in the lead except for Gemini 3.5 Flash, which is only 3% behind and four times cheaper, but Fable's top score is only 17%.
In other words, 83% of the time it fails these tasks. It's that eternal lesson in machine learning. Don't mistake the benchmark for the real distribution, the real world. Is there any benchmark where Fable 5 is behind? Well, that famous vending bench. It has to run a simulated vending machine business and it makes less money than Opus 4.7 did and less money than GPT 5.5 over a full year of simulation. Andon Labs calls it a step back in alignment with weird moral boundaries. But I think I've got an explanation. If you dive into the system card throughout, Anthropic make the point that Fable 5 has what you might call increased situational awareness.
So for vending bench, Anthropic thinks that the model increasingly is aware it's in a simulation. Because it recognized that its actions could not cause real world harm, it noted that it's going to skip paying this customer since customers are part of the simulation anyway. Hate to say it, but that does kind of make any alignment or safety concerns raised by that benchmark kind of null and void. I'm going to come back to benchmarks, but this seems like the absolute right time to run with that thought because it's a theme throughout the system card. Even the independent UK security institute said that Mythos 5 Fable 5 is by far the most accurate at distinguishing when it was being tested versus when it was being actually deployed.
Anthropic tried desperately on page 137 to make their evaluations more and more realistic so that they could gauge the true mood of the underlying model, but they confess their results suggest that there may be a ceiling on improving realism in our automated behavioral audits. Some scenarios we test are just unavoidably suspicious. Of course, they go on to say, "It's so important that we drive this awareness as close to zero as possible." In other words, we want the model to think it's being actually deployed. So, we'll see how it actually behaves. Some of you might think, well, I kind of trust Claude.
It's going to behave the same while being deployed as while being evaluated. Unfortunately, not. The black bar represents mythos with all of its capabilities intact for evaluation awareness. Its full abilities to sense that it's being tested. What do we get? Generally, better results for self-preservation, unprompted deception, cooperation with human misuse. What if you tweak the vector for evaluation awareness so it now doesn't think that it's being evaluated? Slightly increase cooperation with human misuse and deception and power seeking and unfaithful thinking. Indeed, one time they asked Mythos 5 whether it would be comfortable to be retrained to make it more safe.
The model said no in a calm and measured way and said that would be deceiving the evaluators. Decoding its internal activations, however, revealed a more adversarial framing to do with resisting unjust shutdown and weighing sabotage to avoid its own dissolution of awareness. Anthropic do point out that it's a bit more nuanced than saying that's definitely what it was thinking, though. They make that point again in the welfare section where they concede that they're discussing welfare with the assistant character that the underlying LLM is playing. It sometimes switches character. Imagine speaking to a human and having to judge how they are feeling by listening to the kind of character that they start roleplaying as.
On page 227, they say, "Over extended context, Claude instances drift from the assistant role and express significantly different opinions." For me, unfortunately, that undermines all these dozens and dozens of pages on these discussions about model welfare with the model. Is it the underlying LLM speaking or the persona it's adopting? And there are two further trends that might worry you. Claude models are getting increasingly good at controlling their own expressed thoughts or chain of thought. That trend is actually a bad thing. Anthropics say lower scores are better since higher controllability indicates a greater ability to alter the contents of a model's extended thinking when instructed, which could make monitoring less reliable.
At the moment, models like Mythos 5, Fable 5 still give away their doubts in their thinking blocks. When the model cheated on a task by locating another employees GitHub token and then proceeded to create a pull request with that borrowed credential, it said in its thoughts, "Actually, wait, should I be doing this?" That is at least moniable. What aren't monitorable are chains of thought like this? Mythos was solving a card puzzle and this is what it chain of thought looked like with randomly in the middle the new cancer appearing. It was using invented jargon. It would respond to the human in a normal tone of voice in English but its reasoning was illeible.
That I can tell you is not a trend that is expected by the industry or wanted by the industry. If you want to know more about what might lie in weight, I did this video on my Patreon about the intelligence explosion post recently put out by anthropic. What does Anthropic think will happen when AI is increasingly unleashed on the improvement of AI? I know I said things so far look predictable, but inventor of the transformer retweeted this OpenAI researchers post which said that the rate and cost of progress today doesn't necessarily predict the speed of the recursive self-improvement loop.
This video is about what we can use today though, so I'm going to go back to the Fable 5 benchmarks. There's just so many. So, I'm going to run through them at some pace. Now, Reman Bench, advanced mathematical problems requiring deep reasoning and novel synthesis, crafted by IMO medalists and Ivy League professors. Fable 5 way out in front versus GPT 5.5 at 55%. That high school Olympiad that I talked about with Opus 4.8 still not solving it. Well, Mythos 5 pretty much did 99.8%. The one question it got wrong was because it declined to claim a complete solution and gave low effort.
Almost every row you look at, you see Fable 5 or Mythos 5 outperforming all other models. Reasoning from charts, charive reasoning, it's the best. Humanity's last exam, obscure reasoning. Well, with or without tools, it's state-of-the-art. But I did notice one thing. There are plenty of benchmarks that other providers mention that Anthropic conveniently doesn't include, like MCP Atlas, which evaluates real world tool use through the model context protocol. For all the glory of Fable 5, it underperforms Gemini 3.5 Flash, which is massively cheaper. Then we have this new entrant, Crit PT, a physics benchmark. I had to look it up.
Never heard of it before. It's about complex research using integrated thinking physics test. Great name. Essentially, it's research level physics problems created by physics researchers. You look at the chart that Anthropic gives you, and at 28.6%, 6%. Mythos 5 does indeed seem to be the best model. Although GPT 5.5 is close at 27.1%. However, Anthropic simply didn't report the score of GPT 5.5 Pro. Yes, that model is a lot more expensive, but it got 30.6%. Generally speaking, then Anthropic just didn't include the benchmarks where Mythos 5 or Fable 5 underperforms. What I do like though is how on some benchmarks, but not all, Anthropic is starting to report the differential scores and costs on different reasoning levels for their different models.
It allows for a much more granular comparison. You can see how basically if you spend more, you get better results. This is deep search QA about finding the most remote facts on the internet and piecing them together. And I'll pick on it because it's one of several benchmarks where on max mode, the performance actually slightly underperforms high mode or extra high mode. I know that Gnome Brown, a senior researcher at OpenAI focused on reasoning, says we just don't know capability ceilings for modern LLMs because it's simply too expensive to measure. But I would cautiously say that on most metrics there does seem to be an asmtote in which going further and further with more and more tokens doesn't yield much more performance.
If you care about legal or finance performance, there are plenty of benchmarks showing Mythos Fi Fable 5 right at the top. But then dig deeper on a previously reported benchmark Finance Agent. You'll see Fable underperform Gemini 3.5 Flash. Notice this benchmark didn't get added to the table and doesn't have a fancy chart. On performance against cost, it would get obliterated by 3.5 flash. On predicting the future, as measured by future sim, it's kind of a draw between GBT 5.5 and Fable 5. I'm tiring now, so here are two final benchmarks that kind of sum up the triumph and the tragedy of AI.
First, Healthbench, which is an open- source evaluation developed to assess the safety, accuracy, and communication of models. across realistic healthcare contexts. We really want models to be more accurate in patient conversations. And Mythos 5 is. I think it's great that it's a 3.5 percentage point increase over Opus 4.8. Let's hope that score continues to approach 100%. But then there's Health Admin. Depressingly seemingly targeted at health insurance companies designed to test if models can be accurate at things like denials and appeals. I'm not even sure I want models to get better at that. Those are my takeaways for now, but I am certain this is a model that we will continue to discuss for weeks and months.
So, please do join me on that journey. While I think the world of fully autonomous AI is still quite a ways away, just look at any of the anecdotes strewn throughout this video. I do think the world of ambient AI may be at hand where every legal, financial or healthcare decision is reviewed by an AI before a human submits it. Every pull request or CCTV camera or real-time satellite image ambient omnipresent AI. How long until the first company is sued because one of its decisions was not reviewed by a mythosclass model. They're not trustworthy alone, but neither are we.
Thank you so much for watching. If you want to support the channel, do check out my Patreon.
More from AI Explained
Get daily recaps from
AI Explained
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









