Fable is Mythos, and it is really good.
Chapters8
Introduction to Mythos 5 and the newer Fable 5 with safeguards, and the performative gap between access and expectations.
Fable 5 delivers Mythos-level coding chops with strong reasoning and real-world workflow gains, but comes with hefty usage costs and strict data retention you’ll want to plan around.
Summary
Theo’s deep dive into Mythos through the new Fable 5 model is as practical as it is enthusiastic. He explains that Mythos 5 isn’t broadly accessible yet, so what testers got is Fable 5, a safeguarded variant that still feels like Mythos at full throttle. Over 24 hours, Theo and his team burned through substantial inference budgets, pushing from terminal benchmarks to fully functional multiplayer 3D racing demos and even a 15,000-line code-change modernization. He teases hard benchmarks, noting Fable’s performance on knowledge tasks and its superior coding instincts, while also flagging refusals and safety constraints that affect some tests. Beyond raw speed, the video highlights UI/UX improvements, better templates, and a surprising knack for high-quality code output, including a Rust terminal app and a Minecraft-like project built with prompts. Theo also covers cost dynamics, including $10 per million input tokens and $50 per million output tokens, 5-hour session limits, and new 30-day data retention for Fable 5. He peppers in real-world usage anecdotes—debugging Lakebed PRs, porting apps to modern stacks, and even creating a robust racing game with multiplayer. The sponsor section plugs Blacksmith for faster CI on MacOS and Linux, underscoring the broader shift AI tooling is driving in dev workflows. Overall, Theo stresses that Mythos/Fable represents a meaningful leap in what AI can code, test, and design—worth experimenting with now while planning for pricing and policy caveats down the line.
Key Takeaways
- Fable 5 is a guarded but highly capable Mythos variant that delivers strong coding and debugging results despite safety constraints.
- Benchmarks show Fable outperforming Opus on many tasks and approaching GPT-5.5 performance in code-related scenarios, with fewer tokens spent.
- Cost and usage policies are aggressive (10/50 per million tokens; 5-hour session limits; 30-day data retention) and can impact long-running workflows.
- Real-world tests include modernizing a large codebase (15,000-line change) and porting stacks (Prisma, TRPC, Tanstack, Clerk) with impressive results.
- UI/UX improvements and better design sensibilities in code generation help produce more production-ready code outputs.
- Full AI-driven workflows can substantially cut debugging time and enable parallel exploration of big problems, but require careful cost management and policy awareness.
- Public benchmarks (Frontier Code, Skatebench, etc.) reveal both strengths and potential biases in model evaluations, especially around reasoning consistency and refusals.
Who Is This For?
Software engineers and engineering leaders evaluating Mythos/Fable for coding, debugging, and rapid prototyping. This video is essential for teams weighing the economics of aggressive AI usage against the payoff in code quality and velocity.
Notable Quotes
"Mythos really just feels like more Opus, like they turned it up to 12 or 13 somehow."
—Theo’s first impression comparing Mythos to Opus with a noticeable uplift in capability.
"It’s a damn good model."
—A blunt assessment of Fable’s overall quality and utility.
"Go play with this. Push it to its limits."
—Encouragement to experiment aggressively within current plans.
"The economics of software development have changed on a fundamental level as a result of this drop."
—Pricing and budget impact central to adopting these models.
"Go play with this. Push it to its limits. Let me know what you’re able to build with it."
—Call to action to explore capabilities and share results.
Questions This Video Answers
- What makes Fable 5 different from Mythos 5 and Opus in coding tasks?
- How do the 5-hour session limits and 30-day retention policy affect long-running AI-driven projects?
- Are the UI/UX improvements in Fable/M Mythos worth upgrading for production code generation?
- What are the cost implications of using Fable 5 for large teams vs individual developers?
- How reliable are AI-generated PRs and how should teams integrate AI-reviewed changes into their workflow?
MythosFable 5Mythos 5Code generationFrontier CodeOPUSSkatebenchUI/UX in AIRust terminal appMinecraft clone project AI
Full Transcript
Sorry this video is a little later than usual. I wanted to do my due diligence when testing this new model series because it is a big big deal. We've been hearing about Mythos forever and to finally have it in our hands is unbelievable. It's so unbelievable that we don't actually have it in our hands because Mythos 5 is not the model that we all have access to now. What we did end up getting is a new model called Fable 5, which is still Mythos, but it has a bunch of safeguards in front. And while those guards definitely cause it to not do what we want in a lot of different scenarios, that hasn't stopped me from using the hell out of this model.
In the 24 hours since it dropped, I've done over $600 in inference. Oh, wait. That's my Mac Mini. Yeah, that looks a little more right. I've done about $2,000 of inference on this model already. I burned through the 5 hour session limits on two $200 accounts at the same time, which I never thought I would be able to do. I've been pushing so much [ __ ] with this model, and it's not just me either. I got my whole team upgraded to the $200 a month plan and they have been cooking crazy stuff, too. From terminalbased 2.5D adventures to fully functional multiplayer 3D racing games.
Yeah, my team's been cooking. I can't wait to show off all the cool things we've been able to do with this model, as well as the problems you might encounter as you build with it. But since I just spent like $3,000 on subscriptions for my whole team and we'll probably have to spend a lot more on inference for this model, I hope you understand why we're doing a quick sponsor break. I know making fun of GitHub is all of the rage and I've been doing it a lot, but there's one thing on GitHub that is genuinely just unacceptable to still be using.
It's the CI. If you're still on GitHub CI, please pay attention. You've already heard me talk about Blacksmith, but you need to actually try it. It's so easy to set up. You go to the YAML file, you swap Ubuntu latest to Blacksmith 4 vCPU. You sign in with GitHub in order to link it. And now you can have builds that are up to four times faster and also half the price. They say 60% here, and this is what a lot of people see, but we've seen much further. I've seen most of my build times get cut in half.
And when you combine that with the much cheaper price, the result is as low as a fourth of the cost. That's why everyone from Mercury to Expensify to Dscript and Exa to Clerk and also now T3 Chat and T3 Code are all building with Blacksmith. They also just introduced runners for Mac OS which has been a gamecher for us. Our Mac builds used to take upwards of like 16 to 20 minutes and now they're consistently under 10, which is unbelievable. It's made us way more excited to do new releases for T3 Code. We even introduced nightlys because of how much Blacksmith sped up our builds.
If this was just the faster builds, that would be worth the money. And honestly, I'm down to pay for it just for that. But the better logs, monitors, and history systems are way, way more powerful than they have any right to be. I had no idea how much I was just flying blind as I was using more and more of GitHub actions. And now that I can actually see what's going on, I can set up custom monitors. I can see where the error rates are happening, what actions are failing, and why. Debugging things is suddenly so much easier when you're trying to figure out why your CI is still slow.
Having a UI like this that shows you clearly which steps take how much time and how many lines of output did they generate. Oh, gamechanging. Your team and your agents deserve better CI. Get it now at soyv.link/blacksmith. So, let's talk about these models. I have a lot of layers to go into here. From how it feels to build with them to all the benchmarks that we got to a handful of benchmarks that aren't even public yet that I was able to get early access data for so I can show you guys first. So, you should be excited.
I know I am. I took a lot of time to get this done, to do it right. I'll do my best to not waste time and just cover the facts. So, we'll start with the first section here, which is the scores that the models got. Notice that they say Mythos 5/ Fable 5 here. The reason for that is because a lot of these benches had questions that Fable 5 outright refused to answer, which would plummet the scores. Even something simple like terminal bench saw a 20 point drop compared to when the same run was done with mythos because it just couldn't run a bunch of the problems.
But when you take a look at what its capabilities are as in what it's able to do unrestricted it crushes sbench pro. It got an 80% on which GBD56 only got a 58.6. Remember though we don't like that bench. I did a detailed breakdown of why S swbench pro is kind of garbage now and this model unsurprisingly kills it because most of the bench is existing pull requests and handing the model the description of this PR that merged in an old commit hash and it just seems like mythos 5 has so much data in its training that it has the details to recreate those PRs.
The Frontier Codebench, however, is much more interesting and we'll dive into that more in the future. For now though, it got a 30% when Opus 48 only got a 13 and 55 only got a 5.7. Knowledge work, it did very well. Its vision capabilities are a meaningful upgrade from the Opus line. It's finally leading GPT in vision. Apparently, it's beating out Gemini 31 Pro. So, I'm not sure about that cuz Gemini vision stuff tends to be really good. It's really good at spatial reasoning. The first time they've taken this lead from OpenAI, which is cool to see.
Like, I have a lot of fun things to test with spatial reasoning. That said, it didn't do as good with my spatial reasoning bench, skate bench. This is a bench where I measure every model's capability of naming skateboard tricks based on the description of it in three-dimensional space. Google's maintained their absurd lead here with 3.1 Pro at 98%, which is insane. And almost every other model is just not coming close. There's a couple problems in my private version of Skatebench that very few labs get for some reason Google does and the rest don't. And we don't see much progress here with Fable 5 at a 79%.
A lot better than the 72 they got previously with uh 4.6, but 4.8 was actually a pretty big regression. I don't know if there's something wrong with the thinking on that, but uh it struggled a lot and only got a 47. Fable did better, but also like it doesn't want to reason on this no matter what I've done to try and force it. Yeah, a previous run where I threw it on max, I was able to get up to an 83% from Fable, but still nothing compared to Google scores, which are as high as a 98% pretty often.
Not the best bench in the world. Just an interesting one to compare things against. And as I mentioned before, it scored well on Terminal Bench, but the model we get scored much more poorly because it blocked a handful of the requests in the Terminal Bench tests. We should talk about the Frontier code numbers a little bit because they are nuts. This is the second of their three tiers. They have their general tier which is 150 problems, their harder tier which is 100, and then their diamond tier which is just the 50 hardest. The numbers we were just looking at was for that 50.
This is the number for the 100 where it's a lot closer but still almost double GPT 5.5 score. I have suspicions with this bench. I've been talking about them a bit with others. When I dug through Frontier Code, I found some things that make me skeptical of it. In particular, this chart here, which shows that Opus 48 did better on minimal reasoning than low by like 30 plus%. Went right up to where it was before on medium and then for some reason high suddenly does really well at like 13 something%. And then drops again when you go to max.
This looks like a random number generator to me. If I'm being real with you guys, the amount of hard problems it can solve should go up as reasoning goes up, not randomly go up and down. I'm lucky to have data curve sharing the deep SWE numbers with me early. If you haven't already watched the deep SW video, highly recommend it. Most of these code benchmarks suck. This is one that I think makes actual sense. The very least, it lines up with my experience. And what it shows here is that Fable on XHigh performs comparably to GPT 5.5.
Remember though, Fable is somewhat nerfed simply by the safety restrictions. I have no reports of them hitting those restrictions or even mentioning that it dropped down to Opus or whatever automatically, which is what a lot of benches do. They just drop down to Opus when Fable refuses to give it a reasonable experience similar to what you would have with Cloud Code because that's how it works there. They got numbers very close to GPT 5.5. But more importantly, they crushed all of Opus' scores with fewer dollars spent. So, even though this model is much more expensive, it ends up being a better deal overall compared to most Opus models because it uses way fewer tokens.
This is a great change and I'm thankful to see Enthropic token maxing a little less hard at the very least with the capabilities of this model. It's a very good thing because the price is double Opus' price per token specifically. Fable and Mythos are offered at $10 per million input tokens and $50 per million output tokens. This also means they burn through your limits way faster. And more importantly, Claude Code subs only have access to Mythos in Claude Code from today until June 22nd. Fables included for now, but that will change on June 23rd. If capacity allows, though, they will extend the window.
A big part of why they were finally able to put this out is because they have GPUs. Thank you, Elon. So, those are the benchmarks. That's the like core info on it coming out. How does it feel to actually use this model? Well, as I mentioned, I've been using it a lot. I've been juggling between accounts. I have been token maxing using ultra code and tons of workflows for the last 24 hours. It's a damn good model. I'd go as far as to say it's the best coding model ever released by quite a bit even.
It feels very different from the jump to something like 54 to 55 where the simplest I can put it is that GPT 55 felt like they went to 54, they trimmed a bunch of stuff out and then they rebuilt it to be better. Mythos really just feels like more Opus, like they turned it up to 12 or 13 somehow. It's more thorough. It's more willing to cheat. It's harder working. It is smarter. It is unbelievably capable. And it gets [ __ ] done usually at least. I try one of my favorite tasks, which is taking my old ping.gg codebase for the video streaming app I made in 2021 that I haven't touched much since 2022, so it's very out ofd.
and I try to let the model go and modernize it. It used ultra code. It burned quite a bit of tokens. And while the final build it created didn't work. About four turns later of me screenshotting errors and pasting it in, it got a fully working version which was really impressive. This code base is massive. This is a 15,000 line of code change and it worked. It didn't one-shot it to be very clear. Like I had to put a good amount of time going back and forth with it, but it did succeed and that's really cool to see.
Only a few models have even come close and Fable is one that is capable of doing this type of work. So I tried to go a bit further and have it modernize the project with the things I like to use now. Instead of using the old MySQL database through Prisma instead of the old hacky build of TRPC and custom self-rolled O that was breaking all over the place. I asked it to move over to Tanstack start instead of Nex.js to convex for all of the data stuff and to clerk for the off layer. I actually don't think I asked it to use clerk.
I think it just picked based on my system prompt. I Yeah, I do have something in my global that mentioned that. So, that's probably where that came from. We look here. It is doing yet another workflow where it's going through trying to diagnose weird off issues I was having because as good as Clerk is, it managed to hit some edges. It made some silly mistakes like with the layouts on this page here. Amazed that it screwed up something like this with the port. It did an okay job. The Google one will get us in trouble though, sadly.
It maintained my browser support, which is good. This is progress. It didn't work for this before. Oh god, what's this menu? It it it does some questionable things to to put it lightly. Why do I have to request to join my own room? I own this room. Okay, so this type of port is still far from what this model is capable of. It broke almost all the core functionality. It has a ton of like random UI regressions, but it got further than I would have expected. Honestly, it's good to see it able to get even close to like this until you learn how much it cost to do.
And this is where one of the big problems with this model starts to show. I was running this workflow before bed last night at around 11 p.m. or so and it started getting pretty far, but then right as it was finishing the last step in one of the like bigger workflows it was doing, I hit my usage limits and I wanted to just see it get to the end so I could film my video and go to bed. So I decided to switch over to usage based billing so I could just let it do some work and finish up what it was working on.
That ended up being a mistake because it spent $100 How long do you think it took to spend that $100? An hour? Five hours? Maybe like late into the night? It was about 8 minutes. Eight [ __ ] minutes to do $100. And I realized that my usage limits were much better than that with the sub. So, it's probably worth just getting another. So, I did. And then I maxed that sub out in another 2 hours. To be clear, only the 5hour limit, the session limit. But god damn, the fact that I could in one run with one workflow with one prompt max out two subs and then have to go to bed without the task being done because I didn't want to spend a thousand additional dollars on it.
It's rough. It is. And I'm very excited for this level of intelligence to be cheaper eventually. Since I was on a fresh account, I did get some useful info. For my rough math, it looks like one of those 5hour sessions that you get is worth roughly 25% of your weekly limit. So when you max out your five hour, you can keep going when the timer is up, but you can only do that to the end four times in a week. And I could see myself easily hitting that. This means I also suspect they'll be putting out a higher price tier in the future that has even more subsidization because their margins are good enough.
They can probably afford it now, especially with all the new GPUs. But it's also clear they're not sure what the usage patterns will look like and how it will affect their compute. That's why they said here if capacity allows we'll extend the included window because they don't know yet. That's not do they get enough GPUs. That is a question of can they keep up with the demand. I did have a few moments with this model that made me very frustrated though. It knows so much that it sometimes confuses itself. One example I have is when I was working on Lakebed.
I had it do an audit of my cloud project that I've been working on for a bit. Hopefully we'll ship a real version of this very soon. And in this environment, I have two environments. I have production which is linked to the prod branch on GitHub. But I separately have staging which is linked to the main branch on GitHub. So when I merge changes, they autodeploy on staging. A new package comes out for the staging environment. And once I verified everything is good, I then promote it to prod. When I had it auditing this project, it insisted that the current versions of the package were broken because they were required some fields to deploy that only existed on the staging environment and on main.
So, it audited the main branch. It audited the npm packages and said, "These are incompatible. Everything is broken. What are you doing?" And like wrote a whole report about how I need to fix it ASAP. And I'm like, "No, that's not how any of this works. The package deploys to the production environment. the main branch is mapped to staging until we promote it. And when I told it that, it rewrote the report with that information, still insisting it needs to be fixed. I'm like, "No, that doesn't need to be fixed. That's just how the environments work.
Chill out, bro." And that was one of my first impressions, and it wasn't great. So, I didn't have that high of hopes when I started asking it to do other big work and overhauls and crazy tasks that I didn't think it'd be capable of. And it always got pretty damn close, if not making it there. I've seen a lot of other people mentioning they're getting tons of refusals with the model. That has not been my experience at the very least while using Cloud Code almost at all. I think I got one refusal total on the website.
However, I've gotten a lot of them. When I was testing out the cloud.ai site for Fable, I asked some questions about some like SEO stuff for agents cuz I noticed certain things weren't coming up when I asked about it in Cloud. And I was like, "Okay, how do I make it so these results are more likely to be noticed and surfaced by agents?" It thought I was trying to manipulate data for my competition or something and routed me to opus 4.8 and I had to correct it and say, "No, I I am Theo. I mt3.gg.
I own this website. I'm just asking if it's a good place to put this information for SEO and agent search reasons." And then it let me go back to Fable and I was like, "Okay, yeah, I get it now." It's also you have to negotiate with the model to convince it that you should be allowed to use Fable for the task. And it's not always clear about whether or not you're being routed to Fable. When you hit on sensitive topics like cyber security, biology, and chemistry, it will route you to opus and be pretty transparent about it.
But with certain things, like in particular, if you're working on frontier LLM development, like you're trying to make your own models, it won't fall back to a different model. Instead, it will limit the effectiveness through methods like prompt modification, steering vectors, and parameter efficient fine-tuning. The interventions will not affect the majority of coding work. We estimate they'll impact 0.03% of traffic and fewer than.1% of organizations. That is still scary though. This means that they are intentionally making the model dumber when you try to use it for certain things and they don't tell you when that happens.
So, you're being build full price for a model that is dumber and don't even know when that happens. This is bad enough that I plan to do a whole video on it in the near future, but I don't want it to be the focus of this one because there's enough good to talk about here that I do really want to focus on that. The artificial analysis bench did very well as expected. It is now the smartest model ever according to their benchmark. And let's be real, it's the smartest model we have access to by a lot right now.
But there are some interesting details in here. It's five points ahead of 5.5, which is huge. It's one of the biggest leaves we've seen in a while. But more interestingly is the experience they had testing it. Enthropic states the fallback to Opus 4.8 8 occurs in fewer than 5% of sessions on average, but artificial analysis recorded fallback routing in 8% of the tasks across the intelligence index, mostly in scientific questions from evaluations like GPQA, AI omniscience, and humanity's last exam. It apparently bombs HLE like nearing a zero because it can't answer so many of the questions.
But mythos gets a state-of-the-art score. Just interesting to see that their attempts to make the model safe are now actually measuring dumber. These numbers scare me in particular. This benchmark might not be the biggest, most important one ever. It's called [ __ ] bench for a reason, but it grays out refusals and it refused to answer at all on 33% of the questions in the bench, which is kind of insane. I will again emphasize that I've only really had refusals when asking about like cryptography problems or SEO stuff in the Claude app. Cloud Code has been nowhere near as aggressive with refusals for me, but I know others have had experienced much worse, and I again plan to cover that all in detail in the near future.
I want to talk a bit more about code though because there's other pieces that are genuinely super impressive. It's way better at UI. I honestly feel like Opus got away with a lot because the templates it felt like it had built in were good enough to look pretty good a lot of the time, but those templates got old fast. I can't tell yet if Mythos and Fable have a newer, better set of templates or if they actually now have some design instinct, but I can tell you confidently a lot of these look better. Some look cringe as [ __ ] still, but a lot of these look nice.
Like the little tape it added there, the structure of the page, it's solid. It's not as aggressively AI generated. And I could see a lot of these designs being good starting points. This one almost feels like a Gemini design, actually. Now that I look at it, it's good. Like these are workable. It's kind of crazy how quick models got decent enough at design type work. We got to have a bit more fun though. Let's play with some Rust. This is the game I mentioned Maria made before. It's a terminalbased 2.5D game where you can give instructions like a classic textbased adventure.
They also click which works better than I would have expected. They keep me indoors for defensive purposes. I found 23,019 bugs and they grounded me for it. A friend at SpaceX owes me a favor. Say, "Wait." A dot appears in the sky. It grows. 230,000 GPUs slightly used. Rock stickers are still on it. Quick type pipe connect. Oh boy. some issues with the text, but like the fact that this all works at all and that like I have this environment in the terminal is just unbelievable for a model to have been able to build something like this is truly insane.
She also ported T3 code to a terminal app in Rust. She already had T1 code in Typescript, but she successfully ported it to Rust and now has a full terminal UI where you can like click around, make new threads, get responses. Wild [ __ ] Addie made a full Minecraft clone. I did not realize this was this legit. I saw her posting some screenshots and clips, but uh it's a story game. The copyy's good. I don't know how much of this she wrote. Like, I know it's silly, but having a proper Minecraft clone where it made all the assets, it made all of everything.
This is unbelievable. I am I'm very impressed cuz remember they don't even have image gen so it had to like make all these textures itself probably programmatically somehow wild. Proud of my team. I teased the racing game earlier but what I didn't mention properly is that it has full functioning multiplayer as well as a spectator mode. So my whole team was just playing this with each other all night last night which was crazy. Again, this is a piece of software that didn't exist before Mythos dropped, and they whipped it together with a few prompts in like an hour or two.
This level of capability, this level of 3D understanding, this level of oneshot ability is unbelievable. It has been a slow burn to get this far, but seeing it in action, it just it hits different. It's wild to see the tools getting this far so much faster than I would have guessed. I'm not as creative as my team, so my use cases are not as cool as a lot of theirs. But one of the things I did do is I had it analyze the results on Skatebench, so I could get like a breakdown on interesting things.
And it found some really cool info that I didn't have before. One piece it found in particular that's been super helpful to know is that Skatebench is quote really two benchmarks. I don't like the way it writes copies still, but it points out that two of the problems here are nearly unsolvable for most models. The tricks are a cavalerial or a fullc, which is a fakey 360. And since I describe it as switch and off the nose, it calls it a switchnlly backside 360. This occurs over 163 times in my bench or a switchnlly backside big spin for the fakey big quote.
If you're not a skater, these things don't mean anything to you. What it means is that the other models are following the description of the trick too literally and not compressing the name properly. Stupid analogy here, but hear me out. If I'm building a computer and I need a fan, that is a computer fan. If I'm using a computer and I need a keyboard, that is a computer keyboard. But nobody calls a computer keyboard. They just call it a keyboard. Is it correct to call it a computer keyboard? Yes. But people would look at you funny when you say it.
And that's what all of the models did when asked to name these tricks other than Gemini 31 Pro. And it was actually really cool to see Mythos, even though it also got the question wrong, able to identify this and describe the results. So, well, here it also identified some very interesting trends around reasoning levels and how certain models do or don't really do anything with that. As a sanity check, I ran the same prompt with the same data against GPT 5.5 and it didn't get any interesting trends at all really. It found that the 360 heel flip question does a good job of separating the bad models from the good ones, but it didn't identify the specific interesting points that we talked about before.
Just just a interesting thing to ask two models to do. And yeah, I'm impressed with Mythos's ability to take large amounts of data and do things with it. It also recreated the whole bench. Something I haven't done yet though is look at the web version because it also overhauled the visualizer. Oh no, it put all the questions public. I can't do that. I want people distilling on it. Oh, speaking of distilling on your inputs, it is very likely a big part of why this model is so good is because they are distilling on our cloud code histories.
And alongside that, they also added a new data retention policy. So all usage for Fable 5 will require 30-day retention for all traffic. So even if you have the trusted setup with Anthropic where they don't save your data, that cannot be the case with Fable 5. If you are using this model, Anthropic is getting your data. They claim that they won't use this to train new models or for any non-safety related purposes. And they also have new privacy protections for logging and whatnot. but they are now storing your data which means you literally cannot use these models for a ton of real world things where the data retention policy would make it against the company policy or even against the law in some cases.
Simon also didn't get early access. So this is based on his first few hours of use, but he called out specifically that while he doesn't care how much a model knows, he definitely can tell that this model is big. It has the quote big model smell. He recently built a micro Python WASM binding so he could use WASOM for Python code inside of a web container and he decided to try and get it to be full Python instead of just this subset. After a little bit of encouragement, it spit out this 14 megabyte WAM binding that apparently just works.
He updated his data set agent, which is an app he built for managing and looking into SQLite databases with a bunch of huge new features that he ended up actually releasing written entirely by Fable. He said he was impressed with the quality of the API design, tests, code, and documentation that Fable put together. And I agree here. That's one of the things I've liked about Fable is the code feels more like code I want to hit merge on. It's not like measurably better in the sense that it performs way better or whatever. It just feels smarter, like it's writing better code.
It feels like a better employee or a better person with more years of experience than previous models. All models will put in enough effort to solve most problems now, at least the high-end good ones. But Mythos and Fable now write higher quality code, and at the absolute least, I plan to keep using them to go review changes I made and find ways to simplify and make the code better. It just has more taste. It's still weird as [ __ ] It still feels like a clawed model. still has its random errors with tool calls and their [ __ ] hardest and all of that, but it feels better.
The outputs are much more usable than anything I've gotten out of anthropic model before and better than I'm even seeing from OpenAI models now. It's good. I want to talk now a bit about how we should use this model and how it should affect our day-to-day work. I like this post from Walden as a starting point. Your organization will likely not scale with the exponential curve of AI. This should be a wake-up call for engineering teams. Set up your cloud software factories now. Models can fix impossible bugs. UI test the hardest flows. Write extremely good code, etc.
I haven't opened Data Dog manually as far as I can remember. AI should be the first line defense for bugs and feedback. Humans should only look at PRs after an AI's already reviewed it. AI should generate screen recordings of any PR before a human eye has even reached it. The agents should just prompt itself most of the time. I have still not fully come around to agents prompting themselves. I find agents write [ __ ] prompts, but things like the workflow and ultra code features that I've been abusing recently in cloud code starting to come around a bit.
It feels like I'm lighting tokens on fire, but it can make working results often enough that I find myself pushing way further than I previously did. In the past, I would rotate between being super parallel and being super locked in on the loop, especially once I started using 55 on fast and low and medium. I would just go back and forth a whole bunch and churn out work quickly. But I wasn't experimenting as much because I was in the loop. I wasn't really able to do multiple things at the same time and still have the quality I wanted out.
Mythos is capable of going off and exploring and trying bigger things with much vagger instructions. You can give it something as vague as look into other options to make this more performant and it will synthesize good ideas, test them, validate them, and then come back to you with results. Another just unbelievable thing it did is it started writing fuzzers to check for potential issues inside of Lakebed and and found a handful in a huge overhaul I was doing of the database architecture there. It's able to come up with solutions and that feels different. I'm not saying old models couldn't do this before.
I'm saying I didn't trust them to do this before and if it did come up with a solution, I would audit it deeply and use other models to come in and give feedback on it before letting it proceed. Mythos is smart enough, it has enough knowledge baked in, it has enough taste baked in that I can trust it more to go out and figure out these things and then come back with results. Does it do this perfectly 100% of the time? Absolutely not. But you should be running into those problems more because you should be pushing the limits of what it can do more.
It's hard to put into words how often I've been genuinely impressed with the outputs I got from this model. And not just in code either. It has enough taste to be helpful with things like finding good content to make videos about. I've had a bunch of YouTubers hit me up saying it's the first model that can generate titles in a way that isn't cringe and terrible. It is smarter and it has taste and because of that you can trust it with more. It's like going from a really cracked junior engineer to a kind of laid-back senior one where it knows enough to make good decisions by itself.
And I've never felt more like I'm along for the ride it's taking me on rather than I'm the one steering the ship. And I think we should embrace that a bit. Am I saying you should fully let go, stop reading code, and just vibe it out? Absolutely not. What I'm saying is the ceiling for this model is meaningfully higher and we should be pushing to hit it rather than just being excited when a problem 5% harder is solvable. We should go for problems that are five times harder, be surprised when it fails, and like figure out where the line is between the two and what we can give the model to make it better at unblocking itself.
Things like giving it computer use capability so it can debug in the browser itself. Things like letting it write fuzzers or test beds or whatever it needs to be more confident in the code that it writes. I had this model go through all of my stale PRs in Lakebed, make work trees for each of them, analyze them deeply, and give me its thoughts on whether or not it should be updated, rewritten, or thrown out entirely. I cleaned up a shitload of PRs in an hour or two using this model. I rewrote a bunch of data layers with this model in a few more hours.
I ported an app from a 5-year-old codebase nobody wants to touch to a modern stack, which admittedly has its bugs, but is absolutely fixable because I let it go and do its thing. It's time to think a little more boldly with how we use these things, especially during this short window where we have thousands of dollars of usage for just 200 bucks a month. I know I just got a lot of people to cancel their Claude code subs and I stand behind that especially with the business practice Enthropic was going at at the time, but a combination of their realization that they need to be nice if they want to survive as well as this model being just so far ahead of anything else currently on the market.
I think you should try to squeeze out as much usage in those plans as you can right now to get a good feel for the model to see where it succeeds, to see where it fails. So instead of spending $1,000 on a PR that you can't merge, you spend $200 on 10 PRs, three of which you can merge. The economics of software development have changed on a fundamental level as a result of this drop. And I want to make sure you guys know that even despite being an anthropicator, this is worth paying attention to. Things are changing.
They are accelerating. And the market's going to catch up in crazy ways. I expect pricing for models this smart to go down as competition ramps up. I expect Google to do jack [ __ ] [ __ ] as they always do. Things are changing and I've never felt more like my video about the ceiling was just entirely wrong. It's crazy this just exists now and then I can open up my terminal and rewrite any piece of software in any language with a pretty high success rate. Go play with this. Push it to its limits. Let me know if you think I'm insane.
Let me know what you're able to build with it. And most importantly, try to stay excited. I know this is a huge change for our field and the thing that we've grown up doing for years if not decades for many of us. But in many ways it's more exciting because you can do things that just didn't make sense before. I'm having a ton of fun with that and I hope you can have some fun with it too. So until next time, peace nerds.
More from Theo - t3․gg
Get daily recaps from
Theo - t3․gg
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









