Anthropic fights back
Chapters8
Examines Opus 48s benchmark performance, its competitive standings, and how it compares to previous Claude versions across SWE, Terminal Bench, and other tests.
Anthropic’s Opus 4.8 (Claude Opus 48) leans into honesty and workflow power, delivering better questions and code results, but at steep token costs and with mixed real-world reliability—plus new Cloud Code features to test.
Summary
Theo dives into Claude Opus 48, Anthropic’s latest release, noting it as a meaningful, if not world-changing, upgrade over Opus 47. He highlights strong benchmark performance (SWE Bench Pro) and improved question quality in Claude Code, while also weighing the tradeoffs of ultra-heavy token usage, especially with Ultra Code and dynamic workflows. The video delves into practical testing of Slot Slop, a playful harness demo, and a Rust port from Zig to Rust using Claude Code. Theo spends significant time critiquing Cloud Code’s cost structure, sub-entity token burning, and the fast mode pricing, while acknowledging honest improvements in honesty and laziness reduction. He also covers recent debates around benchmarks (Deep SWE, mini SWE agent) and hardware-like usability issues (bash calls, CLI quirks) that still haunt the experience. Beyond the tech, he mentions Code Rabbit as a sponsor and extols its code-review strengths. The recap closes with reflections on Mythos as a future competitor and a candid take on whether dynamic workflows really replace traditional single-threaded tooling. Overall, Opus 48 feels like a solid step forward for Claude—stronger in dialogue and code interaction, but with a cost curve and reliability questions that makers will want to watch closely.
Key Takeaways
- Claude Opus 48 scored highly on SWE Bench Pro, confirming strong real-world coding performance, though early numbers suggest it isn’t beating 54/55 in all tests.
- Opus 48 offers new Cloud Code features, including Ultra Code and dynamic workflows that split work across many sub agents, at a notably higher token cost and with reliability tradeoffs.
- Fast mode pricing is now more reasonable ($10 per million in, $50 per million out) but cannot be used with Cloud Code sub under the current structure, pushing heavy users toward standard mode.
- The model shows improved honesty and lower dishonesty (about 3.7% vs Mythos’ higher rates in tests), though user experiences (like CLI hallucinations) still reveal gaps.
- Opus 48’s cost per task dramatically drops at the high-end harness, but low-end token efficiency may worsen, indicating more aggressive task sizing strategies.
- Theo tests Slot Slop and Rust port workflows to illustrate Claude Code’s capabilities and shortcomings when faced with UI polish, network prompts, and sub-agent orchestration.
- Anthropic hints Mythos could be coming soon, positioning Opus 48 as a bridge to higher-intelligence models at a lower price point.
Who Is This For?
Engineers and AI practitioners evaluating Claude vs. OpenAI models for real-world coding, code reviews, and large-scale projects. Also useful for product teams considering Claude Code vs. CodeEx or Mythos for secure, high-volume deployments.
Notable Quotes
"Looks like I have to dust off my Claude hat because there's a new model in town and it seems to be the best coding model ever made."
—Intro hype about Opus 48 as the best coding model.
"Claude Opus 48 scored slightly lower than Opus 47 when using the Quad Code harness. It was meaningfully cheaper though, as well as meaningfully faster because it was using less tokens."
—Early bench verdict and cost efficiency note.
"Ultra Code is a combination of XI and the new workflows feature where the model will break up hundreds of sub agents to tackle a project in bulk with a lot of tokens being burned."
—Explanation of Ultra Code and how it leverages dynamic workflows.
"One prompt, $100 a month, locked out for 4 and a half hours. I had to upgrade in order to be able to get this video out in time."
—Reality of sub usage limits and cost pressures with Claude Code Ultra Mode.
"Mythos coming soon, guys. Two last things on this one: Mythos class models coming to all customers in the coming weeks."
—Anthropic tease about Mythos as a future competitor and safeguard focus.
Questions This Video Answers
- How does Claude Opus 48 compare to Claude Opus 47 in bench tests like SWE Bench Pro?
- Is Ultra Code with Claude Code worth the token cost for large-scale workflows?
- Can Anthropic's dynamic workflows realistically replace traditional multi-agent coding pipelines?
- What is Mythos and when will Mythos-class models be released to customers?
Anthropic Opus 48Claude Opus 48Claude CodeCloud CodeUltra CodeDynamic WorkflowsSWE Bench ProDeep SWEMini SWE agentCodex vs Claude vs Cursor comparison (CodeEx/Cursor)",
“Mythos class models
Full Transcript
Looks like I have to dust off my Claude hat because there's a new model in town and it seems to be the best coding model ever made. Anthropic just dropped Opus 4.8 and of course they did it on the day my Claude code sub expires. I'm not even joking. I had to go resub today just to test it. I'm annoyed. But the model is really good somewhat. We have a lot of layers to dive into here. As you expect, it's slaughtering benchmarks. It's the highest score any public model's ever gotten on SWE. It's killing it on terminal bench.
Multi-disiplinary learning with HLE. all the things you would expect, but they also put out new features alongside it in Cloud Code, which is why I was stuck resubing. I've been using this model all day. I've done over $1,000 of tokens through it already. And I have thoughts. In many ways, it's better than I expected, but in other ways, it is definitely still a Claude model. I want to break down what I mean by that, as well as all the cool new features that were added in the most recent update to Claude Code, which is honestly the bigger story here.
In my opinion, one of the things that makes Opus 48 so different is its honesty. And I do think that's important. Honesty and transparency are essential to everything that we do, especially all these informationheavy things. But I want to make sure you guys understand that I'm not being paid in any way, shape, or form to talk about one thing over another or be nice to one company and mean to another. My thoughts on Anthropic are genuine. There is nobody incentivizing me to talk about them one way or another. The only people paying me are today's sponsor.
If you like the software I ship, you owe today's sponsor a thank you because Code Rabbit has prevented me from shipping so many bugs. It turns out AI is great at reviewing meticulous code, trying to find small things that might be wrong. And I'm not exaggerating when I tell you hundreds of bugs that I might have shipped were stopped by Code Rabbit. I like their code reviews so much that I found myself changing my workflows to take more advantage of them. Even on personal projects, I found myself making more PRs just to let the agent review my code.
It felt a little silly, though. The frequency at which I was committing and pushing things that weren't done yet just cuz I wanted Code Route to take a quick look is silly. But that shows just how good the reviews are. And that's why I love their new CLI so much. The Code Rabbit Smart CLI is great when you want to review the code that's on your local machine. Even uncommitted code can be reviewed with the CLI. But that's not what's great about it. If you and I wanted to run the CLI, cool, that's awesome. But letting your agents run it is where it gets really magical.
The craziest part is that the CLI reviews are currently free. When you introduce the Code Rabbit CLI, you'll push code with fewer issues. And if you're also using it in your PR flow, you'll ship way fewer bugs. They estimate as few as 95% fewer. And the craziest part is how much faster code merges. Four times faster PRs. Time to merge is an increasing problem in a world of more and more AI filed PRs. And Code Rabbit will help you clean up the noise and ship faster. Ship fewer bugs and better software at soyv.link/codrabbit. I got to lose this hat.
I'm sorry. I'm just not a hat person anymore. And also I have feelings about cloud code. We will get to that. Don't worry. But before we talk about all the cool new Cloud Code stuff, I want to talk about the numbers for this model. As I shared earlier, it killed a lot of benchmarks, in particular, code benchmarks like SWE Bench Pro. They did lose Terminal Bench 21 still by quite a bit actually. GPD 555 is at over 78% and they're under 75, but SW Bench Pro was a state-of-the-art score. There's a problem with that though, and it's very inconvenient because my video about that problem comes out tomorrow because this video just trumped it.
I just did a massive deep dive on S.Bench because a new benchmark called Deepsw SWE came out. The numbers are very different. Check out that video tomorrow. I I promise that one's worth it. It might not seem like a benchmarking video is that interesting, but this one is. I'm going to spoil one detail from it, though, which is the prompting styles used for the SWE bench tests. These benches run in a custom minimal harness called mini SWE agent, and this is the prompt they use. You're a helpful assistant that can interact with the computer to solve tasks.
I've uploaded a code repository in the directory. Consider the following PR description. Can you help me implement the necessary changes to the repository so the requirements specified in the description are met? I've already taken care of all the changes to any of the test files described in the PR description. This means you don't have to modify the testing logic or any of the tests in any way. Exclamation point. This is really bad steering if you're not familiar. If you haven't been writing prompts and analyzing the effectiveness of them at a professional level, silly that that's a thing.
This is a bad prompt. And then having a list of how to diagnose these things and build the feature is really bad. One, as a first step, it might be a good idea to find and read code relevant to the description. Two, create a script to reproduce the error and execute it using the bash tool to confirm the error. Three, edit the source code of the repo to resolve the issue. This is awful. Yeah, terrible prompt. The actual prompts for this specific problems are even worse somehow. So yeah, uh SW bench is junk. That's separate from the fact that it's also contaminated and also has been discovered that many models including Opus models cheat aggressively on it because they'll check the git history for these solved problems because these are from real PRs.
They'll find the actual correct answer from the real world and then use that instead. Yeah, as many as 20% of the passing runs are cheating. We just can't trust this bench anymore. The computer use bench is a little bit more reliable and I am excited to see more improvement there. I've been leaning into computer use more and more lately. It's such a complex combination of vision, precision, spatial awareness, and other things, but the more I use it, especially with codecs, the more impressed I've been getting. So, that's cool to see. Did not have a chance to play with that with this release, though.
So, I don't have much to say there. I'm trying so hard to not just dive into my own usage of it with cloud code cuz it's so tempting. But, I do need to go over benches a little bit more. As I mentioned, deep SWE is a new bench that I'm quite excited about video tomorrow where I go really deep on it. in GPT 55 and 54 slaughtered this bench. 55 up to a 70% whereas Claude Opus 47 was only at 54. The most important chart in this benchmark is this one that shows the difference between SWBench Pro scores in deep SWE scores.
If you legitimately believe that GPT54 Mini and GPT54 are a 4% difference from each other, you're not using agents properly. Thankfully, DBSE actually can measure differences in models like that where 54 mini only gets a 24% and 54 normal gets a 56%. More than double, way bigger gap, which shows this bench actually measures realworld capabilities for these models. So, how did Opus 48 do? Sadly, they're still crunching to get numbers, but I did get the team to share some early metrics with me. Claude Opus 48 scored slightly lower than Opus 47 when using the Quad Code harness.
It was meaningfully cheaper though, as well as meaningfully faster because it was using less tokens. Still not as smart as 54, much less 55, and also much slower than 54 and 55. There's a lot of layers to why, whether it's the amount of tokens that they're generating or the end toend latency cuz they don't have a websocket primitive similar to what they have in Codex yet. Regardless, not great performance there. But they did just send me an update right before I started filming where they did another run, not in cloud code. this one in the mini SWE agent with a much more minimal system prompt.
And it performed way better, beating out not just 54, but 55 on high. It's not as good as 55X high. It's a 63% versus the 70 that 55 got, but it does come out to be slightly cheaper, which I did not expect. This makes the model look much more competitive than other things were suggesting. One of those things is Cursor Bench, which does show that Opus 48 is meaningfully cheaper per task, where 47 was $11 per task, and this model is only $759, but it also scored slightly worse. It is within the margin of error, but that's a regression.
All three of these scores are so close that I honestly don't think there's that big of a gap between these models. And Composer 25 also being so close makes me sus of this. Like, Composer 25 is a great model. It's not that great, though. It's definitely not soda and it's certainly not better than 55 high. So take that with a grain of salt. The coolest part of this chart though is the cost per task reduction. You can see here that max with Opus 48 is way cheaper than max was with 47 going from $11 to $759.
That said, low is now more expensive than it used to be at 293 versus 187. So the utilization of tokens at the highest and lowest end has kind of been condensed towards the middle. Not necessarily a bad thing. This might just mean the model's better at right sizing tasks based on reasoning levels, but I am a little concerned to see the low end getting more tokenheavy. Enough benching. Let's talk about actually using the model. I started some new projects. I analyzed some existing ones that are pretty big before releasing them to the public. I did some ports of old projects to modern technologies.
I even took the time to rewrite a project from TypeScript to Rust as well as a JavaScript project to TypeScript. Did some comparisons between PRs, lots of different things. And overall, it performed pretty good. It had a lot of clotisms. Don't worry, we'll talk about those. But I did notice improvements that made it feel better to use. It asked much better questions, and I felt like it involved me in the loop in a way that was nicer than I expect from Claude models. And in that way, it still is a little better than the GPT models.
They've made massive improvements on the OpenAI side here. But I do find Claude 48 asks me the best and simplest questions with really good formatting and options that clearly state what I'm looking for. And it also handles well when you add your own additional notes, which I had to do for a handful of those things because sometimes the options they gave were a tiny bit too prescriptive. In the spirit of quad code, I did have to make a fun gambling app. And I'll show you what I built. I created slot slop. Simplest way to tell you what it is is to hit enter.
Uh strobe lights warning, by the way. This is slot slop. Press enter to stop a spinner and it will choose your harness for you. Looks like we're getting cursor this time. Hit enter again and it will pick a model for you. 54 could be worse. Hit enter one more time and we are landing on medium. Cool. We're safe. Then you can hit enter and it will run in that harness. Is this a stupid silly unnecessary project? Yeah. Did it take me way more time than it probably should have in like 20 back and forth prompts?
Also, yeah. Did GPT models do any better on this? Not really. Especially when I was trying to get the custom UI here with all the fancy gradients and color spinning and stuff. I found that the Codeex models are just not quite as good at two yet. They'll make a minimal working version faster, but getting a fancy rainbow vomit one like this. Yeah, that's a Claude special. See if we can land a Claude roll. Cursor CLI again. Damn. Five. Let's do one more run. Let's see. Anti-gravity. Oh no. Oh no. Three. Five flash. Bad luck. Oh well.
Fun project though. I quite enjoyed it. But I clearly wasn't pushing the limits of Opus when I built this. So I did a follow-up run where I asked it to port the project to Rust cuz as you guys know Cloud's really good at Rust ports. And of course it works as expected. It did have to rewrite a lot of things because it didn't have Open Tuy which is the library I used for the terminal side, but it got it. Works as expected. Funny enough, it does actually feel a little laggier, but it it works. I want to talk a bit about how I did this in Claude code, though, because I didn't just tell it to go rewrite and rust.
Make no mistakes. I actually try one of the new features, a feature that inspired me to make slot slop. Now, when you run the effort selector, you still have the usual low, medium, high options, but you can also go to X high, which they've had for a bit, max, which I think they've had for a bit now. But most importantly, and again a strobe light warning, Ultra Code, which infects your screen with this awful purple ASI gradient. Ultra Code is a combination of XI and the new workflows feature where the model will break up up to hundreds of sub aents to go and tackle a project in bulk in mass with a lot of tokens being burned.
I recently put out a video where I compare Claude Code, Codeex, and Cursor. And in that video, I talk about this token maxing gambling thing. A couple people said I was exaggerating. Almost all of those people have hit me up today saying that Ultra Code kind of showed just how right that was. Both the super unnecessary extra Twitter screenshottable UI they built for it, but also the token maxing nature. And when I say token maxing, I mean it. Since I barely use Claude Code nowadays, I assumed that the $100 a month tier would be fine.
So that's what I tried. and I hit the cap for the five hour window in under 30 minutes. Want to guess how many prompts that was? It was one. One prompt, $100 a month, locked out for 4 and a half hours. I had to upgrade in order to be able to get this video out in time. So, for those saying that I caved and went right back to where I was, you're not wrong. But if I did it, you guys wouldn't have this video now. So, pick your battles. I still can't believe how quick I hit this limit.
I also can't believe how brutally it failed to resume once I upgraded. I had to reoff to get it all working again, which was obnoxious, but did try its best to summarize the answers by going to the files that it wrote them in. I also use CC usage to measure how expensive this was to run. And the number was kind of crazy. Remember, I didn't have cloud code at the start of the day. So, this is a fresh sub with one prompt. I did 661,000 output tokens, 102,000 input, a shitload of cash values, and it cost about $168 of raw token utilization.
One prompt, remember, because it just spins up so many agents. Don't worry, though, it's super well optimized to have multiple agents editing files at the same time. That's why this agent made five bad edit attempts in a row to the same file with the same information. Yeah, thanks for wasting my usage there, Claude. I really appreciate that. These are the things that drive me mad. These like parallelization, workflow, massive sub aent tasks sound really cool and powerful, but I've just found it makes the failure rate of my runs way higher. I don't think my team's ever merged one of the PRs that we've generated that are like thousands of lines long with all of these sub agents hacking on things together.
It's just too much and things end up stepping on top of each other, burning tokens when the tool calls don't work properly and the result just ends up feeling like a waste of time and money and PRs. So, not my favorite thing. Ran into that a decent bit here. And I'm far from the only one who's noticed this. Just straight up weird calls it makes with bash just trying to figure stuff out. It does some weird tool calls. Apparently, the team's working on a fix for the ones that Matt reported here. Regardless, not great. Sorry for the negative dump, but I wanted to highlight the problems that I' had been having because the good parts are pretty good.
The model's better at asking questions. It writes code slightly better than it used to. It handles long tasks better than it used to, which isn't necessarily a great thing. I find that I like being in the loop more and more lately, especially with i5. And I've had to go back to the old school style of prompting where I write a lot more upfront. But it handles those types of tasks well. I'd actually just find the thread with the usage limit hit where I was on the $100 tier and ran one prompt. It only ran for 23 minutes before I hit that limit.
And then I tried fixing it. Got a login interruption. Had to tell it to continue. And then it got decent results. This was me asking it to audit a project that I've been working on hard, which is the new lake bed cloud thing. We'll have a lot more to share about that soon, don't worry. I as to do a thorough audit trying to make sure everything is pretty solid before release. Found a couple small things that I already knew about and actually have PRs up trying to fix, but it gave good feedback here. all that I found to be worth reading.
It seemed like it actually understood the project. It did a good job auditing the entirety of the codebase. And nothing here is really that red herringish. I'm I'm impressed. It's not bad. I also had to break up this old JS project with a ton of giant god files that were just like 8,000 lines of JS. Totally not like bad. It did a pretty good job of this, too. I did have to tell it manually to read the agents MD because Anthropic still insists on being a special snowflake and ignoring the agents MD standard that everyone else uses in favor of Claude MD because if you don't mention Claude in the root of your codebase, Anthropic doesn't like you very much.
They love their free marketing. But after that, it did a pretty dang good job. I was impressed with the work that it did. It doesn't write TypeScript like Python the way that GPT55 does. I've had to like take a stick and beat the hell out of 55 to get it to write TypeScript properly and not just check types for everything everywhere when it doesn't have to. Claude writes TypeScript better. It just does. You can make 55 write TypeScript very well. And it's not like once you do that, it's worse in some way. It's just that Claude doesn't need quite as many reminders that TypeScript is indeed real and can be trusted.
You don't have to check if something's a function every time you access it when it's already bound as one. I do want to talk a bit about costs though because the numbers I was seeing were crazy. About halfway through the day, I was like 10 prompts in probably. I had a lot of beatings today, too. So, it was tough squeezing that in. I got up as high as $518 of usage halfway through my day. I then continued to use it heavily cuz it's my job. It's what I was trying to do today. And I got my usage all the way up to $220.
You can clearly see here I ran this afterwards. And despite running it after the number went down. I suspect this is some pruning that it does when you're running the sub aents where when the sub aents in the ultra code mode complete it concatenates and condenses all of the JSON from it which results in less accurate numbers here which is annoying cuz I was trusting these numbers and using them for a lot of things. It is what it is. I just wish Anthropic wasn't trying so hard to hide the level of subsidization that they're doing.
But god damn, this model burns tokens. I want to talk a bit more about the measurements and some fun things Anthropic confirmed in the release notes. A big part of why this model feels so much smarter than a lot of the recent releases is the honesty fixes and the laziness fixes. Anthropic's been measuring the laziness of models in terms of how thorough are they in their investigation before giving an answer. 48 never had this problem, which means it actually outperforms Mythos even in terms of its likeliness to give a correct or incorrect answer depending on how thorough it is with this investigation.
That's really cool. This is also probably why the low reasoning effort still uses quite a bit of tokens because it's been trained to not give up until it knows for certain what the answer is. It's also dishonest much less often. Even Mythos had a dishonesty rate from their measurements as high as 27.6%. Opus 48 is down to 3.7%. Sadly, this does not really reflect my own usage. I had a lot of problems here. This is a thing I've never seen before. I'm restoring the old session to show you guys and it's telling me resuming the full session will consume a substantial portion of your usage limits.
We recommend resuming from a summary. They know they're burning tokens. We're not doing that though cuz I got to read this whole history. When I was working on slot slop, I had to integrate all of the different harnesses and specifically their CLI arguments in order to make sure it would spit out the right command and run it properly. I understand why the model might not be great at CLIs that are newer or less well doumented, things like PI or Open Code or especially stuff like the new anti-gravity CLI because Google doesn't even know how to use that one.
But I was really surprised when it got cla wrong. First, it insisted there's no way to pass effort levels to the claude code CLI. I asked is are you sure about that? Not even an environment variable. And it quickly told me I was wrong. Cloud code does have a real effort flag. I did end up crashing out a little bit at the model when I was just outright failing to even use the claude code CLI. It kept getting the wrong flags. It was using dash m, which isn't a thing in cloud code. You have to do d-model.
Somehow it just hallucinated that. So, I'm not seeing the thing that everyone else is here where it's more honest and more thorough and less lazy because it just hallucinated about its own CLI. Like, what? That was really surprising and disappointing, especially when I told it multiple times throughout the history to check the docs about the things we're integrating and it just didn't bother. On the other hand, though, in favor of it not being lazy, it tried really hard to test the changes it was making by running an interactive terminal in the background with an agent with sleep timers that would trigger the enter key presses.
and it tried really hard to get that working and it kept breaking as a result because that's just not a good way to test a full screen takeover terminal experience. And I had to interrupt it and say, "No, don't worry. I'm testing it. It works." Because it was just spinning forever and ever on that. I will say, and I know you guys are going to call me crazy, all of these types of problems are things I just don't experience with the GPT5 line, especially 55. The problem I have with 55 is that it will overindex on the things in the context.
As soon as something's mentioned in the history, it fixates on it and won't forget it. If you tell 55, hey, commit these changes before we start the next ones and you don't remember to start a thread after that, every additional change it makes will get a commit going forward. Although Opus 48 did the same thing for me today, so I don't know what's real anymore. Yeah, it's a pretty good model. If this seems like a little bit of an underwhelming release, you're not the only ones who think that anthropic feels the same. Users will find Opus 48 to be a modest but tangible improvement on its predecessor.
There's still more to be done. We're working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. Interesting. It seems like they're finally waking up to the expense problem and they're going to work on making Sonnet or something like it way more intelligent for the price. Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Glasswing, a small number of organizations are currently using Mythos for cyber security work. Models of this capability level require strong cyber safeguards before they can be generally released.
We're making swift progress on developing these safeguards and expect to be able to bring Mythos class models to all of our customers in the coming weeks. You heard it here first. Mythos coming soon, guys. Two last things on this one. First, I want to talk briefly about the fast mode because their fast mode was massively overpriced before. It used to be like five times more expensive, like brutally so regular speed usage is still the same, $5 per mill in and 25 per mill out, but the fast mode is now only double at 10 per mill in and 50 per mill out.
That said, you also can't use it as part of your cloud code sub, which is very annoying because you can use fast mode as part of your codec sub when you're using OpenAI models. I almost exclusively use X high fast lately and occasionally switch to low when it's like UI changes or small things. Generally, I'm using X high fast nowadays and I can still barely put a dent in my codec sub. There is no way to use fast on this model without paying cash for it. You're paying API prices. And as you saw from my numbers, nearing $1,000 in tokens today, just one day of experimentation.
If I was to use fast mode, that would have been two grand for a day of work. And while I was pretty productive today, I don't think any of the work I did was worth that much. Okay, maybe other than slot slop. This project's worth thousands of dollars. Clearly, we should definitely raise some money on it, huh? But last, I need to talk a little bit about the new dynamic workflow stuff. A lot of people seem quite hyped on this. The idea is that Claude can tackle more challenging work endto end where you tell it roughly what you want to do.
It will analyze the project, design an architecture of agents and sub aents to go do the work for these types of big complex legacy code bases. It's clear this was inspired by the bun rewrite from Zigg to Rust. Like they're trying to make it easier to do those types of giant workloads. It does burn tokens heavily as a result. Again, this is the token burning company as much as it is the Flickr company. The easiest way to get a workflow to Claude is to ask it to create a workflow. They even did their usual thing where if you mention workflow, the word lights up because they love taking words that are totally not used for other reasons and hijacking them to trigger special modes.
Oanthropic. I feel bad for anybody who uses cloud code on a project that has a concept of workflows. You're in for a treat. Oh boy, good luck. You're going to accidentally burn a lot of tokens. They give examples of what these types of workflows are good for. things like codebasewide bug hunts, profiler guided optimization audits, as well as security audits, large migrations and modernization efforts. I've seen it be pretty good for that. Appreciate the effort there. They also say it's good for critical work that you need to check twice as it spins up a lot of agents to test with.
Apparently Jared used dynamic workflows in his port from Zigg to Rust. That's cool. So clearly these things are tied together pretty directly. I love that even they are calling out that Jared will write more about this in the future because Jared's taking more time to write the blog post than he did to do the port from Zig to Rust, which is hilarious and very Jared if you know him. When a workflow kicks off, Claude plans dynamically based on your prompt, breaking it into subtasks and fans the workout across sub agents running in parallel. Results are checked before they're folded in, and you come back to a single coordinated answer.
Agents address the problem from independent angles. Other agents try to refute what they found and the run keeps iterating until the answers converge, which is how a workflow reaches results a single pass can't. Cat from the Claude Code team shared this diagram to show how it works where Claude will write prompts, decide about agents, and kick off these subtasks, each of which might kick off even more subtasks. Kind of absurd to think Claude Claude Claude versus Claude spinning up implementer clouds which then spin up subverifier clouds which then spin up fixer clouds before giving it back to Claude at the end.
Is this what it's like when you accidentally hire three people named Jon on your team? I've been saying this is a big philosophical difference between the labs and I hope you even better understand what I mean now because this is not a thing that OpenAI is really doing. OpenI models can spin up sub aents and they're good for things like investigations, but when Codeex is working, Codeex is just doing the work in a single thread. And I have found it still ends up being faster and more reliable for a lot of these big tasks. I did a huge port of some old code with both of these things.
Codex finished much faster and had code that was just as good as what I got when I did the same thing with workflows. So yeah, I'm sure this is really useful for a lot of different things, but to me it just kind of feels like we're seeing if we can solve slightly harder problems with significantly more tokens. Like if there's a problem that's 10% too complex for the existing models, you can choose to spend 100 times more tokens to slightly increase your chances to solve it. Eh, not my thing. I see myself potentially coming around to this in the future, but now is not that time.
Cat gave the example of removing a bunch of feature flags that were already rolled out at 100% to clean up the code and deprecate the stale ones. Instead of waiting for Claude code to investigate each sequentially, dynamic workflows allowed Claude to process all of them in parallel. Cool. Me, I get it, but I don't think it's that big a deal. So, if you're looking for the smartest model ever, according to artificial analysis, you now have it. It uses fewer tokens, costs a little bit less as a result, and is meaningfully smarter than 47. This is definitely not another one of those bad barely a difference sometimes worse launches like 46 and 47 were.
Even if 48 did measure worse in some places that does not line up with my experience. I find it to be a meaningful improvement. Is it going to replace 55 for me? Probably not. But I got two last things before we finally wrap. First, there was this tweet from Tyler that I loved. Opus 48 is insane, guys. It oneshotted my session usage limit. As I mentioned before, I had to upgrade from the $100 tier to the $200 tier. That all said, as much as the gap between Codeex and Claude has closed as a result of this launch, we still really need Mythos before the gap is fully closed and maybe goes in favor of Anthropic once again.
And since Anthropic seems to really like dropping things on the day that my subscription ends, I'm going to do all of us a favor and cancel once more. After spending yet another $200, my subscription is cancelled and it ends on June 28th. Hopefully, we'll have Mythos before then. And if not, we should have it on that exact day. I think that's all I have to here. It's a pretty good model. I would definitely recommend it if you're bought in heavily to the Claude ecosystem. But if you haven't tried 55 yet, you definitely should. Now, let's hope there aren't any more big model drops coming soon because I I'm already too busy.
I I'm out. Have fun, nerds.
More from Theo - t3․gg
Get daily recaps from
Theo - t3․gg
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









