Opus 4.8 Just Dropped. Here's How To Actually Use It.
Chapters9
Introduces Claude Opus 4.8, notes benchmark improvements over 4.7 and GBT 5.5, and previews how 4.8 may behave differently with new issues and user adjustments needed.
Claude Opus 4.8 arrives with sharper judgment, longer autonomy, and clearer prompts—but choose your effort level to actually see the gains.
Summary
Nate Herk dives into Claude Opus 4.8, explaining how it builds on Opus 4.7 with sharper judgment, more honesty about progress, and longer sustained autonomy. He notes price stays the same per input/output token as 4.7, but cloud code rate limits have increased to handle higher token usage. The video highlights new cloud AI controls like dynamic workflows and an adjustable effort slider, plus practical tips for prompting to maximize performance. Nate compares benchmarks critically, stressing that real-world use often diverges from official scores and that each use case should guide model choice. He shares concrete takeaways from the accompanying Claude blog and API docs, including how to prompt for background, why behind instructions, and when to rely on in-model reasoning versus sub-agents. Several takeaways emphasize tuning effort, giving the model context, and calibrating response length to task complexity. Finally, Nate cautions viewers to test 4.8 against their own workflows and uses a token-tracking tool to monitor consumption. He promises more coverage on the new dynamic workflows feature in a future video and reminds us that user feedback shapes how quickly these models mature.
Key Takeaways
- Effort level is the number one lever in Opus 4.8; switching from high to low/medium or to ultra can dramatically change speed, cost, and accuracy depending on the task.
- Prompting should tell the model what to do, including explicit backgrounds and the why behind instructions, rather than relying on negative prompts or vague requests.
- Opus 4.8 defaults to high effort in Cloud Code but offers low/medium and even an Ultra Code (X high plus workflows) option for large-scale problems, with outputs that scale in effort and token cost.
- Reasoning often precedes tool usage, but you should balance internal reasoning with when to fetch external context or spawn sub-agents based on the task complexity.
- Response length is calibrated automatically by the model based on task complexity, delivering shorter results for simple lookups and longer analyses for open-ended tasks.
- Benchmarks look strong, but real-world results depend on your use case; the video stresses testing 4.8 against your workflows rather than chasing numbers alone.
- 4.8 aims for more honesty and self-correction, sustained autonomy on long tasks, a warmer collaborative vibe, better tool calling, and improved token efficiency.
Who Is This For?
Essential viewing for developers and AI practitioners who use Claude Opus in Cloud Code, especially those moving from 4.7 to 4.8 and exploring dynamic workflows. It helps you adjust prompts, effort levels, and workflow strategies to avoid common pitfalls.
Notable Quotes
""Opus 4.8 is finally here and as always the benchmarks look amazing.""
—Opening claim about 4.8 benchmarks vs 4.7 and competitors.
""Effort is the number one lever now.""
—Central takeaway on how to control model behavior and cost.
""Give the why behind an instruction.""
—Prompts should justify actions to improve adherence.
""response length calibrates its own length""
—Automatic length scaling based on task complexity.
""it is priced the exact same as Opus 4.7 on input and output tokens""
—Pricing detail that users care about in migration decisions.
Questions This Video Answers
- How should I adjust the effort level in Claude Opus 4.8 for different tasks?
- What are dynamic workflows in Claude Opus 4.8 and how do they work in Cloud Code?
- Does Opus 4.8 really beat GPT-5.5 on benchmarks, and how should I compare models for my use case?
- What prompting strategies maximize honesty and self-correction in Opus 4.8?
- How can I verify my token usage and efficiency when upgrading from Opus 4.7 to 4.8?
Claude Opus 4.8Claude CodeOpus 4.7 vs 4.8Dynamic workflowsCloud codeEffort levelsToken efficiencyPrompt engineeringEnthropic MythosGPT-5.5 comparisons
Full Transcript
So Claude Opus 4.8 is finally here and as always the benchmarks look amazing. In a lot of the major categories 4.8 is better than 4.7 and even better than GBT 5.5 as well. But the question is is it really a better model? So today I want to talk about what is new to Claude Code because of Opus 4.8. I want to talk about some of the issues that you guys have been having with 4.7 and some of the struggles and how 4.8 is supposedly going to address those issues. I'm going to go over some key takeaways because it seems like this model is going to behave a little bit differently than 4.7 and you're going to have to change the way that you work with it a little bit.
So, let's not waste any time and just get straight into this one. Okay, so it is May 28th, 2026 and Opus 4.8 has dropped and it is apparently built on top of Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. And important to note, it is priced the exact same as OPU 4.7 on input and output tokens. But what's interesting right here is that they have increased rate limits in cloud code to accommodate the higher token usage of effort levels. So that's rate limits.
That is not your, you know, 5 hour rolling window or your weekly session limits. Those remain untouched, but rate limits if you're using cloud code via API has been increased. All right. So this is the blog post. I'll link this in the description, but I'm just going to go over a few of the key findings here. Okay, so Open Office 4.8 launches alongside several new features. Users on cloud.ai now have control over the amount of effort Claude puts into tasks. And in cloud code, we have a new feature called dynamic workflows that allows it to tackle very large scale problems.
So I'm not going to dive into cloud workflows today, but this is a new feature that I will be making a video about shortly. But now in cloud code, obviously you can see we have opus 4.8 is here. It defaults to high effort. You can also of course switch the effort. But in here you can type in workflows and that's how you could start using that dynamic workflow feature. But what I want to show you guys in here which is pretty cool is in the terminal or the CLI version if you do effort you can see we have the slider.
Like I said, it's going to default to using high, but you can also do low or medium. And you can come up here, XH high, max, or ultra code, which is X high plus workflows. So, it's very, very smart over here, but of course, it's going to cost more from the token perspective. And then the more left you scroll down onto this slider, the faster your outputs will be. Of course, we can dig into the actual benchmarks, which I like to look at. But the thing about the benchmarks is every single time you see a new model, the benchmarks are amazing, right?
it's always better than the other ones and you've always got these other comparisons. So obviously that's what they have to do from a marketing perspective. So it's really important for you to understand what model is actually the best for your use case. Like maybe the case is yeah Opus 4.8 really is better at agentic coding than something like codeex with GBT 5.5 but maybe for your very specific use case is just performing way better even if the explicit benchmarks don't say that it should. Like right here for example, I think that codeex with GPT 5.5 is much much better at agentic computer use than Opus 4.7 and Opus 4.8 even though these two Opus models apparently objectively would be better at agentic computer use than Codec.
So always take these benchmarks with a grain of salt. Enthropic took a whole section of this blog to call out that one of the most prominent improvements is Opus 4.8's ates honesty, which I think is interesting because that's definitely something that I noticed with Opus 4.7, as we're going to dig into over here as far as like problems that people have reported with Opus 4.7, but they took time to call out the honesty here. We train all our models to be honest, to avoid making claims they can't support, like saying, "Hey, this is going to take me 4 hours and then it takes 20 minutes." Or saying, "Hey, I finished.
I pushed all 50 into blah blah blah, but I only actually pushed 15." So, if you guys have ever felt that, you're not alone. And apparently, Opus 4.8 is much better at that. And so they actually have evaluations to test this sort of stuff which is about misaligned behavior. And you can see here that in this case a lower score is actually better. So right here we've got like myths preview coming in pretty low. Opus 4.8 comes in at almost half of what Opus 4.7 and Sonnet 4.6 come in at. But take a look at this.
Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor 4.7. Obviously there's still more work to be done. But what they say here is that they plan to release a new class of model with even higher intelligence than Opus, which is Mythos. You can see a small number of organizations are currently using it for cyber security work, but models of this capability require stronger cyber safeguards before they can be generally released to the public because we don't want some random kid in their basement hacking into your bank account. But anyways, Obus 4.8 is available everywhere today.
So wherever you're using Cloud Code, you should be able to access it. Open up a new terminal, open up a new extension tab, whatever it is. And you can see right here we have Opus 4.8. And you will notice right away that we still have our 1 million context window with Opus. I could come in here and I could type in a just did that twice in a row. A slashmodel. And we can choose between default set or our Opus 4.8, which is most capable for most work. But anyways, Opus 4.7 was released April 16th.
So basically about a month and a half ago. They're moving really, really quick here. And when Opus 4.7 came out, they added the X high effort level, which now has been dwarfed by Max and the Ultra one, Max and Ultra Code. But what's interesting is a lot of people actually weren't happy about this model release because they felt like it was actually worse than Opus 4.6. So some of those main problems were it felt lazy. It was just basically giving up on the goal on the task too early. So, you know, Codex had goal and now a bunch of other different AI tools have goal.
cloud code has/goal and that was kind of like a band-aid fix to put on top of the model to help it work a little bit longer towards some sort of specified goal, but now that is just a core fundamental piece of the model. Not exactly the SLG goal, but just the idea that it's going to be less lazy and it's going to be better at working for longer time. It was also said to be overly rigid with safety overreach. There was a ton of community feedback on the token burn and how much more expensive this model seemed to be.
And the one that I think is the funniest is saying that it had an attitude, which honestly is true if you've ever heard it sort of get a little sassy with you or push back on your own ideas. It's good to have that sort of like brainstorming thought partner, but I have noticed that sometimes it did sort of come off like very short and almost like stubborn. So those were some of the main problems that I felt, but also that the community had felt with 4.7. Now, there is a big difference here between the model having problems and also like you using the model wrong.
It's not always a model problem. Sometimes it truly is a skills problem and the answer isn't just oh well 4.7 can't do this let me wait for 4.8. Sometimes it is a user error thing. So I just did want to call that out as well. So anyways 4.8 obviously comes out today and it was built to fix this stuff right. It was built and said to have more honesty and selfcorrection, more sustained autonomy on long-running tasks, a warmer and a more collaborative vibe, and just efficiency and quality of life, meaning better tool calling, better reasoning, better question asking, better token efficiency, stuff like that.
And so what I did is I read through a lot of the stuff the community was was talking about. I tested out Opus 4.8 a little bit. Obviously, it came out an hour ago, so I haven't been able to deep deep dive into it yet, but I've tested it out. I also read through this documentation here about the prompting best practices which is a pretty long article from the claude API docs which I will also link in the description if you want to check it out. But after reading through all this stuff, there's a few takeaways that I I wrote down and that I wanted to share with you guys.
The first one is that effort is the number one lever now. And when I look back at one of these problems right here, like maybe the laziness or the safety overreach, maybe that was an effort issue because basically if you were doing something that takes a lot of effort, but you've got the model set to low or, you know, medium or even just high, sometimes you just need more effort. And on the other side of the spectrum, if you're doing something that's really simple and you have that on high or extra high, then maybe you're also like dedicating more resources than you need and the model is going to overreason and overengineer.
And that's where you're like, "Okay, this is so easy. Why can't it do it? It's simple, but maybe you just needed to turn down the effort. And so, it really is a balance here between, you know, Claude's intelligence and the token spend and the speed and all that kind of stuff you're looking at. But the point I'm trying to make here is if you are one of those people that open up Cloud Code and you just start typing and doing your work and building and you never tweak the model, start trying because the difference between Opus 4.8 on low and Opus 4.8 on extra high is a significant difference.
Like almost to the point where it feels like a different version, like an Opus 4.9. So, it's definitely worth starting to pull on that lever a little bit if you never have. So, the next one I've got is tell it what to do, not what not to do. And really, the way I got there is if you go through this documentation, it always shows you good example prompts, right? That you could copy for specific scenarios. And when I looked through a lot of these example prompts, I realized that it wasn't really saying a lot of what not to do.
Or I mean, that's horrible timing because right here it says do not do this. But it always tells more explicitly what to do. And what I thought was cool is it gives background. It gives context almost as if the model is sort of like curious and it's going to say, "Hey, you told me not to do X, Y, and Z, but but why?" And the more that you can contextualize that stuff, the better it's going to be able to follow those instructions. And that leads me to the next one right here, which is give the why behind an instruction.
So rather than saying, "Don't use M dashes," say something like, "I want this to come off like I'm really writing it, and this is my writing style, and I never use M dashes, so make sure you're following my writing style." And that is going to have a little bit better um feeling of Opus actually following your instructions. If you guys have seen my comparisons between Opus and um GBT 5.5, one of the things that I've said is that I love how creative Opus is, but sometimes I want it to just do the thing and I want it to do exactly how I want it.
And so maybe that's an issue of effort and also telling it too many negative prompts. You know what I mean? So, it's a mix of the model, but also take accountability on yourself a little bit and think, how can I actually use the model the way that the engineers of the model have actually told me to? Anyways, a few other ones. It's going to default to reasoning before calling tools. So, it's going to try to figure out, you know, the questions to ask and the approach to take on its own with what it has right now before it looks to spawn a sub agent, for example, or to go read that database or to go do this.
And sometimes that's really good, right? Sometimes you want that reasoning before it starts doing things, but sometimes you want that extra context to be pulled in before the reasoning starts. So that's why it's really important to play with your prompting obviously to play with your um your effort level and to be especially when you're switching over all these workflows from 4.7 to 4.8. You don't just switch over and say go and blindly trust that everything's going to, you know, stick the same. You kind of want to watch it a little bit to to get a feel for how the model behaves.
And then when you look at things like response length and verbosity, you can see here that I said it calibrates its own length. And basically what I meant by that is that it's going to judge how complex what it should do and how it should respond based on the complexity of the task rather than defaulting to some sort of fixed verbosity. So this usually means that shorter answers on simple lookups and you'll get longer ones on more of an open-ended analysis that takes more reasoning. So anyways, those are some of my main takeaways. Obviously, like I said, I've only played around with this model for about half an hour.
I wanted to get this video out quick, but as I find more stuff out, I will continue to update you guys. So, last two pieces here. What are people saying right now? So, obviously, there's a lot of different feelings, right? A lot of positive and excited stuff, right? People are saying, "Oh, this already oneshotted my GBT 5.5 right here. Strongest coding model yet. I'm hooked. This is super warm, super collaborative, big jumps in, you know, benchmarks." But once again, a lot of these people have the intention to do some stuff like that or say stuff like this because I don't know, they want engagement or they're marketing something.
And so obviously it's it's great to look at the full end of the spectrum, which is why I also pulled in some mixed and cautious reports like some early reports of bugs already in Opus 4.8. Maybe just because of the roll out, whatever it is, people are still testing it. So there's a lot of stuff to still be cautious about. But the overall vibe, which I think is pretty cool, is that it's almost like we have like these, you know, four or five main bullets of 4.7 problems, and most of these improvements that we're reading about from 4.8 are directly hitting those 4.7 problems and pitfalls.
So, at least we're getting that sense of um Claude using that data to make it better. And if you really think about it, think about the way that you use Claude Code, right? You ask something, Cloud Code responds, you correct it, and you have this back and forth of like, I don't like that, do this better, blah blah blah, I'm your master. And then what happens is because obviously Enthropic can read those logs and and you know, train their models on that data, it's able to say, okay, what are people not happy with Opus 47 about?
What are they constantly saying? And let's just bake that into the model. So, it really it would concern me if a lot of these key problems weren't being addressed like headon. Anyways, but the key thing that I want you guys to always think about is that benchmarks look great and they always will and someone else's use case is not your use case. So figure out right now in your workflow, in your Opus 4.7 workflows, what are your problems? What do you typically get frustrated by? And maybe Opus 4.8 fixes those problems, but maybe it won't.
Even though the model is better, that doesn't mean it's better for that specific problem. So just always be thinking about how can you work in different models or different context strategies or different effort levels to directly address the actual constraints and pain points that you're having right now. So look for things like the vibe upgrade. Look for things like how often you're self-correcting this thing and giving it the same instruction over and over. Obviously you should be working in like memory and different skill files and things like that to address that repetition, but still. And then of course the whole token and workflow efficiency feeling that you typically get a sense of when you're getting near the end of your session limit and when you need to pull back a little bit.
Apparently based on the documentation this model is more efficient with tokens, but we don't actually know yet. And one great way to test that kind of stuff out is you can use my um token tracker, my token dashboard tracker, which is completely free. It's an open source just GitHub repo. I will leave that in my free school community linked in the description. Just give Cloud Code the GitHub repo, tell it to set it up, and it will pull in all of your historical data with Cloud Code, and you can see where your tokens are actually going.
But anyways, that is going to do it for today. I hope you guys enjoyed this one or learned something new. And if you did, please give it a like. It helps me out a ton. And as always, I appreciate you guys making it to the end of the video, and I'll see you on the next one. Thanks guys.
More from Nate Herk | AI Automation
Get daily recaps from
Nate Herk | AI Automation
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









