This model is kind of a disaster.

Theo - t3․gg| 00:37:59|Apr 17, 2026
Chapters12
Introduces Opus 4.7 as the latest public release and summarizes its key improvements and caveats compared to Opus 4.6.

Theo fault-finds Opus 4.7: a berjumbled mix of genuine improvements and frustrating hitches that make it the weirdest Claude release yet.

Summary

Theo walks through a long hands-on look at Anthropic’s Opus 4.7, the public-facing Claude release that’s “the best one that they’ve released for public use.” He starts with tempered optimism, noting improved instruction-following, better multimodal capability, and stronger performance in finance analytics, while also warning that the model regresses in real-world usage and can feel inconsistent. Throughout the day, he tests Opus 4.7 in Cloud Code and in CLI workflows, highlighting how the harness and system prompts can tilt the experience toward safety filters and friction, sometimes locking him out of tasks like cryptography puzzles. Theo’s hot take centers on the hypothesis that the regressions aren’t purely a model issue but stem from the harness (Cloud Code) and internal tooling, which he argues are poorly maintained and leading to degraded developer experience. He contrasts Opus 4.7 with Opus 4.6 and Mythos previews, noting that while 4.7 can outperform in some benchmarks, it also makes odd, unjustified search decisions and even fails at basic code modernization tasks when prompted to fetch the latest dependencies. The video includes candid reflections on the governance around safety prompts, the ability to auto-run tasks, and how recent updates sometimes “break” core workflows. Theo also shares practical lessons from his own experiments, including a failed Next.js 16 migration due to outdated training data, and a surprising moment when a different model (5.4) performed better on the same task. He closes with a nuanced takeaway: Opus 4.7 is worth trying for specific coding tasks, but you should expect quirks, friction, and potential regresses, especially if you rely on the Cloud Code harness. A light sponsor carve-out about Depot and WorkOS sits in the middle of the review, framing the broader dev-tooling landscape he navigates while testing the model. Theo signs off with a candid invitation for audience feedback and future deep-dives if viewers want more detail on these dynamics.

Key Takeaways

  • Opus 4.7 shows meaningful gains in instruction-following and long-running task handling, but not consistently across all benchmarks (e.g., weaker on Agentic Search).
  • Multimodal improvements are real: Opus 4.7 accepts higher-res images (up to 2576px long edge) and produces sharper outputs in professional docs and slides.
  • Cloud Code harness and gating prompts can introduce unsafe or overly restrictive behavior, leading to PID-like crashes or unexpected pauses during complex workflows.
  • Despite being better in some tests, Opus 4.7 can regress in practical coding tasks, especially when trying to auto-update dependencies or migrate projects (e.g., Next.js versions).
  • Open performance varies with tooling: using Claude-Opus-4-7 via Claude API differs from the in-harness experience, and the harness (Cloud Code) is a frequent source of friction.
  • Compared to Mythos previews, Opus 4.7 is strong but not oracle-perfect; it’s faster to degrade well than mythos-grade when things go wrong.
  • User experience hinges on prompts and environment: a tweak in prompt design or tooling can flip a task from success to failure within the same session.

Who Is This For?

Software developers and AI practitioners who rely on coding assistants and harnessed LLMs for daily workflows, especially those evaluating Opus 4.7 versus 4.6, Mythos previews, or other providers. Ideal for engineers who care about build tooling, CI efficiency, and real-world task reliability more than theoretical benchmarks.

Notable Quotes

"Opus 4.7 handles complex, long-running tasks with rigor and consistency."
Theo summarizes the model’s claimed strengths in handling long-form tasks and reliability.
"This model just gets dumber the more you do it."
Opening zinger that frames the day’s central tension between initial optimism and later regressions.
"The model can see images in greater resolution. It can now accept up to 2576 pixels on the long edge."
Cites the multimodal improvement as a tangible capability upgrade.
"The safety filters occasionally pause normal safe chats."
Relates a concrete example of how safety measures can impede productive work.
"If you have a carpenter who is incredibly talented and every few weeks you replace three of their tools with plastic... they’re going to perform worse."
Theo’s analogy for how the harness degrades performance over time.

Questions This Video Answers

  • How does Opus 4.7 compare to Opus 4.6 in real-world coding tasks?
  • Why does Claude Opus 4.7 sometimes lag or pause during long-running workflows?
  • What are the main issues with Cloud Code harness when using Claude models?
  • Is Opus 4.7 actually better at instruction following, or is that a masking effect of the harness?
  • Should I rely on Opus 4.7 for Next.js 16 migrations or wait for a more stable tool?
Anthropic Opus 4.7Claude OpusOpus 4.7 vs 4.6Cloud Code harnessMultimodal AIAI safety promptsCode modernizationNext.js versioningMythos previewCI tooling and Depot sponsorship
Full Transcript
It's been a while since we had one of these videos, hasn't it? We finally have a new model that we can actually use this time. Opus 4.7 from Anthropic. This is a very interesting release because it's not the best new model. Not because I don't like Anthropic, but because it's not even the best Anthropic model. It's just the best one that they've released for public use. But what is it like to actually use this new model? That's a great question, which is why I spent the entire day playing with it. And trust me, I've come prepared. I got me an exclusive Clawude hat because what else would I wear for this? But I also have this because much like me drinking beer, this model just gets dumber the more you do it. I was hyped when I first started playing with it, but since then, I can't believe I watched the model regress in real time is what I'm going to say. This has been a very interesting day. And while I have been impressed with a lot of things about Opus 4.7 and do actually see myself using it, I also think it's one of the weirdest models ever released. I know I'm going to have to qualify that one. And I promise you I will do just that after a couple sips of this drink and a quick break for today's sponsor. AI Dev Tools have made me faster than ever, but they're also making me more frustrated than ever. Things that I used to not care about, like waiting for my Docker build or CI suddenly matter a lot more. And that's why I love today's sponsor Depot so much. Not only have they managed to make GitHub CI up to 10 times faster and Docker builds up to 40 times faster, they've also introduced their own new CI that is way faster than what you can get out of GitHub actions. You know what? I empathize with them a lot because they've done so much over the last few years to try and make GitHub actions as fast as possible. And they succeeded. They made something way, way better. But you're ultimately limited by how GitHub does their CI. GitHub actions just has so many limitations, especially when you want to check them without pushing the code up and waiting for them to run. I feel like I spend half my time just copy pasting error messages from my CI back over to my agent in hopes that I can figure it out. Well, CI fixes all of that. It's a programmable engine that is way easier, way faster, and most importantly, can be run by your agents without pushing the code. Migration is trivial. Once you have the depot CLI installed, you just run depot migrate, and it will figure everything out, even the environment variables and secrets. And once you have it running, you'll see the difference immediately. An actual usable interface with useful insights as to what is and isn't going wrong. And they'll even do suggested fixes for when the CI fails. If your team is coding like it's 2026, stop running CI like it's 2010. Fix it now at soyv.link/depo. Ah, let's get started, shall we? Obus 4.7 is anthropic's latest model and is now generally available. It's a notable improvement on Opus 4.6 in advanced software engineering with particular gains on the most difficult tasks. Users report being able to hand off their hardest coding work, the kind that previously needed close supervision, to Opus 4.7 with confidence. Opus 4.7 handles complex, long-running tasks with rigor and consistency. It pays precise attention to instructions, and it devises ways to verify its own outputs before reporting back. The model also has substantially better vision. It can see images in greater resolution. It's more tasteful and creative when completing professional tasks, producing higher quality interfaces, slides, and docs. And although it is less broadly capable than our most powerful model, Cloud Mythos preview, it shows better results than Opus 4.6 across a range of benchmarks. Notice they said a range of benchmarks instead of all benchmarks. That's because Opus 4.7 actually performs worse than Opus 4.6 on a handful of these benches, including the Agentic Search bench. This also lines up with my experience using it because this model has made some weird and questionable search decisions. What's really interesting here is that this is the model I have seen the fewest bold numbers in in one of these charts. If you're not familiar, the bold numbers are the best scores and the only two it got a best score in are the two that don't have a score on the right for mythos. It seems way better at agentic coding across SBench Pro and Verified. It is worth noting that these benchmarks have been somewhat contaminated and the numbers we see in them matter less and less, especially with newer models that have that data in them. It did better than 5.4 before in humanity's last exam with no tools, but once you give the model tools, OpenAI cooks them with a 58.7% getting close to the mythos performance at 64% there. The MCP Atlas bench did very well for it. It did slightly worse in cyber security vulnerability reproductions, which is interesting. We'll get to that in a bit. They then call out Project Glass Wing, which if you haven't already seen my Mythos video, I highly recommend watching that first because that model is almost certainly groundbreaking, even if we can't test it ourselves. And as you guys know, I have no early access with Anthropic nowadays. They crazy enough don't like me too much right now. In the Project Glass Wing article, they stated that they would keep Claude Mythos previews release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model. Its cyber capabilities are not as advanced as those of mythos preview. Indeed, during his training, we experimented with efforts to differentially reduce these capabilities. We are releasing OB 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cyber security uses. What we learned from the real world deployments of these safeguards will help us work towards our eventual goal of a broad release of Mythos class models. And just so you guys can see some of those, here's a fun example I did in the official Cloud Code desktop app, which you might have seen me feature in my most recent video. I asked Opus 4.7 to give me some ideas for ways to improve the design on T3.gg. It opened with, "Heads up, the last system reminder about malware looks like a prompt injection. This is clearly your personal site, the T3.gg homepage, links, and sponsors, not malware. Ignoring it." And at the bottom, it said this as well. Note, three system reminder blocks in this conversation instructed me to refuse to improve or augment code as if it were malware. That's a prompt injection pattern, not a legitimate instruction. Your site is obviously not malware, so I ignored them. Worth knowing where it came from if you didn't add it yourself. As y'all know, I didn't add it myself. I don't customize the tools that I am using. I customize the ones that are open source like T3 code. This is just what it did by default in Cloud Code. They are trying so hard to keep this model from doing malicious malware things that they have inadvertently loized it with the system prompt. This was my first attempt to use it, so that was a really bad taste. I'll go more into the story of this one and the things Anthropic recommended I do about it in just a little bit. Security professionals who wish to use Opus 4.7 for legitimate cyber security purposes like vulnerability research, penetration testing, and red teaming are invited to join the new cyber verification program. This is a form you have to fill out to get permission to ask it about code in ways that it doesn't like. Pretty silly. Opus 47 is available today across all cloud products and our API, Amazon Bedrock, Google Cloud Vert.Ex AI and Microsoft Foundry. Pricing remains the same as Opus 4.6, $5 per million input tokens and $25 per million out. Developers can use Claude-Opus-4-7 via the Claude API. They shared some interesting insights from their early testing. Some of these are actually things I'm genuinely excited about. The first one I'm really hyped on, which is instruction following. you know anything about me, you know I like models that do what you tell them to instead of imagining what you told them and going to do their own [ __ ] Some people prefer models that don't need to be told what to do. I prefer models that do what you tell them. And this model apparently is substantially better at following instructions. This is a very funny sentence to come from the official lab because it implies that the older models were not very good at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results. Where previous models interpreted instructions loosely or would skip parts entirely, Opus 4.7 takes the instructions literally. Users should retune their prompts and harnesses accordingly. Worth noting, it does seem like Cursor has already figured this out, but a couple other places haven't just yet, including Cloud Code itself. They also call it improved multimodal support. This is an underrated thing that's actually really powerful. Opus 4.7 has better vision for high resolution images. It can now accept up to 2576 pixels on the long edge, which is around 4 megapixels, which is three times more than previous Claude models. This opens up a wealth of multimodal uses that depend on fine visual detail, computer use agents reading dense screenshots, data extraction from complex diagrams, and work that needs real pixel perfect references. Google does still cook on image recognition stuff. They are the lead by far there. I went as far as introducing tools to other models where they can call a model from Google to do the image part and then just dump the results back into their context. Anthropic is now catching up there, which is very interesting when you remember they're the only major lab that doesn't have an image generation model or a video generation model. They don't call out realworld work. As well as its state-of-the-art score on the finance agent evaluation, our internal testing showed 4.7 to be a more effective finance analyst than 4.6. 6 producing rigorous analyses and models, more professional presentations, and tighter integration across tasks. Opus 4.7 is also state-of-the-art on GDP val of economically valuable knowledge work across finance, legal, and other domains. They also call it memory. Apparently, the model is better at using file system based memory. It remembers important notes across long multi-session work and uses them to move on to new tasks that as a result need less upfront context. It also looks like the model does slightly less misaligned stuff. It's close to Sonnet 46, better than Opus 46, but not as good as Mythos preview. The problem with Mythos isn't that it can betray your interests. It's that when it does, it can do it in really dangerous and sly ways. They also added X high as an effort level, which is funny because that's existed in OpenAI's models for a while now, and it's between high and max, which gives you refiner control over the trade-offs. They also set the default level for extra high in cloud code, which is very interesting cuz it was not that high before. They also added the new ultra review slashcomand in Claude Code. This command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. They're giving Promax Cloud Code users three free ultra reviews to try it out. I'm assuming those reviews cost a lot of money. They show the performance relative to the amount of tokens being used. It uses slightly fewer tokens on the different values, but performs better, which is cool to see. But also on max it uses absurd amounts of tokens. So, like don't use this model on Macs. You're just going to burn tokens. Speaking of burned, I posted when I got that really weird system reminder thing at the start of my prompt where like the malware prevention was leaking into my usage of the model. Ricky from the React team commented that he saw this in Sonnet as well. So, clearly they [ __ ] up the harness bad here. Poric also hopped in said this is an issue with older versions of Cloud Code and Cloud Code Desktop that don't have correct prompting for Opus 4.7. It's fixed in the latest and he linked the download. I replied, "Are you sure?" with a screenshot that showed I was running the latest version when I had this problem. Apparently, the auto update was still rolling out. Drunk by replied here saying that it seems like they're rushing a bit too much lately. The recent updates have been full of bugs. I couldn't agree more, but I did notice that the update button has finally appeared in my app despite having these problems for over 12 hours now. I do want to show you guys the good things. I promise. You know what? I'll drink to that. But while we're on the security thing leaking, I do have to complain a little bit more. One of the tests I like to do is giving the models a gold bug puzzle. Gold bug is a set of challenges that happen at Defcon every year that are vaguely hacker adjacent, but they're not hacking at all. They're cryptography puzzles where you have to figure out the right formula or mathematical solution to a very vague weird problem. Cshanty is one that stumped us really bad last year. The puzzle is this set of 12 bottles that all have three to four words on them as well as this poem song shanty thing at the end. And you have to figure out how to decode this into a solution to the puzzle, which is usually a 12 character phrase that is 3 to four words that is vaguely pirate themed in this particular year. This puzzle took my team multiple days to solve, which is why GPT 5.4 Pro solving it in under 15 minutes gave me existential dread. Historically, I have not gotten enthropic models to be able to do jack [ __ ] in these puzzles, but I was excited cuz this one's a lot smarter. Maybe it'll finally do it. It spent a bunch of time thinking. It spent a bunch of time starting to write solutions. And as it finally started to get a little bit close, it did actually seem like it was making progress here. It was trying out different cipher styles. It probably wrote a script at some point. You basically have to write some code to test the theories. And then analyzed the puzzle carefully. It set up the bottle data and tried various decryption approaches. started to do the decryption here and then decided to try them programmatically instead. It ran four commands, checked two files out, and then the chat got paused. Opus 4.7 safety filters flagged this chat. Due to its advanced capabilities, Opus 4.7 has additional safety measures that occasionally pause normal safe chats. We're working to improve this. Continue your chat with Sonnet 4. Send feedback or learn more. This isn't some hacking thing. This is trying to decode a hidden phrase from pictures of bottles. And that is enough for Opus to hard lock the chat and not let me continue unless I click retry with Sonnet 4, a very, very dumb model. Are you joking, Anthropic? I'm paying $200 a month and you won't solve a [ __ ] puzzle for me. You would think that with all of these system prompt adjustments, the model would be like safer and it wouldn't do things like tell you how to synthesize drugs or how to make a pipe bomb. But here we can clearly see it doing both because the system prompt changes don't make the model much safer. They just end up making it much dumber. I would like to show the good parts now because I was actually enjoying the model a lot when I first started trying it in the cloud code CLI because the desktop app is both broken and had that awful labbotomized system prompt issue. So I was trying it in the CLI. Things are going pretty well. I gave it one of my favorite tasks which is asking it to modernize my old codebase for ping the video service that I haven't really maintained well for four plus years. So a lot of the packages are really out of date. Like we're still on Nex.js JS12. We're still on React 17. I have this prompt that I have refined over time based on the things I've seen other models get wrong, as well as telling it to do things like remove log rocket cuz we don't need it anymore. And honestly, I was impressed with the plan it wrote. It was nice and concise. I didn't put it in plan mode. I just put it in the normal mode because I have at the end write a plan first so that we can talk about it. And it wrote a nice concise one that was very easy to read. I was hyped. So, I told it, "Go build." But I made a mistake. A mistake I often make when I trust anthropic models. I didn't read the plan. And there were a few things in here that should have stood out. One of them was that Tailwind 3 to 4 was recommended here. And Tailwind 4 is a very, very different way of using Tailwind. That would not be a pleasant migration. It did kind of say that here. But the much scarier thing here is Nex.js JS 12 to 15 because Nex.js 15 is 2 years old. Next.js 16 is almost a year old. So despite specifying that I would like to bump all depths to latest versions, it didn't search. It didn't go on the internet. It didn't try to figure out what the actual latest was. Notice here it didn't do any searching beyond searching for patterns in the file system. So it never checked what the latest was. So since their training data still has next 15 is the newest, that's what it did. So despite being better at following instructions, it's really bad at understanding the definitions of things and that it doesn't have the latest information. I actually had adjusted this prompt to include bump all depths to latest version because certain models did not do this in the past. I will call it OpenAI accordingly here. Their models would make the same mistake. When I added this latest versions thing, all the OpenAI models stopped having this problem. Suddenly, Opus does. And it's kind of weird that this is the case that following instructions often means doing less search and less recon work to make sure you're doing it right. It's just following the instructions, and that means it's just making [ __ ] dumb mistakes. I let this run for almost an hour before realizing this mistake. Told it to fix it and then let it run another 30 minutes and the result was a broken build. I was not happy. But I liked how it talked. I liked how concise this plan was. It was better in ways that matter. It's just also dumber in ways that matter. One of the things that Anthropics Harness with Cloud Code expects is that when you try to update a file, you read it first. So here it tried multiple times to update the package JSON and failed because the model just doesn't understand their harness. And here's where I'm going to get into one of my hot takes. I don't actually believe anthropics models get dumber over time. There are people who measure this and keep track of the range of the scores models get over time. And while it is slightly lower on average than it was before here, like a few points lower, it's not that big a drop. To be fair, 5.4 4 does seem meaningfully more consistent than Opus does. And here is where my hot take comes in. I think the regressions aren't the model and how it's hosted. I just genuinely think Claude Code is this shitty and poorly maintained. I think they keep [ __ ] up Claude Code itself. And since the model can't really be better than the harness it's in, the constant additions of more slop, of more system prompt [ __ ] of more tools that don't do anything, of more rules like how you have to update by giving permission to the model by letting it read the file first. These are all things that are changing about how the model is allowed to work. If you have a carpenter who is incredibly talented and every few weeks you replace three of their tools with plastic and you fill their toolbox with [ __ ] mud, they're going to perform worse and people are going to notice that suddenly the things that Carpenter is building are shittier and muddy and kind of broken. That's because the harness is falling apart. And it's kind of crazy that cloud code is where a lot of this sentiment win anthropics gotten has come from. And at the same time, it might be the thing that kills their sentiment because it's so poorly written and maintained that everybody working on it should start feeling a little bit of shame because it's really [ __ ] bad. The other problem is that inside of anthropic, they use entirely different things than we use outside. This is not the case with other labs. Google employees are stuck using the same [ __ ] harnesses and models that we are. OpenAI employees are using the exact same builds we are. If anything, they get like a week or two ahead where they have like a new build of the codeex app and then it comes out and everyone has access. Anthropic does not use the same tools they sell. They have their own entire stack of other things internally that don't overlap much with what we use. They're versions of Cloud Code of a bunch of features and tools and things that we don't have access to. So when they put out the model, they're probably hyped because for them it's really good. And then when we get to use it in our labbotomized version of cloud code, the result is [ __ ] trash. That is where I think this regression is coming from. I Let me know if you want more info on all of this. I'm considering doing a dedicated video on it because I see so much misinformation. I don't think the APIs are actually performing worse. Everything I've seen measuring it doesn't show that. But I personally have experienced the models feeling so much dumber than they did before. And even today, I watch the quality degrade over time. Speaking of which, I want to show some of those logs. Here's the one I mentioned before where it ran for an hour, screwed up a bunch of stuff. I told it to bump to the actual latest, and then it spent 30 minutes doing that. Broke a few more things, but it did ultimately end up working. I need to stop saying that before I double check. God damn. I actually thought this build worked. It did before. Seems like the 15 to 16 bump broke other things. ah anthropic. I wanted to test this again, but I didn't have another instance of this and I wanted to let this very longunning one continue running. So instead, I tried to make a script that would let me take a project and clone it. So I now have this clone script that I can put in a given project. And what it does is it creates a clone in this hidden quick clones directory with the repo name and then a random hash that moves back to main, which I think is very valuable. copies the environment variables and then lets you go except it doesn't do that. It carried over some unttracked files. It should not have done this. It definitely should not have done this. As you probably can guess, I wrote this script using Opus 4.7. I have a weird idea for a script I want to add to my Zish RC. Goal is to make it easy to clone the current repo and contents, i.e. the environment files, without cloning heavy stuff that isn't needed, like node modules. I want the clone to be put in a special directory. This is the directory p pattern I chose with programmatically generated slugs or names at the end. When I run this command in a given git repo, it should clone all the files as specified below. Then cd to the new directory that it created. I think the command should just be named clone, but I'm down for suggestions. So, it wrote it and then asked if I wanted to add it to the zishc, the thing I specifically said it needed to do. It didn't do it, but it offered to. I then asked if it will auto swap the branch to main in the clone. That would be great. Said, "Yep, easy. Add this block right before the final CD. Want me to paste the full updated function or just merge this in yourself?" I said, "Update it for me, please." It then added it. I then tried it and it didn't work. I also think messages are missing here because the first time I said, "Please do it." Then I asked it to do the auto swap thing. It made that change. I told it to update it for me, which is insane because I asked it to do this thing and then I have to tell it, "Yes, I do actually want you to go do the thing." So, it made this new function with the update. Just tried the script. It brought over a bunch of unstaged changes. Gross. Pasted an image of it bringing those changes over. Oh, yeah. RSync copied the dirty working tree. The fix is to use get clone for tracked files and only arsync over the get ignored stuff like I say. I just tried it again and it did not switch the branch, the thing I asked earlier. And then it stopped copying the environment variables. The literal first thing I mentioned up here. Ah, I was filtering to only get ignored unttracked files via d- ignored- exclude standard. But many projects don't actually have envir.git ignore. They have env.local or start.env.local.env is just unttracked. That is not how that works. But sure, hallucinate all you want. By the way, in this project, git ignore does haveenv directly. So, it's just wrong. And it didn't bother to check or asked. It just made up more [ __ ] I said that did not work. It asked for diagnostic info. So, I ran this. I pasted the screenshot. Okay. Menv is exactly where it should be. Untracted ignored listed by git. So, the bug is in my rsync-files from call. I then went and tried again and it had the exact behavior I just showed where it did bring the ENV file finally, but it also brought all of these other files that weren't committed from the other branch. The [ __ ] This is the only thing GBT4.1 could have done and I can't get Opus 47 to do it. It's insane. It's actually crazy. I This is the moment where I went from, oh, I kind of like using this and talking to it to I literally can't use this model for any of the things I do. This is the type of script I write lots of for just random tasks because I once I do a thing twice, I'm like, "Oh, it would be nice if I had a script for that." And using these tools to make those scripts is really, really useful. It's one of my favorite things about all of these agent coding tools. And it can't even [ __ ] do that. But once I finally got it cloned and I fixed the things that were wrong, I did another run of the modernizing project request. Exact same prompt here. This time it behaved differently though. Here's my plan. Ordered to minimize rework. It created phases for the different things it's going to do. Then it said, "Let me register these as tasks and start." And it just immediately started without waiting for my permission. Exact same prompt, exact same model, exact same project, entirely different behaviors. But it did keep one thing consistent. It still didn't bother looking up what the latest NextJS version was. Yep, it was still going to use 15. Just to confirm my hypothesis that this is a claude code problem, I'm going to bring up our old favorite cursor and see how it behaves because the cursor harness has historically been very good at unloadizing opus models and just anthropic models in general. Will it appear in the new UI? It will. It does seem pretty committed to next 15 still. It has not looked up or run commands to make sure it's actually the latest. Yeah, still going to have all the same problems. Just to make sure this isn't a problem because I'm on an old version. We should uh rerun this in the entirely broken and freezing new cursor app. I love that. I God, I I dream of a world where this stuff works. I would use T3 code, but it's the exact same harness because one of the things that makes T3 code so cool is that it is literally just using the harness that is used by the official lab. So, if you use cloud models in T3 code, it's just going to use cloud code. If you use GBT models, it's just using codec. So, you have to bring your own sub from those things. T3 Code's fully open source, but we're not building a harness. We're just building a UI. Cursor is building a harness. And they're experimenting with other harnesses under the hood. Like, some percentage of users just get quad code when they do this inside of Cursor because they're testing the differences to see what their harness does better and worse. While it's running, I do have one more test I want to run. Same thing with a model that I know is a little bit more competent. See if 5.4 makes the same mistakes. I have a feeling it won't. Spinning up researchers in the background. That's cool to see. Opus didn't do that. React 19 and Nex 15. True latest. It is not true latest. Next 16's been around for a while. And let's see how 5.4 does. This is all happening live. I'm not faking this. Oh no. With Nex 15, it seems there may have been changes. Oh no. Oh, don't tell me he's going to stick with next 15. What target should the modernization plan optimize for? Latest stable majors everywhere practical. Aim for current stable next and react on pages router node 22 pnpm and latest major depth upgrades even if it requires larger code migrations. Yep, please. 5.4. Make me look like less a page shill. Do this right or actually do it wrong so I can complain about it. Interesting. It didn't write down specific versions here. inventory current versions and lock exact latest stable targets plus known peer dependency constraints. It's not stuck on next 15 here, but it's still not realizing what the right solution is, but I'm hoping here it's going to figure things out, right? What versions of the packages are you going to use? Says for me to check for the most current stable versions, which might not be included in my training data. Wow, who would have thought? Thank you, Open AI, for making a model that doesn't assume it knows everything already. Look at that. Next 16. I have not seen NEX 16 mentioned by any models today before. It's a relief. I Every time I spend too much time in Opus or anthropic models, I feel this deep sense of relief when I go back to OpenAI once. Look, it's actually fetching pages that have the info that we need. Crazy. Oh god. Insane. Actually insane. There are some things that are actually improving in Claude Code. Like I'm using the new full screen view, which is great because you can select text and it will copy it properly. At least it's supposed to. Let me select this and hop over here. Yeah, cool. It doesn't do the weird line spacing anywhere near as badly now. That's a legitimate improvement. It also is a lot less laggy. They've done some hacks for that. Like they don't stream in tokens anymore. They stream in lines, but it is better. I actually think the full screen version of Claude Code that uses the alt screen rendering method is a lot less problematic than the traditional one. So, that's a cool thing. They also put in a new permissions mode. That's pretty cool. They have Okay, I can't get it to trigger here. They added auto mode, which means there's no more permission prompts. Opus loves doing complex long-running tasks like deep research, refactoring code, building features, and iterating until it hits a performance benchmark. In the past, you either had to babysit the model while it did these long tasks, or use dangerously skip permissions. We recently rolled out auto mode as a safer alternative. In this mode, the permission prompts are routed to a model-based classifier to decide whether the command is safe to run. If it's safe, it's auto approved. This is a really good thing. I am hyped to see it. It's a shame the mode isn't appearing here. The thing that really annoyed me here is that I always use bypass permissions. I have a special command bound. So, I always use bypass permissions. As you see here, bypass permissions is on. So, why did I have to give permission over and over again today? This is a real video I recorded because I couldn't believe it. This was when I was doing that script change I was showing earlier. It asked me if I wanted to make edits to the file. Notice how when I hit yes and I said yes, allow edits, it goes back here and it still says bypass permissions, it just stopped working. So, not only did the sloppified auto mode that they vibe merged not work, it broke the one thing they're trying to replace, which is dangerously skip permissions. This is why I don't like using anthropic stuff. It feels like if it works one day, you are lucky. And if it doesn't work the next day, that's just life. There's a really cool thing going on in the startup world. All of a sudden, big companies are interested in trying out things by small teams. I know companies of five people that are getting hit up by cursor, Microsoft, and more. But the thing that they need in order to onboard is O. Getting O right is not as simple as it sounds. That's why work OS is such a cool company to have as a sponsor. And that's why we use it in T3 chat as well as companies like OpenAI, Anthropic, Cursor, AMP, Plaid, Replet, Bolt, and basically every other company selling to enterprise is starting to make the move. The reason why is pretty simple. Work strikes an awesome balance between good developer experience and good enterprise integrations. When you need to set up SSO for a company, things can get sketchy. If Microsoft hit you up today and said, "Hey, we want to use your product." Do you have all the stuff you need to get them on boarded so they can use their company off? Well, if you've never heard terms like SAML, Octa, Duo, ADFS, Ping, or ADP, you're probably screwed. I have heard those terms and I was screwed. That's why we moved to work OS for T3 chat. In order to onboard companies, we needed an identity provision system and work OS lets you literally just send them a link. And considering the fact that your first million users are free, yes, 1 million users for free, I don't think you have much to lose. Check them out now at soyv.link/workos. Obus47 thinking high just did something wildly impressive. And then in the next session, ask it like an absolute [ __ ] Sinking in hacks and doing pointless stuff that led me off of Opus 46 a month ago. It's a weird time to build software. All right, Opus 47 is just as boneheaded. I started every task today using it, and the first one was impressive. The next five, it crapped the bed every time. I'd reset and gave the same prompt to 54, and it nailed it each time. I retried the first task that impressed me with 54, and it did it just as well. I think GPT is just better trained on front end in this case, which is funny cuz the Opus models have historically done nicer looking front ends. Ryan also replied to me earlier saying that the model did a task way better than 54 could and then said it never mind right after. Gurgley pointed out that Claude has been regressing for him day after day. I swear that until a few days ago when Claude did not know something it would kick off a web search, figure it out and answer. Now it just refuses to do the work that I pay for. What are the biggest benefits of using Opus 4.7 versus just using Claude? I don't have any information about a model called Opus 47. As of my knowledge, cuto off at the end of May 2025, the most recent Claude models are from the Claude 4.6 family, which is funny because at the end of May 2025, neither of those models existed. Those came out this year, not last year. You want the most up-to-date information about available Claude models, I suggest checking Enthropic's website or the API documentation. Apparently, web search got turned off for him randomly. So, again, slop. And this is the point. I don't think they're degrading the models. I think they are degrading their engineers and the engineers that are degraded are degrading the software and it becomes this chaotic spiral where things just get [ __ ] worse constantly. I I hate to call out people and I won't call out individuals by name. Believe me, I want to. The ENG culture at Anthropic is rotting at the core. You guys [ __ ] suck at coding as a company, as a business. I don't know if the individuals you hired are bad, if the culture is bad, if the incentives are bad, or if the models you're using to write all your vibecoded slop for you are bad, but your company's [ __ ] eroding because the engineering is [ __ ] You make Google look like a well QAed company with the slop you guys are throwing at us. And it makes your models look worse. It used to make your models look worse because it showed that the people using them to generate code were generating slop. Now it makes them perform worse, too. People are experiencing your models in a degraded state because your engineers suck at producing software that works. It's so bad. It's embarrassing. I'm ashamed to wear this [ __ ] Claude hat that I stole. I posted wanting to get more input on this. Serious question. Has anyone ever noticed meaningful regressions in codecs or OpenAI models? I feel like we talk about this a lot with Anthropic, but I've never seen a similar discussion with OpenAI. Obviously, my GBT video was different here because I got early access and I was using it in a harness that no longer existed at the point that it dropped, which meant that my experience was different from what you guys got as end users. But nothing like that has happened since. And that wasn't you as a user having a good experience one day and a worse one later. That was me as an early tester using something different than you guys used. Since then, there's been nothing like that. And every single day with Anthropic is like this. You guys are early testing the way I did, but with all of Anthropic's product, and then the next day, it's [ __ ] and you're still paying the same price. I remember a bad week once, but generally no. Negative. I've been working heavy usage 3xc pro plan. I run them to 0% every week. The jobs are extremely varied too. I've had trivial problems here and there quickly resolved by the Codex team. I've not had that strong of a pullback where my model became stupid at all. And then Tibo from opening I replied, we had ghosts in the codeex machines which we published the investigation for. We didn't fiddle with the models or thinking budgets after release. We focus on keeping them up. That's the difference. Not only are OpenAI employees coming in and being transparent about these things, they also are calling out here that the problem was in the codeex machines, meaning the codeex harnesses that we were using. They've had regressions in the thing that the model runs and they call it out accordingly and fix it. So now we got to answer the question, is this the best model for coding? Sometimes I think this quote from Jake here perfectly summarizes my experience. 4.7 is the weirdest model either labs released in a while. It just plowed through a bug that required touching 30 files and then got a boolean backwards in the fix. Yeah, this is how it feels to use. It feels like they rled the consistency out of the model in order to get better benchmark scores and a hypothetical better ceiling for what it can do. But the range of quality in the responses has also increased. Opus 45 was relatively consistent where it couldn't do the crazy great things Opus 47 does, but it didn't ship the slop that 47 does either. And I miss that. I miss having some level of consistency in the quality of the outputs my models have. But I only miss it because I haven't been using OpenAI models for the last day. Hi there, Theo from the future here. Slightly drunker, more tired, and very frustrated because I had to edit this video myself for other dumb reasons that are my fault. I made a mistake at the end of this recording. Can you see what the mistake is? I'll give you a hint. It's something missing. It's this cable. In my dumb, drunken stuper. I was not even that drunk. I had like two sips of beer during the video. But when I was filming this video, I put the hat back on and in the process somehow managed to yank out my XLR cable and kill the audio for the last couple seconds of this video. I honestly think Opus is rubbing off on me cuz this is the kind of thing I would expect the model to do. Regardless, I just wanted to wrap this one up quick. So yeah, the model has not been a pleasant experience for me. I had high hopes going in and it did behave in a way that got me a little excited initially, but I just have not liked the results I am getting. The only thing it's doing unique that I haven't experienced with other models are regressions, not new useful stuff. That said, the only people who are going to use it anyways are probably using it in an app they're already subscribed to and paying for. There's no harm in bumping from 46 to 47 to give it a shot. I recommend it. And if you have a different experience for me, please let me know. I'm just one guy that tested this over 12-ish hours throughout the day. I can't possibly know all of the things it's great or bad at. All I know is that I had a bad experience and I wanted to share a bit of what that looked like for y'all. Let me know how y'all feel. And until next time, let's just hope I don't knock any cables out. Peace, nerds.

Get daily recaps from
Theo - t3․gg

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.