Claude Code vs Codex vs Cursor (an honest comparison)

Theo - t3․gg| 00:37:56|May 26, 2026

Chapters7

The creator sets out to dissect the real differences between cloud code, codecs, and cursor, focusing on philosophy and use cases rather than UI polish.

Theo argues that Claude Code, Codex, and Cursor differ more in philosophy and workflow than in raw capability, and urges readers to pick tools that fit their work style without chasing flashiness.

Summary

Theo’s honest comparison digs into how Claude Code, Codex, and Cursor are built with different priorities beyond model accuracy. He traces Claude Code’s rise as a terminal-first experience designed to feel productive, even at high token costs, and contrasts this with Codeex’s more conservative, reliable, app-centric approach. Cursor, with its cloud-backed agents and graphical testing, represents a forward-looking vision that prioritizes collaboration and proof-of-work across devices. He emphasizes that engine capability isn’t the sole differentiator; the teams’ philosophies about tokens, integration, and environment matter just as much. Through examples like Opus 45 and Mythos, Theo explains why model improvements don’t linearly translate to better developer experiences across all tools. He compares UX choices, such as Claude Code’s flashy UI versus Codeex’s pragmatic interface and the browser-based Cursor workflow, and discusses how these shape daily work. Theo also critiques how each tool handles cloud environments, interoperability, and what they incentivize users to do (and spend). He closes with practical guidance: try all three in your own workflow to see which maps best to your needs, and remember that the best choice balances productivity, reliability, and future-proofing, not just Twitter-level demos.

Key Takeaways

Claude Code emphasizes a high-signal, token-heavy workflow designed to feel extremely productive, often prioritizing UI flair and sub-agent demos that show big numbers.
Codeex prioritizes reliability and practical integration with local environments and app-centric workflows, offering less flashy but more predictable performance and useful features like computer use for verification.
Cursor showcases a future-focused approach with cloud runners and a graphical, webhook-like UX that can test real apps in a browser-like environment and work across devices.
Opus 45 and Mythos illustrate how model iterations have dramatically shifted tool adoption, highlighting that better models don’t automatically fix UI/UX or cloud verification issues.
OpenAI’s Codex and OpenAI-driven workflows tend toward token efficiency and more practical verification that the changes actually work, contrasting with Anthropic’s token-burn philosophy.
The ultimate recommendation is pragmatic: test all three tools in your real projects to determine which aligns with your team’s workflow, trust, and long-term goals.
Cursor’s cloud-first, Slack-friendly approach can be enterprise-ready, while Claude Code remains strongest in generating rapid, demo-worthy productivity gains.

Who Is This For?

Developers and engineering leaders evaluating AI-assisted coding tools who want to understand the philosophical differences and pick the tool that best fits their team’s workflow and productivity goals.

Notable Quotes

"This video is about breaking down the philosophy differences between these solutions, not just what tasks they’re capable of."

—Theo frames the video’s goal as a philosophical comparison rather than a pure capability showdown.

"Cloud code is as much a marketing tool as it is a developer tool."

—Theo argues Anthropic optimizes for perceived productivity and social proof.

"Codeex is for skeptical devs that want to use these tools in a way that is productive and leans into good workflows."

—Theo contrasts Codeex’s pragmatism with other tools’ flashiness.

"Cursor is betting out in the future where we’re not running these things on our computer at all."

—Cursor’s cloud-first, device-agnostic approach is highlighted as a long-term bet.

"If you hate coding and you want it to feel more fun as you code, Claude Code’s actually very good at that."

—Theo acknowledges Claude Code’s strengths in making coding feel engaging.

Questions This Video Answers

How do Claude Code, Codex, and Cursor differ in philosophy and workflow?
What is Opus 45 and Mythos, and why do they matter for AI coding tools?
Should I use Cloud Code, Codeex, or Cursor for enterprise team automation?
Can I rely on cloud environments with these tools, and how do they verify results?
Is Cursor’s cloud-first approach realistic for everyday development teams?

Claude Code Codex Cursor Cloud Code CodeexOpus45Mythos token burnAI coding toolsCLI vs GUI environments','openai codex app','cloud environments

Full Transcript

This is not going to be my usual video. I want to try and break down the real differences between things like cloud code, codecs, and cursor. Not in the like how smart are they sense or what tasks are they capable of sense. I want to break down the philosophy differences between these solutions because I find that they are way more different than people seem to think. I know that we've all used cloud code and codecs and maybe you've even tried out the cursor CLI, maybe you use their IDE, maybe you use their cloud environment. Whatever it is, many of you guys have tried all of these tools. Many of you haven't tried some of them for a while. Many of you have one that you really like and you don't bother with the others. And if that's the case, I get it. If you're already using Claude Code and you're happy, I'm not here to try and change your mind. I want to try and explain philosophically what all of these things are trying to do and how they differ. My main reason for doing this is my own philosophy has changed around doing agent development since I moved from Claude over to Codeex. Obviously, it also changed when I moved from cursor to cla code, but it's changed in a way that I like a lot more this time. And I find that when people are talking about these things about cloud code versus codecs, they're often just talking about the model or how the UI looks and not about how different the core of these things are and how differently the teams building them are thinking about how we make software. To be more direct, this video is happening because this reply on Twitter got way more attention than I expected as I broke down some of the philosophical differences between how Anthropic and OpenAI are thinking about their coding tools. But I wanted to go way further than I did here. I want to really break down how these tools differ, how they can be applied effectively in your work, and where I think they might be going long term. I'm going to do my absolute best to present as unbiased as possible because again, my goal here isn't to try and convince you to switch to a different tool. It's to try and help you understand what the different tools are meant to be built for and who they might be best used by. I don't want to sell you anything, okay? I do want to sell you one thing. Today's sponsor. Today's sponsor is an AI code review app. Wait, no, don't go. Trust me, this one's actually different. Macroscope's been blowing me and my team away. It's quickly become Julius's favorite tool that I've added to our code bases. It's what we run on T3 Code and Julius loves it there. But that's not my favorite part. My favorite part is all the cool status stuff they do. Keeping track of what my team is doing is hard as small as the team is because Julius just ships so much. Not only does it estimate how much time people are spending coding, which gets broken by Julius's absurd output. 413 hours sounds insane, but when you see his commits, you understand what it means there. The thing that's way cooler that it gives me is the status updates that show what my team's been up to. Over the last week, Julius has added multi-provider source control for T3 Code, PI extensions, and VS Code themes, T3 chat, model routing, billing, and UX, desktop effect framework rewrite. All of these things that just I wouldn't know about because it's hard keeping track of everything moving on my team that it gives me as the lead. It's super useful. And being fully transparent, I've made hard decisions about team structure stuff around the results I got from this cuz it gave me so much better insight on what's going on. The review side's even cooler, though. The run agents are genuinely awesome. You can programmatically set up specific like lint rule style things you want in your codebase as a directory and then it just runs with macroscope. Just create the file in macroscope, describe what you want it to do and what files you want it to check and it will now run this on every single code review going forward. Ben's been using this a ton on his projects. In particular, he's been using it for automatic PR labeling, surfacing reviews, correctness checks, and all the usual stuff that you would expect. The PR labeler in particular is really useful. I hadn't seen this till now. just showed it to me and I think I'm going to go add this to T3 Code. It would be super helpful over there. It's cool when I figure out my sponsors are even better than I thought they were when I'm filming the ad because they paid me to talk about them. They didn't pay me to do this. See why I like them so much at soy.cope. First, I want to start with Claude Code because it's the one that people seem the most familiar with nowadays. It's what people think of now when they think about coding with AI. It has pretty much fully taken the mental position in the space that cursor used to have. Like I remember in the YC batches where 90% of the batch was using cursor. Now it's like 70%'s using cloud code. The shift happened very aggressively. So how should we be thinking about cloud code? The goal of cloud code was initially Boris's attempt to think into the future. What could code look like if the models were way smarter? How would I want to use these things? In a world where people are going to claw.ai, pasting some code, asking the model for thoughts, getting some changes, then copy pasting them back into the codebase, it was obvious that there was room to improve. And cursor had made some improvements there, especially with like the chat interface in cursor where you could talk to the model about the code and watch it go make those changes. Cloud code was different for a handful of reasons. The biggest by far being that it was built to work in your terminal. Instead of forcing you to adopt a whole new workflow, instead of forcing you to switch over your IDE to some other thing, instead of making you install a different app or write code in a different way or move everything to the cloud, cloud code met developers where they were, their terminals, and they wanted to make a great experience editing code with AI in your terminal. This means they have a lot of inherent compromises they have to deal with. Things like pasting images will never quite be elegant. You can't see the thing you pasted. things like selecting text and all of the weird spacing that things have because they kind of have to. I'm using the new full screen mode to prevent flicker because the flicker kills me. But the new full screen mode is doing a whole different way of rendering to the terminal which means they have to fake all of the UI elements themselves instead of using the traditional scroll buffer behaviors. Has it benefits, has it negatives, is what I prefer when I use cloud code. They're hoping to make this the default in the future, but yeah, it's meant to just be used in your terminal. I really want to emphasize the adoption story that Cloud Code is pushing. They want it to be as easy as possible to get started using Cloud Code. You paste the command to install the CLI and now you're using it. You do have to go off quick with your browser or paste an API key or something along those lines. Initially, it was API key only, which is still crazy to think about, but the whole idea is that it's easy to set up and use without having to change any of your other tools. You would spin up cloud code, make changes, go look at them in VS Code, commit them, do whatever you want. It would just be another tab in your terminal. But over time, as you can guess, people wanted more. People wanted things like automatic git commits and automatic pushes to GitHub and noticing when changes happen in your PRs. People wanted to connect Cloud Code to other sources of information through things like MCP. People wanted Claude code to be able to be triggered remotely in order to control it from their phone or use it in a better UI and all these types of things. People wanted Cloud Code to be able to do more work at once and work on things in parallel. Anthropic tried to not overstep in this way initially, but at some point, I don't really know what changed. They started to embrace this stuff much more. They had things that would let you build into cloud code like their hook system that kind of stopped being iterated on. some of my fun demos I built with cloud code back when I was using it more like the app that would only let you use your browser and things like Twitter if you had something running and whenever your agent finished it would lock you out of Twitter until you went and responded to get it running again. A lot of those things had rough edges because the hook system wasn't fully implemented. I would say the moment of no return like the oh [ __ ] moment for cloud code was definitely when Opus 45 came out at the end of last year. That jump in model capability made it much more compelling to give a model access to everything through the terminal. The idea that I could GP through your codebase to find things, that I could run commands to deploy things, that it could use MCP to get data and do so much more made it way more pleasant to use. And I found myself leaning into Cloud Code way more heavily because I could trust the code more. So I didn't need a UI where I was reading it actively. I did want a terminal because I wanted it to have access to all the things you can do in a terminal. And the biggest thing you might notice at the bottom here, bypass permissions on the model got smart enough that I could trust it to not run malicious commands very often, if at all. And now to this day, I run almost all of these tools in bypass mode or yolo mode or something along those lines. Opus45 was good enough to cause a huge industry shift where lots of people started leaning into cloud code more because the model caught up with this style of tool and this was a huge moment for anthropic especially in places like Twitter. They also noticed what types of things would perform well on Twitter. People love terminals. People love doing cool things with agents. People love when their tools get better without doing anything. and they love this feeling of the stuff they rely on and love improving. And I am going to drop a somewhat controversial take here that I've said many a time. Cloud code is as much a marketing tool as it is a developer tool. Anthropic is largely using Claude Code to push the idea that Anthropic is the best thing to use for building with AI. A lot of the features and the cool stuff they add to Cloud Code is really optimized for the Twitter screenshot. Things like the pet mode that they added was definitely so people would screenshot their numbers and share them on Twitter. Things like the sub agent mode burns your tokens super hard, but it looks and most importantly feels super productive. I remember the first time I spun up a bunch of sub aents in a team. It felt incredible to have so much stuff happening with so little work on my part. Cloud code is really optimized for that feeling of productivity. It really feels almost gified. Little things like the cute guy in the corner here. The way stuff shows in the UI. The way that the teams spin up and check things out. I'll test an example here. Spin up a sub agent for each example project here and audit it for opportunities to streamline the cloud code experience. Even though I'm in the mode that is supposed to hide my email address, it still docks my email there. Thanks, Anthropic. We'll try again. See how much is happening in the UI here. You see the tokens going up as it goes. It has these little updates. It has the loading states. It has the dots flickering. It feels like it's doing a lot. And now it's going to launch the audit sub aents that it will spin up for these different things. And we can see at the bottom here, we can switch to those sub aents and see what they're doing. It feels really cool and really productive. There's a lot going on. That's kind of the point. Claude Code wants to feel really productive, but it also to be frank, they are trying to make it better and they're trying to make it more capable, but generally speaking, Anthropic's philosophy around how to make the tool do smarter things is to let it burn more tokens. And things like all the sub aent stuff, and everybody's already saying this in chat, five minutes later, you're down $200. Token, token, token. You spend two hours trying to fix this. Yeah, you guys get the idea. It's very willing to burn tokens in the interest of doing things in a way that looks and feels really productive and to be frank often can be. A lot of the improvements they're trying to make to cla code are finding new ways to burn tokens to potentially solve work and do more in cooler ways, but also, let's be frank, to look really cool for screenshots and demos on Twitter. This is also why in my opinion, Anthropic is much less interested in letting you integrate cla token maxing so hard that anything spinning it up programmatically risks absurd compute usage that they don't want. Imagine setting up a tool similar to what I just did there where I ask it to spin up sub agents to check things, but you implement that in your CI process whenever a pull request is formed on your project. Then you're going to burn a shitload of tokens. you might not mean to. And that's a thing Anthropic doesn't want you to do. If you're burning tokens with cloud code, they want it to look and feel productive because that's what they're trying to sell. They are eating a bunch of token spend to trade off the feeling of being really, really productive. And if you're not looking at the anthropic cloud code UI, when that happens, it feels much less productive. They're not getting what they're trying to here. This is also why things like their desktop app are heavily underinvested in. This is also why things like T3 Code aren't as interesting to them and why they're not letting us programmatically call the Cloud Code CLI anymore without charging you way more money. They want you in their CLI using their features, having their experience and they also want you building integrations into Cloud Code itself. Because if a tool like T3 code provided other skills or provided other resources, MCPS, etc., and then I can switch over to codeex and not have those things get stuck in cloud code. They actually come over when I switch between providers. Cloud isn't getting the lock in that anthropic wants it to get. So they really optimize for that. Cloud code is meant to be the world's greatest showcase of what you can do with anthropic models at a lot of cost to anthropic in order to win a bunch of sentiment so people feel and arguably are more productive using cloud models in anthropics tools. And you can bet your ass they went hard modifying the UI to make it feel this productive. A lot of the performance issues that used to exist in Claude Code were because they were doing all these fancy loading spinners and things which resulted in the performance being degraded which was a hit they were willing to take because they were optimizing so hard for the feeling. Hey to those in chat saying I'm being too nice. I'm trying to be because I'm not trying to shame people who like Claude Code and are using Cloud Code. They engineered it to be like a slot machine. They did a great job at it. It got me addicted to doing agentic coding. I wouldn't have gone as hard as I did at the end of last year and the start of this year if it wasn't for quad code optimizing itself for this type of thing. Let's contrast this with codeex. I have it on medium fast which is way faster than quad code. So I'll actually turn off fast mode actually. And now I'm going to paste that same prompt. Spin up a sub agent for each example project here and audit it for opportunities to streamline. There is so much less going on here. You have a timer. You have the working text that's slightly animated and nothing else. It did spin up these two subprocesses, but I don't see anything going on in them. It's very minimal. This is way less addicting is the way I would put it. And we sit and wait and wait. It's not trying to feel like a slot machine. It's not trying to feel crazy. It's just trying to get work done. And when you look at the types of features that are being added across these tools, you will really quickly see this difference. Just as an example of one of the recent things they shipped in Claude Code, they added SL radio, which opens a 247 lowfi Claude music stream. Like I'm not going to sit here and say this is cringe or cool or whatever. I just think it shows the philosophy of what and how they are building. like they really are optimizing for the feeling of productivity and the killer thing to post on Twitter. Like you you know that in the feature review process at the company, they are thinking about what does the tweet look like before they build the feature. Meanwhile, let's compare that to the official OpenAI developers Twitter, which is mostly just Codex updates. Now, this tweet did way fewer numbers. 8.2K likes. Codex anywhere and everywhere all the time. Now, your Mac doesn't need to be unlocked for Codex to use your computer. This feature is not very flashy. This feature is inside of the settings computer use section in codeex. You have to go hit this switch in there and then nothing cool happens. You literally can't screenshot this feature because it happens when your Mac is locked. But it makes you way more actually productive. Let's keep scrolling. Ship paper cut. New diff marker settings in appearance. If you prefer the classic plus minus, you can turn that on. Otherwise, it'll just be colorcoded. This is not flashy at all. This got 900 likes, but it's real and practical. They added a feature where Codeex can bind a double command hotkey to copy what you're currently looking at in your app and throw it over to Codeex to populate the context. I love this. I'm a big fan of using screenshots as context. And having a built-in thing like this is great. Like, it's practical stuff that doesn't demogate because the point of these things aren't to be cool demos on Twitter. It's to be frank to solve real problems that they have as they are building software using these things. Even computer use itself being such a core part of codecs. It didn't make that much sense to me. It seemed like it was one of those things that sounded really good but just wasn't that practical. But similar to how cloud code wasn't useful until the models got good enough, computer use really wasn't useful until the models got good enough as well. And now that 55 is really smart at these things and I also have a dedicated Mac Mini in the other room that I use for a lot of my code tasks, Codeex being able to use computer use to go verify its changes is awesome. And it's way more likely that by the time Codex tells me it's done with a task that it actually did the task correctly. This is also where another big philosophical difference comes in where they want to use these tools. The vast majority of people at Anthropic using Cloud Code are using it in the CLI. The vast majority of people using Codeex at OpenAI are using it through the app. Funny enough, a lot of my friends at OpenAI are non-technical. They're not devs. They're on marketing and whatnot. They all use Codeex, the app, and they use it for all sorts of cool things because this interface is a big part of how they work. And it still is optimized to just be a good experience, not to be a flashy experience. Like again, I'll do the same demo I did before. I'm just going to tell it to spin up a sub agent in this project to audit all of the examples. There is very little going on in this UI. There is just the thinking here. Now, it's going to do a search. It's going to go check things. It's not trying to be flashy. It's trying to be productive. Let me update to the latest cloud app so it's less likely to break. Oh, great. I have to sign in again. Wonderful. Who would have thought? Why is it doing it through [ __ ] Safari? I [ __ ] hate Anthropic, man. This is so bad. I have to manually enter all this [ __ ] cuz I don't have one pass on the like inapp safari and it doesn't work with my pass key through one pass. This is horrible. Holy [ __ ] You can tell they don't use it. Like they just straight up do not use the desktop app. Even better, I'm now in. But it says at the top right, failed to log in. It may have been cancelled. They they don't use this app. I don't know how else to say it. They clearly do not use this app. I'll update to the latest and hopefully the problems will go away, but it's probably just going to reveal my email again. There we go. We have chat co-working code. Also, the first thing I noticed is that all the threads I did in the CLI aren't here. It doesn't sync across. The codeex one does. Just an annoying thing. How do I even add a new project in here? I don't know how to add a new project, guys. Do I need to like select here? Why does it start in the root, not where I already was? So much more is happening. You have the fancy animations for the text here. You have the token number bouncing up all the time. The counter, the super flashy different shaped star there that stays when it's done, by the way, too. So, it's hard to know when it's actually completed. It's just a lot. And it took so much effort to even like get this project added. It's They didn't build this app for you to use it. They built this app so that it would exist if people stop asking for it. Personally, I'm not that into the slot machine feel that Claude Code seems to be going for. I much prefer simple, reliable solutions you can really trust, like today's sponsor. Got to be so real with you guys. Hiring sucks right now. The quality of the résumés coming in looks like it's gone up, but it's all just slop. And the volume of RS coming in is absurd. Like 10 to 100 times more than I've ever seen. Finding good engineers is really hard. And then convincing them to join even harder. That's why G2I has become such an easy recommendation for me. These guys get hiring. They already have a network of engineers that they know and trust and they know them so well. That's why companies like Batteround have moved to G2I for their hiring. They've already hired nine senior engineers and they're never going back. The biggest problem they had was that the leadership at the company had to spend so much time hiring because finding good people requires leadership, not a bunch of money spent on recruiters that don't know what they're doing. As they said, Batteraround didn't need more resumes. They needed fewer, better candidates that they could evaluate quickly and confidently. If this quote doesn't win you over, then I don't think G2I is for you. I think the best part is the network, the vetting process, rather than just the pedalling people, so to speak. It's the biggest difference for me. Before they use G2I, about 50% of their hires were success cases, according to them. But since moving to G2I, they'd say the numbers are around 90%. Stop wasting time and start hiring quality at soy. G2I. This isn't just about how it looks and feels to use the thing. There's also a very practical how they're thinking about the models and where they're going aspect here. A lot of the bets in the direction that Anthropic is going in is as the models get smarter, they should have the ability to do more. Like if the agent is smart enough and it can write its own prompts for sub agents and just spin up a bunch and let more tokens solve the problem. Generally speaking, anthropic strategy is if more tokens solve the problem, use more tokens. And they are very quick to look for ways to solve problems with more tokens. Cloud code is memory leaks on Windows. Go rewrite bun and rust with a shitload of tokens. Like it's insane how far they push that idea that more tokens can solve the problem. Open AAI does not go that way at all. Okay. Also, like do you see how much like the way that this updates is like line by line, word by word in a way that that feels very like visually engaging. That's by design. They do all of this to feel as exciting as possible. But the again to the philosophy thing here about the token usage, Anthropic will gladly burn tokens. Whether it's yours or theirs or anything else, they are more than happy to come up with solutions that require you burning way more tokens to solve the problem instead of finding better ways to solve and verify that the problem is solved. OpenAI solution is much different in this regard. You can tell just from how many tokens they use for things like the artificial analysis bench. They are trying really really hard to make their models more efficient. 54 mini not the case. 54 mini burns tokens. Sonnet 46 burns a similar amount to mini. GBT54x high and opus 47 burned a similar amount too, but GBT55 used half as many tokens to get a better score. OpenAI is trying way harder to be token efficient, but they don't want to compromise on the accuracy of the solution. So, they're trying to find ways to know if the code worked or not using fewer tokens. And that's where stuff like computer use is really powerful. If the model can make a change and then look at what it changed and verify if it succeeded or failed, it ends up being a lot less work than spinning up 15 agents to go triple check every line of code. This also means you need to have an environment that works well. This is one where the biggest differentiators is. Anthropic very excitedly sets up tools to let you transport your thread from your machine to the cloud to finish it up there. But good luck getting the cloud environment for cloud code to work properly. I tried and gave up. And the advice I got from my friends at Anthropic was, you don't really need the agent to run the code. Just throw the tokens at it. It'll probably get things done just fine. But when it doesn't, it sucks. And I like using the cloud runners for big tasks that I don't want to run on my machine. So the fact that it is not really able to verify the results because it doesn't have the environment configured correctly just kind of sucks. Codeex was trying harder to get a good cloud environment, but it seems like it's become much lower priority. The easiest example to see this is in the codeex section of the chat GPT mobile app. They hid the cloud option under like three menus. They're really not optimizing for cloud because they know how hard it is to get most projects running properly on the cloud, but you probably have it running fine on your machine. They know this because that's their own experience building. They struggled to get everyone at the company's projects working properly in codeex on the cloud. So they just made it so you can control your own machine with your phone instead. Anthropic's bet is the models will get smart enough they don't even need to run the code. They'll just write it correctly. OpenAI's bet is it's so annoying to configure things properly that we should probably just use the configuration you already have on your machine. And now we need to talk about the third option because their play in the cloud is very interesting. It's kind of crazy how quickly cursor went from far and away first place to third. And I honestly think a lot of that's cuz people don't know where cursor's real power is. You might notice my cursor looks different from yours. It's cuz my cursor is currently in the browser, not the IDE. That's because cursor's cloud stuff is so far ahead of any of the competition other than maybe Devon. Cursor cloud agents aren't just like a crappy headless Linux sandbox that the agent can use to try and execute your code. The cursor sandbox spins up a full graphical interface Linux instance that it can use to run your full app the way you would on a real computer and then uses computer use to test the changes. So here it ran the T3 code codebase and then launched the browser version because it's exact same as the version that you would run in the electron shell in order to test its changes as it made them. But Theo, I already have a computer. Why would I want to do that? I understand. When I'm on my phone, I didn't set up a computer to control remotely. It's very nice to be able to go spin this off. If I want to work on two different things that need computer use at once, it's incredible to go spin this off. There are so many use cases where one computer is limiting because one computer can only do one thing when you're computer using it. And if it's not on or not connected to the right network or any of those other things and you can't use it at all, you're just kind of stuck. And there's something legitimately magical about being in Slack. Someone brings up a problem with the product. You at the cursor bot and say, "Hey, can you go spin up an agent to fix this?" and it responds in the thread with a video of the fix it applied showing you that it worked. This is so far ahead of anything that Cloud Code or Codeex can do, it's insane. Another way of thinking about this is that Codeex is betting really hard on where things are today. What is it like using agents to code right now? How do you make it more likely they succeed? Anthropic is betting on where the models will be in a few months. Also, where the models they're using are because they're all using Mythos. They are betting on the models getting so smart that these types of things matter less. Cursor is betting way out in the future where we're not running these things on our computer at all. We are just triggering them through whatever other tool we want and then getting back proof that the thing worked. That does bring up another really important difference and probably the most painful philosophical difference between all of these things. I mentioned this before, but I really want to emphasize it. OpenAI uses the Codeex app all of the time. There are thousands of employees at OpenAI that use this app every single day. The cloud desktop app does not get anywhere near as much love, largely because they're all using the CLI instead. But there's a problem here. The version and state of things that anthropic employees are using is very different from what we are getting as end users. The model that a lot of those employees are using is mythos. We don't have that. They have a custom build of cloud code with a bunch of cool new hidden features in it that we don't have access to. This is why we end up with really embarrassing things like system prompt stuff leaking because the system prompt that they use internally is different from the one that we get in the apps that we use. That's why I ran into problems like this when I tried out opus 47 in the new cloud code desktop app where I asked it what are some design improvements I could make on this site and it opens with heads up the system reminder about malware looks like a prompt injection. This is clearly your personal site, the T3G homepage, not malware. Ignoring it. We're using something entirely different from what they're using. And I don't know how to properly emphasize this because people just don't seem to get that. You do not get the tool anthropic employees get. You get what they feel generous enough to give you and they are not testing it reliably enough. This leaks through the whole clot experience. It is clear that this tool works great for them and the portions of the thing they're using that they cut off and throw out to us are not tested well enough. And the result is that we get weird [ __ ] like this. I guarantee you that one of two if not both of these things is true. Either this system prompt is not the one they were using internally, which is very likely the case, or this problem that I have here just did not happen with Mythos, which is what they were all using internally. the model we're using, the system prompt we're using, the harness we're using, the UIs we're using, all of those things are not the same thing that they are using at Anthropic. Meanwhile, the Codex app that we're using, the Codex models that we're using, and the Codex plugins and features and all of that that we're using is the exact same thing that they're using at OpenAI. I can't cite my sources here, but you need to believe me on this. I have friends at all of these companies. It is very apparently the case, and you can tell externally that this is also the case. It's obvious. And to their credit, Cursor is very similar here. They use the [ __ ] out of the things that they are doing. They obsessively test every detail before taking the exact thing they're using internally and throwing it at us externally. They even do crazy stuff like disabling every model other than the new composer model and not telling people that they did that. So, everyone thinks they're using Opus or GBT55 or whatever, but they are dog fooding their own thing to see what issues people run into. The effort that cursor puts into making sure things are good before they put them out is insane. OpenAI is close to that same level. They just have issues. If if anything, the thing I've experienced, which is funny, is that they have [ __ ] versions internally at OpenAI. Like I ran into problems with GBD55 during testing because the researchers didn't have the search tool on because they don't want things leaking, which I totally understand. The the point being we get more than they have. Cursor gives us exactly what they have. Enthropic gives you less than they have. That said, Cursor is not really that focused on integrating with other things. They did just throw out the cursor agents SDK, which is like cool and took them long enough to do it. Anthropic is actively trying to push us away from using our subs outside of their UIs. Open AAI can't stop inventing awesome standards we can build on top of. Like to be really frank, my agentic coding app T3 code that I built with Julius, mostly Julius, this would not exist if it wasn't for the codeex app server, which is part of the open source codeex CLI. They give us so much stuff to build on top of and around. We are building on top of the codec CLI the exact same way the Codex app team is. I am admittedly disappointed that the Codex app itself isn't open source. I wish it was. It's another big part of why we built T3 codes. It was a good open- source option, but we built on top of the Codex app server, which was awesome to build with. And the Codex employees will just openly post on Twitter like if you want to use Codex for something different, tell your agent to go to this repo and look at this folder and it'll figure it out. Cursor seems interested in building these types of things, but historically hasn't done that great of a job. When we wanted to add cursor to T3 code, we had a lot of problems because the recommended path was through their agent client protocol bindings that they had set up for the CLI, but the cursor CLI was super behind and the ACP stuff even more so. You couldn't even pick modern models. We tried setting this up a few months ago, not even. And the most modern composer model was one when two had already been out for a while. They didn't have Opus 46 or 47. They only had 45. They didn't have any of the new sonets. They didn't have any of the new GPT models. it was super limited and they were all hardcoded in so we couldn't pick newer better models. We had to go back and forth with the team for a while to get them to fix that. They did end up doing it and they ended up also putting out the SDK. So it seems like they want to make more of these integration things but it's not a priority or focus for them. Open AAI is heavily prioritizing and focusing on this. Claude would really rather you don't do this. Anthropic is trying to keep you from building integrations with cloud code unless they're deep inside of cloud code itself. OpenAI wants you to do whatever the [ __ ] you want. They just want to see people build the future using whatever, especially if they're using their stuff. Cursor wants that, but isn't prioritizing it properly. One last piece that I think is important to understand, model improvements. Enthropic has to add a bunch of features because their public models haven't improved since December. I still feel like 46 and especially 47 are more so regressions than they are improvements. A lot of why Codex has improved is because the models got way better. 52 to 53 to 54 to 55 were massive improvements. In the time we went from 52 to 55, we went from Opus 45 to 47. So, Opus got worse twice and OpenAI models got better three times. That compounds massively. If you took the same old version of the codec cli that was around at the time and switched it over to 55, it would feel incredible. If you took the old version of Cloud Code from the end of last year and tried to use Opus 47 in it, it would probably crash because Cloud Code was getting so buggy, especially then with the performance issues. But if you could get it working, it would probably still feel worse because the model was dumber and less willing to do the types of work I want to do. So Anthropic has to make up for this somehow. And they're doing it by doing a shitload of cool looking features and trying to make moments happen with cloud code. They're trying really hard to seem like they're making a lot of progress. And to be frank, they're getting away with it. It's working. People say things like this on Twitter. The anthropic team is still innovating the most in the harness layer and it's not even close. The only reason you think this is cuz they do better on Twitter and they keep adding things cuz they kind of have to. But since their actual better model is mythos and they're not releasing it for various different reasons that we can conspire about all day, their tool has to look better because it's not able to improve just from models getting better. Separately, codeex is meaningfully improving constantly. 55 itself is enough of a reason to make the jump. This is definitely very different from other videos comparing these tools. I know most of them just like send the same prompt and compare how good the answers are from all of them. Whatever. If you like the model, you like the model. If you like the tool, you like the tool. The things I care about more are where is the team going with this? What problems are they trying to solve with this? And can I trust this thing to work well tomorrow like it does today? I hope this helps give you a good answer to those things. If you're still primarily coding using a sidebar in VS Code or Copilot or Cursor, please go try any of these, especially the desktop apps. They've gotten a lot better and it's a much nicer way to code using AI. I'm so blown away with how much I like things like the codeex app or the T3 code experience. It's just it's really good. The models are just one piece of this puzzle. And as great as the models for anthropic might be, all these other things matter, too. So, give these tools a shot. See which ones fit your desires and your workflows the best and go with that. If you hate coding and you want it to feel more fun as you do it, you want to feel good and like you're productive the whole time, Claude Code's actually very good at that. If you need help staying motivated, go hop in Cloud Code. If coding is still scary to you, go try out Cloud Code. It'll probably do a great job at that. If you've been an engineer for a long time and you want a tool that mostly stays out of your way and just gets [ __ ] done, Codeex is phenomenal. It really feels like it's buy and for engineers, and I hope it stays that way. Believe me, if Codex regresses, you'll hear it first for me. And finally, with Cursor, I recommend using it through the cloud side. The Cursor cloud is incredible. It's so powerful. does things I never would have predicted that cursor could do or that AI could do as a whole. It's also one of the best solutions to set up for your team so nontechnical people can kick things off in Slack. It's really the enterprise ready end to end solution. Highly recommend trying it if you haven't. The cloud stuff is so cool. So to simplify, cloud code for unmotivated devs or bad devs that want to feel like they're productive. Codeex for skeptical devs that want to use these tools in a way that is productive and leans into good workflows that they're already probably doing. Cursor for people that want to set their teams up for success with cloud runners that are able to solve real problems in simple ways using tools inside of their Slack or whatever. And to answer the final question from chat, what about anti-gravity? It's a great way to secure your future by convincing your boss that AI is not capable of doing anything at all. And it is not useful for anything beyond that. And if you're asking about PI, you don't need this video in the first place. Go play with the cool fun things. Hopefully, this helps you understand the benefits of all of these different solutions, the philosophy of how they are building these different things, and the reasons why you might pick something over another thing. But in the end, you got to just try all of them and see what maps to your workflow the best. and let the tool shape your flow a bit too. When I tried to treat Codeex like Cloud Code, I didn't like it as much. When I let it do its thing more, I ended up really liking it. Hopefully, this has been helpful. This is a very strange video for me to make, but I hope that it is as useful as I wanted it to be. Let me know what you all think. And until next time, he snirts.