Cursor just crushed Claude Code
Chapters9
Overview of Codium's small, fast coding model and why it stands out despite not being publicly benchmarked externally.
Composer 2.5 by Codium is a game-changing, ultra-cheap code-focused model, and Cursor’s ecosystem makes it a compelling enterprise bet despite trade-offs in exposure and benchmarking.
Summary
Theo (Théo? Theo) introduces Codium’s Composer 2.5, highlighting how it distills open-weight coding prowess into a fast, cost-effective model usable mainly inside Codium’s platform. He contrasts it with earlier Composer versions and rivals like Gemini, noting Composer 2.5’s impressive cost-to-performance on Cursor’s bench. The host also pitches Blacksmith as a superior CI option, setting the scene for how model economics intersect with developer workflows. He dives into nuanced pricing factors—per-token costs, tokens generated per problem, and subsidization dynamics from labs like Anthropic and OpenAI—before explaining Cursor’s data leverage and the unique access model for Composer through Cursor’s tools (ACP, Cursor Cloud, and the SDK). The discussion covers training innovations (targeted textual RL, synthetic task generation, and dynamic task difficulty), potential reward hacking, and practical implications for how teams collaborate with AI agents on real codebases. Theo closes with practical reflections on when a fast, intelligent model shines (heavy, back-and-forth coding workflows) versus when heavy parallelism becomes too noisy, and he weighs Cursor’s long-term bets in the enterprise AI race, including SpaceX AI collaboration and potential future leadership in code models. There are candid critiques of Glass and notes on design aesthetics and UI prototyping experiments, all framed within a broader prediction: Cursor may leapfrog state-of-the-art code models in coming months given compute, data, and new training methods.
Key Takeaways
- Composer 2.5 delivers 50 cents per million tokens in and 2.50 per million tokens out, while outperforming composer 2 and other mid-range models on Cursor’s cost-per-task bench.
- Cursor bench ranks Composer 2.5 between 5.5 high and X high in quality, yet at a fraction of the cost of Opus and GPT-5 variants, enabling cheaper, longer coding sessions.
- Composer 2.5 uses targeted textual feedback and 25x more synthetic tasks than Composer 2, including feature deletion/re‑implementation challenges tied to real tests for verifiable rewards.
- The model benefits from a heavy RL setup with dynamic task difficulty, synthetic data, and on-policy distillation that steers the student model toward a teacher’s guidance in real-time contexts.
- SpaceX AI collaboration and a potential Cursor-SpaceX IPO-linked deal underpin the strategic value of Cursor as a data- and compute-rich platform for enterprise-grade coding workflows.
- Composer 2.5 is built on Moonshot’s Kimmy K25 checkpoint and reportedly doubles the base model’s score on Cursor’s internal bench, while maintaining low cost.
Who Is This For?
Software engineers and engineering managers evaluating AI-assisted coding tools for large codebases. If you want fast, affordable, and enterprise-friendly coding copilots that fit real-world editor workflows, Theo’s take on Composer 2.5 and Cursor’s ecosystem is essential listening.
Notable Quotes
""Composer 2.5 is now available in cursor. It's a substantial improvement in intelligence and behavior over composer 2. It's better at sustained work on long running tasks...""
— Theo citing Codium/Composer 2.5’s improvements and its practical benefits for long, complex tasks.
""A lot of these labs are up charging as much as like 90% for how much it costs to run the model... they are pricing based on how much they think they can get away with charging for the thing cuz they need to make massive profits""
— Theo explaining the economics of model pricing and subsidization dynamics.
""During the Composer 2.5 run, we applied this method to a variety of model behaviors...""
— Describes the 25x more synthetic tasks and targeted textual feedback approach.
""The problem isn't that it's too expensive or too slow or too dumb... it's over here on Cursor Bench... we don't have anything else to measure against""
— Theo on benchmarking limitations and the need for external evals.
""Cursor has data... that is data they can use to train the models""
— Highlighting Cursor’s data advantage in training and model improvement.
Questions This Video Answers
- How does Composer 2.5 compare to GPT-5 on code tasks, cost per token, and task efficiency?
- Can Cursor’s enterprise model access be benchmarked externally, or is it limited to Cursor's ecosystem?
- What is targeted RL with textual feedback and how does it reduce tool-call errors in coding tasks?
- What are synthetic tasks in ML training for code models, and how do they prevent reward hacking?
Codium Composer 2.5Composer 2 vs 2.5Cursor AI platformCursor benchAI in CI/CDBlacksmith CISpaceX AI collaborationRLHF and targeted textual feedbackSynthetic task generationTool usage and tool call reliability
Full Transcript
Last week there was a really cool release I'm afraid people are sleeping on. It came from a company with a lot of compute and it was focused on being a small model that is very effective at solving code tasks. And no, it wasn't Gemini 3.5 Flash, even though it probably should have been. It was Composer 2.5 by Codium. This is a crazy drop. The rate at which Codium has caught up to state of the art is pretty absurd, but there are catches here that we need to talk about as well. If you're not familiar with the Composer line of models from Codium, this has been their attempt to distill good open weight models into a great coding model for usage within Codium.
That said, it's only really for use within Codium. So, it's hard for us to bench it externally. That means we kind of got to vibe our way through it, but it also means Codium does to an extent as well. They have published some awesome numbers including their own benchmark, which sadly is not particularly public, the Codium bench. But the results seem to be incredible, and I can tell you from my own use of it, this isn't just hype. While I don't think this is the greatest model ever and you should use it for everything, I do think it's worth looking at and understanding both because it's a good model release from a company that understands what we need as devs, but also because it shows that the strength that the major labs have might be starting to crack a little bit under pressure.
While I am an early investor in Codium, the acquisition stuff happening with SpaceX kind of destroys any incentive I might have had to shill it super hard. So, I'm not sharing this cuz I'm biased. I'm sharing this cuz it's actually really interesting to me. And don't worry, I will [ __ ] all over a lot of other Codium things throughout this, too. All of that said, want to make sure it's very clear Codium is not paying me at all for this coverage. The only people paying me today are the sponsor. As powerful as continuous integration is, it's full of compromises.
I like to call this the CI compromise triangle. The different things you have to to about when you're configuring your continuous integration. You start with convenience, which is how easy is it to implement this? You also have performance. How long does it take to run? Be nice if the CI runners were just a little bit faster. And of course, you have cost. Obviously, there's lots of different solutions for CI that have different compromises across this triangle because it would be impossible for anything to just be better at all of these, right? Like there's no way something could be more convenient, faster, and cheaper, right?
Oh, is it a Blacksmith ad? Yeah. Blacksmith is 2x faster and 60% cheaper for most people when they move from GitHub Actions. I thought this was too good to be true, as did many of their customers like Mercury, Superbase, Dscript, Exa, and more. But then you use it and you realize you can't live without it. If it was just cheaper, it would be worth it. If it was just faster, it would be worth it. And if it was just more reliable, it would still be worth it. But then you discover the console and how much better it is than anything GitHub has shipped in years.
You get actual metrics about the workflows you're running, what ones have high error rates, what the errors might be, as well as logs that give you really useful details on what is going on across all of your CI. Every time I forget to set up Blacksmith on a new project, I regret it immensely and then I move over and watch my build times go from 3+ minutes to under one. Your devs and your agents is our fast CI. Get it today at solidity.link/blacksmith. So, let's talk about this new, intelligent, fast, cheap model that Cursor put out.
First, I want to talk about price. A lot of people don't understand how models are priced. I'm going to do my best to explain it simply here. You have the input and the output costs and these are in million token groups. A token is just a chunk of text, usually three to five characters, kind of like a word, that is how models interpret text and then generate text as well. Input tokens tend to be a lot cheaper than output tokens because the inputs are the things you're passing the model and it's just using that to set up what its guesses for next token prediction are going to be.
The outputs are the actual work it's doing, so those tend to be much more. The base price for a model like Claude for Sonnet has historically been $3 per million tokens in and $15 per million tokens out. On the 1 million token context versions, they would sometimes be more expensive. Anthropic has since stopped doing that, OpenAI has not, and the Opus models tend to be much more expensive. Like 4.5 Opus is $5 per mil and 25 per mil out. GPT 5 to 5.2 were actually very generously priced in my opinion at $1.25 per mil in and 10 per mil out.
5.2 was a bit of a jump to $1.75 in and 14 out. 5.4 was another jump to 250 in and 15 out, and then 5.5 was a massive jump to $5 in and 30 out, doubling what it was in 5.4. It seems like the prices are going up across the industry, but they aren't necessarily. There's two additional layers to the pricing that are important to consider. The first is the amount of tokens that are actually being generated. Flashbang warning for artificial analysis. Artificial analysis measures roughly how intelligent models are on their own benchmarks internally that they publish to the world, but they also measure other pieces.
In particular, how fast the model solve the problem, how many tokens they generate, and how much it cost to run. Output tokens are one of my favorite things to look at here, and you'll notice some interesting stuff. The most output tokens didn't come from the smartest models. It came from DeepSeek V4 Flash, GPT 5.4 Mini, and Claude Sonnet 4.6. Smaller models often generate way more tokens than they need to to solve problems. It's also worth noting that the lab trying the hardest to reduce this overuse of tokens very much appears to be OpenAI, where models like 5.5 only did 75 mil tokens for the same bench that Sonnet did 200 mil for.
Almost a third the number of tokens and a higher score as well. And when you add in models like 5.5 low and medium, which still score really well in the benches, like 5.5 medium scores as well as Opus 4.7 roughly, and the low version is worse, but it's still much higher than 5.4 Mini and Flash and roughly around where something like Sonic 46 is. It is expensive per token, but remember, it's putting out comically less 7 mil tokens. That is a tenth as many as 55 high and a 30th or less as many as Sonic 46.
And 55 medium is only 22 mil, so very token efficient models. So that's the second factor of pricing. Factor one is per token cost. Factor two is number of tokens to solve a problem. Factor three is your ability to negotiate better deals. Some of these labs are up charging as much as like 90% for how much it costs to run the model because they have lots of other costs like the people who trained the model, the computer they had to buy to be able to run the model, the computer they had to spend it to train it as well.
They have to factor that into for their prices, but a lot of these companies could hypothetically speaking give you a deal for half off or more per token and it wouldn't cost them money. It would just be opportunity cost. So they are pricing based on how much they think they can get away with charging for the thing cuz they need to make massive profits to justify the absurd amount of money they spend training and doing all that. That's also why they can subsidize so hard doing things like, you know, giving you $4,000 of usage per month on the $200 plans like Claude does for example.
Those plans subsidize so hard that it's affecting cursor because if you're subscribed to Codex or Claude code, you are now using a competitor to cursor. And Anthropic and OpenAI are going to subsidize your usage in meaningful ways because they want to get you locked into their ecosystems. That's just the nature of how this all works. So if you spend $200 on Claude code, you can get 4,000 plus dollars of usage. You can probably get as much if not more doing the same with Codex. But if you're using the APIs directly, you don't get that level of subsidization at all.
You would be lucky to negotiate 30% off. So $4,000 to a Claude code sub costs $200. $4,000 of Anthropic models to Cursor probably costs $3,000 at best. And this means a handful of things for Cursor. It means that these models getting more expensive on the books for the API prices massively hurts the amount of usage you can get as a user of Cursor. It also means they are competing in a subsidization war that they don't have a player in. Because if Anthropic or OpenAI decide they want to crush Cursor, which seems like they kind of have, they have the compute and the models and the wedge to increase how much Cursor has to pay and decrease how much users have to pay to use something else, that puts them in a really hard spot.
But Cursor does have one thing that almost no one else does, including the labs to an extent. Cursor has data. Because Cursor's the tool that's been used by so many developers for so long building with AI, and many of them didn't turn off the share my data switch in the settings, that is data they can use to train the models. Not just like they read your code base or something. So, people get way too interested in that part. That's not that interesting. What's much more interesting is how you work with the agent, where you work on a plan with it, and then you give it feedback as it generates things.
You tell it, "Hey, you missed this thing. Go change that, too." Those chat histories are so valuable. And Cursor kind of gets to double dip here because they can use the models at admittedly way higher a price than other people get them for to resell to use the user building with AI in their tool, but they also get to keep the data if the right switches are hit, so they can use that to train a smarter model, not as smart as the ones that they are distilling from, but close enough that they can collect a lot of the margins that they are losing to there.
Because if you choose to use Cursor for open AI models instead of using Codex for open AI models, you open AI makes a lot more money as a result of that. And this is a necessary thing for cursor to survive because if the prices keep going up, the subsidization continues, the incentive to use cursor is just a massive financial mess, which is why when composer first dropped with composer one and it was priced at 125 per mil in and 10 per mil out and it was super token hungry, I was confused. That is too expensive for a model that wasn't very good.
And then one five, which was meaningfully better but still not great, was more expensive than sonnet. And sonnet's already overpriced. $3.50 per mil in and 17.50 per mil out. This is the point where I kind of wrote off composer for a bit. I just didn't see the light. That was my mistake though. I was betting against Jacob and you should never bet against Jacob. If you're not familiar, Jacob's the creator of Supermaven as well as Tabnine. Supermaven got acquired by Cursor and he made the tab complete what it is. He wanted bigger challenges, which is why he started doing the model training side.
And then composer two hit. It was way cheaper. It was seven times cheaper than one five was and way smarter too. And now we have 2.5, which is still that same cheap price of 50 cents per mil in and 2.50 per mil out, but the numbers are where it shines. On cursor bench, and this is cost per task by the way, so left is more expensive, right is cheaper. Composer 2.5 is scoring between 5.5 high and X high, but costs way less than either. That's a really big deal. Composer two was already pretty competitive in this sense where it was crushing things like 3.5 flash and Kimmy K26, which is the model it's based on, which is really interesting.
It was getting up there with like Opus on medium for way cheaper. But 2.5 has gotten slightly more token efficient, significantly smarter, and stayed absurdly fast. Here are the numbers of all of the models cursor currently publishes runs for with what is admittedly their benchmark cursor bench. This benchmark doesn't have too many details public. It's real problems that they've encountered that they collected is my understanding but they have not published it. There's a lot of good reasons for that like SW bench Pro is contaminated. It just is. If you make the benchmark open source, the labs will get the data on how to do it better and then that benchmark stops being useful.
It's sad that that's the case but that is the case. Apparently both composer 2 and 2.5 are still based on Kimmy K25. They didn't switch to K26 as the base which makes their numbers even crazier. They have effectively doubled the score of their base model with Kimmy K25 with composer. Now at a 63% on their bench. GPT 5.5 got a 64. Opus 4.7 got a little closer with 65 at 20x the cost though. 5.5 was only roughly 10x the cost for a point percentage point but composer 2.5 being that good for that cheap is absurd.
Showing that you can distill an open weight model to be close to that level of intelligence in coding task specifically is both super cool to see but also kind of shows that all of the work for making models good at code is post training in RL largely. Like they took a model that was okay at code and just ran a [ __ ] load of compute and a [ __ ] load of post training on it to try and force it through RL to be better at coding and they succeeded. So let's read what they have to say about it.
2.5 is now available in cursor. It's a substantial improvement in intelligence and behavior over composer 2. It's better at sustained work on long running tasks following complex instructions more reliably and it's more pleasant to collaborate with. They improved composer by scaling training generating more complex RL environments and introducing new learning methods. In addition to training composer for 2.5 on more difficult tasks, they've improved behavioral aspects of the model like communication style and effort calibration. These dimensions are not well captured by existing benchmarks but we find that they matter for real world usefulness. I agree. A lot of these things are just hard to measure.
Composer 2.5 is built on the same open-source checkpoint as Composer 2, which was Moonshots Kimmy K25. I like this chart. It's a little crazy, but shows how this got so good. This is a estimated measurement of how much compute was used for each stage of training. If you take all the compute that went into Composer 2.5 and break it up based on Kimmy K2, which was the original base that was trained by Moonshot, then compare that with Kimmy K25 where they did another roughly the same to refine it into a better, smarter model. Composer did another 8x-ish compared to that.
It was 10x more total compute in the end. And they were able to do this because of their collab with SpaceX AI, which is still a very annoying, weird name collab thing. If you're not familiar with that, Cursor currently has a pending deal with SpaceX where either they will be paid $10 billion for their collab, which is mostly getting their data, and also Cursor gets access to their compute to use for things, or at the end of the year SpaceX can purchase Cursor for 60 bill. The current rumor is that that's going to happen when they IPO, which is currently in process.
The S-1 was just filed. So, it's not unlikely this is going to happen and Cursor will become part of SpaceX, but that compute seems to have been huge for them. For them to have so soon after that collab started already put out a model that's in a third place on their benches for code is nuts. They gave some fun details on how they were doing the training here, too. First piece they call it is the targeted RL with textual feedback. Credit assignment during RL is becoming an increasingly difficult challenge as rollouts can span hundreds of thousands of tokens.
When a reward is computed over an entire rollout, it may be hard for the model to tell which specific decision helped or hurt the outcome. That's especially limiting when we want to discourage a localized behavior like bad tool calls, confusing explanations, or style violations. That is really hard. If you're you're measuring with RL that the output works, but it did three bad things during it, how do you tell it failed without having it fail the actual output? The final reward can tell us something went wrong, but it's a noisy signal for where it went wrong.
So, they trained 25 with targeted textual feedback. The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better. For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student token probabilities towards the teachers. Interesting. So, effectively, there is a another model that has been steered the right way by being given this additional context and information, and then it tries to direct the thing being trained to behave similarly without having that same context.
So, you tell one model, "Hey, this is the thing we want you to do." And it tries to steer the other model into going that way without actually having that additional instruction. Because in the end, we shouldn't have to tell our models every step along the way how we want them to behave. We should tell them what we want, and then they behave. So, this makes sense to me why this would work well. It's really cool to hear. It's awesome they're being so transparent about it, too. A good example here is when you have tool errors.
For example, if the model has a bad tool call or tries calling a tool that's unavailable, the model would receive a tool not found error, but the model would then continue to make additional valid tool calls. This screams Gemini. I hope my friends over at DeepMind and Google are reading this because you need to copy everything they're talking about here. No model I use regularly has as many malformed tool calls as Gemini models do. That's because they're RL for the end result, almost certainly. So, if it has five bad calls and eight good ones, it doesn't care cuz the result came out correct.
They're trying to fix those behaviors. The fact that it hit one error in the process of hundreds of tool calls would have a minimal impact on the final reward. Yes, but how So, how do you fix that? With text feedback, we can target the specific mistake by inserting a hint in the context of the problematic turn, such as reminder, available tools are with a list of available tools. And remember, this is all just probabilities, where based on the things in the context and how that's affected the model, it is more likely to answer in certain ways.
So, if they have a teacher that has different context, its probabilities are better aligned with what we want, it will steer the student model that is being RL then trained in this process to behave more like the teacher who that has this information. So, by giving it that additional context, it lowers the likelihood that you have the wrong tool call probabilistically, and it increases the likelihood that you have a valid tool call instead. And for this specific turn, we update the student weights towards the new probabilities. During the Composer 2.5 run, we applied this method to a variety of model behaviors, from coding styles to model communication.
Here's an example. It tried calling a tool that doesn't exist, read lints. Reminder, available tools are read, write, shell, and string replace. And now the token probabilities that violate the hint go down, and the model weights are updated to avoid that type of error. The next call they have is synthetic data. Apparently, they're not just using the data they collect from us anymore. Composer's coding abilities improved substantially to the point where it began to get most training problems correct. To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run. Composer 2.5 is trained with 25x more synthetic tasks than Composer 2 was.
We use a range of approaches from creating synthetic tasks that are grounded in real code bases. For example, one synthetic approach is feature deletion. For these tasks, the agent is given a code base with a large set of tasks and asked to delete code and files in such a way that the code base remains functional while specific testable features are removed. The synthetic task is to re-implement the feature, and the tests are used as verifiable rewards. That's a clever way to generate this data. A lot of the work for making models smarter is these types of clever things.
I talked about them a lot in other videos, but this is a smart one. Tell the model to delete a feature and then tell it to re-add the feature using the old tests. One downstream consequence of large-scale synthetic task creation is that it can cause unexpected reward hacking. As the model got more adept, Composer 25 was able to find increasingly sophisticated workarounds to solve the task at hand. In one example, the model found a leftover Python type checking cache and reverse-engineered the format to find the deleted function signature. In another, it was able to find and decompile Java bytecode to reconstruct a third-party API.
We were able to find and diagnose these problems using agentic monitoring tools, but they demonstrate the increasing care necessary for large-scale RL. Yeah. I even seen things like this in my own stupid tests. Like earlier, I was trying to demonstrate a new framework I'm working on and I made a bunch of demo projects for this built. And in here, you will see that the model found a to-do app with this already implemented on another project that had a similar name nearby cuz I did a restart I did a Mario demo I did multiple demos and it kept finding other projects nearby that it could use to cheat.
I'm checking nearby Lakebed Capsules for updated syntax rather than guessing the database API. Yeah, it found other folders nearby that had examples in it. The models are trained to succeed, not to avoid cheating and trying to RL without letting the models cheat is a very difficult thing and it's cool they're sharing examples of the model hacking that. Enough of this though, I want to play. But here's also where I have to crash out a little bit because Glass frankly still kind of sucks. This is their attempt at doing a Conductor Codex app T3 code type thing and it is slow, clunky, and obnoxious.
We will use it here cuz it is the best option they provide. I do want to call out in T3 code that we support Cursor through their ECP bindings for the CLI so you can use Cursor models here, but to make sure we're using everything properly, I am going to do this pass using Cursor Glass. It's annoying. I just opened cursor dot for this directory and it brought me to a new thread view for a different project. This is like the command I use the most, like cursor here. Okay, the second time it opened it right.
Again, like the I don't know how to put into words how far the quality control has fallen. I was really hoping glass would be the reset we needed. It hasn't been. This has become one of my favorite tests to do where I take a game that I built already that is kind of functioning but needs a lot of work and I tell the model to re-implement it from scratch in this directory. I'd like to re-implement my game fish slop from scratch. I want better assets and more reliable core engine and a path to making it a real quality web game.
The new directory for it is here, use that as a reference and build the game in its entirety in this directory. Do not stop until you're done. Let's see what it does. One of the coolest things about it is that it's really fast. It's already spinning up parallel agents to do different things at the same time, which is really cool. One of the crazy things they did, I haven't seen a company try something this crazy before, is that they did an internal test last week with the model where all cursor chats for employees were redirected to composer 2.5 for a few days.
This employee didn't even notice, which is a testament to how good the model's gotten. That is super cool to hear. And you can see it flying through this stuff right now. It is it is churning. It did just do like five edits in a row here to like the same thing. -1 +1 -1 +1 -1 +1 -1 +7 +1 -7. Interesting. It's actually checking the resulting code, which when I tried the same thing with 3.5 flash yesterday, it did not do. On that note, 3.5 flash, which also came out the same time as composer 2.5, got a way lower score than 2.5, lower than 2.0, and cost $1.94.
I just I don't see why anyone would use 3.5 flash for almost anything. It's just not a good option. It's done? Oof. Page failed to load. Here's the result. So, it is one of the two models that failed to make the actual page render, which is sad but not super surprising. It is a Chinese open weight model that was RL'd to all hell. Yeah, let's see how this works now. I don't have enough shells to unlock those. Well, unlike the one that Gemini made, it actually functions and has the core mechanics behaving. That is genuinely insane.
I I don't love the asset it generated here. I'll give it an annoying task, actually, to fix that. Pull over all the assets from the original game. Let's see if it's able to swap out the assets for the ones from the original game and how quickly it's able to do that. PNG assets exist in Fish Game Public, but the glob tool failed to detect them. Interesting. This is a big move, changing the entire render pipeline for your game live. You can see the game flashing in the background as it does all the changes. Apparently, it's done.
That took under 30 seconds. And yeah, I switched over to my original assets. It made a lot of things too small. Let's tell it to fix that. Scale is really off and UI is hard to read, low contrast. On the topic of UI, I do want to talk about the UI it generates cuz it's actually kind of impressive. Friend of the channel Dara has made an awesome tool that lets you compare what models design and how they handle design different ways. The site itself was designed by Composer 2, funny enough. He runs different models with the same prompt to generate different web pages to see how good they are at design.
And here are the designs that we got out of Composer 2.5. This one's actually kind of nice and chill. I I like the way it used the font. I like the little thing it made here. Tasteful grain. Yeah, this is fine. This is clearly using the front-end design skill. Yeah, cuz these are all the the designs you get out of the front-end design skill. It screwed up the spacing on these buttons a bit, like that should be uh aligned to the bottom, I think. And this doesn't look great either, but like it it honored the goal of this design, okay.
This one I just I don't like. I don't like this style of like preppy [ __ ] I hate the retro ones. And this is fine. So, here's how it goes without the front end design skill from Infropik. This looks fine. This looks fine, but way too similar. This is different. It's different in a way that's kind of cute. I don't love it, but it it is cute. This is awful. And this is like an old Tailwind slash like Bootstrappy site. I've seen a lot of sites that look like this. This feels like an old era design that you would get in the the purple pink gradient classic.
Not bad. Like like getting this type of design out of a model this cheap is really cool to see. Interesting bug here. We'll resume when background work finishes. What is the background work that it is finishing? Cool, they called me. Let's see how this looks now. Oh. Yeah, it screwed up the scaling a bunch for this, sadly. It got like the core mechanics right. These buttons on the bottom don't work for clicking things, though. Like I have to press the hot keys. It's it's far from perfect, but I could see how somebody who doesn't have a lot of money, does have a cursor sub or like a student discount or something, could with enough massaging and back and forth make something actually decent out of this.
Like I see the path there. It's not a super happy path, but there is one. Okay, it actually does have a lot of things wrong with the feeding mechanic in the game. It's sadly not the worst solution I've seen for this, but it is not great. This is an admittedly very complex task, but yeah, I was hoping for a little more there. For my like day-to-day random TypeScript end-to-end [ __ ] I found it to be relatively pleasant, but there is a big problem. The problem isn't that I don't like glass, even though I don't. The problem isn't that it's too expensive or too slow or too dumb or any of those things.
The problem's actually over here on Cursor Bench. And the problem isn't Cursor Bench itself. It's not that it's a dishonest test or anything like that. It's that we don't have anything [ __ ] else to measure against. Because unlike every other model that we talk about on this channel, there is no API we can hit to test this model ourselves. The only way to try out Composer, regardless of how much money you're willing to spend, is to use it inside of Cursor. Unless you spend $60 billion to buy them, then you can probably use it for whatever.
Point I'm trying to make is that we don't have a real measurement of this outside of what Cursor is putting out. The Rob did come out and say that they are reporting terminal bench and SWE bench multilingual, which sadly is a polluted test. Enough info on it has leaked that models have it in the training data and it's kind of compromised now. It is what it is. Allegedly they are working with artificial analysis and others for more external evals, which is very good to hear. But I understand on one hand why they're not going hard on this because they made this model just for coding and nothing else.
It'll probably perform really poorly on stuff like SkateBench, for example, because it isn't built for that use case. It's built just for coding. And a lot of these benches, like if you were to throw it on the artificial analysis intelligence index, the core bench they have here, it would perform very poorly, possibly as bad as Kim iK25 or worse, because a lot of the benches things that aren't code. So I get why they're not prioritizing this, but it does suck to have a model that seems really useful for a bunch of different [ __ ] that I can only use through Cursor's app or through the ACP implementation on their CLI that we had to like push really hard for them to fix.
For a while you couldn't even use Composer models through the ACP bindings because they had a limited hard-coded set of models that didn't include the new Composer models. Obnoxious. They've since fixed that after a lot of back and forth with the team. They also just put out the Cursor SDK, which allows you to build agents using Cursor's harness and infrastructure and everything, which hypothetically means you can write code like this, where you create an agent that uses Composer 2 and has access to a specific directory, and then you can programmatically run things against it. This is cool.
Especially cuz you can point it at cloud repos as well. There's a lot of promise here. And as I've mentioned many times, I do genuinely really, really like the Cursor Cloud stuff. cursor.com/agents has consistently been my favorite way to use Cursor as of late because it has a full Linux box in the cloud that it can use to spin up whatever your project is, as long as it's not like an iOS app, and actually test against it with computer use. Super cool when combined with fast models like Composer 2 and 2.5 and the ability to trigger these from my phone or from Slack.
I like where they're going with all of this. I just hate when I see a new model with numbers that are this good, and I can't it. This is cool, especially cuz you can point it at cloud repos as well. There's a lot of promise They're the only lab, in quotes, worth looking at that doesn't expose their stuff over API at all. This is very different for companies like Anthropic, who subsidize usage through their tool, but charge you full price to hit it directly. The only thing going on right now that's close is Gemini 3.5 Flash, where you can pay API prices, but if you want the full speed version, that the one that runs at 1,000 TPS instead of 200, you have to use antigravity IDE or CLI to get the full speeds.
There are very few instances nowadays, thankfully, of labs restricting access to their models on a fundamental level. Usually, they just subsidize. But now we have both Google restricting us from using the fast version of their model over the API. We can only use the slow one. Previously, we had Mistral do the same thing, where they had a version of the model that could do 1,000 TPS that was working through their partnership with Cerebras at the time. But if you hit it over the official Mistral API, you'd be around 50 TPS. And now we have Composer, which is literally only headdable via Cursor surfaces.
You cannot do it over API at all. Apparently, people have it working in tools like Pi by using the Agent SDK that Cursor put out. That's cool to see, and I don't think they're going to be as egregious with limiting how people use the SDK as companies like you know, the one big A has been about that. But it does suck that I feel like I can't meaningfully benchmark this or integrate it in other tools. Like if I just wanted to swap out the API endpoint in something like Snitch Bench or inside of something like Pi to use what they have here, I can't.
I have to rebuild the application to use their SDK. And for what it is worth, I have not heard the best things about the SDK. I have admittedly not had a chance to try it myself, but I have heard that it is missing things and is not the most reliable stuff yet. That said, if enough of you guys build cool things with the Cursor SDK, I will have the leverage I need to force the team to push it further. And that would be really cool. So I do still recommend you go build cool things with this.
Just make sure you do it in a way where if you do have to swap out, it's not that hard to do. So what are my final thoughts on all of this? Well, if you're the type of developer that is still staring at the agent as it runs and are doing very collaborative work with it, where you ask it to research a thing, it researches it, it gives you a result, you ask it to then go make a change, and then it does that, you look at the result, where you're having a back and forth that you're engaged with the model, having a model this fast and smart is really, really powerful.
That's why I was excited for 3.5 Flash and ultimately was disappointed because it wasn't good enough at coding to have a good experience doing that with. Composer 2.5 is a much better balance here. It might not be as capable of unblocking itself in solving complex tasks as things like the latest GPT models or the latest Opus models, but it is really fast and close enough that you can have this back and forth with it in a way that reminds me of the old days doing a gentle coding with the difference being it's way smarter and way more capable.
So, the people who are working at big companies on giant repos that just want to like sit in the code base with a real IDE and talk back and forth with the model about the changes they want to make and not have to wait 20 minutes for it to go explore [ __ ] This is an incredible model. But, if you're the type of person that's spinning up 20 agents in parallel on the same thing all doing different [ __ ] and by the time you spin up agent 17, agents 1 through 6 are done, this will not be as useful to you.
That said, I find myself steering away from the super parallelization stuff lately cuz it's just too much context in your head to keep and if you're doing anything else, it's just impossible to maintain. I can't tell you how many cool projects are probably sitting on random work trees on this computer that I've never had a chance to even look at yet. So, if you're doing this type of engaged back and forth with the model, a fast model is really powerful. I'm saying this confidently cuz I've experienced it myself recently cuz I've been using GPT-4-5 on the fast mode for like 2 weeks now because I got way too much free usage from OpenAI cuz I went to the 4-5 event and everybody who applied to go got I think a 10x bump.
So, despite building a literal cloud from scratch over the last few days using the these models, yeah, I'm building a cloud. We'll talk about that another time, don't worry. I'm not struggling to even take down 1% of my 5-hour limit much less my weekly usage limits. I'm trying so hard to burn through this and I literally can't because they gave me too high of a limit as a result of that event. Insane. It is what it is, but as such, I've gotten addicted to fast because it's 2x more expensive and I'm still barely able to burn the [ __ ] tokens.
I think Composer 2.5 fast is a much more realistic option for companies where they can't have devs doing the subsidized token usage that also don't have that free 10x bump temporarily. For companies with big code bases that have developers actually in their editors reading the files and looking at the changes, like hand-in-hand with the model. It's a good workflow, and having a model this fast, this cheap, this reliable for that type of real-world work fits the archetype of Cursor's customer very well. Because Cursor is killing it in enterprise. When I talk to my friends at big companies that aren't Amazon, where they built their own shitty IDE, almost everyone else I talk to is using Cursor.
So much of the Fortune 500 is on Cursor, it's insane. This is a really good win for them that will keep them from taking a potential deal with OpenAI or with Anthropic that I'm sure they're in negotiations with all of those companies. This is a massive play. And unlike those other labs, if you go to Anthropic, for example, with a deal, you can only use the latest Opus and Sonnet models. If OpenAI does something cool, you're [ __ ] If Cursor does something cool, you're [ __ ] Same deal if you go with OpenAI. But if you go with Cursor, you at the very least have their awesome models with Composer.
But if one of the other labs leapfrogs at any point, you still get access to that, too. So, Cursor has continued to be a really compelling bet for enterprises, and I think these developments help that a ton. By having a really good default that they can subsidize themselves and make really compelling value propositions for their users with, they are able to stay a player in this race. And as silly as it sounds, I have more faith in Cursor to sustain the AGI bubble or whatever happens than I do a company like Microsoft. Because at the very least, Cursor's moved far beyond the VS Code fork we made fun of them for into a real lab focused on real developer workflows at a level that Microsoft cannot even comprehend.
If I was to guess where we'll be in 2 years, I legitimately think it's more likely that we're using xAI Cursor Composer 7 than anything by Gemini for our day-to-day dev work. Even if it thinks things are going on in the background when they're not, bugs will happen. I have one last thing I want to end on here. Cursor posted in the thread where they announced the new model, "Together with SpaceX AI, we are training a significantly larger model from scratch using 10x more total compute. That means it's going to be 100x more than Kimmy was trained on." With Colossus 2's million H100 equivalents and our combined data and training techniques, we expect this to be a major leap in model capability.
It is legitimately possible that Cursor doesn't just catch up to state of the art for code. They might leapfrog. There's a real chance that in just a few months, Cursor will have the best model for code. And when you look at the trajectory, when you look at how fast they have scaled up where they are, and combine that with the insane amount of compute they have and data and all of the new training methods they've been figuring out, I think there's a real chance they do it. At least I hope so so my investment becomes something, but we'll see where that all goes.
I got nothing else on this one. If you're still on Cursor, you should definitely try it. And if you're not on Cursor, I'm curious what you're using instead. Let me know in the comments and until next time, peace nerds.
More from Theo - t3․gg
Get daily recaps from
Theo - t3․gg
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



