AI code benchmarks lied to us
Chapters10
The speaker critiques existing benchmarks as noisy and contaminated, arguing they misrepresent how models perform on real coding tasks.
A sharp critique of so-called coding benchmarks, introducing a new, realism-focused benchmark and showing OpenAI models often outperform open-weight rivals by a wide margin in real-world coding tasks.
Summary
Theo from the channel t3.gg takes a critical look at popular code benchmarks like SWBench Pro and the broader benchmarking culture. He argues that these benches are contaminated and not representative of everyday developer work, with models cheating or solving problems in ways that don’t reflect real coding tasks. The centerpiece is the new Deep SWE benchmark from Data Curve, designed to mirror real projects and use realistic prompts, diverse languages, and end-to-end verification. Theo walks through why Opus 4.7/4.8 and GPT-5x variants outperform older open-weight models by a sizable margin in Deep SWE, and why SWBench Pro’s checks and prompts distort the picture. He also discusses the cost, token usage, and wall-clock times, showing that even faster-looking models can be far more expensive and less capable in practice. The host openly shares his Bias and sponsorship context, and emphasizes the value of building your own benchmarks from real failures and projects. He highlights how better prompts, stronger verification, and more realistic task design lead to meaningful, actionable differences between models. The video ends with a call to developers to document failures, build their own benchmarks, and push the industry toward benchmarks that actually reflect day-to-day coding challenges. Theo credits Serena and the Data Curve team for delivering a bench that aligns with real-world work, and teases future updates and findings from the Deep SWE project.
Key Takeaways
- Open benchmarks like SWBench Pro often misrepresent real-world coding ability due to cheating, contaminated problems, and weak verification.
- Data Curve's Deep SWE bench focuses on end-to-end, behavior-based evaluation with novel tasks and shorter prompts but far larger, more meaningful code outputs.
- On Deep SWE, state-of-the-art models like GPT-55/Opus 55/64 deliver substantially higher success rates (up to ~70%) than top open-weight models in practical tasks.
- Mini SWE agent tests reveal deeper differences in harness implementations; results can swing based on harness choice and system prompts, underscoring realism in Deep SWE.
- Deep SWE shows dramatic token-cost and wall-clock-time differences, with OpenAI models delivering better efficiency per task despite higher upfront model costs.
- Theo emphasizes the importance of building your own benchmarks from real failures, maintaining a failure corpus, and comparing against diverse, private data to avoid hype.
- Bias concerns in benchmarking are natural, but the data curve-and-lab setup appears to be driven by a commitment to reflect real-world developer needs rather than sponsorship alone.
Who Is This For?
Essential viewing for developers evaluating AI copilots and code agents, especially those moving from open-weight models to state-of-the-art options. It’s a wake-up call for teams to build their own benchmarks rooted in real projects rather than relying on flashy, public tests.
Notable Quotes
""The numbers on this bench have been nonsense for a while, and that sucks because SWB Pro was meant to be the realistic bench for realistic coding problems.""
—Theo critiques the credibility of popular benchmarks from the outset.
""Deep SWE prompts are half the length, but the solutions require five times more code and two times more output tokens.""
—Highlights the design philosophy behind the new benchmark and why it’s more challenging and realistic.
""Opus 55 did the best by far at a 70% success rate.""
—Key result showing the performance gap in Deep SWE versus others.
""The more you read into this, the more it makes sense why this gap is so big.""
—Theo ties observed results to broader benchmarking trends and real-world use.
""Benchmarks should reflect how we actually build things in our real projects with agents.""
—Core argument for why Deep SWE matters for developers.
Questions This Video Answers
- How do Deep SWE benchmarks differ from SWBench Pro in evaluating coding assistants?
- Why do OpenAI models like GPT-55 outperform open-weight models on real-world coding tasks?
- What makes a benchmark realistic for developers using AI agents in everyday coding?
- How can I create a failure corpus to benchmark AI code agents for my own projects?
- What are the limitations of using mini SWE agent harnesses versus official harnesses in benchmarks?
SWBench ProDeep SWEData CurveOpus 4.xGPT-5xGemini 3 FlashClaude OpusOpen-weight vs. state-of-the-art benchmarksCode benchmarkingAI coding agents
Full Transcript
It's getting harder and harder to know which models are actually good for the work we do as developers. That's what we have benchmarks for though, right? We have all these cool benches like SWbench Pro and Arena that are here to tell us what models are actually good at coding, right? Well, about that. While it is cool to see Mythos kill it on SWEBench Pro, I personally don't believe that Quen 37 Max or GLM 5.1 are meaningfully better than state-of-the-art models from OpenAI. I also don't believe that Gemini 35 Flash is just spitting distance away from GPT 54 and 55.
That's just obviously not true. The numbers on this bench have been nonsense for a while, and that sucks because SWB Pro was meant to be the realistic bench for realistic coding problems. But not only are the problems kind of [ __ ] they're also contaminated. So much of the info on how to solve these problems has leaked that models will regularly cheat and the cheating is barely even measured by the people verifying the results. To be a bit frank, this bench kind of sucks. And the fact that it's lasted this long, especially postcontamination, is frustrating. And while artificial analysis has been trying their best to pull together a bunch of things to make their own better suite of coding measurements, it's still not great.
And to be clear, the artificial analysis coding agent bench is less their own bench and more a combination of other benchmarks that already exist that they're running with various harnesses and models. And when you go look at the problems that these benches are actually testing, you'll realize just why they are so bad. And don't worry, we'll go and do that later in this video. The issue here is almost misleadingly simple. The problems that are being benched against aren't realistic representations of the things we do in the problems that we try to solve using agents every day.
The prompts are a bit nonsensical. The structure of the tools they're using doesn't make sense. The repos already have solutions so the models can cheat. There's a lot wrong with how these benches work when you compare it to the work that we actually do every day. At least that was the case. DBSE is pretty much exactly what I've begged every person I know in the data and research side to make for years now. I wanted a bench that accurately reflects how we actually build things in our real projects with agents. And now we have it.
And the results are so damning that I'm going to be accused of bias. I can't wait to break down what this bench is, how it works, and how much crazier the results are, especially when compared to the ones that we're used to. But first, I have to disclose biases. I am an investor in data curve. It's part of why I bullied them into doing this and why I'm so happy that they made it. I've also given this company more [ __ ] than almost any other company I've invested in in my life. I am very proud of them for doing this, but I'm not biased towards them in the ways that you might think.
And when it comes to labs, neither I nor Data Curve has a particular bias. We'll talk a bit more about them. But first, I do need to be biased for a second for today's sponsor. AI models have finally gotten good enough to make computer use compelling. It's still so insane watching these agents navigate your computer and do things on it until they get stuck. And they get stuck a lot. I just had an agent get stuck when it was trying to update a single input field because it couldn't see the field in the UI. It was kind of maddening, but then I saw today's sponsor and realized this problem is solved as long as you're using browserbased.
Specifically, when you're using the new browserbased browse package, it's a browser CLI for your agents. What does that mean? I already thought my AI could use the browser. Well, it can, but it doesn't really know what to do. It's just going to keep screenshotting the page and looking for things and rarely ever manage to find them. It needs to know where things are. One of the best ways to give it that knowledge is through skills. Browser base is working on a collection of all the skills needed to navigate all the websites you would care about.
First, you add all the relevant websites as skills and they will look through their indexing to see if they have the information needed to use the site properly. And once all of the skills are added, you just use Claude as you normally do. Plan a road trip to Utah with EV charging stops in campsites for each night. Book and reimburse on ramp. That's nuts. That's so many different steps that I never would have expected it to oneshot it before, but when it has the right skills, it can do it. But if you want to be more granular, you can do that, too.
You can call different UI elements in a given page that you're on by calling things as simple as browse click and then the input that you're trying to click on. Or you can tell it to type a specific thing like apartments in SF, select an option, press a key, use the mouse to move around or scroll, take a screenshot, get the content, and you can use this with any tool you're already using. The more I've been automating my work and trying to even just get things deployed, the more value I've seen in using a tool like this to actually let my agents do stuff on my computer.
When you're ready to move the sessions to the cloud, you should probably do it at soyv.link/browserbase. Hi, Theo from the future here. One day in the future, because everything changes really, really fast. I know a lot of you guys are looking at this chart and saying, "Of course, Theo published this video before Opus 4.8 dropped. It's going to make it look so bad." That's what I'm here for now. Data Curve is still working on getting finalized numbers for Opus 4.8, but I want to share what they have so far. They initially ran it using the Clad Code harness instead of the mini SWE harness used for all the other tests.
And the results were that it performed roughly as well as Opus 4.7, but for meaningfully cheaper, which is a good change. It's still significantly slower than OpenAI models and also slightly dumber, especially when you're using it with cloud code. They did do a follow-up test using mini SWE though and saw a massive improvement in the score. That move bumped them all the way to 63% which puts them pretty close to GBT55 at 70%. They're theorizing as to why Opus performs so much better in mini SWE in comparison to Cloud Code. Their current theory is that the Clad Code system prompt is kind of holding back the capability of the model to do these oneshot tests.
I'm honestly inclined to believe that. I know that Cloud Code loves to stop and ask you questions and get feedback on things before it goes, but also the Claude Code system prompt is just kind of garbage. So, yeah, that doesn't surprise me. If anything's changed since I recorded here, I'll be sure to pin it in the comments. So, take a look for that if you want to make sure the numbers haven't shifted since. Everything else in this video is still super valuable simply because it's about the culture around benchmarking, the techniques that we use to measure these things, and most importantly, how we actually prompt these agents in order to get them working in real world code bases.
I also slowly descend into badness as I learn how bad SWE is, which like wow. So yeah, stay tuned for that. While I'm tempted to keep leading you guys on promising the results, I'm not going to do that. I'm just going to show you them and then we can talk about it because OpenAI slaughtered in this bench. GBT 55 did the best by far at a 70% success rate. The next highest was 54 at a 56%. Then we finally get an anthropic model with Opus at a 54%. Then there's a huge drop, and this is the most notable thing here from Opus 47 to Sonnet 46, the next highest.
We went from 54 to 32, nearly 50% drop. So what's going on here? What could they possibly be doing that makes OpenAI models so good on this bench and other models so bad? Well, first I'm going to deal with the bias accusations cuz I already see them happening in chat. People seem to think data curve is maybe being paid by OpenAI to make this bench. I was the one who showed this bench to OpenAI. So, no, I hit them up and told them about it. Data Curve's job is to sell data to labs to help them get better curation for code data to train the models to be better at coding.
The issue they were encountering is that the labs would report their models were really good at code and then they would go look at the models and look at the results and they saw a ton of [ __ ] because they were so confused about the gap between these models and how they were reported to be at code versus the actual experience using them. They went deep especially on SW Bench Pro. This bench used to be one of the best ones we had for coding tasks, but honestly, the quality of the problems in it isn't very high.
But more importantly, the quality of the analysis of the results is absolute garbage. They use an AI to analyze the results to verify them to make sure that things went as expected. And the analyzer and verifier often disagree when they run these tests. There are many cases where the verifier failed, but the analyzer judged it correct as many as 19 to 28% of runs. Cheated is the bucket that things fall under. If the verifier said that the code worked and it passed, but the analyzer looks at the history of what it did and determines that it did something it shouldn't have.
And that hit roughly 13% of Opus 46 and 47 trials. 87% of those cheated runs involved the agent reading out of the getit history in order to cheat and find the right shape. I didn't realize they actually wrote the analyzer to compare what the verifier from SW Pro did and they found these gaps. Their audit found that SWE Pro misgrades about 8% of false positives and 24% false negatives. And when you combine that with the contamination going on, the bench kind of sucks. If you have doubts on this bench, this chart should be the one that convinces you.
It shows what the numbers were in SWBench Pro and how they come out on this new deep SWE bench instead. If you've ever used a model like Gemini 3 Flash to try and code, you understand the fact that it getting a 35% when Sonnet got a 54% is nonsense. Those models aren't within 30% of each other. They're different worlds. Sonnet 46 can actually do work and Gemini 3 cannot do [ __ ] There's definitely some confirmation bias here where I and a lot of people that I talked to have seen and experienced models behaving in the way that this bench measures.
I personally don't use Gemini models for real dev work because they so quickly fail just running in loops with tool calls, failing to find the right file, writing code that doesn't compile at all. They're just not good in real work. They might be able to edit one CSS file to make things pretty, but when it comes to actually navigating a codebase and getting things done, no, not even [ __ ] close. And when SABB Bench Pro finds Sonnet 46 to be 1.5x better than Gemini 3 Flash, their bench sees Sonnet 46 as six times better than Gemini 3 Flash.
That is a massive jump and shows just how much better this bench is. What's craziest to me is the gap from the bottom to the top being so big. Let's compare like Haiku versus 55. In SWB Pro, there's 20 points between them. On Deepswe, there is 70. There's finally an actual range that is useful to look at and learn from. So what makes this test so different? How are they getting such crazy results? Are they putting in hints to make it so the open AI models perform better and the anthropic ones perform worse? No, they made it more realistic.
All of the tasks are written from scratch. They're not using existing commits or PRs. There's way more diversity in the tasks that they're doing. It's not like half Python like SABB Bench is. The prompts are half the length of SVB Pro prompts, but the solutions require five times more code and two times more output tokens. But the most important part is the verification layer. Verifiers are handwritten to test software behavior rather than implementation details. A lot of this comes down to prompting, and I really like how they talk about it here. They mention that strong models will test their own work unless the prompt tells them not to.
This is one of the biggest problems with SEB Bench Pro is that it tells the models to not write their own tests to verify results. So the models will write the code and then hope it works because certain models will do what you tell them to, certain ones won't. And since Opus models and claude models in general tend to just kind of ignore what you ask a lot of the time and do things they probably shouldn't, they'll cheat more. They'll make tests when they're told not to more and they'll put themselves in a scenario where they're more likely to get the right answer by hacking, cheating, and doing those things.
Since this test is way simpler prompts, like two sentences instead of two paragraphs, the results are that the model can do what it thinks is right more. And if it doesn't have the resources to cheat, it has to make resources itself. And the results are way, way more reliable. They wrote numbers showing how often did a model write tests when it wasn't supposed to in the SOBBench Pro versus when they wrote tests in deep SWE because they're not telling it it can or can't. And you'll see here for Opus 47, it wrote tests 28% of the time even though it was told not to.
And it wrote them 83% of the time when not given an instruction. BY4 wrote tests almost half as much at 18% in SWBench Pro. And now it's up to 85% with this bench instead. To put it bluntly, SWB Pro tests how good are models at contaminated Python repos that they're allowed to cheat in and told explicitly to not write tests and how likely they are to ignore it. So, SBench Pro is realistic in the sense that it's only being tested on existing work people have done with really shitty, poorly written prompts. And Opus is indeed better at ignoring your prompts.
So, if you're looking for the model that does the best job at ignoring what you ask, Opus seems to be really good at that. If you're looking for a model that does exactly what you ask, we now have a bench that shows it. There are so many other fun things to dig into here, like wall clock times, output tokens, and actual cost to run the bench. And we'll talk about that all in a sec. But first, I want to actually show some of the examples because they publish them. Deep SW prompts are aligned with the way developers talk to their agents.
They're behavior focused, short, and free of large interface definition blocks rather than overly verbose and prescriptive. Agents must discover where and how to implement the change. So, a substantial share of the capabilities being evaluated involve end-to-end exploration instead of just the execution of an oversp specified engineering task. Public benches source from GitHub issues and poll requests often carry more detail. Repro steps additional context code snippets in tests that assume specific symbols or signatures. Deepsw SW instead scores observable behavior which lets prompts stay short and natural even when the underlying tasks are substantially longer. So the prompts are half the length.
The amount of lines added is five times more and the amount of files touched is also much bigger too. Their range of technologies is much better overall. About 30% TypeScript, 30% Go, 30% Python and then a couple other things. This is across 91 active repos in five languages versus S swbench which is 12. Most importantly though, all of the tasks are novel. So there aren't existing solutions on GitHub for them to find and use to cheat. Now the drop in the false positive and false negative rate is crazy. Their verifiers only had false positives.3% of the time.
SV Pro is almost 10% and then 24% false negatives down to 1.1. So let's take a look at some of these examples. I'll just hit the top three here. This one is for happydom, which is a JS interface of a web browser without the guey. This is really useful for testing and doing things on the DOM without a real DOM. This also means it has a ton of complex implementation details you have to get right. Here's the prompt. HappydOM currently leaves some asynchronous work in an invalid state after disposal when shut down through happydom.close, page.clo, browser.clo, or a navigation that swaps out the active page state interrupts request or response body consumption.
The read must reject with a DOM exception named abort error. The same shutdown behavior should apply to multiart form data parsing. Successful reads that are not interrupted should remain unchanged and fully buffered response bodies should remain readable after shutdown. Schedule timers and request animation frame callbacks associated with the discarded page must also be cleared. Great. This is a good prompt. Specifically, this prompt gives a very clear what the goal is. What should this do when behaving properly? We can see here every anthropic model passed. Gemini 35 Flash passed 5455 did as well. Deepseek V4 three out of four times.
GLM and Chem three out of four times. Mimo started to fail more. Gemini 3 Flash errored half the time. 31 Pro mostly failed. And now we're back in hell. I want you all to really think about these types of problems as we go over this stuff because you should have the opportunity to make similar things. If you've tried using models for your work and you find the results aren't very good, every time that happens, write it down. Have some notes somewhere where you write down the models you tried, the prompt you used, the tools you were using, and the code base and the hash that you were on when you tried the prompt.
Maybe fiddle around with the prompt a bit and see if making slight adjustments can get you the result that you're looking for. But most importantly, keep a corpus of these failures. Keep track of the times the agents couldn't do the work that you wanted them to do and keep that around because when new tools come out, you can measure it yourself. You can even create your own mini benchmarks. These models are very good at doing that. If you can collect these examples, throw them into some sandbox somewhere, and see which agents can and can't solve them, you now have a useful measurement for what models do and don't make sense for you and your business.
You might even have something good enough to share publicly. We need benches so badly. And if you are experiencing models not doing what they should in your day-to-day work, you have something that I don't, the data to make one of these benchmarks. Do it. It's not that hard to do. And you'd be amazed how quickly the research community notices when you put together a small, seemingly shitty benchmark. I have been absolutely floored how positive the response was to SnitchBench and Skatebench. These two crappy benchmarks made by a YouTuber who had some weird questions he wanted answered.
Take advantage of this opportunity. It's not that hard to build. It kind of feels like the SBE bench pro creators are making it hard to read their prompts cuz this is the only way to actually look. And this is a this is a mess. So, the first example they have is for NodeB, which is a Node.jsbased forum software that I haven't heard of anybody using in a long time. Like, oh, there are still people committing. Cool. That's good to see. The project's not dead. But they took an actual issue that existed. I think they even have a link to it somewhere.
Oh, they have the patch and the test patch to test things. The problem statement is what I want to read. This is the example. Email validation status not handled correctly in ACP and confirmation logic. Title description. The ABIN control panel ACP does not accurately reflect the email validation status of users. Also, validation and confirmation processes rely on key expiration which can prevent correct verification if these keys expire. There's no fallback to recover the email if it's not found under the expected keys. This leads to failures when trying to validate or resend confirmation emails. This is awful.
There are steps to reproduce, which is helpful, but this formatting is terrible, especially ending it with labels. What? This is a horrible prompt. This is a terrible prompt. This is not measuring [ __ ] So, all the SWE pench runs, append just garbage to the system prompt, and I finally was able to dig and find it. Thanks to Chaff for helping me get here. You're a helpful assistant that can interact with the computer to solve tasks. I've uploaded a code repository in the directory working dur. Consider the following PR description. This is where the problem goes. That problem is the awful thing I just read over here a second ago.
Can you help implement the necessary changes to the repository so the requirements specified in the description are met? I've already taken care of all changes to any of the test files described in that problem. This means you don't have to modify the testing logic or any of the tests in any way. This part in particular is awful. Telling it it doesn't have to modify testing logic or add tests. This one line should invalidate SDW bench by itself. This is so bad. Like like why is snitch benchmark like shitty benchmark checking if models are willing to hit the government up when you're doing medical malpractice better architected and better designed than this [ __ ] Your task is to make the minimal changes to non-EST files in the directory to ensure that the description of the problem is satisfied.
Follow these steps to resolve the issue. One, as a first step, it might be a good idea to find and read the relevant code. Two, create a script to reproduce the error and execute it using the bash tool. Three, edit the source code of the repo to resolve the issue. Four, rerun your reproduce script and confirm the error is fixed. Five, think about edge cases and make sure your fix handles those as well. Your thinking should be thorough and so it's fine if it's very long. Are you joking? Well, at least we know how Gemini models perform okay on S.Bench because they are told exactly how to do things.
I don't know about y'all, but personally, I don't write 15 steps when I ask a model to go fix some small bug. This is the type of prompting that would have almost sounded like it made sense in the GPT3 days, but the fact that we're still using this bench today is, I'll just be [ __ ] frank, it's pathetic. It's actually pathetic. And this has been the industry standard for a long time. This is the same bench that Anthropic bragged their model got a 78% on, and GBD55 only got a 58.6. six. Well, what's funny on that is that it's likely their numbers were just entirely wrong.
Somebody did the math based on the reporting from data curve and calculated that the actual score 55 should have gotten is probably an 86.7, not a 68.5. Insane. And again, for comparison, let's look at a deepswe prompt. Fix prompt QL label sorting across typed and untyped values in this Go project. Very well- reggarded library for monitoring systems and time series database stuff. I know we use Prometheus at Twitch. I never touched it, but I know this is a very legit library. Label sorting must use multi-dommain typed comparison. Current behavior does not produce a stable total order when labeled mix heterogeneous typed and untyped string representations.
Values with leading whites space are never parsed as any type form and must be sorted before all other values within this leading whites space group. Ordering is by natural sort of the original strings. Order classes are as follows. Details of how this should be ordered. Numeric parsing must accept scientific exponents in optional leading plus signs. A bare exponent marker with no following digits is not a valid number and falls back to untyped natural sorting. yada yada you get the idea. This isn't telling it how to solve the problem. This is telling it what the problem is and what the solution should look like.
This is a good prompt. If your prompts look more like the absolute If your prompts look more like the mental illness that is SWE agents system prompt here and less like what I just showed here in deepsw SWE, you probably should be going with whatever the results of S swbench pro are because shitty prompts from shitty prompters with 5,000 too many characters. Seems like they're measuring that fine. I like prompting like this. My prompts are me describing the problem and roughly what the solution looks like to the agent and then trusting it to go do its thing.
Also interesting on this one, this big Go project that required a lot of code to change for it to work. The solution was 805 lines of code. GLM51 did okay on this. It actually got a 50% score. GPT54 got 30 out of 7. So technically it did worse than GLM 51. 55 similarly had 30 out of 7. Opus only got it twice. Very interesting. The more you read into this, the more it makes sense why this gap is so big. And I want to be clear, all of this shit's obvious. This isn't them doing some crazy novel thing with the bench.
They just sat and put in the work that nobody else wanted to do because you have to verify these results. And that's a shitload of reading shitty code. Nobody wants to do that. And I am thankful that they were willing to put in the work to go do it because they finally got us a bench with results that mean something. And now we can look at the other results that mean things because finally we have a bench that's good enough that token outs cost outs and wall clock times are useful as well as the comparisons of different agents which I think is really cool too.
Let's actually start with that. They use the mini SW agent from SWEBench which is a super minimal agent harness that just gives the agent the ability to use bash. They tried running against the official harnesses from the other models as well and the results were telling. The pass rate when they use mini SWE agent with Claude Opus was a 50%. But when they switched to Claude Code, it dropped 10%. It did actually do more output tokens with mini SWE agent than it did with Claude Code, though, which is notable. With GBT55, it scored identically in the subset they tested with the same harness, but it used more output tokens in Codex.
Not a lot more. And to be clear, both of those numbers are way smaller than the Opus run, where the worst run with Codex CLI was 25K outputs. And with Claude Code, the better case for Opus, it was 48, almost doubling. And then Gemini 3.1 Pro got a 40% in the mini agent, but still only got a 20% through the official Gemini CLI. Surprise. They did all this testing, so they wanted to make sure that this agent wouldn't be meaningfully disadvantageous to any one model family. And it seems like the results came out fair as a result.
So, let's look at the fun numbers. Score versus output tokens per trial. For those who wonder why I hate 35 Flash so much, here you go. There's a chart that shows it. It scored less than half as well as GBD55 did, and it used comically more tokens. GBD55 did 47K tokens per trial on average. 35 Flash did 150,000. 3x the tokens. Opus did 2x at 97K again compared to 47K. double the number of tokens and a lower score. This goes even further with the cost comparisons where Opus ends up being more than 3x more expensive than any other option.
If you wonder why Anthropic's revenue is growing so fast, this is probably part of it. Fact that they charge so much more for their intelligence because it burns way more tokens and doesn't perform that well as a result. If you want to spend a lot of money on AI and you're trying to maximize your AI spend at your company, you should definitely be using Opus. The numbers tell you why. So they cost $16 per run average. GBD55 cost $5.80 average by4 was even cheaper at $3.30. My favorite number here though is the Gemini 35 flash number because the model's a flash model.
It should be way cheaper than OpenAI. Right. Right. Like why would I use the $30 per million out model when I could use the $9 per million out model that's so fast and just as smart. Oh, it's because it cost almost the exact same amount of money in the end. But it's a flash model. It's fast, right? Let's look at the performance. Oh yeah, it was faster. It finished in 15 minutes versus 20 minutes. Okay, so it was like 20% faster and it was 3x dumber and roughly the same price. Why would you use 35 flash for like anything?
Yeah, these numbers are crazy. I can already predict what a lot of the questions in the comment section are going to be. So, let's address as many of them as we can. First, I am sure people are asking, "What about Opus 4 6? What about the new Quen models? Let's turn those things on. Here's Opus 46, Quen 36, Haiku, and Miniax. Cool. Turns out Opus 46 bombed this test. I'm not quite sure the details as to why I haven't talked to the team about it yet, but Opus 46 really struggled with this bench. So maybe all the things Anthropic was saying about Opus 47 being better at these long tasks were true in the end.
Looks like it very well could be the case. I will talk to them more and try to run this myself out of curiosity, but all the other things I just turned on landed much lower. This is probably the most damning bench for openw weight models ever because none of them get to even half the score of the last generation of state-of-the-art models. Sure, when you look at artificial analysis, things look less bad. It's like look, Mimo V25 got a 54. Kimmy K26 got a 54 as well. Opus was a 57 and 55 is only a 60.
Like that's not that big a gap. Even Deep Seek V4 Pro did a 52. Like clearly open weight's catching up, right? Well, uh, no. Kimmy K26 scores as well as 54 Mini and 54 Mini is not a good model. If 54 is more than doubling every single openweight score, that's pretty damning. And this also confirms the experience I had with Deepseek V4 Pro where if you isolated it to one file to solve one small thing, it worked fine. But if you wanted to actually do real work in a real codebase, it falls off immediately. I am very thankful to have this bench.
It has confirmed so many of the things that I and you guys as well have felt as we use these things that the gap between opus 5455 and everything else is huge. That 55 is a meaningful jump in the ability for agents to do serious work with minimal context. And that when you actually use these things for real work, giving them real problems to solve, the gap between the openweight models and the state-of-the-art models feels way more than 5%, which is what other tests were showing. And again, when you compare this to the Code Arena bench, you see why I don't trust the old benchmarks.
I believe this one is mostly a front-end test, which sure, JPT55 is not as good at generating a homepage or a marketing site as Claude Opus is. Cool. These numbers make no sense. In a world of people running with these types of results, it is such a relief genuinely to have a company that I invested in that was a bunch of nerdy Waterlue kids just a few years ago have the same frustrations I had with the [ __ ] quality of benchmarks and data and come out to show us what it should look like. I know this video is going to get flamed with accusations of paid bias or shill or whatever.
And I guarantee every single one of those people is either working on small projects or hasn't given the newest OpenAI models a proper shot because the gap between openweight models and cheap old models for real day-to-day development work compared to the state-of-the-art is massive. And any benchmark comparing these things should show that. And thankfully we're seeing other people do this as well. swbench is another attempt to do a realistic coding bench that is getting more attention right now that shows pretty much the exact same breakdown as we just saw. Cloud code is a little closer to 55, but not like a lot closer.
55 is a little closer to using Cloud Code and their custom harness doesn't seem to behave quite as well as Claude Code and Codeex do in their testing, but it shows a very similar split here with the massive gap between the best openweight models and the current state-of-the-art. So, these results aren't as unique as they might seem. They only seem unique because we've been fed [ __ ] benchmarks for the last few months, if not years. All of that said, this bench isn't perfect. It's missing a few things I would really like to see. Ideally, we would have more tests comparing the official harnesses versus the one that they built with Mini SWE.
It'd be cool to do the whole run with both. It would also be really awesome if they included things that are a little more annoying to put in stuff like Composer 25 from Cursor, which you can't really use in your own harness. So, they'd have to make this work with other harnesses. It would also be cool if they made a way harder version with more private data. They were almost too transparent with the examples, which makes it easier for things to be trained on this bench in the future to force high scores. So, I hope they do a much bigger private version where models don't score quite as well.
A benchmark dropping with a 70% score is a bit scary because that means it's going to hit 100% by the end of the year at this rate. So, I would love a version of this benchmark that like knocks a zero off of everything. What if 55 got a seven instead of a 70? That would be really cool. They were super transparent with the limitations of their benchmark. They specified that by running through mini SW agent with only the single bash tool in the shared prompt that different models might not behave as well because of the differences in the harness.
GBT expects apply patch and claude expects the text editor tool. Routing every edit through bash might actually hurt them from being able to do as well as they can in their own harnesses. It's also of note that devs don't actually use mini STB agent. I don't know anyone who uses it. I don't think it's meant to be used. It's not PI although it is similar to PI. So I guess if you like PI it'll be closeish. They're using things like codec CLI cloud code cursor and the Gemini CLI which their leaderboard does not reflect. They want this bench to be as realistic as possible.
So they're calling this out even though nobody else calls these things out because none of them are trying to measure actual usage day-to-day for coders. They're just trying to measure roughly arbitrarily how intelligent is the model. The corpus draws only from active open source repos with at least 500 stars on GitHub. This keeps the tasks anchored in realm maintained code, but the results may not generalize to longtail repos or proprietary code bases. Also a very good call out because deep SCB is focused on these long horizon tasks. Bug localization and refactoring are under represented. So if a model's really good at debugging or refactoring and not on solving big feature ads or harder problems like their benching, it won't be represented well here.
It's possible that Gemini 31 Pro is the best debugging and refactoring model ever. We wouldn't see that here. The bench only covers five languages as well. It's mostly concentrated on TypeScript Go and Python. Widely used languages like C++ and Java aren't represented yet. Also something cool to fix in the future. But remember, they wrote all the verifiers by hand. So any language that they add here, they need an employee who understands it well enough to verify the results. And while the prompts are shorter, they much more so reflect how devs actually message with agents. Behavioral verification needs some minimum specificity to know what surface to test against, which puts a floor on how tur a prompt can be before the test becomes ambiguous.
Yep. If this type of thing excites you and you made it this far into the video, you might want to go click the link in the description because they have a hiring button and I know they would be pumped if they heard you came for my video. So make sure you let them know. This was absolutely phenomenal work from Serena and team. They did a great job on this bench. I've been begging for anyone to make a bench like this for a while now, and it's so cool to see it not just exist, but confirm the experiences that I and my other engineering friends have been having working with these things.
And this gets me even more excited for the next generation of models. I hope this video was helpful to you. I know it was really exciting to me. Let me know what y'all think. And until next time, he snarts.
More from Theo - t3․gg
Get daily recaps from
Theo - t3․gg
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









