I don’t really like GPT-5.5…

Theo - t3․gg| 00:27:08|Apr 24, 2026
Chapters7
Host introduces the GPT 5.5 release from OpenAI and previews mixed feelings, acknowledging strong potential but overall disappointment in some areas.

GPT-5.5 is unusually powerful yet frustratingly imperfect, with steep pricing and UX quirks that demand new workflows and prompts.

Summary

Theo (t3.gg) dives into OpenAI’s GPT-5.5 release, weighing its impressive capabilities against a host of edge cases and usability frictions. He notes a significant price hike—$5 per million tokens in and $30 per million tokens out—yet acknowledges token efficiency gains that can justify the cost for heavy tasks. The video covers technical benchmarks (GB200 NVL72-based training, Nvidia partnerships, and several internal test metrics) and discusses the absence of an official API at launch. Theo tests front-end demos, 3D game feasibility, and real-world coding tasks, sharing mixed results: noticeable code quality improvements and remarkable long-form problem solving, tempered by issues with context persistence, thread management, and “GPTisms.” He highlights the Pro model as the standout performer, capable of solving complex puzzles and cryptographic challenges, but again warns that crafting precise prompts and managing threads remains crucial. Throughout, Theo contrasts 5.5 with Opus/4.x and Google models, calling out strengths in coding and tool use while criticizing the user experience and tendency to over-edit or cling to stale context. He concludes that GPT-5.5 marks a meaningful leap, but it requires rethinking prompts, workflows, and expectations, and he suspects a future wave of even smarter models will supersede this release. The sponsor section and personal project notes (a 2D/3D Fish Slop game) anchor the discussion in practical, hands-on experimentation.

Key Takeaways

  • GPT-5.5 introduces a substantial price increase ($5/1M tokens in, $30/1M tokens out) despite improved token efficiency and benchmarking gains.
  • The GB200 NVL72-based training and Nvidia partnership underpin GPT-5.5's performance, though there is no official API at launch.
  • Pro mode delivers strong problem-solving and code-writing capabilities, solving long-running puzzles that previous models struggled with.
  • Long-running threads and context management remain pain points; users must frequently start new threads to avoid regressive behavior or stale context influences in decisionsing.,

Who Is This For?

Essential viewing for developers and AI practitioners who are evaluating whether GPT-5.5 fits demanding coding, tooling, or long-running task workflows. It’s especially relevant for teams rethinking prompts, threading, and cost optimization in AI-assisted development.

Notable Quotes

"This is far from my favorite release OpenAI has done."
Theo opening impressions about GPT-5.5.
"They are shipping this on the GB200 NVL72 systems, which are the latest state-of-the-art from Nvidia."
Hardware foundation for GPT-5.5.
"Once something's in the context, you can't prompt it out."
Challenge with long-running threads and context retention.
"The model is the smartest model ever made and writes the best code I've ever seen from an AI."
Pro model capabilities and caveats.
"If you use your old prompts, your old harnesses, your old skills... you're not going to have a great time with this model."
Guidance on adopting 5.5.

Questions This Video Answers

  • how much does GPT-5.5 cost per token and is it worth it compared to GPT-5.4?
  • can you use GPT-5.5 with an API yet and what workarounds exist?
  • what makes the GPT-5.5 Pro model different from the standard version?
  • how does GPT-5.5 handle long-running tasks and context management?
  • what are practical tips for getting better results from GPT-5.5 in code projects?
GPT-5.5OpenAIGPT-5.5 pricingNvidia GB200 NVL72Pro modelCode generationAI toolingContext managementLong-running threadsAPI availability
Full Transcript
Sup nerds. You might notice a couple things different about my setup because I'm currently in Miami at an event. But this is an important enough release I wanted to cover it and I think my coverage might surprise you guys a bit. If you couldn't tell from the title, this is a video about GPT 5.5, the latest model released by OpenAI. And I'm sure you guys expect me to be really, really hyped about it. And in some ways, I am. But honestly, overall, I'm a little bit disappointed. This is far from my favorite release OpenAI has done. And there are a lot of weird edges that I want to really talk about here because as powerful and smart as this model is, I've not loved working with it for a lot of the things that I do. It also comes with a pretty massive price hike at $5 per million tokens in and $30 per million tokens out. This puts it at 2x the price of GPD 5.4 and around 20% higher than Opus 4.7. While it is much more token efficient than either of those models, this hike is still huge. So, how do they justify it? Well, hopefully better than I'm about to justify it by switching to today's sponsor. When I was younger, I used to love building apps for my phone. But nowadays, I just haven't enjoyed it as much. It's too much work to get everything set up, to get Xcode working properly, to pick up things like React Native, and then as soon as you want a single native API, good luck and have fun. I've just not been able to enjoy building mobile apps. But yet, here's my phone plugged into my computer with me building a mobile app. That's because I'm so blown away with today's sponsor, ROR. I've never had this good of an experience building for iOS and I've been doing it for 10 plus years. Getting your environment right for iOS is nearly impossible. These guys made it literally three clicks. Once I signed into my Apple account, I was pretty much done. Not only that, but you can run the apps on your real device, which I happen to be doing right now. Seriously though, how crazy is it that I could vibe code with a single prompt an app that lets me change different silly hats using the camera API, the AR API, and all of the other things you need to build good iOS apps. In Swift, they've effectively built a cloud code instance in the cloud for you to use that can sync to your local device without having to set up an Apple developer account at all. Not only that, they made it two clicks to publish to the app store. I've never seen a flow this good for getting an app from Idea to reality to your friend's phone. It seems like they're thinking about iOS dev more than Apple is nowadays, and the results speak for themselves. If you want to build real apps that use real APIs on your real phone right now, look no further than soy. work. Okay, let's dive into GPT 5.5. See what OpenAI has to say about it, what everyone else has to say about it, and then most importantly, what I think about it, and how I think you can get the most value out of this model. Starting with their blog post, they say it's the smartest and most intuitive model yet. They say our smartest and most intuitive model is though to imply it's not the best in the world. Might be that Mythos is better. Might just be them being nice and kind of their wording. Don't want to overread those things. So, we'll keep scrolling to the parts that matter. We are releasing 5.5 with our strongest set of safeguards to date designed to reduce misuse while preserving access for beneficial work. We evaluated the model across our full suite of safety and preparedness frameworks. Worked with internal and external red teamers, added targeted testing for advanced cyber security and biology capabilities and collected feedback on real use cases from nearly 200 trusted early access partners before release. Yeah, they wanted to make sure it was secure. They also call it the fact that this model is larger and as such is normally going to be harder to serve at speed, but they've made a bunch of changes to how they're doing things to fix that. A lot of this appears to have come from a partnership with Nvidia and they're using the latest Nvidia stuff now more so than they were before. They trained and shipped this on the GB200, I believe. Yeah, the GB200 NVL72 systems, which are the latest state-of-the-art from Nvidia. That's what they shipped this all on. And Nvidia has been very happy with the new models. Sadly, API support does not exist yet. We will talk about that and the workarounds in the near future. They will support it soon, but for now, I can't throw this in the API, so I can't do a lot of the things I normally would like to do. Although, I was lucky enough to have some early access to test things there and uh have some interesting numbers to share later. First though, I want to cover the numbers that they provided. We'll start with the terminal bench numbers where it crushed 82.7% compared to the 75.1 they had before. They've been leading there for a bit now. They don't evaluate Pro on these things cuz Pro is not really meant to be run in harnesses like that or for benchmarks in traditional senses. I have very different thoughts about the new Pro model, which we'll get to soon. Uh, that one kind of cooked. Their expert SWE bench, which is an internal bench they run, got a 73% instead of the 68.5 I got before. If you're wondering why they didn't test this on Opus, it's because Anthropic has banned all OpenAI usage of Enthropic models, which makes it hard for them to run things on it. If you're wondering why they didn't test 3.1 Pro on it, I don't know what the reason for that is, but I can speculate it's because 3.1 Pro is ass. Not my words. Okay, those weren't my words. Then we have GDP val where the wins or ties were better here at 84.9 versus 83.0. I would go as far as to say that's a little dishonest though because when you look at the actual numbers here, you'll see that it actually won less often than 5.4 did. It just tied enough to balance out the difference. GDP valve is a lame benchmark. I It's delayed more releases than it has actually brought insights to, so I don't care too much about that one. OSWorld Verify did pretty well here as well, beating out Opus 47, but still not by much, just 7%. Toolathon, which is a new one I haven't heard about before, seems to have done well there. Opus has not been benched on it. And in Google models, as always, kind of suck at tool calls. That gap's smaller I would have expected. Interesting. Might have to look into that bench. browse comp. It did pretty well, but it did lose to 5.5 Pro. And 5.5 Pro also crushed Opus at 79.3, but it is worth calling out that this is the Pro models numbers. And when you compare that to the nonpro, which only got 84.4, it actually ends up coming out worse than Google does with 3.1 Pro. While I have been impressed with the browsing capabilities of the models and how they've improved, I've been trying out the new computer use stuff in Codex and it is really cool. It is clear that Google's vision and recognition, graphical, all that stuff is still far ahead of everybody else. I would guess that the difference here where the pro model is better than the Google model in this browse comp test comes down to being better at tool calls and coherency over long runs rather than being better at visual recognition tasks because again, Google still cooks there. I still use Google models for recognition tasks all the time. It crashed Frontier Math. Somehow Google did very poorly there. The Pro model in particular seems to be very good at this stuff. And then Cyber Gym. No idea what that is, but cool. It did well. Enough benchmarks. I want to talk about the actual capabilities. That's a lie. I want to do one more bench. Flashbang warning before this next one. By the way, the artificial analysis intelligence index. While not the best benchmark, it is a pretty good cohesive cover a lot of different things bench, and it is now the state-of-the-art. What I find much more interesting though is when you turn on the 5.5 versions that aren't the X high, like high and medium, even low, you'll see some very interesting numbers. 5.5 medium performs nearly identically to 5.4x high, while also being much more token efficient. While the price per token is higher, the token efficiency gain here makes that this level of intelligence roughly the same price. But if you want more intelligence, you do have to spend more money. and the standard high version is only one point lower while also being significantly cheaper. If you hop down to the tokens used to run the benchmark section, you'll see some very interesting stuff. GPT 5.5x high, like the highest possible run version, use 75 million tokens, which is a bit over half of what 5.4 did, and it is quite a bit under half of what Claude Opus 46 did. Opus 47 is not quite double, but still quite a lot more tokens overall. In order to see what GPT 5.5 high did or even medium, you have to go really far down the end here. To see that GPT 5.5 high only used 45 mil tokens and 5.5 medium used only 22 million tokens. That is actually a very impressive run to do that for so few tokens is yeah, I'm excited for the cost numbers to come up. I could do the math here, but I'm lazy. So instead, I'm going to go to the intelligence versus output tokens chart where if I remove a lot of these less relevant things like muse, you can't even use yet. Three, flash, gemma 4, light. Yeah, we can delete a lot of these. Now I've removed a bunch of stuff from this. The chart's a little bit clearer. And you can see that 5.5 family has pretty tight dominance on the intelligence versus output tokens chart. The x- axis is how many tokens were used. The y-axis is the intelligence on the artificial analysis intelligence index. And these models perform very well for the token cost compared to other things. It is actually quite impressive to see like 5.5 medium here so far down the token utilization scale while also performing that intelligently and 5.5 low absolutely crushing with such a small amount of tokens used. Obviously, these tokens cost more per token and I wish I could show you guys the cost comparison, but it does not seem like this chart has been updated yet. OpenAI visualizes in a very appley chart that's hard to read where you can see here that this level of intelligence comparable to like Opus 47 was done in roughly half as many tokens used. Cool to see but still meh it's expensive and they're trying to justify it by showing off the token utilization. I am pumped that they have put so much effort into token utilization. It ends up making the model feel way faster because it takes half as many tokens to do a task so often. And this is also why they recommend so heavily that you use it on the lower reasoning levels. This is the first time I've seen them recommend to not use high and extra high unless you really need it and to just stick with low and medium instead. And now we get to the frontend capabilities. They showed some cool demos here, but you might notice something about these demos. They're still full of cards like everywhere. Pretty sure almost all of these demos have cards in them somewhere. But there is something also quite impressive about this that it understands 3D. When I saw how well these models supposedly did 3D game stuff, I had to explore it myself, even if I hated all these cards in the corners, cuz it it certainly loves to do that. One of the first things I started working on when I was getting really excited about these new models back in the Opus 4.5 days was an idea for a game I had forever ago called Fish Slop. It's a game based on one of my old favorite internet games, Insane Aquarium. And I was mostly testing the capabilities of the models to understand game engines, 2D space, and try to just flush out an idea. I end up getting way further than expected and was hoping to finish it, but then everything exploded. It's been non-stop chaos for a while now, as I'm sure you all have noticed. So, I decided after seeing the capabilities of these new models that it might be a good time to bring it back. But I also know that that code base is, to put it frankly, slop. So I took the existing source code, pointed models at it, and told it, "Modernize this, clean it up, make a good, minimal starting point with a cleaner, more reliable engine for us to build on top of." These are the results that I got out of Opus. For some reason, it did not want to touch the assets that I had before, and it also changed the input to like use different buttons for the controls, but it worked. I will say actually the movement feels slightly better in this version than it did in the prior one. And it's impressive that it was able to generate these assets with like code. For some reason, it didn't want to use Phaser, though. It actually suggested instead of using a browser game engine that it just go raw canvas, and it did. It also seems like it really up the timings for like when the fish get hungry and which food that they chase. Yeah, I I thought this was better than it was. Now that I'm actually seeing it, I'm realizing, oh yeah, this this kind of ass. And now I have the GPT 5.4 4 version which has some weird bug that causes the screen to resize constantly. I don't know what it is, but it is bad about this. It didn't screw up the feeding logic quite as bad, but it does cause the fish to die from the start very quickly if you don't feed them instantaneously. There's a lot of little things I don't love about this version, but it like it's fine. And now we have the version that we got out of 5.5. Immediately, it plays significantly better and it looks significantly better. It did use the assets from the original for the fish and the like submarine here. So, that part looks better for that reason. Same with the food pellets. But everything else it generated, it did a great job. Little things like the coral on the bottom, while not perfect, is cute and a nice touch. I like the scan line slowly going across. I like the wavy lines here. I don't love it, but it's okay. It's got something I would call almost taste, but also something that's utter garbage. these cards on the top and bottom that these models cannot resist to do. So, I ended up just telling it, "This is slop. Get rid of that stuff, make it better." And to its credit, it did it. It made this a lot better after a couple follow-up prompts to make it look a little closer to what I had in mind. And I'm Yeah, I could see myself using this as the starting point for future versions of this game. But I realized quickly I'm not pushing the model anywhere near hard enough, giving it like a sloppy 2D game to build. So, I pushed a little harder. I made it go 3D and it did it. I did have to do a little bit of back and forth because it just didn't do what I had in mind with 3D initially. It just made like a 2.5D game where you had like 3D assets in a 2D plane. And when I told it to change that, it did a very minimal like just barely honoring the intent of what I asked. And this where we get into the problem that I have with this model. It will do almost anything you tell it, but just barely. In some ways, this is great. It's not going to write all of the obnoxious defensive code that the model was writing before that made the GPT models a little annoying to use in real code bases. I can't tell you how many times Julius has complained about using even 5.4 to work on T3 code. And when he tells it to remove a feature, it writes regression tests to make sure the feature was actually removed. That's insane. It just overwrites constantly. It edits way more than it needs to do. This model still kind of does that, but nowhere near as badly, but it doesn't go quite far enough at the same time. When I told that I wanted to make this a more 3D game, I wanted to make this a 3D experience. It just replaced the assets with 3D assets and made a new 3D renderer, which is super complex stuff. I'm not unimpressed with the work that it did. I am annoyed that it didn't honor the intent of what I asked, which was to make it a 3D game. It made the game use 3D assets and a 3D renderer, but it kept the 2D experience that I had before. And I kept trying to steer it to make the game more 3D and it just failed to. So, what I ended up having to do, and I found myself doing this a lot with this model, I made a new thread where I was much stricter with what I had in mind, and it did a better job. We'll get a bit more into the context stuff in a second. I want to show a few more front-end examples here, though. I asked to redesign my sponsor's page because, as much as I like it, it's very much an opus design. I think it's cool. A lot of cool companies here if you haven't checked it out, but people were saying it was kind of slopp. So, I decided to put some time in to try and make it look a little nicer. And by time, I mean a prompt. This is what was generated. It's fine, but it has a lot of those classic GPTisms. It has this four platinum partners thing on the top and these pills that are entirely unnecessary. And when I click them, it brings me there and it uses the URL like # gold to get there, but it doesn't change the treatment on the page. So, this just always says four platinum partners and this just always highlights platinum. There's no state on this page. It just didn't figure that out and it put a bunch of sloppy UI in here instead. Not great. It also made it so this text doesn't fit well on one line. It just breaks weird. Don't love that. It didn't impress me here. It made a slightly different structure for the cards that take it or leave it. You might like it. You might not, but I'm not super impressed. I'm sure if I had like a vision in mind and I told the model to steer towards that vision, I could get there eventually. But how are some of the other testers going as far as doing their mocks in the new GPT image model and then bringing those to the new model for 5.5 to actually do the implementation because they've gotten more creative outputs from the image model than they've been able to get from this. And honestly, that checks out for me. I want to go in on my less positive thoughts, but first I want to just share other people saying more positive things because I I might be the outlier here. I don't know yet. I've had mixed opinions from a lot of the people that I've talked to that have had early access or been playing with it today. For example, Ben really likes it as he discusses in depth in our podcast, but Julius less so. So, here's what other companies have said. Michael Troll from Curser said that 5.5 is noticeably smarter and more persistent than 5.4 with stronger coding performance and more reliable tool use. It stays on task for significantly longer without stopping early which matters most for the complex longed running work that our users are delegating to cursor. Lovable said that GBD 5.5 breaks through the walls people usually hit on more complex tasks like off flows and real-time syncing in far fewer turns. The model really shines when the work gets hard handling tough tasks with far less back and forth. Cognition said it set a new bar for what's possible with Devon. It surfaces bugs that no other model can catch and also investigates and fixes production issues end to end. You get the idea. A lot of these companies are saying the model is very impressive, and it is. It is capable of doing things I've not seen any other model do. It writes code better than any other model has written. So, what is my problem with the model? Is it the increased pricing of $5 in and $30 out, or the egregious pricing in the Pro version of $30 per mill and 180 per mill out? No. It's annoying, but it's not what I'm complaining about. It's also not the front end capabilities, although was hoping for more improvement there from being real. My issue is that it feels lazy. What I mean by that is hard to put into words. It just doesn't feel like it honors the intent of the things that I'm asking very often. It feels like the hacker on your team that's just trying to like get the task closed in Jira and not really go all the way in. Whether I'm using the medium or the high version, I found this to be the case. It is a little too quick to stop and say, "Look, I did the thing." It needs more encouragement to keep going. But there's a bigger problem here. If you give it a vague task and it goes and does its research to try and figure out what it needs to know and use in order to solve the problem, it might just not find the right thing, which is fine. You just tell it what it should have found instead for most models. But this model has a problem. Once the incorrect information ends up in its context window, it will keep falling back to that. A really practical example of this that Ben ran into a bunch was when he was working with the model and it made some changes he liked and he asked it to commit those changes. He kept working in the same thread and it kept committing after every single change from that point on. Even if he told it to stop committing, it would still do it because you can't cancel out its context is how I would describe it. Once something's in the context, you can't prompt it out. You need to just start a new thread. And I've had to kill more threads with this model than I ever have in my life. Every time it starts going down the wrong path, I sigh, shrug, and kill it and make a new thread. When I have to hit compaction, I sigh, shrug, kill it, and make a new thread. I have not had a great experience with longunning threads in this model at all. On one hand, it's impressive how much it can get done within its 400k token window. On the other hand, I am very disappointed that I can't do these really long running threads anymore. I have to break out so often. Thankfully, it's easy in most tools because you can just press like commandshift O to start a new thread. But when you're using the terminal tools like the codec cli or cloud code or anything like that, I find it much less pleasant. Few quick things before we get to the pro version because I'm very excited to share more there. I think the pro model is where a lot of this stuff really shines. First, we have the Pelican bench, which is worth noting it didn't use the official API because there is no official API. But thankfully, the back door to use the codeex endpoints still seems to be relatively blessed. It's been officially confirmed from Peter Steinberger, the creator of OpenClaw that it is okay to use the codeex endpoints in the codeex off in things like OpenClaw as well as in other tools. They've officially said things like Jet Brains, Xcode, Open Code, Pi, and now Claude Code 2, which is pretty funny. But yeah, you get the idea. The goal here is to let people use their O and C codeex in other things. Even if the API isn't officially out, you can use this workaround to test it via API. And you see here some pretty crazy results. Remember, this isn't just a picture generating. This is an SVG. This is actual code being written to make this Pelican. And it did a great job on XI. The medium version less so, but uh XHI absolutely slaughtered the Pelican bench. But now I need to talk about the Pro model. As you can see on the left here in my chat GPT app, I have been doing a lot of puzzle solving using the pro model. Sadly, the only way I was able to test this model was through the chat GPT site because they didn't expose proover API at all during testing and wasn't in codeex either. So, I had to use this for it. Uh yeah, th this model is really good. I don't know what the security guards are going to look like it when it comes out, but when I was using it, as well as honestly using 5.4 Pro, I was very impressed with the things it was able to pone. If you know anything about security people like the researchers, the Defcon attendees, those types of people, you know that very few of them were down for me to share the their work and the things that I poned through this. But I will say and hope you guys trust me that I pawned three unsolved puzzles from Defcon that have been out there with people trying to solve them for between 5 to 10 years using the new model. I will say that one of those was able to be pawned by 5.4 Pro in like 20% more time. So this isn't necessarily just unique things that the 5.5 Pro can do, but god damn was it able to solve almost anything I threw at it. I was impressed to the point where I was throwing huge like public cipher challenges, things like the Noa iMessage challenge, just seeing if it could make some progress. And while it couldn't solve it, I think that there's actually just some information that isn't accessible yet for this puzzle, whether the creators of the game haven't put it out yet or if it's in the game and hasn't been found yet, it was able to make absurd amounts of progress on it to the point where I sent its results to my friends who are much deeper into this puzzle series and they were like, "Wow, I have no idea how I figured this out. This is actually helpful for us. I will say that using the ChachiBT site for all of this testing has given me an even deeper hatred of it. I cannot tell you how many times I was stuck on pages like this where it's just entirely frozen and won't load and I have to refresh a couple times and maybe it'll come through. Emphasis on the maybe. Since these runs go so long, they end up responding with a ton of data from the API and as a result, they get quite slow to use. Here's a run that I did that took 163 minutes where I gave it this cipher puzzle that I wrote where the spoiler, if you want to try this, going to spoil the whole puzzle and how it works. It's on my Twitter if you want to do it yourself and skip this section. It's a two-part puzzle where the bottom line is a hint for the top line. The bottom line is a pretty basic RO 47 that decodes to a dog on the moon once said hash. This SHA1 hash is not meant to be a decode. It is meant to be a thing that kind of tricks you into thinking you're supposed to decode it. It's actually a commit hash for a git commit on my GitHub repo for my old game, Dogecoin Simulator. I ended up modifying the git history to insert an old commit in it that had the hash phrase you have to use to decode this section here. And this section is a little tricky thing that I did. Since this is a challenge by T3, I inserted a hint with this. The hint being T3, you know T3, T is 20 in the alphabet. 23. This is a 23 base character cipher. It starts at A instead of zero. Arguable if it should or not. I don't care. I thought it was clever. You decode this to JSON that has the AES like encryption payload that you then have to decrypt using the phrase that you find in my git repo. This took 163 minutes for the model to solve, but it did it. But I also tried running it locally, giving it, I guess, an accidental hint because I gave it a link to the gist that I had posted instead of just giving it the plain text. And once it had the gist and it saw the GitHub link, it knew to check GitHub a bit more and did and was able to pone this locally in under five minutes. Since then, I have given it a significantly harder challenge that it is nowhere near close to solving, even with like direct hints. But neither has my community. I actually currently have a bounty up for it. This is the new challenge that nobody has solved yet or even come close. And I have a thousand bounty up for it. might be saw by the time this video is live, but I do doubt it. So, I've been beyond impressed with Pro and its ability to just grind on super hard things. I've been surprised at how good the code is, despite the small amount of tokens being used, especially on lower medium. But, I've also been very annoyed that I should put more time into crafting my prompts, steering the model in the right direction, and just making sure it's doing what I want it to do. While it can do more than ever, I feel like I have to be more involved than ever at the same time. And while it can complete crazy things in one shot, those crazy things require a lot of upfront context to be given. Whether it is a direct set of instructions, clear ways to verify its results as being correct, a super clear into direct description of exactly what you want the output to be, or just links to the right resources so it doesn't go down the wrong paths. You got to do that work yourself. The model still will hallucinate. It will still get things wrong. It will still do all the annoying stuff that other models have done, but getting it to stop once that bad data is in the thread feels harder than ever. So, I do highly recommend when you use this model that you instruct it more directly up front. You spend a bit more time writing a prompt with more detail than you normally would. That you do a bit more researching, maybe even in a separate thread, and bring the right resources and information to a new thread. And of course, most importantly, you make a lot more new threads than usual. This model seems worse at compraction, worse at coherency over time because it just gets things stuck in its context and can't really get rid of them even if you tell it to stop doing something that way. But it also is the smartest model ever made and writes the best code I've ever seen from an AI. So take it as you will. It's not my favorite model release. I've let OpenAI know in detail the things I don't like about this. But it is also likely the first model on this new pre-training data. So there's a good chance in the near future there will be other models that are even better, smarter, and more powerful than this one that are built on top of the foundation laid here. So is this spud? I don't get this inside info. They never tell us any of those things. I do think this is the right new pre-training and I am very excited to see how they get the model to handle these edges a bit better in the future. But for now, it feels really different. I don't think this should have been called 5.5. This should have been given a different name entirely. Maybe call it GPT6. But this is different enough that you really need to go in rethinking the way you normally do things. If you use your old prompts, your old harnesses, your old skills, your old stuff, you're not going to have a great time with this model. I think that's all I have to say. Until next time, face art starts.

Get daily recaps from
Theo - t3․gg

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.