Your prompts are tech debt.

Theo - t3․gg| 00:20:29|Jun 3, 2026

Chapters12

Introduces the ongoing problem of technical debt in software and how prompts contribute to it.

Prompts are a hidden, silent form of technical debt, and smart prompt hygiene is essential to keep AI tooling reliable and maintainable.

Summary

Theo from t3.gg dives deep into a concept he’s been mentioning across videos: prompt technical debt. He argues that prompts, like code, decay as models evolve, and upgrading models can suddenly break well-tuned prompts. Sean Godc’s work is cited as a foundation for understanding prompts as debt, with practical examples like Agent MD files and system prompts that balloon and bloat context. Theo shares personal experiences—from preferring minimal setups with Pi to criticizing over-engineered prompt ecosystems (MCP servers, plugins, and every tool ever installed). He emphasizes that keeping prompts lean, auditing Markdown prompts, and sticking to third-party AI coding tools by default can reduce risk and maintenance. The episode also highlights the difference between tangible code debt and the subtler, silent decay of prompts, which can undermine performance after a model upgrade. Finally, Theo urges listeners to write prompts themselves, delete generated prompts, and treat prompt management as a first-class engineering concern. The message is clear: manage prompt debt actively or watch your AI systems drift and degrade over time.

Key Takeaways

Prompts can decay silently with each model upgrade, turning a working setup into non-functional without obvious failures.
System prompts and tool descriptions greatly influence model behavior; tweaking them per model release can yield 10–30% quality improvements in some benchmarks.
Minimal, Unix-like setups (e.g., Pi project) often outperform heavily customized stacks by avoiding context bloat and unnecessary prompts.
Overbuilt Agent MDs, MCP servers, and excessive plugins inflate context and reduce maintainability; default to lean, well-audited configurations.
Regular audits of codebase prompts and agent prompts (readmes, agent MDs, system prompts) prevent stale assumptions from hindering future model updates.
Rely on third-party AI coding tools (CodeEx, Cursor, Cloud Code, Pi) to piggyback on engineers’ prompt optimizations rather than building bespoke, brittle harnesses.
Prompts are as critical as code quality; write them yourself, delete generated prompts, and keep prompts up-to-date with model evolution.

Who Is This For?

This is essential watching for AI/ML engineers, DevOps teams integrating coding assistants, and startups relying on prompt-driven tooling. If you manage AI-powered features, you’ll benefit from knowing how to keep prompts lean and future-proof.

Notable Quotes

"Prompts are important. Minor tweaks to an LLM's prompt can unlock significant performance improvements."

—Highlights how small prompt changes can have big effects on model output.

"Prompts decay silently, even when code stays relatively stable."

—Emphasizes the subtle, ongoing risk of prompt degradation.

"Write your prompts yourself and delete them whenever you get a chance."

—Advocates for prompt hygiene and avoiding over-reliance on generated prompts.

"If you install every MCP you see, you’ll end up with half the context taken up and nobody knows why the model is slow or dumb."

— warns about prompt and context bloat from unnecessary tools.

"A set of prompts that you carefully crafted in January may be out of date by February."

—Illustrates rapid model evolution and the need for continual prompt refresh.

Questions This Video Answers

How do I audit and prune my AI agent prompts in a codebase?
What is prompt technical debt and how can I mitigate it?
Which AI coding tools are best to rely on and why should I avoid overcustomization?
How often should prompts be updated when a new model version is released?
What is Agent MD and how does it influence prompt engineering in T3 projects?

Prompt engineeringTechnical debtAgent MD System promptsMCP serversT3 ChatCursor CodeExPi projectModel upgrades

Full Transcript

Technical debt's always been a massive problem for our industry. It's one that I've encountered more times than not across all of the different roles I've had, even building my own stuff. It's so easy for technical crust to build up. Thankfully, AI is here to save us, right? Well, sure, for many places, the technical debt is building up more as more people who don't know how to devright are just throwing in prompts and begging to have their stuff merged. But on the other hand, I've actually had a lot of luck using AI to trim technical debt from projects where it wouldn't have been worth it before. I've talked about this a bunch randomly throughout other videos, but I want to go a bit deeper here. Not just on technical debt, but a new type, prompt technical debt. Crazy enough, in many ways, prompts can be technical debt, too. Sean Godc's already written multiple awesome articles that we've covered in these videos, and I have a feeling I'm going to love this one even more. Technical debt's one of those things that hopefully, fingers crossed, will be kept around to fix. It might not be the most fun work. We might be going from builders to janitors, but it is absolutely a thing we have to be considerate of. And I'm excited to see how Sean is thinking about it in particular around the prompts themselves and the debt they represent. I already am seeing things in here I'm going to like, like the agent MD as debt. I know for a fact that the agent MD in the T3 code repos at best out of date, at worst causing problems. Before we can get there, I am hoping to retire before our jobs become just slop cleaning up stuff and at the very least to pay my team. So, we're going to take a quick break for today's sponsor. I need to be real with y'all. I'm getting scared about the future of safety and security in software. I've talked about this a bunch in other videos, but I've never really talked about how to get the security right. That's because it's different for everyone. But if I know one thing is true, you need to have the sources in your code. If it's not in your codebase, you can't trust it. And that's why I'm really hyped for today's sponsor, ArcJet. These guys have the primitives to make more secure software inside of your code directly. The easiest way to set up is to copy their prompt and paste it in your codebase. Then it will figure it out from there. But if you look at the code, you'll be just as impressed because it's really good. It was so good. I ended up investing in the company because I was genuinely that hyped on what they're doing. They provide a set of packages that handle all of the identification systems that you need to know who a user is and if they're real or not. They have a rules section for setting things up like bot detection and you can even allow certain types of bots through if you want. They have token buckets, which is super useful for tracking how much a user's done a thing. You want to allow users to send five messages or maybe they can go to 20 pages before you start rate limiting them. Don't put that in your database. Put that on a platform built for it. You're not doing this by whitelisting at the front before they hit your code. You're doing this in the actual route file in the next.js project here. I love this example because you still just do the things you need to for all users. These things are free though. You're just concatenating some strings. But once you actually want to decide if we should do the call or not, you call aj.protect with the request and the other data you care about. And then you get back a decision whether or not you should actually process this. And the decision has everything from the reason it happened, which makes it easy to know, oh, is this a bot or are they out of messages or are they getting limited because of sensitive info or some other thing? You return if they were denied, otherwise you let them go. In a world where everything's getting pawned, you should be confident in your code and your security. Get those things right at soyv.link/archchet. Let's dive into prompts are technical debt too. It's common and correct to say that all code is technical debt. Yep, all code is a problem that you have to maintain. I agree. Adding code is a necessary evil for developing new features. You almost always have to do it, but each line of code adds to the complexity and maintenance burden of the system. All future changes to the system have to work with the existing code or at least avoid breaking it. Once systems accumulate enough code, they become impossible for a single person to understand. Instead of reading the code and understanding what it does, you must rely on guesses, theories, and heruristics. Sensible engineers write as little code as possible. Yep, I have said this many times. I was really proud at Twitch that I think I ended with more code deleted than added, and I was really trying for that. I love deleting code. It's my favorite thing. It's also funny that we're talking here about how no one person can keep all of the code in their head. I still remember the era where we thought AI would be able to do this. I'm going to make a call out that feels a little bad, but hear me out. Sure you guys know of Michael Trule, the CEO of Curser. He's an incredible dude, very smart, runs this company great. He's a huge part of why Curser's been so successful. He didn't have much industry experience before starting Curser. I'd argue he had effectively none alongside his co-founders as well. They didn't have much experience building things at scale. I still vividly remember a conversation that we had had as well as some interviews he has done about this point because he seemed to think not only were there engineers who understood the whole codebase but there was also the possibility of AI doing it too. He firmly believed context windows would get bigger and bigger and the role of cursor would be to have the best system to load the whole codebase into context. Then cloud code came up and was like, "Yo, what if I just grap for [ __ ] and did a way better job." We even thought AI was going to do this better and then realized it isn't. And now AI works kind of the same way we do. It has short-term memory loss and it needs help finding things in your codebase. It's gotten good at finding those things through an insane amount of compute being used to them. But neither humans nor AI, as far as we understand, can keep track of a million line codebase confidently. They need to understand how the parts go together, but not the details of every line of code across that type of massive codebase. And I find this is one of the hardest things for earlier career devs to make the jump for. And even for startup founders that are quite successful, they're not used to the idea that the codebase is bigger than their own understanding. And even the most successful people have been guilty of this some amount. Back to the article specifically, the sensible engineers write as little code as possible. This is a thing I worked really hard to do and I was very proud of. Many large projects now have a set of codebased specific prompt files. Agents MD, Claude MD, the same files but in subdirectories, as well as skills. If you're building a program that uses AI, you have separate prompts for capabilities and for each tool, as well as a whole set of system prompts. Even better, we have dynamically constructed system prompts in tools like T3 Chat where the system prompt is adjusted based on certain parameters you select based on what tools you turn on and off things like search as well as which model and model provider you're using because certain models cough Gemini cough cough need a lot of help not being [ __ ] So, we have some very complex, annoyingly so, code that is effectively just a string concatenation system to generate the right system prompt to steer the model roughly where we want it to go. Obnoxious, but like borderline necessary, and now that I'm thinking about it, there's a lot of technical debt there. Prompts are important. Minor tweaks to an LLM's prompt can unlock significant performance improvements. If the same model feels different across codeex, cursor, open code, and copilot, it's almost certainly due to subtle differences in prompting. I know this is one of the things that cursors put a lot of time into. The cursor harness meaningfully improves the performance of something like opus when compared to using it inside of cloud code. Some benches measure it as high as like a 10 to 30% quality of performance improvement when you use opus and cursor instead. A lot of that is just the system prompt. And I know that cursor does crazy things like AB testing system prompt changes, crazy benchmarks they build internally where they slightly adjust the prompt for different models to see how it behaves. I know, for example, when Gemini 3 Pro dropped, using that inside of official Google stuff sucked and cursor was weirdly good with it. And I learned that one engineer had stayed up really late the night before launch testing everything he could to try and force Gemini 3 Pro to behave. And I still remember the day I tried it, seeing it in the thinking traces where the model had like five steps of thinking where it was just talking itself out of using tools unnecessarily. That was because they had to add a specific blurb to ungemini the Gemini models. On one hand, system prompts aren't this like thing we have to carefully protect like we pretended they were not long ago. I can't tell you how many reports we've gotten from users of T3 Chat that are like, "Look, I stole your system prompt by asking this model this question." Cool. I don't [ __ ] care. Maybe give me some feedback so we can do a better job at it. There is still a lot of work in the engineering of these system prompts, though, because a difference between two can make a meaningful difference in the performance that you get. The author calls out that AI companies do spend a lot of time testing and tweaking their prompts. So, it makes sense why engineers spend a lot of time tweaking their agent MD files as well. I'd even call switching tools or workflows to be a form of prompting. Yeah, which tool you're using definitely affects these things. If I start wrapping my agents in a RAL loop, put in a new skill file, or install an MCP server, that's still a change to my prompts, even though I'm not the one who wrote it. Yep. And this is something I find people don't understand when they're using tools like Codex. In Codeex, they have a plug-in system where you can go and install plugins. When you've installed a plugin, that plugin is now one of the things in the system prompt that the model knows about. I could even ask to prove it. What plugins and skills do you have access to? Here are all of the plugins that I have installed and it knows about it knows about them because they're all in the system prompt. It also has this big pile of available skills. The front end design skill still finds its way in even though I have deleted it many different times. I don't know what is causing it to keep reappearing, but it is actually a good audit. A lot of these things I don't want. So, I'm going to go clean these all up later actually. Yeah, this is a real problem to consider. And many models will just use tools cuz they're there, even if you don't want them to. MCP servers in particular are really guilty of this. I've seen people who just went and blindly clicked install on every MCP that sounded cool and now whenever they start a new agent, half the context is already taken up and they're really confused and upset and think AI is not very smart. Look, I installed everything and the AI is still dumb. Good luck. Just doesn't work that way. Sean calls out that he thinks it's a bad idea to spend a ton of time tweaking a bespoke agent coding setup. Interesting. shots fired at our friend Ben and his crazy pie setup that he keeps iterating on constantly. I fall in this camp. I like using things as close to stock as possible and I'm just thinking about where they're even running. So why does he feel this way given that prompt adjustments can deliver so much value? That's because prompt adjustments are model specific. Earlier he said that AI companies spend a lot of time tweaking their prompts. In fact, they spend that amount of time for each new model release. I experienced this personally. There were bad things happening when I used codecs with 55 during the early testing that were largely because they hadn't adjusted the tool descriptions and system prompts properly enough for the new model, which resulted in it doing things like searching way more than it should have been and polluting its context with nonsense once it found bad results. They adjusted the descriptions to fix this. And as crazy as it sounds that a prompt that worked great for 54 won't necessarily work as well for 55, absolutely is true. And I've experienced this a ton. That's kind of why I thought 54 to 55 should have been a different name or number to indicate how big a jump it was. Not because it's so much better necessarily, because it's so different. Like 53 to 54, I didn't have to change how I prompt much. 54 to 55, I've had to rethink how I use the model entirely. I like the phrasing that you have to learn how to hold the model every time. I absolutely agree. Even between 55 low and X high, it feels entirely different. In other words, a set of prompts that you carefully crafted in January this year might be out of date or actively harmful by February. Worse still, you might not even notice. Model capabilities are already so hard to pin down unless you're running every problem through various different models and tools. And even weak AI systems are surprisingly good at some problems. You might just think, "Huh, the new anthropic model isn't as impressive as the hype." Or, "Wow, Claude Code got worse recently." Yeah, absolutely. This I've experienced this, too, and the labs have experienced it as well. Like I just said, like this new model in all of their measurements was killing and in their custom tools internally was killing. When I tried it in something a little older and more customized to my uses, it bombed. Funny enough, this is actually one of the reasons I really like Pi because I didn't customize it. I use PI in as boring a way as possible because I want to see how the models work with nothing provided. Given the most minimal possible setup, what does it do well and poorly? And then I add things to solve specific problems I have when I have them. And I think this minimalist way of building, the the Unix philosophy, so to speak, is one of the best ways to avoid these types of problems. When you get a new model, start it on the smallest possible tool with nothing included and then add things as you need them. I've learned that whenever I mention Pi, people don't know what I'm talking about half the time, so I will point you at it. It's the Pi project. It's now hosted on Arendelle Works on GitHub. It's by Badlogic aka Mario. It is a set of tools, but primarily a coding agent. The magic of the Pi CLI is that it is super customizable and you can tell it to add features and make changes and it will but it also starts in a very minimal place unlike GitHub which has been loading for the entire time I'm talking about this. Great work Microsoft Pi starts up with under a thousand tokens of context. That's crazy. Some harnesses like Claude Code start with like 10,000. With all the tools and features built into Claude Code, the system prompt is now 65,000 tokens. With everything disabled, it's only 12K. But that's insane. That's before you even start doing things. Pi is really cool that it lets you build and customize it these ways is great. As I said before, I like it for not customizing it because it stays insanely minimal. The sinister nature of this debt is what makes it so scary, though, because tech debt's pretty apparent whenever you go to add a new feature or make a change to your codebase. You feel the tech debt every second. But the subtlety of these regressions is much more painful. In this sense, prompts are a worse form of technical debt than code. When technical debt blows up, it usually causes errors or a tangible slowdown as you try to understand the code. Prompts will decay silently. Yeah, very big deal. Also, even janky code tends to be relatively stable when untouched, but every single model upgrade could turn a functional prompt into a non-functional one. Just for example, let's look at the T3 code repo, which has an agent MD that has not been updated for 2 months. It still says this repo is a very early whip proposing sweeping changes that improve long-term maintainability is encouraged. This line probably causes models to do things it shouldn't. We also call it that T3 code is currently codeex first. That hasn't been the case for a while. This is an outdated file that needs to be updated. The issue is that we haven't had much reason to touch it because we're busy shipping features and changing this correctly would ideally involve us testing the changes to see if the model behaves better or not. I did get a question from chat which is isn't the sweeping changes line an intentional lie. It kind of was initially. I leave things like this in often specifically to try and get the models to be willing to push back more and suggest big overhauls that you might not expect a model to suggest. This framing of it is not necessary anymore and is quite possibly damaging. The other project I hinted at earlier that I'm working on is LakeBed. It's a new way of doing full stack app development. And the agent MD for this project is less a traditional agents MD where I tell it where things are and more an essay like a letter to the model about how I want it to behave and how to think about the project. So that doesn't need as much context on what it is every time. It just understands. But this is different from the readme because the readme is for users, not for agents. Just like the way you have to think about these things is crazy. And it's it's a different type of engineering. It is still engineering in the sense that you're like trying to find the right way to get these pieces together and to get this technology to behave, but it's different in the sense that it's nowhere near as deterministic and the failures could be very quiet and hard to notice. I also love this call out from Joel here. So is the best thing that came out of OpenClaw. Absolutely agree. The idea of like describing the why and how and not just the what and where to the model is a very good idea and I totally agree. I can't wait till we get to patch MD though. That's a plan I've been cooking for a while that I've not had time to do. And while prompts decay silently, even janky code will tend to be relatively stable if it's left untouched. But every model upgrade could turn a functional prompt into something entirely non-functional. Could you simply decide to not upgrade models? Some people are trying this, but the pace of improvement is fast enough that it isn't really practical. A delicately prompted agentic harness built around GBD41 is always going to underperform a barebones harness built around 47. This might seem like a sensible strategy at some point in the future when the rate of model improvement slows down. At this point, I don't believe that'll ever happen. I've been proven wrong enough times. There's also the point that it could just be that models are so capable that we don't need the extra intelligence for normal edge tasks. But I don't believe it's a good strategy today. I agree with the author here. In Shaun's view, most people should just be picking an AI coding tool maintained by a third party company like Cloud Code, Codeex, Cursor, C-Pilot, T3 Code, etc., and leave it as unconfigured as possible so they can piggy back on the work of the teams of engineers who are evaluating and tweaking prompts with each new model. This is the same reason I built T3 code the way I did. Rather than make our own harness, our own system props, our own everything, we wanted to take advantage of the fact that companies like Enthropic and OpenAI and Cursor are putting a lot of time into massaging their tools to get them right. And we just wanted a better UI on top. This also means you should do things like avoid unnecessary MCPs and skills unless absolutely necessary and keep them off by default. At least this way, if one of the teams gets something badly wrong, users will notice eventually and start complaining. Yep. How many times does Claude code regress because they put something stupid in the system prompt? It's hilarious how often that happens. I've demonstrated that in many videos. When you write AgentMD files, try to avoid behavior steering, like the now outdated, "Think step by step, you're a skilled engineer," or, "If you get a task right, I'll tip you $200." Don't forget, make no mistakes. Keep them limited to specific, concrete facts about the project. Don't let models fill your agent MD with pages of barely reviewed text for the same reason that you wouldn't let them fill your code base with pages of barely reviewed code. Yeah. And I would also agree like even more so bloating your Asian MD is bad because it makes all future code bad and it makes it harder to clean things up. He ends with a great sentence here. Write your prompts yourself and delete them whenever you get a chance. Absolutely agree. If you're using AI to generate all of your prompts, if you're using AI to generate markdown files that you use as prompts that are sitting in your codebase hoping that they provide value, if you use like the claude code/init command, which I talked a lot of [ __ ] on on my agent and cloudmd video, you are making your models dumber. And if you leave that in when a model upgrade happens, you're leaving something stupid from a previous generation holding back the new one. I see a comment from chat here, which is, I just want the stupid thing to read my mind. The less I have to explain myself, the better. This is true vibe coding. Well, I have good news. That is basically what was proposed here. Most people should just be picking an AI coding tool maintained by a third party company and leave it as unconfig as possible so they could piggyback off the work of the engineers who are evaluating and tweaking prompts with each new model. Do that. That's what I do. I spend as little time as possible customizing these things. I have a like threeline global agent MD that is just like the tech I prefer building with because I don't have to tell the model every time I'm in knitting a new project what I like and even that gets in the way sometimes and I'm considering getting rid of it. Most of my projects don't have an agent MD or claude MD for quite a while in unless I'm doing something big enough and bold enough that I'm tired of steering it and I just want to keep it going in a specific direction. Generally speaking, you should just be prompting. And if it doesn't do what you want, you should prompt it to do what you want. And you'll start to build the muscle across all these different models, across all these different tools on what this tool needs to be told to do the thing you want it to do. And if you take anything from this, you should go do an audit of your markdown files that are being used for things like this. do an audit of your system prompts and the other things that get passed to agents to do work in your code bases and see if you made the mistake we made here where your agent MD hasn't been touched for literally 2 months in a codebase that has been entirely rewritten multiple times in that same two-month window. If you're watching a video like this one, you probably already have meaningful technical debt across your prompts. It's time to go deal with it. In efforts of reducing context bloat, I'm going to kill it now. Thank you as always. Peace nerds.