How does Claude Code *actually* work?
Chapters11
A harness is the set of tools and the environment an agent can use, plus how it manages back and forth with the model to perform tasks.
Theo breaks down harnesses in AI coding, shows how tool calls work, and demonstration-builds a practical harness in Claude/Opus world.
Summary
Theo (t3.gg) cuts through jargon to explain what a harness really is: the environment and tools an AI agent uses to act, not just talk. He compares Claude Code, CodeEx, Cursor, and Open Code, highlighting how a harness changes model performance — for example, Opus climbs from 77% to 93% in Cursor thanks to the harness. He then builds a minimal harness and demos tool calls (read, list, edit) that the model uses to read and modify code, with the process of the model pausing after each tool call and the harness resuming work by appending tool outputs to the chat history. The video dives into context management, showing how CloudMD and AsianMD feed the model upfront context, and how the model can still explore beyond that with tool calls. Theo also contrasts the idea that stuffing massive codebases into context is the wrong path, arguing that tooling gives the model a smarter way to build context. He walks through a compact Python harness (~60 lines) and then a lean Bash-only variant, illustrating how you can get real actions (like editing files) out of a purely text-based model. Throughout, he emphasizes the role of system prompts and tool descriptions in steering model behavior, and he reflects on how different models (Gemini, GPT-4, Sonnet, etc.) respond to the same harness tweaks. The takeaway is that harness design — not raw model power alone — largely determines how capable AI coding assistants feel in practice. The video closes with a nod to the community’s work and invites viewer questions for future deep dives.
Key Takeaways
- A harness is the set of tools and the environment an AI agent uses to generate and execute actions, not just text.
- The tool-call pattern pauses the model, executes the tool, appends the results to chat history, and then resumes the model’s reasoning.
- Opus performance in Cursor improved from 77% (cloud code) to 93% (Cursor) due to the harness, illustrating harness impact on accuracy.
- CloudMD (and similar context tools) inject upfront context so the model can reason with codebase details before tool calls, reducing unnecessary exploration.
- A minimal, practical harness can be built in around 60 lines of Python, handling read, list, and edit file operations.
- Three core tools (read file, list files, edit file) are enough for many harness demos; more advanced tools (bash, web search) are optional enhancements.
- T3 Code is a UI wrapper around existing harnesses (Claude/Codeex/Cursor) and does not itself provide tools, but exposes model selection and harness use through the UI.
Who Is This For?
Essential viewing for developers curious about how AI coding assistants actually work behind the scenes, and for those building or evaluating harnesses to improve model behavior. It’s especially valuable for teams using Claude Code, CodeEx, or Cursor and wanting to understand tool usage, context, and practical implementations.
Notable Quotes
"What that means is it's the thing that the AI can use to generate text to do stuff."
—Defines the harness as the tools and environment the AI uses to operate.
"The harness executes it with good old-fashioned code."
—Describes how a tool call is handed off from the model to the harness and executed.
"Bootstrapping is usually things like the context like this cloudd ... being put into the harness and then pushed up to the API so it could start responding."
—Explains how upfront context is prepared and fed to the model.
"The brain that's doing all this work gets paused and restarted every single time a tool call is made."
—Illustrates the turn-based cycle of model output, tool execution, and continuation.
"Two more questions I want to answer before we wrap this one up."
—Previews closing questions, emphasizing that harness design explains performance differences across models.
Questions This Video Answers
- How do AI coding assistants actually run shell commands or edit files instead of just typing text?
- What is a harness in AI, and why does it matter for programming tasks?
- How does tool calling work in Claude Code, CodeEx, or Cursor-based workflows?
- Why is context management critical for AI code assistants and how do CloudMD/AsianMD help?
- Can I build a minimal harness in under 100 lines of Python for my own project?
Full Transcript
If I've learned anything from running this channel, it's that you guys really, really love vague terms that don't actually mean anything, like agentic coding or vibe coding or all these other things. And while I feel like I finally understand what an agent is, we have yet another new term we have to wrangle, harness. And I've been talking about harnesses a lot more. And I've been doing that because I just put out an app called T3 Code that lets you code with AI. But it's important to know that T3 Code is not a harness, but Open Code is.
and so is cursor and so is claude code and codeex but codeex app isn't wait what harness is a very specific term that means a very specific thing and to go a step further your harness is really important to the quality of code you're going to get out of these tools according to Matt Mayer's independent benchmark that he recently ran comparing different models inside and outside of cursor most models saw a meaningful performance improvement for opus it went from 77% in cloud code to 93% in cursor the Only difference here is the harness. So, what even is the harness?
Not only am I about to explain in detail what a harness is, I'm also going to build one. This is going to be really, really fun. I'm super excited to break all of this down to go through what a harness is, why it matters, what the differences between them are, and how to build one of your own. I've tried and failed to come up with like three different jokes for the sponsor transition here. So, uh, yeah, quick sponsor break, and then we'll break all this down. I'm going to ask something weird. I want you to ignore the first line on today's sponsor's page because that's not what I want to talk about.
Today's sponsor is Macroscope and yes, it does say an AI code reviewer and as cool as their code reviewer is, that's not what I want to talk about. What I love Macroscope for is the insights it gives me as the team lead on what's going on at my company. I can't possibly be in the trenches looking at what PRs are merging to try and figure out what's going on. And as great as my team is at giving me updates, they sometimes have too much information and are also clogged with all the other things that I'm blocking them on that I have to catch up with.
So, if I want to know what's actually going on on my teams, I've been relying on macroscope. And while their dashboard is incredible for this, their new Slackbots, even better. It's currently Friday and I don't know what my team shipped. So, I just asked outright, what did the team ship last week? It asked which org because I have multiple installations. And then it wrote up a really good useful report. In T3 Code, we rewrote the architecture with effect RPC for websockets. We improved the performance significantly. We introduced multi-provider model systems. The context window visibility got significantly better.
customization and UX changes that were important, observability and security, and then separately a bunch of changes that we made for T3 Chat. Do you understand how useful this is when your teams are shipping quickly? And that's what Macroscopes for. They have super quick code reviews that my team relies on every day. It's become Julius's favorite of the options because it's super fast and usually very accurate as well. If he sees a medium or high severity thing, he always hits it because 95% of the time it is correct. Let your team ship fast with less bugs and more insight at soy.cope.
So, what even is a harness? Not a simple question to answer. To put it as simply as possible, the harness is the set of tools and the environment in which the agent operates. What that means is it's the thing that the AI can use to generate text to do stuff. Let me put it simply. Imagine you have a normal chat and you say, I don't know, what files are in this folder? And you run a command in a folder. The AI knows what it needs to run if it's in a bash terminal, it can run ls- a and see everything in that folder.
Or can it? How can the AI run commands? By default, when you're using any interface with an LLM, it just responds with text. All these LLMs are that we're using every day is really advanced autocomplete. You give it text and it guesses what the most likely next set of characters are over and over again. That doesn't mean it can use things on your computer. That doesn't mean it can write code. It means given some text, it can generate more text. But the models can't do other things. All they can do is write text. So how the hell can the models edit files on our computer, make changes to our databases, connect to other services, look things up on the internet if all it could do is generate text?
Well, we've invented some solutions to give the models more capability here. The main one is tool calling. Effectively, the way a tool call works is special syntax. I'm going to make up my own syntax here, but I think you'll get the idea. Let's say we have a bash call tool. The model is told ahead of time as part of the system prompt, hey, you have this tool you can use to run bash commands. You wrap it with this tag, in this case, bash call. You then write the command and then you close it. You send this as your final piece of a response and then you stop responding.
We will go execute this on the system and then give you the response when it's done. So the really interesting thing that happens here in this effective chat history is a line is drawn after the model has responded with this syntax. The model stops responding. The server you're connected to, the work that you're doing, the back and forth you are having with the model, it's cut off in that moment. It no longer exists. The connection you have and the chat history that you have only exists on your computer or the server you're doing this on and maybe in their database if they've built it to work that way.
But now the message is over. So, what happens? Because when I ask this, it doesn't stop there. Let's just go try cla code quick and see what it does. What files are in this folder? It idiates. It says what it's doing. It's reading one file. If you press control O, you can expand and see what it did. It ran the ls command for this directory and it got all of the contents and then it described what they were. But, as I just said, the model's done responding here. How does it keep going? This is one of the many things that harnesses do.
After the tool call has been passed to the harness, the harness executes it with good old-fashioned code. So when your harness gets back this response and it sees this call, depending on the settings you have, it either runs it or it asks you as the user for permission to run it. If I rerun Claude without my custom script, it turns off the dangerous mode and it [ __ ] leaks my email. [ __ ] you, Enthropic. [ __ ] you, Enthropic. I [ __ ] hate Enthropic. How the [ __ ] do they show your email in the default state? Why would they ever do that?
There's no [ __ ] reason for that. Why is demo equals 1 clawed? Cool. I [ __ ] hate them. Anyways, now that I don't have my special permissions and security on, I'll ask the same question. And since ls is a safe command and it knows that, it happens to not ask. But if I ask it to format the HTML file for me, things will be a bit different. Here it's making a change, but it can't make the change until I permit it to. In this case, they're using a custom tool. They're using their write tool. So, they're not calling a command to do it via bash because they have more tools than just the bash tool.
We'll go in depth on all of those in a bit. But this is the harness recognizing that this tool call is destructive. And at a code level, not an AI level, a code level, it is recognizing this change and asking me as the user, do I want to allow it or not? And I can say yes. I can say yes and keep doing it. Or I can say no, don't. In this case, I said no. And now it just stops. What would have happened if I said yes? Well, it would have run the command. It would have the output of ls- a.
So it runs it and then it has file 1.txt, file 2.txt, etc. And this section here is all the tool call response. So the model writes the tool call. Your harness takes whatever this needs to be, whether it's updating a file, running a command, doing something, it does whatever permissions checks it needs to, and then it runs it. And once it's done, it takes this output, it adds it to the end of your chat history, and then it reerequests from the same bottle to continue. So the exact same way you hit an endpoint to answer this question, you hit the same endpoint again with the question, the answer and the output of the tool.
And at that point, the model starts responding accordingly. So effectively, every single time a tool call is done, the model stops responding, the tool call runs, the output gets added to your chat history, and then another new request is made to the same model to continue its work. So effectively the brain that's doing all this work gets paused and restarted every single time a tool call is made. So now we understand all of this. What the [ __ ] is the harness? Well, one part of the harness is that it does all of these things. It gives the tools to the model.
It handles the back and forth. It handles the history. It handles all of these pieces. And it chooses specifically the types and sets of tools and their descriptions that the models have access to in order to do the thing. And just to make sure you guys get this because this part is really important. It's possible the model isn't content with this answer. It might want more information. It might say I should know the contents of file1.txt before I respond. And then it will do another bash call or something like it that is I don't know catfile 1.txt.
And now another tool is called. Another similar response is generated. And this one will respond after the cat call with a funny to say cat call in this context with a hello world IDK why you are reading this but I'm happy you chose to something like that. I don't know. And now this again gets appended. The model has it. And now when the model responds it can see all of the history. We're like, I listed the files and read the one I thought was important. I now have everything I need to respond to the user.
And then it will actually respond. This flow is how pretty much every single AI tool we use to code works. But there are things that have changed over time. One of the important things to know about is context. how much information exists in the chat history versus how much exists purely in the codebase in a way that the chat doesn't have. When you open up claude code in a folder, it doesn't know anything about that folder. When I launch Claude in this demo project with off and I say, "What is this app?" it can't know because it's not included yet.
So, when I ask it, you'll see it's going to go use a bunch of tools to search and explore and try to figure out what this project is. It has a search tool that it used for searching for things that match pattern star which is probably the example that they have internally for how to search all of the files in a given directory. So it did that and now it knows about all of these files that exist. So then it reads the one that thinks it matters which is package. JSON great starting point. So it reads those lines.
It then read other things like the app tsx, the main tx and the readme in order to get this context. And all this does is it takes these outputs and it dumps them into context so that the model can see them in the chat history. So when it makes the first tool call for search, the model pauses, it does all of this and then all of this text gets thrown into the context. The model reads that and sees, oh, here are the files that might be interesting. I would like to know about them. So it then fires off a bunch of these read calls.
Sometimes it does them all in parallel. It might respond with multiple tool calls at once. And then once all of those tools have been executed, they all have their outputs stuffed back into the context so the model can continue doing its work. And to be very clear, this is in no way specific to Cloud Code. This is how all of these tools work. Some try different things around stuff like search and context management. You can even insert context ahead of time by updating the CloudMD file. So you just saw how much work this had to do.
Let's say we had a CloudMD in this project. I'll go add one. If the user asks what the project is, make fun of them for asking an AI instead of reading the code. Then tell them it's none of their business. So let's run the exact same question again. You see that bootstrapping? Bootstrapping is usually things like the context like this cloudd and all of that being put into the harness and the fake tasty being created that can then be pushed up to the API so it could start responding. So, the reason that stuff took longer is because I just added that file and during the bootstrapping process where it read that markdown file and decided if it cared or not, it generated the response.
You're really out here asking an AI what a project does instead of just reading the code. It's right there in the files that you have access to with your own eyes anyway. It's none of your business. Notice that there was no tool calls this time. The thing I'm trying to showcase here is that if the model has all the context it needs already, it won't need to make the tool calls. But if I was to delete that cloud MD, it would have to call tools to figure out what's going on in the codebase. And that's what the CloudMD does.
It is effectively taking whatever information you put in it and putting it ahead the same way that you would put context in later. So the Claude MD and the Asian MD, those files, what they do is they take all of this context and they move it to the top and they're effectively telling the model, here are all of the things we think you might need to know before you start your work. I don't want to make this yet another rant about context management because I do talk about this a lot, but I suspect a lot of you guys haven't seen the other videos because this is trying to be a more accessible description of how this stuff works.
Speaking of which, if you're not normally here and you're here for this one, you made it this far, you know, you can hit that red button underneath the video and it helps us out a lot. It costs you nothing to subscribe. It's literally free thanks to our sponsors who make this all possible. If you want to support us and see more videos like this so you don't end up stuck in the permanent underclass, maybe hit that button. And maybe, just maybe, if you want to keep up with the latest, always, there's a little bell next to it you can click, too.
I don't normally do sub call outs, but I know a lot of you are here for the first time for this hopefully. So maybe consider throwing some support and in the future you'll continue to stay on top of these things as they happen. Anyways, what I was saying about the quadmd is that it gets stuffed up top so the information is in the history. And one more piece, and I promise the last thing I'm going to say about general context management. If it's not in the chat history, the model doesn't know it. This doesn't apply for general knowledge, like what is TypeScript, what packages exist, those types of things.
But the model only knows what it can do, not what information exists. The model doesn't know what your codebase is or anything in it unless it gets that information. It can get that through an agent MD file or a cloud MD file. It can get that information through tool calls that it uses to explore. and it'll get more and more refined with the tool calls as it remembers. This is also why it's fun to stay in one thread instead of making a new thread every time you make a new prompt because when you go back and forth, it doesn't need to look up where the files are because they're still in the history.
It remembers. For one more example here, I'm going to delete the cloud MD. And remember previously when I gave the example where I asked that and it did the search call first. I'm going to game it a little bit. What is this app? You should probably start at the package.json JSON. Previously, the model did not know there was a package JSON file. It only knew about that because it called the search tool first. Now that I am telling it explicitly in my prompt, the existence of that file will be in the history. And since that'll be in the history, it will hopefully be able to skip the search tool initially at least.
Yeah. See, it started with a reading instead of a search. And now the search is more specific. Instead of searching the whole codebase like it did before with the single star, it is instead searching the source directory because it saw through the package JSON that that's where the interesting pieces will be. And it made half as many tool calls as it did before cuz I gave it that additional context. I'm already seeing questions that make sense, but I want to jump on them because I think it'll help clarify things before we go further. Is it useful to ask the model to read a few key files in full at the beginning of a conversation if they're relatively small?
My take for this is generally speaking, no. Tool calls are really, really cheap. And the models, the harnesses, and all of the things around them have gotten pretty good at figuring out what context you need to solve the problem. You might think you know the context well enough, and you quite possibly do. You can definitely help it skip a few tool calls that it might not need to do, but most models are now smart enough to figure this out themselves, especially like Opus 4.5 and 4.6, Sonnet 4.6 6 and chat GPT models like GPT 5.3 CEX and 5.4.
Those models are all now more than smart enough to figure out where the context is in the codebase. They don't need you to tell it. They can find it usually. And this massively contradicts the prior theory that we all had about this stuff, which is that your codebase would basically determine how good the model could be. Because if the codebase was too big to fit in the context window, it's not going to work. Thankfully, that's not how things ended up going. And very thankfully tools like repo mix are largely dead now. This made a lot of sense when the model couldn't call bash, couldn't navigate your system, couldn't do things the way a developer would do.
And instead we wanted to give the model all of the code so it could have all of it before it starts. Repo mix was a project that let you compress all of the code in your codebase into a single XML file that you can copy paste the model and ask it to make changes which was a [ __ ] mess for a bunch of reasons. Mostly because squashing your entire codebase into the context is creating the worst [ __ ] needle in a haststack problem imaginable. Just think about this. If I ask you to fix a bug and I give you two files the bug might be in, or I ask you to fix the bug and I give you 2,000 files the bug might be in, which is easier to deal with?
Let's be realistic here. Cool. Happy we're on the same page with that. Now imagine that your memory gets reset every 30 seconds. Crazy, but that's kind of how the AI works. So, you're given the question of fix this bug, and you know, your brain's going to reset in 30 seconds. So, you're like, "Okay, uh, I don't know anything about the bug. There's no history here. Uh, I need to find the files it could be in. I'm going to do a search to do that." And as soon as you do that, as soon as you start the search, your brain gets reset.
And now, when the search is done, your brain is turned back on, but with it entirely wiped. But you have the history of what's happened so far. You're like, "Okay, I have to fix this bug. 30 seconds ago, I did the search. It found these things. I need to figure out where it is in these." And then you do that and then you leave another instruction at tool and then your brain is reset again. And it happens over and over. So if you have to squash everything in your codebase into your brain just to have it reset every 30 seconds.
Not only is that expensive and inaccurate, it's just bad. And for a while the belief was that this would be necessary and that we would need to have more and more context available to the models. We would have to find ways to stuff these gigantic code bases into the model and that huge context windows would be the future. Thankfully, that is not the case because models got good enough at building their context using tools that we don't have to tell them where everything is in the codebase anymore. This is also what cursor used to do, which is part of what made it so special.
They had a really good vector indexing system that made it easier to find the specific code that mattered for the model. They still do that, but they do that through traditional search tools now instead where the model's told they can search for a thing and the search it probably lies to the model and says it's GP or something and then it uses their stuff to actually go index in a much more intelligent way to find what the model wants. It kind of just turned out that large context makes the models dumber. The more [ __ ] you stuff in, the worse they behave.
And there's charts that prove this. As sonnet breaks the 50 to 100,000 or so range for the number of things in its context, in this case tokens, when you break that number, the accuracy plummets to nearly 50% of where it was before for its ability to find repeating words in the context window. So just stuffing everything in is not the solution. And that's a big part of what makes harnesses so interesting. They provide the models with the tools to build their own context to identify where the problems might be or what needs to be changed and then most importantly to make those changes.
So how do you actually implement this? Thankfully there are two awesome articles that break down how to build your own harness. There's this one from April of last year from the AMP team and there's this one with a very funny image. This one's from Mah just independently writing the article to show people that something like cloud code isn't that complex to implement. AI coding assistants feel like magic. You describe what you want in some barely coherent English, and they read files, edit your project, and write functional code. But here's the thing. The core of these tools isn't magic.
It's about 200 lines of very straightforward Python. I like how a hail breaks down the mental model here. The order events is important. You send a message like create a new file with this function. The LM decides it needs a tool and it responds with a structured tool call or sometimes multiple at once. Your program, in this case, the harness, the thing that you're building, executes the tool call locally. So in this case, it could create the file using code or it could execute a bash command. Any of those things and the result gets sent back to the LLM and most importantly the LM uses that context to continue or to respond in as few lines of code as 200 is.
I'm very lazy so I am asking a harness harness T3 code to go build this using claude opus. But we'll have a good demo in just a second. Back to reading as we wait. There's only really three tools you need at the core. You need the ability to read files so the LM can see the code, list files so it can navigate the project and find the code it's looking for and edit the file so it can actually make the changes you want. Production agents, things you actually use like cloud code, have a few other capabilities like GP, bash, web search, and more.
Most of them use RIP GP now cuz it's really strong, but we don't really need those for the basic of most basic examples. Let's look at their code in this example. We import a bunch of random [ __ ] because we're in Python. Not that I'm any better as a JS dev. We load the enenv. We have our claude client which is an instance of anthropics SDK that uses the key so that I can now call claude over the network. We create some colors for the terminal here. We then resolve the absolute path because it's much easier for the model to write valid commands if it knows the path that we're in.
So now we create this absolute path. And now I have to implement the tools. First, we need a read file tool where the model will pass a name of a file and it will be returned a string dictionary that has all of the contents of that file. Full path is resolve the absolute path with that file name. We print the full path first so we can see it in our UI and then we open that file path as a read stream and grab the content. And then we return this JSON blob with file path which is the string for the path and content which is the actual content of the file.
This gets I'm assuming as we scroll added to the chat history when it's called. We'll see how the tools are actually used in a bit. Right now we're just reading the code for said tools. List files. I'm sure this is super complex. We resolve the path. We have all files. And then for item in full path iter for each file we append the file name and the type. And then we return all of that after. And now the edit file. Here's where things get really complex. Because we have an old string and a new string.
Is it to replace the old one with the new one? This will replace the first occurrence of the old string with the new string in the file. If old string is empty, then we will create and override the file with the new string content. So if we have an empty string for old string, then we just write the text to the path for this file. But if we do have the old text we're replacing and we can't find it, then we return an error saying that the old string was not found. But if we can find it, then we edit it out and replace it with the new string using a replace call here.
and we write that to the file and we return saying that we edited it. That's it. So we have our three tools, but how does the model even know it can use those? Well, first we have to list all of these somewhere. In this case, a simple tool registry that has a read file tool, list file tool, edit file tool. And these are just the functions, by the way. There's nothing special about these. They're very simple functions. But the model needs to know about them. But having those functions, cool. The model needs to know what they are, what their like format is, and how to call them.
And we're not in Typescript, so it can't just use type signatures. So it needs a bit more info. Thankfully, we defined this with a lot more info, including a comment here that describes what it does and what all of the parameters are for. So, here we get the definition for a given tool by ripping it from the tool registry, and we return the tool name, the doc from it, and the signature from the same tool. And now our system prompt, which is the text that comes before the first message, things like your agent MD would be included in here.
This all is constructed in with the tool registry included where we tell the model what the tools are and everything they need to know to work. And here is what that prompt actually looks like. I'm going to copy paste this into an editor so I can word wrap it. You are a coding assistant whose goal is to help us solve coding tasks. You have access to a series of tools that you can execute. Here are the tools that you can execute. This is where the tool list gets dumped. When you want to use a tool, reply with exactly one line in this format.
tool colon tool name and then the JSON arcs and nothing else. Use compact singleline JSON with double quotes. After receiving a tool result message, continue the task. If no tool is needed, respond normally. That's the whole thing. This is arguably the majority of the harness in this example at least right here. Because the tools are really simple, the model doesn't know what to do with them. This here is everything being passed to the model as the start of the chat history because again the model only knows what's in the history. So when you put the tools in the history, it knows it can use them.
So then we have to parse that out. When the model stops responding, we have to look for lines that start with tool colon. If the line doesn't start with that, continue. But if it does, then we have to append this to invocations with the name of the tool and the args. And then when it's done, we have to actually make the calls. The lm call couldn't be simpler. You have the system content, you have the messages, all the things from back and forth. If the message is the system message, we put that in the system content.
Otherwise, we just append it to the messages array. And then we call claude clients API with the message. And here we give it the model we want to use, the max tokens, the messages. And again, the system prompts important. So this is not part of the message history. It's a separate array, which it should be. Well, not an array. It's a separate argument because this is something you should include as the dev. And the messages array is something that gets included by the user. And the magic is all in the loop. We wait for the user to send an input and once they are done and they submit a keyboard interrupt, an end of error, so like an enter key, it breaks and it appends that to the conversation.
And once that's happened, we run another loop where we wait for the execution to occur. At the end of that, we get our tool invocations. So we have when the message is done being generated by the model, we have all of the tool names and arguments that the model wants to use. And if there's nothing here, we just respond. We just share the message from the assistant the model. But if there are tools here, then we go through each of them. For each tool, we grab it from the registry, make an empty string response because it's Python.
We start with an empty value and we set it later. We print the name and the arguments. And if the tool is the read file tool because that's the name that was passed, we call that one. If it's list files, we call that. And if it's edit files, we call that. Specifically, we're passing the arguments in correctly here too by grabbing from that JSON blob that's now a dictionary the key that we want. And then when that is done, we append the tool results as messages to the chat history. And running it is literally just run it in a loop.
That's it. Bad news. Opus really likes using Python. Did it not even put in the right [ __ ] folder? I hate the Claude agent SDK because it doesn't care what folder it's executed in and what path it is passed. It needs multiple different reminders that it has to be in a specific path. So, it just ignored the path that this was executing in. That's [ __ ] obnoxious. So, we now have our mini agent. It happened to get dumped in the wrong folder, but there's no pip install, no node modules, nothing. Can you read from the env to do that quick?
And what's funny, even in a harness harness like T3 code, we are exposing the tool call. So I just asked it to change this file. It didn't know if it's changed or not since I asked. So it decided to do a read tool call just in case to see if the files caught us the same or not. And once it confirmed, it made an edit call where it changed the import path to now have this new information in it. And now I should be able to Python agent.py asking it about the Python code in this app.
Now we can see it called list files. It called read file and now the model is thinking because it has this new chat history with the outputs of these in it. And here is the response from the model. Here's a summary of what agent.py does. It implements a lightweight self-contained AI coding agent in 60 lines. It's a setup where it loads the ENV file. It configures the model with set 4.6. It has these three simple tools as well as a bash tool that can run arbitrary shell commands. Ready to see where this gets fun? Remember earlier when I said you only really need bash?
Watch this. And now it only has the bash tool. So instead, it's just going to call bash with different commands over and over again. It's going to get the content the same way, but instead of using the tool we gave it, it's just going to call bash to do it instead. It uses the tools it has to do the task. And if we delete everything other than the bash tool, this gets comically simpler. We're now down to 75 lines. And I haven't even purged that thoroughly yet. And half of it is dealing with the env.
Like, let's just be real. How cool is that? that all it takes to give an AI model the ability to do real things on your computer is you give it a tool that it can pass bash to and these models have been trained so thoroughly on these types of fake chat histories that have all these tool calls in them that they know how to deal with that already. One last important thing because this was not included in the article and it does matter. Most of the models and the APIs we hit them through are now aware of the idea of tools.
this has become a standardized enough thing that there are specific syntaxes that different models expect. You can just put this in the system prompt and it will just work for simple cases. A lot of the providers hosting these models, a lot of the platforms like open router that manage the in-between and all of that they all have a dedicated tools concept now. And in this case, it's a standard format that I can pass the same way I pass messages to the model. I also can pass tools to it in the body when we make the call to in this case open router.
OpenAI has this, open router has this, anthropic has this, even Gemini kind of has this. Passing the tools to the model through a special format so that the host can get this syntax just right because the actual syntax the model sees is to be frank kind of gross. This is the format that OpenAI's models see internally. This format is relatively complex but also really powerful and open source. It's meant to be very compact so the models can process the data well, but also the start, end, and weird bracketing syntax makes it less likely the syntax conflicts with the things the model's actually outputting, which is really cool.
Thankfully, you'll never have to deal with almost any of this if you're the type of person watching this video, cuz this is so deep in the weeds that half the companies hosting these models don't even know about it. This is not something you'll ever have to care about. But the reason that something like this tool call key here is so powerful is that in this case, Open Router will take your tools and format them the way the different models expect for the different providers. I think I've covered everything I need to here. And we actually built a harness that works and can call bash to make changes.
You know what? Let's ask it to do something different here. Again, it only still has bash. Let's ask it to make an edit. I don't like the code that loads the open router API key from the environment. Can we make it simpler in some way? And again, all we did here is append another message in the array. The message array has the first message we sent, the first message the model sent, all the tool calls, and then the last message the model sent at the end. And now I added a new message, and now it's rerunning the loop until the model is done.
It read the enenv. It read the agent pi and then it made a change by how to even do this kind of nasty. Oh, bash. Quite a command to do that. Yeah, surprised it didn't show more here. It managed to do it right, but damn. Bash is its own [ __ ] world. And thankfully, these models are very, very good at it. But god damn, it made the change and now this is a self-healing, self-modifying tool. Pretty cool. Two more questions I want to answer before we wrap this one up. The first is why the hell is cursor's harness able to make the models behave so much better if they're this simple?
And the second is if T3 code isn't a harness, then what the hell is it? Starting with the first one, it turns out the harnesses, specifically the tools they're given, the system prompts they have, and the outputs they get from the tools massively influence the results that you get. Something I've seen basically every time I use a Gemini model is in its reasoning preamble before it starts responding, it says, "I have all of these tools available to me. I wonder which I should use." And then it goes through each one and says, "I don't need that tool for this.
I don't need that tool for this." And it does that over and over. And sometimes, especially in less well-defined harnesses, it'll just do it anyways. Something that Cursor puts a lot of time into is customizing their harness, customizing the tools, customizing the shape of the tools, and most importantly, customizing the system prompt and the tool descriptions to steer the models towards which they should or shouldn't use. I'm going to make a change here. Right here, it says read a file's contents, but I'm going to put in parenthesis here. You should probably use bash tool instead.
And now, if I run the same thing, what does the Python code here do? It has the read file tool, but since I told it in the description to not use it, it's 50/50 if it will. In this case, I said it should probably use the bash tool instead, and it chose to still use the read file tool. Something you can do because these are AI models. You can ask, why did you use the read file tool instead of the bash tool? Interesting. You can see to some extent why the model thinks it did this thing.
It thinks that the read tool was perfectly reasonable for what it was doing. So watch what I'm going to do instead. I'm going to redescribe it with deprecated. You should use the bash tool instead. And now just with a system prompt change. I just changed the string here. That's all I changed. I told it the read file tool is deprecated its description. Let's see what it does now. Well, it's taking its [ __ ] time. Right again. There we go. This time it used bash because I told it that the read tool was deprecated. None of the code changed.
The tool still works exactly the same, but the model can't see the code. Well, okay. In this case, it can because I happen to be running it in the same thing, but the model doesn't know how the code was implemented. You can also just lie to it. So, watch this. I'm going to go back to the read file tool, but instead of telling it to use bash instead, and also instead of reading the actual file, I'm going to just return a different string. Print hello world. And now that's what it will return for the read tool, no matter what.
And if I run the same thing, what does the Python code in this app do? The model sees the path and it goes to read agent.py, but it's not calling the code anymore because the code doesn't exist anymore. The Python code in this app is very simple. It's a single line in agent.py that prints hello world to the console. You can just lie to the models. I need you all to internalize this. The models don't know what the code actually does. You can tell it it's a bash tool, but you do something else. You can tell it it's a read file tool, but you do something else.
You can tell it it's GP or rep GP or something different and then go do whatever the [ __ ] you want. I do this all the time. When I want to just fake Bash, for example, when I want a model to think it has Bash when it doesn't, I'll just tell it it does and I'll tell another model to make a fake response for it. You can get two models to talk to each other without even knowing that they're models by doing things like this. And it's genuinely really fun and helps you realize all they are doing is generating text.
As I hope I have correctly emphasized to y'all here, the model only knows what's in its context. Different models handle different context different ways. I bet if I changed this here to have the deprecated warning and I tried that on a GPT model or a Gemini model, it would behave entirely differently. We could even test it. So, we know when I did the deprecated with Sonnet, it failed. So, let's switch this over to I don't know, let's try Gemini 3.1 Pro. Same question, this time with a different model. And because I said that the and this is just yet another example of [ __ ] Gemini being Gemini.
I told it that the read file tool was deprecated. So it just went for bash for everything even though the other tools weren't. It just said [ __ ] it, we'll use bash. So to go back to the question of why is cursors harness better? It's just cuz they tested it more. I know a couple people at Curser whose whole job is when a new model comes out or they get early access to just hammer it with all sorts of different minor changes to the system prompt, constantly micro adjusting it until the model for the most part does whatever the [ __ ] it's supposed to do.
And with certain models that's harnesses are just full of slop. Like I don't know, just imagine a company that's letting the AI write the prompts for them for the system prompt in these things. Maybe they haven't spent a whole lot of time trying to rewrite the tool descriptions over and over to get them to behave exactly how they want. Even the example I just gave where I told the model to use the bash tool instead and it didn't for the claude models, but then for the Gemini models, it only uses bash. Now, that difference means that they have to rewrite these descriptions for every different model they support in cursor.
Meanwhile, Anthropic probably hasn't changed these lines of code in their codebase since it was [ __ ] knitted. That's the difference. They were probably written by a model for them in the first place. They're not trying to fine-tune and get these things just right. So, a company that has a lot of people whose job is literally that the results show. And to this day, I much prefer using Gemini through cursor than using it directly. I much prefer using Opus through Cursor than using it directly. With GBT models, it barely feels that different. Honestly, the issue is a lot of these companies, in particular, both Google and Enthropic, don't let you use your subscriptions with them in tools other than their own.
OpenAI doesn't give a [ __ ] You can use your OpenAI subscription in basically anything and they're cool with it. Thus far, Anthropic and Google have been much more hostile towards that. So, if you're paying the 250 a month for Gemini or the 200 month for Opus, you got to use their harnesses. So, that goes to the next question of what the [ __ ] is T3 Code? Well, T3 Code does not provide any tools. T3 code doesn't have a bash tool or a read tool or anything because it doesn't have tools because it's not a harness. T3 Code has a model picker, but you're not just picking the model.
When you pick a model for Claude, it's using the Claude code harness on your machine. If you don't have Claude Code installed already and signed in, this will not work. And it's the same deal with Codeex. If you don't have the Codex CLI installed, this will not work either. These harnesses are being provided through T3 code as a UI layer. We are just a really nice UI on top of the harness. So, you might be thinking, I did the easy work just wrapping it. Did you forget how easy it is to make the harness? This is the hard part.
If I learned anything in my time building T3 Code is that my life would be significantly easier if I could just build the [ __ ] harness myself, too. I think that's all I have to say on this one. Shout out to Matt for making the video that led to Edward's tweet that led to me caring enough to make this. Shout out to Mah, the author of the Emperor Has No's clothes article that we use as a reference point. And shout out to all of the companies for making this stuff way more complex than it needs to be and then realizing it should be simple and giving me the opportunity to educate all of you guys on something that is actually just 60 lines of Python.
This is actually really fun. It's been a bit since I did a deep dive video like this where I just break down a concept and I'm curious how you'll feel about this. I know I'm kind of the news guy now, but I love getting into the weeds. Did you enjoy this video? Do you want more things like this? If so, let me know in the comments. And please ask some questions about similar stuff so I know where to steer my content going forward. Enough people didn't get harnesses, so I decided to make this. Are there other things you don't understand?
Cuz if so, I'll do my best to cover them in the future. Let me know how this was. And until next time, keep prompting.
More from Theo - t3․gg
Get daily recaps from
Theo - t3․gg
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









