Don't build more AI agents until you watch this
Chapters5
The chapter argues that agents improve not by piling on tools, but by pruning them and focusing a strong, well-maintained harness around a repeatable workflow. It emphasizes maintaining the harness as models and work evolve, and outlines four durable principles for 2026 about keeping agents healthy through deliberate maintenance.
Vercel proves fewer tools, smarter maintenance, and a strong workbench beat ever-expanding agent toolkits for reliable AI automation.
Summary
Nate B. Jones uses the Vercel case to flip the usual script: more tools don’t necessarily make an AI agent better. By studying a top sales rep, Vercel built a repeatable workflow and then trimmed tools to keep the harness healthy as the model evolved. Jones stresses that maintenance matters more than raw capability, arguing that models improve while harnesses must evolve in tandem. He warns that a stale wiki, outdated SOPs, and drifting dashboards become dangerous when agents act proactively. The talk leans on Stewart Brand’s Maintenance of Everything to frame agents as sailboats that require ongoing tuning, not a one-off build. Codex and Claude Code appear as exemplars of harness-centric design, where the toolset, memory, and approvals create an operating surface for work. The core message: you’re choosing not just a model, but how much harness maintenance you’re willing to own. Finally, Jones offers a practical five-part checklist to keep any serious agent healthy: current sources, reachable permissions, a clear job, traceable proof, and measurable value. Read the book he promotes and start rebuilding your own harness with that mindset.
Key Takeaways
- Vercel improved its sales automation not by adding tools, but by studying a top performer’s workflow and building an agent around it.
- A better model can render an old harness dangerous; you must prune and redesign as models evolve to avoid drift.
- Harness maintenance is central: the same setup that worked yesterday may become obsolete tomorrow as the model and work change.
- Codex and Claude Code exemplify harness design—memory, file access, approvals, and logs that create a stable work surface as models improve.
Who Is This For?
Product leaders, AI engineers, and revenue/sales ops who want durable agent systems. If you’re building or maintaining task automation, this video helps you rethink harness design, maintenance, and upskilling for future model changes.
Notable Quotes
"The agent got better when the workbench got cleaned up."
—Core metaphor: maintenance and pruning trump constant tool addition.
"Agents inherit all of the crud of the systems around them."
—Drift in documents and processes sabotages even a powerful agent.
"The bigger question isn’t can you build an agent, but can you keep the setup healthy as the work changes."
—Maintenance mindset over one-time build.
"Codex is strong not just because the model is strong, but because OpenAI keeps maintaining the harness around the model."
—Harness as operating surface, not just a brain.
"You’re choosing how much harness maintenance you are outsourcing versus owning."
—Strategic decision about setup vs. outsourcing upkeep.
Questions This Video Answers
- How do you design a maintenance-focused AI agent harness for evolving models?
- What is the sailboat analogy in AI maintenance and why does it matter?
- Why does trimming tools sometimes improve agent performance?
- What role do Codex and Claude Code play in harness design for agents?
- What should be included in a practical five-part agent health check?
VercelAI AgentsHarnessCodexClaude CodeMaintenance of EverythingPrompts driftWorkbenchSailboat analogyMaturity of AI systems
Full Transcript
Vercel made its agent better by deleting 80% of its tools. You heard that right. And that sentence can sound wrong if you've been following a lot of the hype around new tools and new skills for agents. So, I want to set the record straight. The usual story we hear is that agents get better as you give them more stuff, right? More context, more memory, more tools, more integrations, more access, more autonomy. Let the agent touch the CRM, let it use Slack, let it browse the web, let it update the record. Vercel's example is a really healthy counterexample in that process.
And no, it's not just about context window, which is the usual reason people dump out tools. Messages came in. Some were real leads, some were spam, some were support questions dressed up as sales questions, some came from little companies, some came from accounts that might matter to Vercel, some deserved a really quick reply, and some needed research, and some needed routing. It's the usual messy inbox. So, Vercel studied one of its best reps. They watched the workflow closely enough to turn pieces of it into an agent. I love that part. You have to study what people are already doing.
What did the rep ignore? What did they answer? What made a lead real? What research happened before the reply? When was a message actually a support issue? Where did the human still need to make a judgment call? Then, they built the agent around the actual observed workflow, not the paper workflow. Again, love that. The agent filtered inbound messages, it qualified leads, it researched companies, it drafted responses, it routed support questions away from sales. A human still reviewed the work because the goal was not to let a bot roam around the company, right? The goal was to take a repeatable workflow from a strong employee and make that repeatable bit run fast.
And that's already a great story. But, the more important lesson is what happened after the agent existed. The agent did not get better when the team kept piling on tools. It got better when they took away tools. And this is something that I think that a lot of folks who are excited about agents need to sit with more. And this goes for skills, too. If you've got a pile of skills in your codex or Claude, pay attention. Because most of us are building agents the opposite way in practice. We're so enthused about building, right? We start with one task and add a tool and add another tool and add a memory file and add a slack integration, add a browser, and add a CRM action, add another exception.
And after a while, the agent will look super powerful and muscled up, but it's going to become harder to trust. The beginner instinct is to add. The maintenance instinct is to ask what should be removed. That is the real agent story of 2026. Not can you build an agent? Look, I've got a video on that. There's dozens of videos out there on that. Of course, you can build an agent. The harder question is whether you can keep the setup around the agent healthy as the work changes and the model evolves. People call that setup a harness.
If that word feels super technical, you can call it a workbench. It's kind of the same thing. The agent is the worker, the harness is the workbench. It's what the agent reads, it's what it remembers, it's what tools it can touch, it's what it's allowed to change, it's what proof it has to bring back, it's what stops it when the work gets risky. For Salesforce agent, had a workbench or harness. It had a documented workflow from a top performer. It had tools, it had handoffs, it had human review, it had feedback, and then the team learned that part of maintaining that workbench or harness is pruning.
And that is a much more important lesson than AI replaced the sales process, which is what all the headlines were about. The real lesson is that useful agents desperately need good maintenance. And I think there are four first principles here that I want to lay out that are going to be durable for 2026. The first is that agents themselves are moving. The model underneath the agent is not stable. It's getting better, it's getting better at tool use, it's better at reasoning across steps, it's better at understanding messy instructions, it's better at reading files, it's better at remembering what matters, it's better at moving through work without needing every step spelled out.
So, agents get better, and that sounds purely good. And mostly it is, but it also means yesterday's harness can become very wrong very quickly at the price of an update. A tool that helped a weaker model can confuse a stronger one. A A rule that protected you from an unreliable model's mistakes can trap a better model. A workflow that forced structure around a clumsy agent can become a drag when the model can handle a lot more of the work itself. These are all real examples. We are used to software breaking when it gets worse. That's our mental model.
Agents can also break when the model gets better and that is a different and new thing. It's a strange new maintenance problem. Imagine the first version of an agent is not very reliable. It overreaches. It invents patterns. It treats one example like a trend. So you build a really careful harness around it. You give it strict tools and narrow the prompt and say only use these sources. Don't infer. Don't create records. Don't recommend a next step. Just summarize what you see. And that may be exactly right for that model. the model improves. Again, real examples here.
Now it can compare sources better. It can understand the workflow better. It can tell the difference between a weak signal and a real pattern. It can draft a useful next step. I am describing November to March of this past 6 or 8 months. But your harness still treats it like the old model. So the agent is underused or the opposite happens, right? The old model was clumsy so you give it broad access because you knew a human would catch everything. Then the model gets better. Now it can take 20 plausible actions in a few minutes.
Now they look real. They look organized. They create work that a human has to unwind. So the model improved, the harness did not and that is a massive driver of agent breakage in 2026. Normal systems drift. Prompts drift. Wiki's get stale. Dashboards break. Automations keep running long after the process changes. SOPs describe how the company worked months ago. Slack channels become junk drawers. Templates survive long after the reason for the template disappeared. None of that started with AI, right? Every company already has this problem. The product wiki, it's a little or a lot wrong. The CRM field means something slightly different than it used to.
The dashboard, it still says activation, but the team changed what activation means. The support tags have evolved. The road map moved, the owner changed, the process changed, the docs didn't. With normal software, this is vaguely annoying and you sometimes get messages saying, "Please update your wiki." With agents, it's very dangerous because agents don't sit. They produce work. They're proactive. That's their job. They summarize, they recommend, they draft, they route, they update, and sometimes, of course, they act. That's the value. So, a stale wiki that is annoying to you is incredibly dangerous to an agent because it doesn't know that and it just keeps on working.
And this is the second principle I want to communicate. Agents inherit all of the crud of the systems around them. If your wiki is stale, your agent reads and ingests stale truth. If your process changed, your agent will follow old process unless you update your docs. If your prompt is written for last quarter's company and model, that agent may keep serving last quarter's company and not realize everything's changed. If your dashboard definition is incorrect now, the agent will make the wrong number feel very convincing. This is not a model failure in the simple sense, right?
The agent did its job. It's the old maintenance problem with a machine that now can produce work from that mess that is sometimes very convincing. And this is why Stewart Brand's Maintenance of Everything, I think it's the right frame for agents. Brand is writing about sailboats and vehicles and weapons and manuals and corrosion and the work that keeps important systems alive after the launch moment is over. Agents are a lot less like apps and more like sailboats. I love this book. This is like one of my favorite books of the year. You don't just launch agents and walk away.
The weather changes, the lines loosen, salt gets into everything, and yes, this is all from that book. The same setup that worked yesterday can be wrong tomorrow. A sailboat is not maintained because it was badly designed, it is maintained because it lives in motion. Agents live in motion, too. The model changes inside them. The world changes around them. In that sense, they are much more like traditional vehicle maintenance than anything else we've seen in software in a long time. The harness has to keep up with the model changes and the world changes. And so few of us really have a good system for that.
Now, the third principle I want to call out is that the biggest AI companies already know this. A lot of the implicit bet from the frontier labs and platform companies is not just that their models will get better. It is that they can use those better models to ship and evolve the harness faster. And I think that's one reason why it's really important to talk about Codex in the strategic context of OpenAI's long-term strategy. And I think that's one reason Codex matters so much. Codex is strong not just because the model is strong. Codex is strong because OpenAI keeps maintaining the harness around the model so it feels intuitive and native as the model and the world evolve around it.
It has become closer to an operating surface for work as it's evolved. So, it has a terminal and a desktop app and an IDE and a browser and computer use and files and plugins and memory and automations and approvals and sandboxing and network controls and keychain storage and manage configs and logs. This is way beyond a chat box with a smarter brain. It's a very carefully maintained workbench around machine work. And the Claude code team is doing the same thing, right? They're investing heavily in their harness. I'm really excited to do the Claude code is as amazing as Codex review, guys.
So, please give me something that cool. And to go back to the workbench analogy, every tool in that workbench is carefully chosen with Codex, right? The terminal matters because real work lives in commands and repos and files and tests and local tools. This was Clyde Code's original insight, by the way. The browser matters because real work happens on interfaces that humans see. And both Anthropic and OpenAI are building that way. Computer use matters because not every tool has a clean API. Plugins matter because work lives in a bunch of other systems: GitHub, Google Drive, Jira, Slack, etc.
Memory matters because preferences and corrections should not have to be rebuilt every day, right? Approvals and sandboxing matter because a capable agent still needs boundaries. Logs matter because when an agent does something weird, someone needs to know what happened. This whole surface together is the harness. It's an art to build a good harness. And there are really two teams in the world building good harnesses: the Anthropic team and the OpenAI team right now. And this is where the hyperscaler and frontier platform bet gets super interesting. If the model can help you ship the harness and test the harness and refactor the harness and observe the harness and train the harness, then capability gain is going to start to compound real fast.
Because better agents can help build more effective harnesses, better harnesses can make the agents more useful, and then better agents can help rebuild that harness once more. That is why the Vercel story is not just a quirky sales automation story. It's a pattern we all need to learn from. The companies that win are not the ones that build the perfect wrapper once. They're the ones that keep rebuilding the wrapper as the model and the work change. They rebuild that workshop. They rebuild that harness. And this is why the direction of Codex over time feels really significant to me.
If Codex keeps getting more capable, and that's an if, and the Codex harness keeps getting closer to the operating system of work, another if, then OpenAI is not only selling intelligence, they are selling the environment in which intelligence becomes useful. And that's the same bet Anthropic is making with Clyde Code and Clyde Co-work. The harness evolves, the model evolves, the harness lets the model touch more real work over time. More real work creates more pressure to improve the harness, and that loop is ignited like a flywheel. And that loop matters, and it raises the bar for all of the rest of us.
Because if you're building your own agent setup, you are now not just choosing a model, you're choosing how much harness maintenance you are choosing to own versus how much harness maintenance you're outsourcing. A light custom harness might be a clean set of instructions and memory and source folders and repeatable methods around Codex or Quad. That can be enough. Here are the sources. Here's the job. Here's what you can't touch. Here's the proof I need. Here's when a human decides. A deeper custom harness is a very different thing. Because now you have a data feed, a review screen, permission levels, logs, model choice, escalation paths, approval rules, and a plan for what happens when the model changes.
And that can be very worth it to invest in, but now you're not just building an agent. You are investing in the long-term maintenance of an agent and harness system. You are taking responsibility for evolving the system around the agent over time. And the more custom the harness, the more you own the upkeep. And this is not abstract for me. So now I'm thinking about my delegation model differently, and part of it is just the ordinary mess of work, right? Folders move, drafts change, source packets get updated, memory gets stale, and the way I want the agent to use local context changes as the agent gets better.
So the thing I maintain is a lot more than a prompt. It's It's the whole way the agent meets my files. Where should it look first? Which folders are a source of truth? What should it ignore? What should it ask about before touching? What should it remember? What should it forget? When it searches memory, is that right? When does it actually go read the file? That is a harness question for me. And that's a tiny personal harness question, right? I'm not even talking about team harnesses here. And it has changed because the agents have changed, because the models have updated.
And this brings me to the fourth principle, and it's the one that I think matters the most. You need to ask, I think all of us need to ask, what is my harness? What is my workshop? Not in a sort of technical way that makes it feel scary, but in a very practical way. If you use chat GPT or Claude or or CodeX or any other agentic tool, your harness is the setup that makes that model useful for your real work. Maybe it's project folders, maybe it's your memory, your prompts, your source docs, your approval habits, your browser access, your file rules, your tools, your verification loop, your way of asking for proof, your habit of making the agent read the actual source instead of guessing from context.
If you're a product leader, your harness might be the sources your agent reads before planning, right? If you're in sales, your harness might be the CRM fields or the call notes or the routing rules and and the human approval steps. Maybe you're in support, right? Your harness might be the policy store and escalation paths and refund rules. If you're a writer, your harness might be your drafts and your transcripts and your voice notes and your editorial rules and your archive and the instruction that the agent has to show where an idea came from. If you're an engineer, of course, your harness would be a repo, test, terminal, permissions, work trees, logs, review rules.
That in many ways is the most mature example of a harness we have today. But that's the real question for all of us, whether we're engineers or not. What is your harness? What are you doing to ship it? What are you doing to rebuild it? Do you know when to rebuild it? Do you know how to rebuild it? What are you doing to evolve it as the models improve? The more useful question is can I maintain this sailboat over a voyage? What harness does this agent need? Those are the same question. And then the really mature question is what part of this harness will I need to delete later?
And and Brand gets into this, he talks about simplicity as a key to maintenance, and I love that. This is the Vercel lesson. The agent got better when the workbench got cleaned up. And once you start to realize that your your maintenance question is not just a modification question, but potentially a deletion question, the whole agent conversation becomes a lot more complex. You stop treating the harness as a one-time wrapper, you treat it as a living system where you do have to add sometimes, and you do have to take away. You have to think about the system's health overall.
So, for any serious agent, I would check these five things. First, what's it eating, right? What's it reading? Are the sources current? Did the workflow move? Did a new source become important? Did an old source become misleading? Second, I would test its reach. What can it touch? Can it only read? Can it draft? Can it create tickets? Can it post in Slack? Can it update records? Can it spend money? Can it publish? A permission that was harmless for a weaker model may be too broad for a strong one. A restriction that made sense for an unreliable model may hold back a better one.
Third, I would check its job. Is this still a summary agent? Is that useful? Is it becoming a planning agent because agents are getting better and it can be? Is it supposed to find themes? Is it supposed to recommend tradeoffs? Is it supposed to route work? Do not let the job change silently. Change the job on purpose if you're going to do it at all. Fourth, check the proof. The agent shouldn't just say, "Customers are frustrated with onboarding." It needs to link to the tickets. It should link to sales notes. It should quote customer language and have a source.
And it should say which sources it checked and where, and which ones it could not access. So, the proof is not just the agent saying it. The proof is an a linkable trail a human can inspect. Fifth and last, check the agent's value. I don't think this gets done enough. It's like asking if the sailboat got there, right? Does anyone read the output? Does it change the work? Does it save time after review? Does it create another pile of work? Is it duplicating a report? Has the model improved enough that the agent ought to be rebuilt?
Has the business changed enough that the agent should be retired? That can happen. Agents, unlike almost anything else, break in two directions. They break because the world around them drifts, and they break because the model inside them improves. And maintenance is the work of keeping the harness fit between those two moving things. That delicate art, that sailing art, is the future of agents. It's not just more capability, it's better maintained capability. The work changes around the agent, the model changes inside it. And if you ignore either of those, the agent does not have to fail very loudly to become quite dangerous.
All the agent has to do is to keep working, and it will start to haunt your business. And before I forget, yes, I'm going to say it again, read the Maintenance of Everything. If I could recommend one book on agents that isn't about AI, I would recommend this one. It's a phenomenal book. It's out of Stripe Press. I love what Stripe Press is doing. Thank you, guys. It's by Stewart Brand. Go get it and read it. It will teach you a lot about how to think about the maintenance of technical systems. Have fun. I'll see you next time.
More from AI News & Strategy Daily | Nate B Jones
Get daily recaps from
AI News & Strategy Daily | Nate B Jones
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









