Anthropic Just Revealed The Truth About Agent Harnesses

AI LABS| 00:14:02|Mar 31, 2026

Chapters11

The chapter discusses how modern AI coding frameworks have become bloated, with Opus 4.6 cutting away unnecessary components that encode outdated assumptions about model limits.

Anthropic’s Opus 4.6 reshapes agent harnesses: focus on planning, generation, and evaluation while stripping dead weight from older frameworks.

Summary

AI Labs’ breakdown of Anthropic’s latest harness work reveals a turning point: many framework components have become outdated as Opus 4.6 raises the model’s own capabilities. The host explains that Anthropic’s experiments removed each harness component to measure impact, concluding that only planning, generation, and evaluation truly matter for effective agents. They contrast Opus 4.6 with older approaches like BMAD, SpecKit, GST, and Superpowers, arguing that context isolation and micro-task sharding are less necessary when models can plan at a higher level and execute with autonomy. The video highlights how Anthropic’s planner can generate plans with a product-level focus, while Claude’s native planning features still struggle to stay at the product, not implementation, level. A key point is that the evaluator must be separate from the generator to avoid self-assessment bias, and graded evaluation metrics are used to quantify UI design, originality, craft, and functionality. The host also covers the shift away from context resets and detailed task breakdowns as Opus 4.6 enables continuous, long-horizon operation. Finally, practical setup guidance is offered: you can emulate Anthropic’s framework with agent teams (generator + evaluator) and consider GSD for a ready-made planner-evaluator loop, while integrating the best ideas across frameworks. The video also plugs AIABS Pro resources and Miniax sponsor features, including MaxClaw for easy AI agent deployment.

Key Takeaways

Anthropic’s experiments remove components one by one and find that most framework parts are dead weight; Opus 4.6’s capabilities render many old assumptions obsolete.
Only three harness roles are essential: planning, generation, and evaluation; other steps can be stripped back as models become more capable.
Planning has evolved from detailed upfront specs to high-level framing; micro-technical planning up front often causes errors to cascade.
The generator builds features feature-by-feature, coordinating with an evaluator that uses tools like Playwright to simulate user interactions and verify outcomes.
Evaluation must be separate from the code writer; a graded rubric (design quality, originality, craft, functionality) guides what “done” looks like.
Context resets and long context management are less critical with Opus 4.6; continuous sessions and external task breakdowns persist beyond the model’s context window.
You can combine Anthropic’s ideas with GSD or other frameworks by using agent teams (generator + evaluator) to maintain coordination and avoid cross-agent overhead.

Who Is This For?

This is essential viewing for AI/ML engineers and product teams experimenting with agent-based systems, especially those upgrading to Opus 4.6 or integrating planner-generator-evaluator loops. It clarifies what to prune and what to preserve for long-horizon agents.

Notable Quotes

"Over the past few months, we have covered many AI coding frameworks including BMAD, GSD, SpecKit, and Superpowers. And a lot of you actually started using them."

—Intro referencing the breadth of frameworks evaluated and the audience’s adoption.

"Anthropic actually ran experiments testing out different aspects of the harness, removing each one and measuring its impact."

—Core claim about the empirical approach to pruning harness components.

"From their findings, they concluded that all an agent harness actually needs is agents for planning, generation, and evaluation."

—Summary of the minimal viable harness.

"The core idea is to keep the project at the product level, not the implementation level."

—Product-focused planning contrasted with implementation-heavy planning.

"The evaluator keeps verifying the work and communicates the issues with the generator and they work in coordination to implement the whole app that matches your standards."

— describes the generator-evaluator collaboration.

Questions This Video Answers

How does Opus 4.6 change planning for AI agents compared to BMAD or SpecKit?
Why is an independent evaluator crucial in agent harnesses and how are grading criteria used?
What is the role of Playwright in testing AI-generated apps and how does it integrate with a generator?
Can you implement Anthropic's harness using agent teams, and what are the practical steps to set it up?

Anthropic Opus 4.6Agent harnessesPlanner-generator-evaluator loopBMADSpecKitGSDSuperpowersClaude PlaywrightContext windows | context resets | long-horizon agents

Full Transcript

Over the past few months, we have covered many AI coding frameworks including BMAD, GSD, SpecKit, and Superpowers. And a lot of you actually started using them. But Anthropic just ran experiments on their own harness, removing components one by one and measuring what actually mattered. Their conclusion was that most of it is now a dead weight. Every component in a framework encodes an assumption about what the model cannot do on its own. And with Opus 4.6, those assumptions have gone stale. We went through the whole thing and mapped out what still matters, what you can strip out, and what your setup should actually look like. Now, agent harnesses play an important role in making agents work substantially better over long horizons. Anthropic has already released an agent harness, which we covered in detail in a previous video explaining how to set it up and use it. We have also covered other frameworks in that same context, and while their implementations differ, they are all trying to do the same thing. But when these frameworks were released, the models were not as capable as Opus 4.6 is now. For example, frameworks like GST are focused on context isolation, but that is not a problem with Opus 4.6, not only because of the million token context window, but another reason that we will talk about in a bit. Therefore, a lot of previously implemented frameworks are now an overhead for new model capabilities. Anthropic actually ran experiments testing out different aspects of the harness, removing each one and measuring its impact. From their findings, they concluded that all an agent harness actually needs is agents for planning, generation, and evaluation. The rest are just ways of doing things that become dead weight given how capable the models are. Now the core theory is that every component in an agent harness, no matter which one you are using, relies on the same principle. Each component encodes an assumption about what the model can do on its own. These assumptions should be stress tested because they may be incorrect and they will go stale as the model improves and that's what they did throughout the article. Therefore, with the evolution of the models, your harness should also evolve. And if you are working with the same principles laid out a few months back, you are not keeping up. Planning is the first step that remains unchanged across every framework. But the way you plan has to change for more capable models. Anthropic's previous long-running harnesses required the user to provide a detailed spec upfront. Frameworks like BMAD and specit literally shard the task into smaller fragments and microtasks that help the AI agent implement it with ease. And these weren't just small tasks. They were literally detailed steps that agents just had to follow without thinking. This is because at that time the models were not capable enough and needed to be microg guided so that they could perform the way you wanted. But with Opus 4.5 and 4.6 this has changed. When Anthropic tested this, they found that if the planner tried to specify micro technical details up front, a single error would cascade through every level of implementation, making it hard for the agent to deviate and fix issues on its own. It all relied on how well written the plan was. Therefore, planning has now become high level rather than a detailed technical implementation. Agents are much smarter on their own now and you just have to tell them what deliverables are needed. They can figure out the path toward that on their own. With this shift, planning approaches like those in BMAD and specit are no longer as relevant. You can limit BMAD to the planning phase up to PRD generation with no need to go into the technical sharding process. As we have mentioned before, PRD generation with BMAD is effective because it has specialized agents for understanding product requirements better than Claude would have done on its own. This is because those agents have the external context for specific tasks added in by the author. Alternatively, you can use the questioning session from superpowers since it was actually intended to identify edge cases which can be more effective than multi-level task documentation. But the core problem with overly detailed planning is that it locks the agent down and does not leave room for the AI to make discoveries and figure things out on its own. Anthropic has also given an example plan that was generated by the planner agent which you can use to set up your own planner agent. It clearly outlines that the plan should go big on scope and push the boundaries of whatever app idea you provide. The core idea is to keep the project at the product level, not the implementation level. This matters because if it tries to plan out the implementation within the project plan, it becomes too focused on technical details and may fail to deliver what is actually needed for a complete product. Now, you might think that Claude's own plan mode already does similar planning by asking questions and providing a detailed plan. But here is the difference. Even though Claude has a planning agent, it still focuses heavily on implementation details and does not truly operate at the product level, which goes against Anthropic's findings. Therefore, once you have this in place, you can simply ask Claude to use the agent you created to plan your app, and it will generate a complete plan and document it in your folder as it progresses. This plan includes a full feature breakdown at the product level, and with each phase, it includes user stories that show what the user's perspective looks like. This helps Claude implement the correct workflows that users actually expect. But before we move forwards, let's have a word by our sponsor, Miniax. Setting up AI agents is a nightmare. API keys, server configs, Docker setups, and after all that, your assistant forgets everything the moment you close the tab. The solution is Maxclaw, a cloudpowered AI at your fingertips. No setup, no headaches. You can deploy your own Open Claw. Just click deploy, and you're live in under 10 seconds. It builds websites, writes code, runs research, and automates your busy work, all from simple text prompts. MaxClaw connects directly to Telegram, Slack, Discord, and more, letting you automate workflows, browse the web, and even generate images or videos, all from a simple chat. It is part of Miniax agent, an AI native workspace where everyone becomes an agent designer. It works on Mac, Windows, powered by M2.7, which matches Claude Opus 4.6 on Sweetbench. Stop wrestling with complex setups. Let Maxclaw handle it, and click the link in the pinned comment to get started. The agent that writes the code should not be the one evaluating it. This is the second most common problem and it is not usually discussed much. Self-evaluation is problematic because if you use the same agent that wrote the code to evaluate it, it tends to respond very confidently and praise its own work even when the quality is clearly subpar. This might be easier to manage for tasks that have quantitative metrics like whether the APIs that were implemented are actually working. But this problem becomes much more pronounced for tasks that do not have clearly verifiable outcomes. The biggest example of this is the UI. What constitutes a good UI is subjective and AI might not fully grasp your intentions. It may consider its own implementation as well done even if it does not meet your standards. This issue was already recognized by the creators of multiple frameworks and they implemented their own evaluation mechanisms to address it. All of the frameworks we have covered like GSD, BMAD and superpowers ensure that the same agent that wrote the code does not get to evaluate its quality. This approach significantly improves the accuracy and reliability of the agents evaluations. Therefore, whether you are using an existing framework or building on your own, you need to ensure that the evaluator is completely separate from the implement. Before implementation begins, both the generator and evaluator agents negotiate a contract agreeing on what done looks like for the work. This helps because both agents clearly know what to achieve and what to verify. With highle planning, there still needs to be actionable implementable steps. But during testing with the harness, they tried removing the sprint contract. They found that Opus 4.5 was less efficient in this scenario because the evaluator still had to step in to catch issues. But with Opus 4.6, the model's capabilities had improved so much that the contract was not necessary. The generative agent was capable enough to handle most of the work on its own. Therefore, for smaller models like Sonnet or Haiku, you still need to document tasks, break them down properly into sprint structures and have each agent agree on what complete looks like. But with more capable models, you can rely on Opus to execute the highle plan without these additional steps. Now, we said that there is a reason why context isolation matters. This is because smaller models experience context anxiety, a phenomenon where models start losing coherence on lengthy tasks as their context window fills up. When this happens, they wrap up work prematurely and claim they have implemented tasks correctly, even when they have not. The solution that helped was a context reset, clearing their context windows before starting implementation. Since the context was cleared, they could rely on a task breakdown documented externally, which persisted across context resets. But the models exhibited so much context anxiety that compaction alone was not enough. They needed additional measures to prevent problems on longer tasks. Starting with Opus 4.5. However, models no longer exhibit this behavior. These agents can run continuously across an entire session and the way Claude handles compaction is sufficient for their functioning. Therefore, context resets are no longer necessary and detailed task breakdowns like those in BMAD and spec kit are not needed either with just highle guidance alone being enough. The generator agent is the main implement that actually builds the app feature by feature. It takes the specs from the plan and continuously implements them while integrating with Git for version control. The generator works in coordination with the evaluator agent. After building a feature, it hands it over for testing and receives feedback to improve its implementation. Its workflow is organized into several steps. Understanding the task, implementing it, and refining the implementation. Even within the implementation phase, work is divided into four subphases covering different aspects. It follows the design direction, verifies its work, and then hands it to the evaluator. This creates a structured step-by-step pattern enabling the agent to implement an entire app independently and systematically. The evaluator agent acts as the adversary to the generator. Its job is to ensure the app is implemented correctly, not by doing a generic find bugs pass, but by approaching it critically from the perspective that bugs exist. It can use tools like playright to test the app by simulating user interactions, identify bugs based on predefined criteria, and send feedback back to the generator. By reading the plan, the evaluator gains a clear understanding of what done should look like and checks everything thoroughly before approving it. Each framework has its own validator, but the approaches differ significantly. BMAD uses specialized code review and QA agents that generate and run tests, evaluating the code from multiple angles. GSD uses a verifier sub agent that checks the implementation against the existing plan and produces a documentation report. Superpowers relies on fresh sub aents and enforces strict TDD where no code can be written before the test cases. If the agent tries to bypass this, it is blocked. Specit treats specs as the source of truth and allows the agent to verify code against the documentation. But none of these frameworks provide a scoring mechanism with the level of rigor Anthropic was aiming for. Therefore, the evaluator in anthropics harness is the closest to Ralph Loop's strict implementation enforcement for Claude, ensuring the agent actually delivers what is needed with a proper graded evaluation mechanism. Also, if you are enjoying our content, consider pressing the hype button because it helps us create more content like this and reach out to more people. The agent has no means to know what the right output looks like for you, especially in cases where the implementation is not quantifiable. Therefore, you use graded evaluation mechanisms so that they know what the right output looks like to you. When Anthropic gave an example for the evaluation metrics for the front end, they mentioned that the AI tends to converge on similar outputs most of the time. They set four grading criteria for both the generator and evaluator agents. The first is quality of the design, instructing it to check if the field is coherent or just separate components strung together. Then originality, which is one of the main ones because AI tends to default to the same purple and white gradient pattern for most UIs. This goes against how humans design because for a human, each design choice is deliberate and this makes it easily identifiable when the website does not look good. The third is craft. The minor details like typography, spacing, consistency, and color harmony where the contrast ratio is technically balanced rather than giving it a more creative look. And the last is functionality. Because in terms of UI, each component plays a visual role in enhancing the user experience. Claude already scores well on craft and functionality, but the rest are the most common struggles and the prompts need to push it to its best capability by emphasizing that the best design comes from quality. Therefore, when you are building your app, you can set up similar criteria for as many features as you want like code architecture, the front end, UX, user flows, and more. have each part mentioned in the criteria have a dedicated score so that the model can identify its importance based on how well it performs. These files are referenced in the evaluator agent because the evaluator's job is to score so it knows what rubric it should be following. Given everything we have covered, you might wonder what you should actually do now. If you want a framework so that your setup is easier, go for GSD because GSD inherently uses the planner generator evaluator loop by default. But its evaluator just matches the code against the existing plans and relies on user acceptance testing. It uses a pass and fail mechanism, not a scored implementation. Therefore, you can take the best parts of the anthropic framework and combine them with GSD. For example, changing the evaluator agent and combining it with the criteria so that the agent knows what the right implementation is. But if you want to use Anthropics framework and set it up on your own, you can implement it by creating agents based on their respective roles and have them work together using agent teams. You can use one agent team member as a generator and another as an evaluator. The reason for using agent teams is because they can communicate with each other while sub agents cannot and would have to write to a document creating overhead. Therefore, Claude creates the tasks from the highle plan and creates both agents at the same time where one is implementing while the other is running tests using the playright MCP with the browser waiting for updates from the generator so that it can start the testing process. The evaluator keeps verifying the work and communicates the issues with the generator and they work in coordination to implement the whole app that matches your standards. Now all the agents used here along with all the resources are available in AIABS Pro for this video and for all our previous videos from where you can download and use it for your own projects. If you found value in what we do and want to support the channel, this is the best way to do it. The links in the description. That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below. As always, thank you for watching and I'll see you in the next one.