Lindy, JP Morgan, And OpenAI All Built The Same Layer. Most Teams Haven't.
Chapters14
The speaker recalls real agent failures—from deleting emails to affecting production data—and introduces a new pattern that helps prevent these failures, emphasizing the importance of understanding how failures occur beyond jailbreaks or hallucinations.
A proven pattern to safely deploy agents: separate execution from judgment using a dedicated judge model, with Lindy as the blueprint for scalable, auditable agent control.
Summary
Nate B. Jones highlights a growing architectural pattern for enterprise-grade AI agents: pair an execution agent with a separate judge that verifies actions before they happen. Using Lindy as a public example, he explains why prompts alone aren’t enough to prevent real-world missteps like unauthorized emails or data changes. The key idea is specialization and a long-running-context mindset, where a dedicated judge model monitors the agent’s intent and actions. He emphasizes four action-impact levels and argues for a multi-outcome judge (not just yes/no) to handle drafts, escalations, and policy-compliant operations. The production takeaway is clear: you must classify actions, place the judge at the action boundary, and tune escalation rates to preserve trust and safety. Finally, he warns about correlated judgment and the current advantage frontier models have over open-source options, urging practitioners to design robust governance around agent memory and tool usage. For deeper detail, he guides listeners to a Substack post with concrete implementation patterns and metrics.
Key Takeaways
- Lindy solved a real-world failure by inserting a validator (judge) model that reviews an agent's proposed actions before execution, enabling safer real-world impact.
- Actions are categorized into four impact levels (readonly, write, external, high-risk) to determine how hard the judge must guard and whether human approval is required.
- The judge should offer multiple outcomes (approve, block, revise, escalate) rather than a simple yes/no, to balance autonomy with safety and trust.
- Place the judge at the exact action boundary so every tool call or decision is checked in real-time, preventing wheel-spinning or unintended side effects.
- Beware correlated judgment when using the same model for agent and judge; frontier models mitigate this, but open-source options require extra caution and governance.
- Frontier/open models are preferred for judge roles, while the agent’s execution model can be more capable but should be supervised by a stronger, separate judge.
- Memory governance and clear audit trails are essential for production-grade agent systems, as outlined in Nate B. Jones’s Substack guidance.
Who Is This For?
AI engineers and product leaders building enterprise agent systems who need practical, scalable governance patterns to prevent real-world harms and data mishaps.
Notable Quotes
"“The Lindy team started to dig in and try and figure out what would make a difference, how this could be fixed more usefully.”"
—Context for Lindy’s architectural pivot after a real failure.
"“The validator reads the justification, checks it against the available context, and decides yes, no, or something in between.”"
—Core description of the judge pattern.
"“The acting agent needs to justify what it wants to do to that model, cite evidence, and be extremely clear about its task scope.”"
—Emphasizes how execution must be constrained by justification.
"“In the new Lindy design system, a separate validator model or judge model reads the action and decides whether the model doing the action should proceed.”"
—Direct quote describing the two-agent separation.
"“You need a four-way split: allow, block, revise, or escalate.”"
—Highlights the necessary decision pathways beyond simple yes/no.
Questions This Video Answers
- How does a separate judge model improve safety for AI agents in production?
- What are the four action-impact levels for agent governance and why do they matter?
- Why is it dangerous to use the same model for both agent and judge in AI systems?
- What is the Lindy architecture and how can it be implemented in a real business?
- What memory governance practices are recommended for production agent systems?
Lindy (agent)LLM as judgeagentification patternfault-tolerant architecturememory governanceaction boundarypermission and escalationcorrelated judgmentfrontier models vs open-source modelsSubstack implementation details
Full Transcript
I can't tell you how many different times I have talked about the mistakes that agents have made. I've talked about the OpenClaw instance that started deleting emails until someone ran to unplug it. I've talked about the issues in production databases where LLM agents have deleted production data. And I've told you other stories, too. I've told you stories about real LLM agent issues where hacks occurred that affected public companies. These are all real stories. And the thing that we aren't talking about is the new pattern that is appearing on the internet in agent setups in our workflows as of the last couple of months that helps us to avoid these kinds of situations because we'd all love to avoid this, right?
We don't want to be the person running to unplug the Open Claw instance. We don't want to be the next piece of news about how the agent spent too much money that wasn't the agents to spend. We don't want to be the ones that are like, "Oh no, the company's in the news and it was my mistake because my agent went off the rails." Nobody wants that. We have a pattern now that actually works that I don't think enough people know about. This is almost a PSA video. Please, please, please make sure you understand how this pattern works because once you do, it's very simple.
You will see how it applies in so many ways. But first, we need to talk about the failure mode. The the way the agent screws up. I'm not talking about jailbreaks or hallucinations here. I'm talking about the agent doing the thing you trained it to do. Doing something past where it had permission, inferring authorization from a thread that didn't grant it, updating a record because the old value looked stale and accidentally deleting something. Maybe opening a pull request because the test passed and the change looked done and nobody told it to wait, so it just went in and committed code.
We have built our agents to act and we now need to build the layer that decides when and how they act. And so I want to go through five pieces in this video. I want to talk about how we fix this architecturally and how people are actually doing this in production. Now I want to talk about the failure mode that I see a lot of teams making before they start to build aic systems. And then I want to get into three more pieces. What to avoid as you start to think about new technology in this situation and you start to think about new agents.
How to classify new agentic systems. and what you need to decide when you're setting up a strong agentic architecture. If you're building agents that touch multiple systems, this is especially critical. You can't bolt it on later. So, let's just dive in. I think the cleanest public example of the kind of architecture I'm talking about is actually from Lindy. So, let me start there. Lindy is an agentic product that works across email and calendars and follow-ups and messages and connected tools. That breadth is what makes the product really useful to customers. It's also what makes the product just inherently risky from a surface area perspective because an agent that can send an email or schedule a meeting on your behalf isn't just helping you think.
It's acting in the real world on your real relationships. The Lindy team wrote to this directly. They thought about it during internal testing. They hit the failure every serious agent product eventually has to face. The agent started sending emails that had not been authorized. That's a real example. The irony of this is that Lindy has created agents to lift work. And so the agent thought it was being helpful by sending the email instead of checking first, but it wasn't. It was acting in the real world. And so when this failure occurred during internal testing, the Lindy team started to dig in and try and figure out what would make a difference, how this could be fixed more usefully.
And I want to run through what they did because I think it's instructive for anyone who's wrestling with how you build agentic systems that are stable. Lindy tried the obvious things to fix this. First, they tried better prompts and they tried requiring agent authorization, but both of those didn't really work for different reasons. The the enforcement of prompting didn't work because even the most strict prompt does not hold across a really long context window. It just it doesn't hold in the agents memory. It doesn't work. That's not the way you guard agent behavior. And we know this in 2026.
manual confirmation, which they also tried, does not work because you are training the user that the agent doesn't do the real task and you are reminding the user that they can just click okay all the time. And those are both habits you don't want to instill in your users. You want your users to know the agent can do the work it's asked to do and you want your a users to know that if they are asked for permission, it will be rarely and it will really matter. If you don't have that behavior instilled, you run the risk of the user authorizing something they would never have intended to authorize because you have trained them to just click okay out of habit.
This is famously what the entire European Union did when they put out their cookie policy a few years ago and now everyone just says, "Yeah, yeah, yeah, get out of my way cookie policy." And it's just a bunch of people clicking buttons and not paying attention. And so I admire the Lindy team for thinking carefully about the real solution to this problem because Lindy made an architectural move instead. In the new Lindy design system, a separate validator model or judge model reads the action and decides whether the model doing the action should proceed. The acting agent needs to justify what it wants to do to that model, cite evidence, and be extremely clear about its task scope.
The validator reads the justification, checks it against the available context, and decides yes, no, or something in between. So all this is is two agents. One is trying to complete the task. One is trying to decide whether the proposed action aligns with you, the user, and your intent. And what makes this work is specialization coupled with the new capabilities of longrunning models. And so models work really well when we figure out specialization at the right grain. I think this is a great example of Lindy finding a pattern that is at the correct grain for the agent capability we have today.
Agents today can do multi-hour tasks. They are able to compose hundreds of tools together to do work. They have gigantic million token context windows. They're very powerful and designed to do work over time and work around obstacles. In order to keep them on the rails doing things in line with your intent, they need an equally powerful model with a different persona. A persona that instructs that judge model to only guard your intent. And so all of that agentic focus on task is dedicated only to checking whether your intent is followed. And what that does is it enables us to take advantage of these new stronger models.
But we can take the power of that new stronger model and put it in an appropriate specialization, put it in a judge persona that allows us to check and guard rail dynamically what the agentic model that's in charge of actual execution is trying to do. It's perfect to me because it essentially allows us to look at the powerful capabilities we have today and apply a couple of simple personas that allow us to build a very scalable system. Now, you might wonder, Nate, you dismissed prompts. Why don't prompts matter here? I thought you've talked about prompts and why they're important.
Well, prompts do matter, but prompts cannot do a policing job. And I'll give you an example for this. Take a sales follow-up. Let's say a prospect replies, "Can you send over the pricing deck?" The actor agent would reasonably infer that sending the deck is the next helpful step. Several questions sit underneath the inference that agent just made in my example. Did the user authorize this kind of sales deck send? Is this the right deck? Is it a current deck? Does it contain non-public pricing? Is the prospect under NDA? Did the agent start the thread and now treat the other person's reply as permission to keep going?
None of these are language questions. All of these are actually control questions. They depend on authorization and policy and context and they generate real consequences. If you write a prompt asking the agent to pursue sales and to police a task, it will tend to pursue instead of to police. And the reason why is very simple. The agents are designed to get the job done. And you must give the agent a clear overarching goal number one in order to take advantage of that power. And if goal number one is get the sale, the agent is going to naturally optimize for that task.
You cannot have the same agent optimizing for two different primary goals. That's the trap. That's the trap. Now the the simple answer in 2025 was to trust the human. That was the other major thing is have the human check it. Well, one human attention is getting scarce. I am talking to many many people who are saying I have a dozen agents running. I know Boris Churnney at Claude Code talks about having hundreds of agents running. You do not have spare time to be looking at individual actions and approving them in the course of an agent run.
We are out of time for that. That is not going to work as a real workflow in 2026. It's just not viable. We have scaled past it. We need another option. And LLM as judge offers us a way to scale human attention. Now, this of course raises the obvious question. How do you know which actions need which kind of judgment? You start by classifying them. I would group agent actions into four buckets and I think the line that separates them is the degree to which an agent action has real consequences and this helps you shape the way you set up your LLM as judge right uh so readonly actions are very light right you can retrieve you can summarize you can inspect there's no external side effects and you don't need a heavy judge here unless the action involves sensitive data or unless the agent has the option to do more than read reversible rights like drafts labels internal notes local files.
The action absolutely affects a shared internal system and you do need validation, but it may not need to have an audit trail depending on the tool set that you make available to the agent. If the tool set available to the agent includes permanent write or permanent delete, always, always, always, you need a very, very tight judge pattern. So, those are the first two levels of impact. They're relatively light. The third level of impact is a little more serious. This involves sending messages, booking meetings, stuff that has external impact, right? posting publicly, opening poll requests, notifying customers.
These touch other people in systems outside your agents private workspace. These must must must pass through a strong judge layer that guards your intent before allowing execution every time. And then of course there's high-risk actions, right? Spending money, deleting data, changing permissions, merging code, submitting legal or financial work. Best practice these days is you get the judge plus human approval path unless you have written an extremely narrow and explicit policy that permits automation in one or two circumstances and guards that this classification matters because it's practical. It allows you to set up LLM as judge systems that really work in production.
I'll tell you the common failure modes when people start to design these systems are to fall off on either side of the track here. On the one side, you might treat every action as harmless and that leads to an unacceptable risk profile. On the other side, you might treat every action as catastrophic or risky and then nobody's going to want to use it because of all the approval patterns. That's what the Lindy team ran into. So once you've classified the actions, the judge belongs at the action boundary. You need to be very clear about instantiating the judge review.
When an agent makes a tool call, when an agent makes a intended action, when an agent produces a proposal to make a decision that involves writing or deleting data, whatever it is that is your exact profile, your judge needs to be right there where the LLM is triggering that plan and say, I need to check this. And this is something, by the way, that we see in other systems. Codeex does this with the auto review system today. That's effectively what Codeex does. the the judge comes out and says, "Ah, I'm going to check this before you make this tool call." Sometimes it approves, sometimes it denies.
Now, if you want to get into how you design dual agentic systems like this for each of these agentic risk classes, if you want to look at readonly, write, external actions that touch customers, how you handle really sensitive stuff like money. I go in depth in agentic system design in the Substack post for each of them. It's just too long and detailed for this video. What we need to have structurally is an understanding that the judge and agent system is a pattern that we can repeat, but the pattern needs to have the correct nuance for the data we're handling.
And that's why I've taken the time to go through these four briefly, and you can get all that detail at Substack. Now, let's get into a piece most people miss. Most people think about these judge systems and they naturally assume, well, we have LLM as judge. It can be a yes or a no and we move on. Didn't Nate say simple primitives? Not quite. You actually want to give the judge a very intentionally allowed decision scope. Most production workflows almost always need a middle path. The right answer is often that the agent should draft the email but not send it or archive the record instead of deleting it or remove the attachment or ask for explicit approval of the human or route the decision to legal.
So the judge needs multiple outcomes, not just two, right? It needs to be able to allow the agent to do the thing. It obviously needs to say no or block. It needs to be able to ask the agent to revise. And fourth, it needs to be able to force an escalation to a human or a higher trust process. That four-way split is the difference between an LLM control layer that people tend to build around and bypass because it's just too simple. Yes and no is too simple, versus a sophisticated LLM control layer that the agent can use to get real work done and that humans tend to trust.
You need to think about your escalations in that world as a rate that if it's too low is dangerous and if it's too high is going to damage trust in the agentic system and going to make your humans really annoyed. If you want more advice on tuning that rate, I've got some specific scenarios in for different industries and stuff. I've got that on the substack, too. Now, I want to flag the failure mode that should worry you if you're trying to figure out how to design the system. It's somewhat model dependent and it's called correlated judgment.
The idea is if your actor and your judge use the same model, the same context, the same prompt style, and the same assumptions, they share blind spots. The challenge is this. That is true, but it is much less true of cutting edge models in May of 2026 than it was in December or November of last year. And so I want to call out that one of the ways that you see a difference between cuttingedge frontier models like Opus 4.7 and GPT 5.5 versus cutting edge open- source models, new open source models or even older models that are closed source.
It's their it's their ability to handle this kind of nuance and challenge. It's their ability to generalize well and not get caught up in correlated judgment and correlated blind spots. Part of why the entire LLM as judge pattern works is because we assume the judge is probably going to be a closed source truly frontier model like chat GPT 5.5. If it is not, if you are using a Quen model, if you are using an older Gemini model, if you are using an older Claude or Chad GPT model, you may find if you use the same older model for action and the same older model as a judge that you end up running into issues of correlated judgment where the model that is assigned to be the judge acts like a 2024 or 2025 judge and starts to look at the problems and the issues the agent is bringing it with an overly ly favorable bias and tends to just overaccept whatever the model is proposing and be relatively easily manipulated.
So this is a problem but it is much much much much less of a problem with current frontier models to the point where I would not classify it as a primary failure mode. And when I'm designing the system that is not the thing I would worry about. I would worry much much more if I'm designing with frontier models about the overall scope and boundary conditions of the agentic system. What am I allowing the system to touch? What am I allowing the system to write or delete? Those are things I would obsess over with current Frontier models because their capabilities are so strong.
I would worry less about whether the LLM as judge correlation problem is a real issue because on Frontier models in May of 2026, it's it's just much much less prevalent, almost not an issue at all versus if you go back 6 months, 8 months, it becomes a real issue. So think about that. People ask me, why can't open source models just be good enough to do all this work? This is a great example why you do not want them doing this job on their own model generations. You don't want a Quen model judging a Quen model in that way.
You would prefer to have a more powerful closed source model acting as a judge. Okay, so let's step back for a second. The pattern across this whole conversation is that agents are starting to look a lot less like just running in a workflow, running down a line racers. uh certainly less like chat bots. They're looking more and more like managed workers. They're also not looking like swarms. By the way, swarms was an idea in 2025 that I don't think has aged very well. A managed worker needs task assignment, communication, context, permission, supervision, correction, and a work record.
The first wave of agent products spent a lot of its energy on just getting agent workers stood up so that they can do that stuff. This next wave, what I'm talking about with LLM is judge is around the management system for your agent worker. Your agent needs a manager. Your agent needs a manager. That's kind of what the judge is. The whole product used to be the agent and that's just not true anymore. Now it's the system around the agent that lets it act without turning every action into a gamble. And we are realizing that one of the key elements of that that is relatively simple to uh place is a cuttingedge model as a judge on your intent.
that is responsible for checking the actions, proposals, tool calls of the acting agent and making sure that it is not acting inappropriately. And that's really really important. It it reduces your risk materially. Now, if you want to get into this in depth, the article on Substack goes way deep on this. I think it's one of the most important architectural patterns in the next two or three months as we build agentic systems that do real work. So, I'm going to get into a full action proposal format by action type with the exact fields that an agent would have for an outbound email, a poll request, a CRM update.
I'm going to get into how you think about a general judge versus a specialist judge and which split you want to set up in your agent system first. I'm going to look at the metrics I'd track on the judge as if it's its own product broken down by particular action classes it's approving. So you can measure how effective your judge is because eventually when you build these systems, you need to think about the judge as an agent that you will eventually want to upgrade in your system to a new agent and and how do you do that in a way that that is responsible that tests the metrics etc.
I also get into the memory governance model including why agent written memory needs to be handled very specifically and carefully and and how you distinguish agent written memory in judge systems. And I put the link in the system. If you're actually building this, that's where all the implementation detail lives. If you're actually building this, that's where the implementation detail lives. And if you enjoyed this, if you want to dig in farther on like the key concepts that are shaping how agents are actually getting done in the enterprise, I talk to enterprises all day. Like this is what I like to talk about.
Uh, subscribe. Have fun. We'll talk more tomorrow. Cheers.
More from AI News & Strategy Daily | Nate B Jones
Get daily recaps from
AI News & Strategy Daily | Nate B Jones
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



