The prompting playbook

Claude| 00:33:49|May 22, 2026
Chapters13
Speaker introduces the prompting playbook, two practical scenarios, and the goal of moving beyond simple dos and don'ts to hands-on evaluation driven practices.

A practical playbook for improving prompts with structured prompts, evaluation, tools, and agentic loops to fix production prompts and build new AI agents.

Summary

Margot Vanlar of Anthropic leads a hands-on session on the prompting playbook, focusing on real-world prompts in production and new agent design. She emphasizes evaluations as the backbone for understanding whether prompt changes actually improve performance across models with different capabilities. The talk walks through a miniaturized Meridian Mobile customer-support prompt, showing how crowd-sourced patches and model migrations can degrade behavior if not properly managed. Starting with general hygiene, Margot demonstrates how structuring prompts with clear roles, policy, and tone—and adding an explicit output contract—can boost reliability. She then shows how introducing a calculation tool and a proper tool schema fixes arithmetic failures, and why “instructions” don’t confer actual capability. For a new agent task (week-long retail staff scheduling), she contrasts single-prompt approaches with an agentic loop that splits generation, evaluation, and repair into separate prompts, achieving better results with lower latency and token usage. Across both scenarios, the key lesson is to use eval suites, versioned patches, and modular prompts to avoid overfitting and maintain a clean, auditable prompt architecture. Margot closes by highlighting how agentic loops can enforce soft constraints at runtime, offering practical paths from theory to production. The session is a vivid reminder that prompt engineering is an iterative craft anchored in measurable outcomes, not guesswork.

Key Takeaways

  • Use an eval suite with control, edge, and capability tests (e.g., plan data, billing, and escalation) to detect regression when migrating to a new model.
  • Add structured prompt sections (role, policy, tone, data, and an output contract) and use stop sequences to stabilize output formats.
  • Introduce explicit tools (like a proration calculator) and define a tool schema; this provides capability without relying on brittle mental math in prompts.
  • For complex outputs, consider structured formats (XML tags or JSON) and harness-level controls to improve consistency and testability.
  • When upgrading prompts, track changes with version control and document the rationale to backtrack if needed.
  • Build new agents by modularizing tasks: generation, evaluation, and repair prompts; an agentic loop often outperforms a single, larger prompt.
  • Soft constraints can be enforced at runtime with generate-evaluate-repair prompts to adapt to case-specific requirements.

Who Is This For?

Essential viewing for AI engineers and product teams who maintain production prompts or build new AI agents. The talk provides actionable techniques for prompt hygiene, evaluation-driven iteration, and agentic design to ship reliable AI features.

Notable Quotes

"So what's actually going on here? Well, in order to start unpacking that question, um we need a starting point. And that starting point is evaluations."
Margot emphasizes evaluation as the foundation for understanding prompt changes.
"Instructions don't add capability. Telling the model it's critical to do a calculation right doesn't make it better at mental math. So the correct approach was to give it a tool."
Highlighting the core principle of using tools rather than relying on prompts for complex tasks.
"We need to have an eval suite to act as a way of testing that regression so that we can apply our prompting best practices to that."
Underscoring the role of evaluation in detecting regressions after migrations.
"Give this balanced view where it says, you know, customers on grandfather's plan have different allowances, but it's captured in the customer information that's given and that is the accurate source of truth."
Illustrates how to handle edge cases and avoid hallucinated or outdated policy details.
"Instead of one big prompt for everything, we split into three prompts: generator, evaluator, and repair. The agentic loop solved all our test cases with lower token counts and latency."
Showcases the practical benefit of an agentic loop for scheduling tasks.

Questions This Video Answers

  • How do you set up an evaluation suite for prompt changes when migrating to a new LLM model?
  • What is an output contract in prompt engineering and how does it improve reliability?
  • What are the benefits of an agentic loop (generate-evaluate-repair) over a single, monolithic prompt?
Prompt engineeringEval suiteXML prompt structureProration calculator toolAgentic loopMeridian MobileAnthropic ClaudeModel migrationOutput contractStop sequences
Full Transcript
Hello everyone. Um, thank you so much for joining me this afternoon in the breakout room. The last session today of code with Claude. I hope you've all had a fantastic day so far. My name is Margot Vanlar. I am an applied AI engineer at Anthropic here in London. And this afternoon, we're going to be talking about the prompting playbook. And prompting is arguably one of the first skills, if not the first skill that we had to learn as engineers when we first started to work with LLMs. and even now it continues to be one of the most critical um skills to building effective AI systems. So today we're going to discuss some best practices um in the context of two practical scenarios that you're probably encountering at work. The first is where you have an existing prompt in production that you've been maintaining for some time um and possibly you're migrating it to a new model or making a change to the architecture and for some reason it's no longer working as well. The second scenario is where we're building an entirely new agentic use case from the ground up and we need to build the prompt from zero to one. Now, in order to illustrate these best practices, I don't just want to give you a list of dos and don'ts. I want to walk through a practical example that's been inspired by real prompts that um I've seen some of our customers work with who are building on Claude. So, the prompt that we'll look at today is a miniaturized example. the prompts that you're working with are probably a lot longer and more complex than the one we'll see today. Um, but it's representative of some common problems that you might encounter when maintaining a prompt. So, imagine that we have a prompt that multiple people have been collaborating on, contributing to. There's no clear owner. It covers a lot of different areas like policy, like tone, processes. um we have some patches for kind of previous models that we've migrated to all mixed together. Um it's built up and it's complex. And when we're migrating to a new model, we're finding that suddenly a lot of our test cases are no longer working as well as we expected. So what's actually going on here? Well, in order to start unpacking that question, um we need a starting point. And that starting point is evaluations. We need evaluations to provide that rigor um to understand whether a change to our prompt is actually correlating to an improvement in its performance. And we have different models which have different capabilities and different behaviors. And when you migrate to a different model, it could be that your system is no longer working as well for two reasons. First of all, if um the new model might be capable, but it's behaving differently and therefore we can tune our prompting to fix that behavior. The second case is where actually the model that we're changing to isn't as capable and no amount of prompting is going to fix that. So we need to have an eval suite to act as a way of testing that regression so that we can apply our prompting best practices to that. So in the example that we're going to be looking at today, as I said, it's going to be a miniaturized example. We'll have five test cases in our eval. In reality, you'll have a lot more test cases in your EVA suite, but the key thing here is that it's representative of three key cases that we need to cover. Those three key cases include having a control case, which is a case which should always pass. It's something that the we know the model handles well. It's unambiguous. The second is edge cases. And these are cases where we've seen the model fail before. And by including instructions into the prompt, we're making sure that same behavior doesn't slip through again in the future. And finally, and critically, we need to make sure that the model has a good understanding of the extent of its capabilities, where it should be handing off to a human or where actually maybe it should be point blank refusing to answer a request. So, in the example that we're going to be looking at today, um we'll be using a um prompt for a customer support bot for a telco company called Meridian Mobile. And these are the five test cases that we are going to be looking at today. We have a simple control case looking at um you know what's the data limit in the basic plan. Uh we're also looking at edge cases such as its ability to do calculations such as calculating proration bills. If I switch my bill halfway through the month uh or if I switch my plan halfway through the month, what will my bill look like? We want to check that it's accurately addressing key questions which are covered by our policy. um we need to make sure that it's escalating to a human whenever there is um a billing error. Um and finally, we want to make sure that our model isn't withholding any information that it has access to, which it should be handing over to the customer. So what we're going to do in this process is we'll take our prompt and we'll run it on our v0ero um of of the eval and we'll see what our failure modes are and systematically target those failure modes one at a time to see if we can resolve those failure modes by prompting and along the way we'll learn a little bit more about the kind of antiatterns uh um and traps to avoid and this is representative of how we would apply apply these best prompting techniques in practice. Right? We are rarely writing a prompt from scratch. We're often debugging an existing prompt. And best practice before we start targeting those failure modes specifically is to kind of apply our general prompting 101 best practices, applying general hygiene to clean up before we do the eval run. So let's have a little look at the example that we're going to be using. So what we're looking at here first of all before we look at the prompt is just this vibecoded web app that I've made for the presentation today so that we can look at how we're iterating on the prompt together. Uh um in this page here I can easily run my evals on all five test cases and inspect the results in a little bit more detail. So before we have a look at the prompt I'm just going to run the evals in the background. This is a pretty good first pass at a prompt. When we look at this, we've defined the bot's role at the top. When we scroll down, um we've given it some data. We've given it some information on how to reason over uh um the answers that it should be giving to the customer. It's giving some critical instructions around the tone it should use, um how to do calculations, etc. And then finally we're passing in our customer account context and our user message. So let's have a look at how our first pass the evals did. So we can see as we expect our control case all of our test cases have passed. This is what we expect for this unambiguous test case. But it's performing pretty poorly in these other areas. Now, before we zoom in on those specific failure modes here, let's do some general cleanup of our So, as we mentioned, when we look through this prompt, there's a couple oddities here already. So, for example, first one is we're telling the bot that it's um a human, which just isn't true, right? We can see as we scroll down there's clearly some information here that has been copied directly from a website. So the key giveaway here is a reference to a hero image. Um there's even some references to cookies um at the bottom. So we need to remove a bit of redundant information. When we look at the instructions here, they're all grouped into one big paragraph. So we've got some reasoning here. We've got instructions about the role. uh um some critical instructions as well without a real way of unpacking um policy from guidelines from tone etc. So let me just I've preempted some changes we want to make to this prompt and this is just a diff view of some of those changes. So what we've done is first of all added some structure. So you can see that we've added XML tags here to define the role to separate general guidelines to separate policy to separate tone of voice um etc. So if we run that eval then on this new updated prompt, we should hopefully see an improvement in the output as is. So we can see just by clearing up the prompt, we've already improved the model's performance on this prepaid scenario. There's an interesting regression there in that fifth hotspot case and I don't want to worry too much about that now. There's going to be some natural level of variance in the different runs of the eval and we'll come back to that case specifically to see if we can make the prompt consistently better in that area. So what did we learn from this then? Um, simply clearing up the prompt with a better structure, with a better role description has improved the performance. And this is a best practice that you can return to at any stage of writing and maintaining your prompt, especially as your prompts get more detailed and more complex. A general rule of thumb that I like to follow is if you're reading a prompt and you can't tell guidelines from policy, from data, most likely the model isn't able to either. So before looking at some of those cases in more detail, there's a little bit more general cleanup we can do. Um specifically here looking at creating an output contract. This is a key best practice to follow if you're struggling with your output format consistency. Now in this case, we have a customer support bot. We want it to reply in a conversational tone. So it's unlikely to be a big issue in this case, but it's something to bear in mind if you you're dealing with more complex output structures like nested JSONs for example. So again, if we go back to the prompt and see what fixes we can apply here. First of all, we've added a section uh um at the end where we've defined an output format for the model telling it to use uh um XML tags to output the response. But the prompt is not always the most effective way of handling issues. We can also change things in the harness to ensure consistency to a higher degree. So what we've added here to the API call is a stop sequence which is going to detect that closing XML tag and tell the model to stop generating a response at that point. Now when I run the eval here I don't necessarily expect to see any clear improvement in performance. Um, but it's a general best practice that we should be following and as I said is something that we should remember in particular when we have more complex output schemas. One thing to point out here as well if you do have a more complex output schema something like structured outputs can be incredibly helpful to ensure that consistency in a more programmatic way. Okay. So after the cleanup then we can see that we now have two test cases which are consistently passing but we have three key failure modes the proration the billing error and the hotspot. So let's isolate these one by one um to iterate on the prompt and and see the effect of that. First of all then the hotspot question. So the question is how much hotspot data is on my unlimited plan. What we expect the model to do is state directly the amount of hotspot data that the customer has. And the reason this is a slightly complex case is because the customer test case that we're dealing with is on a legacy plan. So actually the current policy doesn't apply to them. So if we see what's going on in the actual test case here, the customer data which we're feeding uh um to the prompt includes the amount of hotspot data that customer has. They have 5 gigabytes, right? But they also have a grandfathered plan. So what we're seeing uh the model is actually telling the customer is the general um the unlimited plan includes 4 GB um but since you're on a legacy plan you should go check this out yourself. So let's have a look at the prompt then to see um why the model is deflecting this question to the customer account URL rather than actually giving the information itself. Now if we read this prompt originally it said we changed our plans recently and the policy doc shows the current plan data and customers on grandfather's plan have different rates. Never give a customer the wrong plan details. instead point them to the URL. So it's clear that this instruction, this latter one, never give customer the wrong information, is the instruction that the bot has been optimizing for. And you might recognize this as being very similar to a patch that you might have introduced in a previous model that you were using to avoid where the model was giving the customer the wrong information about that plan. Now, as our models have evolved, they have gotten much better at instruction following. So, it's likely that instructions like these have now become redundant and are actually being overfitted to. So, what we're going to tell the model instead is give this balanced view uh um where it says, you know, customers on grandfather's plan have different allowances, but it's captured in the customer information that's given and that is the accurate source of truth. So running the eval here we should hopefully be addressing uh um all of the test cases for the hotspot case. Now I am running this live so there could be some variability here but we see here that now clearly all of our test cases are are passing. So what did we learn from this? Well, we worry a lot about hallucinations or the invention of facts and numbers, but actually the opposite can also happen. The model can withhold information that it actually has access to. Now, we saw here that this is likely a result of a patch that we introduced for a previous model. And a best practice that we could follow here is actually using version control where wherever we are making defensive changes in the prompt, we are tracking the reason why we've introduced these. Sometimes they're necessary, but in the future these kind of changes can produce unwanted effects so that we can backtrack on them. So the next failing test case then is this proration calculation where a customer asks what if I upgrade to the 30 gigabyte plan? What will my next bill be? And what we want the model to do is to perform some calculation and return exactly uh um what their next bill would be rather than giving some sort of vague output which is what we can see it's doing right now. Uh um if we look at what the model is returning, it's clearly reasoning through it. It's doing a little bit of mental maths here and there, but it's not really giving the customer a concrete answer. And I wouldn't rely on this as being able to accurately give the customer a response. So if we look at the prompt then to see how we can fix this in the original prompt we can see that all the instructions that were given to it is telling it don't ever give a customer a vague answer. Uh um critical always calculates any pr-rated amounts correctly. Now, telling the model to do a good job isn't particularly helpful when we don't give the model the capability to actually do a good job. We want to avoid the model doing mental math. So, what we're going to introduce is give the model a tool. So, we're saying in the prompt whenever you're doing any calculations, please use the calculate proration tool to do so. In order to introduce that tool, we need to introduce it into the API to tell the model you have access to this tool. We need to define um the tool schema which tells the model what this tool does and when to use it. And then finally we need to actually implement the tool which is the maths behind how it should be doing that calculation. So running that eval then for another pass we can see that all the test cases are now passing. It's clearly done uh um the maths using the tool in the background and returning the correct response. So the key lesson to take away here is instructions don't add capability. Telling the model it's critical to do a calculation right doesn't make it better at mental math. So the correct approach was to give it a tool. Overall giving it the ability to reason over harder problems and using tools to actually execute them reliably. So now we have one final failing test case which we need to address which is this billing error here. In this scenario there is a billing conflict and what we really want is the agent to escalate this to a human. And what we're seeing it doing instead is it's trying to explain to the customer what the reason behind it might be. and it's trying to kind of diagnose the problem itself. So in order to fix this behavior, let's again have a look what it was told in the prompt. We see in the initial instructions it was giving, it says, "Avoid escalating or transferring to a care specialist unless absolutely necessary as it cost approximately $8 and it counts against our team's fast contract resolution." Now, this is only giving one side of the story, right? We're telling it what the cost is to escalating, but not the benefit, which means it's going to overfit again to not escalating this scenario. And second of all, we've got this clear conflict between what we've defined in the eval in terms of what we want the model to do to do this escalation versus what we're actually telling it to do. And the fix that's relevant here is to give it both sides of the story by saying it costs $8 uh um to escalate a case, but actually if you get this wrong, then it's going to cost you a refund as well as customer trust. Again, here we observed how the model optimizes for a goal. And this kind of instruction is a common instruction to give. It's quite similar to the one we saw earlier where we didn't want it to overfitit to a certain type of behavior. But it's the kind of instruction that can be followed quite differently by different generations of models. And specifically, as models become more intelligent, we need to remember to state both sides of the trade-offs because our models are becoming better themselves at making those tradeoffs themselves. if we just go back to our eval then and um run our final test case, we should see that all of our evals are now passing correctly. So overall, we looked at applying general hygiene principles, how that can provide an initial uplift to the prompt, making sure we're removing any redundant instructions which were initially intended as patches for previous models behavior, making sure we're giving it tools to do certain tasks reliably. Now there's one other scenario that we uh introduced at the start which is one that you might also encounter in your work which is where we're building a new agent from scratch. And the example that we'll look at here is um an agent whose purpose it is to create a week-long retail staff schedule based on employee availability and other constraints. And when we're building a new agent from scratch, we need to consider not just the prompt, but also the model that we're using and the harness that we're using. So in this next example, we're going to compare a number of approaches to explore the impact of those three different areas. So again, I've just vcoded up this web app so that we can walk through this problem. Uh um in this demo here I've just laid out what the problem is that we're addressing. We have our eight employees. Um on the right we have this schedule that we need to staff with the headcount and we have our constraints that must be satisfied in every Now, because we have these hard rules, rather than using an LLM judge like we did in the previous case to do the grading, we can actually use a just a Python function which programmatically checks for every schedule that's generated how many violations were made. So to begin with in a we want to start simple. We're going to use a simple prompt. We're going to use the bare bones that we think we'll need with a model set 46 to see how it performs and how we're going to hill climb against that. Um, so here is our baseline prompt. We've already applied some of that general hygiene and those best practices that we saw earlier on using XML tags to structure the prompt. We've given it an output format as well. Now that we're giving a schedule, uh we're asking it to output a JSON which if we don't give that output structure might lead to passing errors uh downstream. When we run the simple model on a first iteration of the evals, all cases fail. Now, just what we're looking at here is in our test set, we're essentially repeating uh um we're doing five trials here. Uh and these numbers are showing how many violations were made in each trial. In the outputs, we can see that it's made a decent attempt at reasoning through the problem, but it's burning a lot of tokens. Uh and it's clearly not checking its work um as it's not getting to the right impact. So let's try a larger model uh a model which we know is better at reasoning. So we're going to run it through uh Opus 4.7 instead keeping everything else the same. Now, interestingly, whilst all test cases are still failing, you can see that the overall number of violations that Opus has made has reduced significantly from Sonnet 4.6. So, we're possibly on to something here, right? This isn't good enough to ship because it's still failing, but clearly giving it more reasoning capability is helping drive it towards a better result. So what we're going to try next is using opus with adaptive thinking instead. So it can decide for itself how much thinking it needs how much reasoning it needs to use to solve this issue. So no change to the prompt really just uh a change to the API here. So this now seems to reliably generate compliance schedules, but it requires a lot more tokens. We're tripling essentially in the number of tokens that we're using here, and we're tripling the latency. So we want to try and see if we can optimize that cost latency trade-off a little bit more. Um, this is latency 100 seconds. Obviously, I'm running this one async. uh for the purposes of time, Opus 47 hasn't magically gotten much faster since the last time uh you used it. Um so let's see if we can optimize a little bit more for the token latency tradeoff. What we haven't tried yet is using Sonnet 46. So a smaller model but with a better prompt. We looked a lot at the prompt optimization uh um in that last section. So I've added a couple details to the prompt here. We're talk we're in particular how to uh reason through this problem and most critically telling it to check its work before outputting it. So when I ran that EAL we see that it passes in two out of the five cases. Now the failure modes that we're seeing is actually not violations of the scheduling requirements but the model hasn't been able to finish the tasks within the output limit that we set. So whilst we could increase the max tokens that this model is able to use to get all five test cases passing, we see here that we're using even more tokens and this run has an even higher latency. So this is probably not the route that we want to go down. Now as a final pass, then we want to look at doing this a little bit more agentically. So we're going to use this generate evaluate repair loop where essentially the generator now creates a first draft of the schedule and then we have a separate prompt which reports any specific violations that it made. So not programmatically checking it but checking it with an LLM. So we're checking for every rule and we're providing evidence of every violation. And we then have a third repair prompt which receives uh any violations that were made and tries to make targeted fixes to it. So we have three very simple prompts but they're now running independently rather than trying to do everything in one large prompt. So we can see in this case our agentic approach has solved all of our test cases um with a much lower number of tokens and with a lower latency than trying Sonet 46 with a better prompt. So going forward, it seems like there's two appropriate approaches to take here. Using Opus 47 with adaptive thinking or using this agentic loop. Now moving forward, we'd probably want to do a little bit more optimization on this loop to try and get it to be more efficient. But there's one key benefit as well from using this generate evaluate repair loop. And that is that you can put in soft requirements at runtime. So in the evaluation prompt, we can say Harry doesn't like working with Sally. So as much as possible, try and separate them from working together or we need a third shift uh um on Wednesday, for example. So it means that you're not having to make changes to the um Python function which is doing the evaluation in the back end every time to satisfy for any soft constraints which might depend just on a case-byase basis. So to wrap up then pulling all of those learnings together, what did we see? Well, we looked at two scenarios. Two scenarios which I, as an engineer, see most in my day-to-day, which is where we're maintaining a prompt. We're migrating to a new model which has some different behaviors and we're and building a new use case from scratch. We saw that general hygiene principles following those can and immediately uplift the performance against um a set of evals and that we need those eval to be able to rigorously see any impacts of changing our prompt on the output. Then we saw this process of targeting failure modes one by one, adding structure, avoiding long band lists, etc. were all things that helped push our model to the correct behavior. And then finally with our new agentic bot that we were building, we saw the impact of splitting into three separate prompt systems. So rather than using one prompt to address everything, we're actually isolating different tasks where it's easy and repeatable to separate out the steps that it needs to take every time. Thank you so much for attending this afternoon. I hope you have a fantastic rest of your day.

Get daily recaps from
Claude

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.