Evals for taste: Hill-climbing a slide-generation agent

Claude| 00:39:16|May 23, 2026
Chapters10
Define evals as systematic tests that measure how an AI system performs on a specific domain or use case and identify strengths and areas for improvement.

Evals unlock actionable feedback for slide-generation agents by turning vibes into measurable benchmarks you can iterate on relentlessly.

Summary

Claude walks through the why and how of evals in AI, using a hands-on example with a slide-generation agent. He explains that evals are systematic tests that reveal what an AI does well and where it falters, and argues they should be the backbone of building better agents. The talk emphasizes turning vague feedback into concrete benchmarks, and shows how to mix code-based, rubric-based, pairwise, and human graders to score outputs. You’ll see a live loop: design evaluators, decorate an agent with system prompts, run evaluations, inspect results, and iterate. Claude also demonstrates a QA loop where a second agent critiques the first one's output to drive improvements. The session includes practical setup: a wrapper with an agent model, a structure for graders (emoji counts, layout quality, color balance, etc.), and a workflow to ship improved benchmarks and models. He cautions that evals should evolve over time and cautions against treating any single score as gospel. The takeaway is clear: with well-designed evals, you gain clarity, enable faster model migrations, and surface issues before launch.

Key Takeaways

  • Evals convert subjective vibes into actionable signals by explicitly defining success criteria and scoring rules for AI outputs.
  • Code-based graders (regex, word-count) are fast and deterministic but brittle; model-based and human graders add nuance and context, at higher cost.
  • A living benchmark card accompanies each model release to compare performance across evals and competitors.
  • A QA loop (creation agent plus critic agent) reliably surfaces and fixes errors, driving iterative improvements.
  • Calibrate graders and prompts continuously; evaluators evolve with the model and use case to avoid metric saturation.
  • Switching to a stronger model (e.g., Opus 47) can yield cleaner outputs and higher scores, but you must still validate with your own evals.
  • Clearly separate reasons for a score (pros/cons) from the final verdict to avoid regression errors in model judgments.

Who Is This For?

Essential viewing for ML practitioners building AI agents, especially those deploying tool-use or presentation-creation agents. It’s a practical guide to designing, calibrating, and iterating evals to ensure meaningful improvements.

Notable Quotes

""Evals are systematic tests that measure how well an AI system performs on a specific domain or use case""
Definition of evals and their basic purpose.
""We want to have something that's actionable""
Why evals matter beyond vibes and general impressions.
""A grader is basically a way how we can judge the output""
Explanation of graders and their roles in the evaluation framework.
""Calibrate between zero and five and as we see the scores... I think are quite high""
Demonstrating how scores look in practice and the need for calibration.
""Add a QA loop... assess, criticize, fix, and verify""
Describing the transversal enhancement pattern for improvements.

Questions This Video Answers

  • How do I design effective evals for a slide-generation AI assistant?
  • What are the trade-offs between code-based and model-based graders in AI evaluation?
  • How can a QA loop improve AI generation outputs before deployment?
  • What makes evals a living artifact rather than a one-off benchmark?
  • Which prompts or system messages help steer an AI agent to produce higher-quality slides?
EvalsAI benchmarkingAgent designCode-based gradersModel-based gradersHuman gradersQA loopPrompt engineering
Full Transcript
Hello, hello, hello. Good afternoon, everyone. I hope you all had a wonderful lunch. Um, there's so many of you as well. I'm actually kind of surprised by this. Um happy to see that there's that much interest in talking about uh evals. Um I personally am a big fan of anything evals related but I know not everyone's that's not everyone's cup of tea, right? Um so very happy to see this many people of you. Um so yeah this so today's session is really going to be about evals. Um, and I guess my goal for this session is for you all to be afterwards to be inspired to build evals to be like, okay, evals are actually really useful. Um, and how you can act on them, right? Like we're going to be building evals. I want you to get a better sense of like, okay, how should I be thinking about building evals? What are useful type of evals? And then also, how can we use and take these evals to then make better agents, right? So that's the main goal of this session. Um, and the way we're going to do this is by building a slide generation agent and then finding out like, okay, what are some good evals? What do we want to measure? And then how can we build now better agents based on the feedback that we're getting from our evals. And the first thing that we all need to set the stage on is what are evals, right? So, evals are systematic tests that measure how well an AI system performs on a specific domain or use case, right? So, they give you information about like what's the quality of the results? Um, what did you do well, what was it not good at, how can we improve, right? And evals, they are made up of tasks that define certain scenarios um that then encode certain expectations through the grading logic. So one way that we're thinking about evals is if you for example are building an AI system, an AI agent and you want to make sure that the output adheres to like a certain type of quality or you want need to make sure like this must always be present. Evals are a way to kind of encode this behavior in a way where then afterwards if your evals fail you know like okay my agent is not doing or behaving the way it is intended. Right? So that's the way how we can use these evals and then evals is also the bridge between things like it seems to work or like um we know it works or maybe you saw like ah it kind of feels a little bit worse today for some reason. It's always very hard to act on these types of vibes, right? Like I think vibes definitely have their own place. I think they're useful just to get like a general sense check of like how people are feeling, but they're not very actionable, right? And that's kind of what we want to get out of EVAs. We want to have something that's actionable. So then we always ship EVA or like we always once we release like a model, we always have this accompanying benchmark card, right? And we always list like oh these are like a bunch of EVAs. This is what we achieved, what our models achieve. We compare them to other models. We compare them to competitor models, right? Um and there's like always a few usual suspects, right? Like for example um SWE bench is a very famous one which measures agentic coding abilities. Terminal bench is one that's also quite popular. But we also have other types of evals right we have like tool use and agents like for example like tbench browse bump os world which are some other evals that measure different things and then we also have like reasoning and knowledge um like arc agi too. Um now this is all fine and dandy right and then you look at these evals and we always every time a new model releases like oh it's top of the benchmark for these and these um evals right um and they give us like a gen general sense of like how well is the model and how much did we improve upon previous versions right but for you guys if you're like building something if you're building an agentic system this doesn't really say much usually right like because like we we don't measure for example like a very specific use case that you guys are building on right we measure these gen generic general benchmark that measure a lot of capabilities but they might not be applicable to your specific use case right so that's why we always say build your own evals benchmark the different models benchmark your AI agent and make sure that you get the most out of the models and make sure that you're also using the right model for the job right and so why are these Evals specifically important. So this is my pitch to you of start using evals, right? So without evals, suppose you don't have evas. I think we've all been into the scenario where you have like this agent and it's working fine and then you get like this feedback of like a customer who's like saying like uh it's it's not really up to par of like this new model switch. It's something is off, right? It's very hard like to do anything with that information, right? It's just like okay um do you have some logs maybe that we can take a look at some specific instances right and then you try to like debug it manually right but in a way you're still flying blind right you're always in a reactive loop so you wait for the feedback and then you're like okay let's see what we can do about this right so you basically only catch issues in production you you can fix for example like one issue which then might for example create multiple more down the line by making I don't know a prompt change tweak that suddenly degrades the cap capabilities on like other tasks that you haven't even considered. Um it's it's also quite annoying to distinguish like genuine feedback from noise, right? Um which is always you don't want to act on every single thing that you see because people have also some um biases in the way they perceive these things, right? Um and then finally I think which is the most important one is there's no way to verify improvements or regressions on anything that you're building or that you've done, right? So you like need a way to make sure that the changes that you are making to your agent are actually impacting the quality and making sure that you improve upon your previous versions, right? And so this is basically what evals do give you. If you add evals, you have clarity. You need to define what does success look like, right? Because like if let's say you don't have evals, right? And you're not even able to articulate like this is how the agent should behave. This is what a successful end product would look like from an agent. then how can you make sure that your agent is actually behaving properly because you can't even vocalize it to yourself like this is what it should be. So building these emails forces you to define formalize in a way what you expect your agent to do. Um it also allows you as I said to iterate on optimal agent configs. Um you can also adopt new models faster right instead of like saying like oh we might test out this new model and then see if it's like okay you now have like some clarity to say like okay this is better on this and this is not better on this and this is why we should or should not migrate to a new model which especially I think is quite relevant with the pace of new models coming out I think is also like just taking this load of your back of always constantly having to find like okay what is the new frontier Right. Um and then finally making problems visible before launch. Right? So you know like oh if there's like a few um cases that you have that we always do well or that you trust to provide a lot of insight that's where you get the most value out of evals. And so h how do evals really fit in? So originally um when we were thinking about like prompt engineering we had this basically this flow of like how you should optimize your prompts right? So first you develop your test cases which are the evals in the end. Then you write like a prompt, you test out the prompt against the tasks, you refine the prompt a little bit and then it goes back. You run the prompt again. You refine it until you've like okay I'm doing good great on my evals. I'm confident that my system is working properly and then finally you can like ship the polish polish prompt right. Um over time systems have gotten a little little oh can I go back? Can we go back one slide please? Thank you. Um, so over time it has gotten a little bit more complex with now agents coming into the loop with like tool calls, skills, all the different ways to optimize your context and all that stuff. So over time these systems get more and more complex. So it's also way more levers that you can pull to make changes to your agents which makes it once again then more important to have evals that forces you to have a concrete way of identifying these are the things that we can change and these are the things that impact the system in a positive way. So once again like with agents it's the same flow right um except now we test way more and way more complex things um evals when you create them there's basically a few graders a grader is what we consider basically a way how we can judge the output right and like one of those ways is for example a codebased grader which is pretty similar to for example a unit test as you might know in like software engineering right um it's can be like a string match rejax, maybe a fuzzy fuzzy match. Um, but it's like a strict analysis, right? Um, it finds static uh and tool call checks. And the advantages of this one is it's fast, cheap, deterministic, but it has a big drawback which is that it's brittle and it also lacks in nuance, right? Um, and with this we mean like especially brittle is quite an interesting one in my opinion because like these deterministic checks, they force a certain deterministic behavior, right? But with and sometimes this is absolutely the way we want um an agent to behave right like for example let's say you have an agent that creates a slide deck for example you want to make sure that in the end there is a slide deck present deterministic check but then if you want to have like a check on what's the quality of this slide deck this is way more nuanced right like you cannot easily encode this in like some deterministic checks right and that's why we also have this second type of graders which is the model based graders Right? And this is like rubric based reasoning. Um so you for example say is this slide high quality very generic but that might be for example a rubric or like for example is this text coherent also a way to get some intel on how well your agent is performing. Um you can do some interesting things with this as well. Um pair wise comparison is in my opinion quite underrated. Let's say you have two examples two outputs. Um, and then you basically OS model which one of the two do you prefer and why? That's also quite interesting to get some information out of especi especially for these scenarios where you don't really have a clear way of of defining what makes a better one, right? Um, and and then another one is the multi-judge consensus which is just for example you take like best of three and you say like three judges score independently and say like majority wins for example, right? Once again that this multi-judge consensus is interesting because it allows you to introduce some more determinism in a way um where if you have like we know that an LLM is undeterministic right and the same would be happening for these model based graders right like if you run them like 100 times a few times it might say oh this is great and a few other times might say ah it's not that great if you have like this multi-judge consensus you basically are assuming let's put more compute into this and let's see what the majority of our greatest consensus It is right and this is unlocks a lot of things right like this is flexible this is scalable this is nuanced but as I said it's nondeterministic it costs more money and also it requires some calibration which we will see is not easy at all and then finally the most expensive one are the human graders and these are probably the graders that when you building these agentic systems you will be using the least right because they're like incredibly expensive um you have like a whole subject matter expert that will do like a whole review of the system. It will it's expensive. It's slow as well, but it is more it's the highest quality. It is very nuanced. Um and yeah, it's like really good for like some AB testing and some spot checking, right? So, I'm not sure like how many of you were able to clone the repo beforehand and have this all set up. Um, I actually wanted to do this session a little bit differently, but given the amount of people, I will probably do a little bit more um myself uh instead of like letting you um think about all of the things. Um, but I'll quickly give you an overview of of what's in the wrapper, right? Um, let me make this a little bit bigger. Um, I have made some pre-made uh slides that I will show you in a bit. Um, the resources is the main thing where you guys would be working in. So, you have the let me actually close this session for now. Um so you have like the agent. ML and this is basically where you would define your agent, right? Like um I think before we did a session. So this is basically what we're going to do use like the manage agent. So for the people who attended that session um before lunch, it's basically the same thing. We define here like an an agent in this case. Um and we have given this uh the system prompt, right? So this is a system prompt that we're giving. So basically you are a slide generation agent and when the user gives you a topic create a powerpoint file at this location and then also we tell it you have a shell um with bid pptx uh pre-installed right um so that's all we give it for now and then we also have like an environment which we've defined um with like a few packages um and what is it what it needs to complete this session um and then basically that's it we also have some other things defined but I will get to that I think maybe the first question that I have for the audience Today's we want to make a slight generation agent, right? What do you guys think is a good eval? What are you trying to measure? What would be some good information that you want to get out of evals? Sorry. Number of words on slides isn't really useful uh thing to track. Any anyone else with some ideas? Yeah, absolutely valid. Absolutely valid. Yeah. Yeah. Yeah. Um and and this actually I like these two examples because they immediately give you like a different sense of um how you can use a type of graders like for example the number of words on a slide is quantifiable right it's like easy say you can count the number of words with like a deterministic grader with like a code grader. the one if it's like overlapping or if it's overspilling that one is harder to um um encode in code right so for this one you might for example use a model grader and that's exactly what we did right so we have actually defined for you guys already a few graders beforehand two specific uh uh directories we have the code and we have judge so the code one is as I said it's like these these code graders are quite derministic like for example if we take a look at emoji count for example is one um that we have defined where We basically just count the number of emojis present in the slide deck. Um because we we just noticed that it's quite prevalent. Like for example, if I open the slide deck, um let me go with environment one in this case. Um so these are the slides that I it's basically the agent running. Um it's it's done beforehand just because it takes can take quite a while um to get the agents running. Um but this is for example the results of the initial agents, right? So this is slide number one. Um slide number two. Slide number three. Um with some weird things on the bottom left. Um slide four and slide five. Now I think we can all agree like this is not the best slide deck you guys have ever seen. Um but it's a good start. At least it there's a slide deck. Um there's five slides. I think that's exactly the prompt that we send it. Um so we have a few slides. There's a few content on here. There's like a few boxes. It's, you know, it's a slide deck. Um, given these slides, is there anything else that you guys are seeing that like this is something that we would never want in our slide deck? What was that? No teal. We can if you absolutely want to avoid teal, that's absolutely right. I think in this it doesn't do that for every single slide like let me see for the career one. Um let me see what this Oh okay maybe it does always use you still actually but for example in this one we see like this overlap of like words and and and this this horizontal um what else do we have some weird coloring? Um yeah there there's a few weird things happening generally right. Um so yeah based on this we take you take a look at what it is what the results are and you're like hm what type of graders do I want to define for this specifically right and so we did that and we noticed for example emoji count is one that's quite prevalent we want to check how many times do we see an emoji popping up another one is for example cluttered slides like how many shapes do we see on these slides like if there's just so too many things it becomes cluttered um counting the number of slides for example We always ask for five slides. Making sure that you have five slides. Um do we have slides with image, small fonts, text heavy slides? Now this is this is in this case it's quite arbitrarily chosen, right? These were just like things that were like thought were like this is quite representative of what a slide deck might have for graders, right? It really depends like I want really want to stress this like it really depends from use case to use case what makes a good grader, right? I think generally the way I think about this thing is if you have a grader that you get no useful out of information out of then you should not have that part of your eval right like each thing you should be able to tell like for each single scenario that you're testing you should be able to say like this is the information that I want to get out of this this is the type of this is the part of the system that I'm testing and this is how I can act on if it's being degrading right um so those were just like a few codes once and then we also have a few judge ones. For example, the color judge, which basically judge what's the color contrast and then it gives a score from like zero to five. Um same with image um the layout text and this is the prompt that we give. Um let me close this one real quick. Oh, actually let me keep it like this. So this is basically the system prompt um that we give it. Um, oh, so it's saying, "Please evaluate the slide based on each of the following criteria. Text, the title should be simple and clear to indicate the main points. For main content, avoid too many text and keep words concise. Use a consistent and readable font size, style, and color." And I mean it it goes on and on, right? So we give like for each of the different things that you want to measure, we give like a little information of like this is what you should be focusing on when you want to measure this, right? Okay, cool. So we have these evals. Let's say you have now created a slide deck and you now want to see like okay what are the results right and how can we act on these results. So in this wpper we also have created this nice little script that will automatically score your slide deck for you. And so at the top here we basically have it all um listed out. So we have like the slide count which is being counted the slides the number of slides with image text heavy slides cluttered slides small font slides and so on. We also have our judges over here which are saying like they give a score from like zero to five based on like how good is the text, how good is the image, the layout and the color right. Um honestly like these scores you can immediately note that these scores are quite high. So as we said like we calibrated between zero and five and as we see like the scores you've been giving here like between 2.8 date and four which honestly I think are quite high given the slide deck that we have seen right so that's like a part of the calibration that needs to happen as well right I think this also like one thing that I maybe want to stress it's not because you have set up your evals once that they are now like the ground truth you know um evals over time they can evolve they need to be a living artifact it's not like something you make once and then forget and then use this like to make all of your future decisions on right because like we will see over time like as I go through all of the different examples that we have we will see like there needs to be a way also how we can see how we can make sure that the evals that we create are actually still measuring something useful for us right um if you ever hear people talk about saturation of evals that's basically what they mean in the way that like the eval is not giving any more relevant information that we can act on due to several reasons cool so we see this and I guess maybe the first thing that we want to do in this case is we want to make an agent that is a little bit more polished, right? And so for this we actually just update our system prompts. So in instead of just having like oh you are a slide generation agent make a slide deck we now give it a little bit more information of like what are the expectations that we have of you in terms of typography right um because as we noticed we said like oh the font is too small there's too many words on there it's not readable or it's too big right so we give it a little bit more information so we say like slide title should be this size section header should be this size body this size caption this size right um and we also give it some information on the layout and density like here are the things that we expect from the layout and density point of view. For example, we say keep the body text concise, leave braiding room and left aligned paragraphs, right? And then also we I think everyone kind of I mean I am at least getting like ticked off like if I read something that's clearly AI written, I'm always a little bit skeptical of if I can completely trust the content and if the p person sending me these texts is like has like read it himself or themselves and is standing behind that content. Right? So we also say like avoid these AI generated tells as well. So never use a thin accent lines in the titles and don't pepper slides with emojis as decorative icons. Right? So this is based on the things that we have seen in our EVA. Right? So we have seen as let's go back a little bit. We've looked at the slide deck um and we're like oh this is not properly done. These fonts are a little bit off. Um there's some emoji used in here. It's like a little bit all over the place. And then based on the score we were like okay these are the things they were clearly failing at right so we have like emoji counts four in this case small font slides also four as well cluttered slides too and text heavy slides right so based on the information that we have gotten from the eval that we have run we have made these changes to our new agent right let me now pull up the result of the new agent that we have created in this case right um so this is slide one which I think is immediately way more enjoyable to look at like there's no overlapping stuff. Um, there's no dollar sign. That's just generally it's cleaner. This once again, I think this one still has like quite small text, but at least once again, we're like getting a little bit more consistent with the coloring as well. Um, once again, like the whole slide deck is more consistent. The third slide, the fourth slide, and the fifth slide, right? And this is just by basically identifying here's a few failure modes of our original one. Here's how we now make changes based on these things that we found in the system prompt. And now we run it back. And now once again we can do the same thing. So this we're now basically in this loop of finding what's wrong, iterating, finding what's wrong, running it again, and making improvements over time. So now we can take a look back at what we find over here. Oh, and this is actually way worse. Suddenly we see like emoji count 20. I'm wondering where they are. I haven't seen them actually. Wondering where that is at. Hm. Wonder if this like a mistake in this case. Um but generally we see like okay small font slides we've seen that but we've we've improved upon the cluttering. Um and let's see text heavy. Is that still the case? Um I think I think that's fine. I mean those are a little bit text heavy but I think it's acceptable. Right. So now we're like so this once again shows the value of like human review as well, right? Because now we see, oh, these things that we have defined in our evals are maybe not as well defined as we hope them to be, right? Because now I'm here arguing like, oh, this is not as text heavy as I expected it to be, right? So that means that something is actually wrong with the way we're grading. So now we go back then we would go back, go to our grader, change the grader, update it, and make sure that it better reflects the actual thing that we want to measure, right? And this is also not something to be underestimated like this calibration of how your agent should behave um and how your judges should judge the specific a output is really something very fickle right like you should spend like proper time trying to find the ways on how you should make this happen. Um let's say now that we want to have an agent like I I think with this one I mean it's fun. I think it's nice um but it's still quite text heavy and it's only text right. Um, let's say now that we want to have an agent. Let's say that's one of our requirements, right? That we have an agent that we always want to have includes diagrams. Once again, we go back to our system prompt, we update it, and we now say every slide must include at least one generated diagram or chart inserted as an actual image, right? Um so once again we update the system prompt or any part of the agent that you can tune and then we go again and we check what do we get. Okay so this one is quite interesting. Um I guess personally I'm not a fan of having an image on the opening slide but once again it is what we defined that it should do right. So I'm going to let that slide but it's a nice nice graph what it's saying it's like no negotiation and active negotiation. So it's arguing that if you do active negotiation for your salary, you can see over time the gap widens between no negotiation and yes neg negotiation some extra benchmarks. I I think this looks immediately way better just in the way that like kind of grounded into some actual facts right now instead of just waffling its way through the slide deck. Right. Yeah. This one I'm not a big fan of. I feel like it's a little bit stretched, but that might also just be the screenshot. Um, yeah. And this one also not the best one either, right? Let me see like let's see what the score.json now says. Okay. No emojis. Great. No cluttered slides. Still quite text heavy slides surprisingly. Um, still small font slides. I think that's fine. I think we just say like with images. I think yeah, I think we we accept like these types of things are fine. Um so once again shows you some um questions regarding the creator that we have set up but now we can also take a look at like the judges right like for example because now we have images that we have created so now we can also consider how does the image judge uh think this is and it says it's 3.8 eight out of five. Um, doesn't say give us a lot to go off, right? It just gives us a random number. What does it mean? How can we improve upon this? But that's fine for now. Now, one thing that we always see that works just generally quite well. And that's like it's transversal over every single use case is adding a QA loop, right? Um, for coding, this is quite intuitive. That's basically saying like you create an agent that actually is writing the code, right? And then you add a second agent that is then looking at the code that has been written and just criticizes it. So it's basically saying this is bad, this is bad, this is bad, this introduced a bug, this introduced a bug, this is not according to standards, whatever. Right? So it basically is criticizing the the thing that has been created and then that part of the feedback you give back to your original agent, the creation agent. The creation agent goes off again, does the creation, does the finetuning, makes the changes that were informed by the criticizing and then once again after that is done, it goes back to the criticizing agent. And that loop basically goes on and on and on until both sides are like, okay, this is fine, we can ship this. And that's basically what we now do in this uh next step. Um, so we basically say like, okay, required QA loop. Um, assume there are no problems. Um, I assume there are problems and then your job is to find them. Approach QA as a bug hunt, not a confirmation step. And this is quite interesting because we're like actively instructing the agent to behave in a way adversar adversely, right? Like we're saying like there are issues, you need to find them. It's not it's not like oh there might be something you might be interested in finding something. No, it's actively saying there are issues go find them. Um, and then we say like we instruct after writing the deck, okay, convert it to images. Inspect every slide image yourself. fix issues, rerender, reinspect, and then do not stop until you've completed at least one fix and verify cycle. Cool. Now, as I said, I think for coding, this is quite intuitive, but I think it's also quite intuitive if you take a look at like um the the slides that we have created, right? Because that's basically what we did. We have looked at the slides and we're like, "Ah, this is not good. This is not good. Let's take that feedback, update our creators, update our system prompt, and let's run it back again, right? So let's now see if we can actually if this is actually showing some improvements. Um I think this is immediately a lot better. So the the the image is way bigger now. I think it's way more readable even from a further distance away. Um still the slides are small but like for example it's source now. There's a source over here as well which is quite good. Um I think this is also way better. It is more cleanly structured. I think the image is also a little bit better as well. Right. a quite interesting graph in this case. Uh your value profile versus team average. This one is still a little off in my opinion. Also, we now have like a little introduction of like these weird ticks. Um and this one is also a little bit better, I would say, but I think like the just the image taking is um kind of messing with the slide here. And so then we kind of know the drill by now. We take a look at the score. We see like has it improved? Why do we see still gaps? And now we see like for all of the judges that we have created, it is higher than uh the ones before, right? We're now all good in the 4.2 to 4.4 now. Um so we're on a we're on a good track, right? And um you can keep on doing this. You can keep on doing this. Um and you will always make like this little changes. But sometimes and this is I guess where it gets quite interesting and more like more like nuanced is you can also just go to a smarter model right because like now you're like defining oh this is what a good slide should look like this is what it should do this what what it should not do but with these models getting smarter and better over time you kind of expect them to be like able to figure that out on their own right um I mean that would at least be nice so that's what we tried out as So now in the last one we basically just changed our model to opus 47 instead of sonnet 47 which we have used up to this point if you can 46. Um so now we have switched to opus 47 and we have basically just given it a simple prompt again like you are a slide generation agent and then when the user gives you a topic create a powerpoint file at whatever and then you have a shell. So it's basically just the initial prompt that we gave to our sonnet model in the beginning, right? And then once again, let's now consider taking a look at the results of those and this is just a base prompt, right? Like you can immediately see like it's significantly better than the sonnet one, right? I think there's still clear issues that we can iron out, but generally like it's way more structured, right? And then we can take a look at the score as well. And I think this is quite interesting and quite telling. Like for example, Opus just does not use any emojis. Like it kind of knows like if you want to make a slide deck about salary increase, emojis are probably not right place to put them, right? Um it also has like fewer small font slides because it's kind of has like this innate knowledge of okay, it should be readable. This is how a slide deck should function. This is what people expect out of a slide deck, right? And then we get to these judge graders, right? Um, we see a 4.4. We see a five for the image. Do we even have an image in this one? I don't think we do, actually. No, we don't. Okay. But, so once again, we got a five in this one. Layout judge 4.2 and then a color judge 4.8 and title body coherence 4.4. So, this is like immediately giving like extremely high scores as well, right? Which I think is quite interesting because like this is once again showing that we might not be measuring the right thing. And this is not too unexpected for these types of um graders right or for these judge graders. I think one of these things with like I okay let's go to the code gra I think those are quite straightforward I think most people in the room would have understood by now like how they work and how what we can do them like for example emoji count it's quite simple just count the number of emojis and that's it but with this judging what we have done here is actually quite problematic we basically say like give a score from 0 to five and for text well the text should the title should be simple and clear to indicate the main point for main content, avoid too many text and keywords, but it has nothing to anchor on, right? Like it doesn't really know what good looks like in this case. It doesn't know what bad looks like. So there's still like this trade-off between like what does a model actually know and what do we need to give more information on to the model to make sure that it can give like a proper um proper judging of what we actually have produced. Right? So for example in this case um I would for example say what could help is say like oh this is a bad example like let's say you have a zero like everything is just awful these are some telltale signs that you're dealing with an extremely badly formatted um um slide deck and then like over time the different ranges you can kind of express um and then once over time uh once again like that doesn't mean it will still be able to give like a good answer because we now have these results. We have this number that our LLM decided to output for some reason. Like for example, in this in this case, image just put out five. Okay, what do we do with that number now? Okay, it's a five. We we we just said there was no not a single image in that slide deck, right? Um so how can we interpret this five? One way of doing this is just basically always asking your judge graders to give reasons why it came to that conclusion, right? And one thing that should be very like um cautious about is the ordering, right? Um I've had it happen where I was like setting this up and I did like this exact thing. So I had like the number and then I said like, "Okay, give me also the reasons why you did that." And so then it said like, "Oh, it's a four and the reasons for this are this, this, and this." But we know that an LLM it works auto reggressively, right? So if it is anchored on like this four, it will do anything it can to argue why it should be a four, right? Anything. And even if it's like extremely bad, if it's like if it should be like a one, it will still say, "Oh, it is good for these and these reasons because it needs to justify the four that it put out." So once how you do it is you actually turn it around. So first you say like, "Give me a bunch of reasons. Give me pros, give me cons, give me reasons why it should be high, give me reasons why it should be bad." And then based on all of those reasons together, then you need to make a final decision on the output, right? And that's also that goes also back to like this QA loop as well. Um because then once again you can get a little bit tricky here where you have like multiple agents also doing the verification part where you have like one agent that is like finding all of the issues and then the other one is like refuting those. For example, one example that I can give which I think is quite interesting. Let's say you want to make um a document for um where you need to like some analysis. You first need to get a lot of context from the internet for example like on a legal document for example right um and you ask the you ask a model to like make a summary of um a certain case what was decided what does this have for legal implications for other cases right you need to be very careful with like all of these things that like legal cases are generally quite tricky and like an agent would love to create like oh this and this and jump to conclusions like this is the reason and that's it right and then the creator might be like oh this is unclear this is um maybe not as this is maybe in uh untrue. This is maybe uh maybe like glossing over the actual facts all of those type things, right? But then once again you can like apply these multiple techniques. You can have like multiple graders for example seeing like um evaluating those and seeing like what are the main ones popping up um because once again a grader might still elucinate things as well, right? Especially in like these very nuanced scenarios, right? So there's like different ways of how you then can work with these judges to make sure that you actually get like good consistent output that is actionable, right? And what I've shown you here today is basically just a small introduction to how eval can help you, but it's definitely not the end. I think 45 minutes for a session on evals is in my opinion quite short because it can get really deep, right? Because like I started this off this session with talking about benchmarks which are in the end just eval why would every single model provider care so much about benchmarks so much about evals if it wasn't one of the main important things when we are building new models right exactly we need to find the things that we are failing at exactly we need to find things what are we good at what are we bad at how can we make the model better in future generations and That's the same thing when building applications that are consisting that's using AI agents, right? It's the same thing. It's just finding what works, finding what doesn't, iterating, and making sure that the changes that you're making that you're informed on the decisions that you are making and making sure that the changes you make have actually positive influence on your final output. Okay, thank you guys so much. This is all the time that I have. Thank you guys.

Get daily recaps from
Claude

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.