The thinking lever

Claude| 00:21:17|May 21, 2026
Chapters10
Introduce the concept of thinking lever and how test time compute at inference can improve token efficiency in solving hard problems.

Claude shows how spending more test-time tokens can boost reasoning quality, with adaptive thinking and budgets to balance speed, cost, and intelligence.

Summary

Alexander Bricken of Enthropic’s Applied AI team introduces the thinking lever—test-time compute—that lets Claude spend more tokens to think through problems before answering. He connects this approach to model scaling at train time and demonstrates how larger models (Opus, with Haiku, Sonnet, Opus families) improve performance as more tokens are used. Through live prompts, Bricken compares low, high, and max effort settings, noting that more tokens generally yield richer simulations, smarter drivers in traffic scenarios, and more nuanced outputs, at the cost of time and money. He walks through three core capabilities—thinking space (scratchpad reasoning), tool calls to interact with the outside world, and text (the final answer). Adaptive thinking and interleaving thinking let Claude decide when to think, when to call tools, and when to generate text, emulating human problem solving. Bricken also explains budgets and how to balance train-time compute (model size) with test-time compute (token usage) to reach human-level task performance. Practical tips include evaluating with tough test sets, recognizing diminishing returns at very high effort, and defaulting to extra-high as a strong baseline for many workloads. He ends by advocating a world where Claude autonomously allocates compute under user-defined task budgets, improving field performance over time.

Key Takeaways

  • Larger models with higher test-time compute improve reasoning benchmarks; for Claude, token expenditure correlates with better problem-solving across domains like Darch QA and humanity's last exams.
  • Three core test-time capabilities exist: thinking space (scratchpad), tool calls (external interfaces), and text (final output), enabling iterative reasoning and action.
  • Adaptive and interleaving thinking enable Claude to decide when to think and which tools to use, mirroring human problem-solving patterns rather than a rigid think-then-act cycle.
  • Easier-to-improve performance comes from higher effort levels up to diminishing returns; max adds tokens but may offer marginal gains over extra-high for many tasks, so extra-high is often a practical default.
  • Budgets and task-based constraints let users control how long Claude can think and how many tokens it may spend, aligning compute with project timelines (e.g., day-to-week horizons).
  • If intelligence is required, use larger models (Opus-family) even at low effort; for simple tasks, smaller models (Haiku) can suffice with appropriate effort.

Who Is This For?

Engineers and ML practitioners evaluating how to deploy Claude in production, especially those balancing latency, cost, and cognitive load for long-running tasks.

Notable Quotes

"Good to see you all today. Uh I'm happy to have so many clawed lovers in one room."
Opening remarks and tone-setting by Alexander Bricken.
"Specifically, we're going to talk about how Claude leverages compute at runtime, at inference time, typically called test time compute to make more effective use of tokens in solving some of the hardest problems that Claude has in front of it."
Definition of the thinking lever and its purpose.
"the more tokens it spends, the more time it takes to run this sim and the more detailed the simulation becomes."
Demonstration of low vs. high vs. max effort on a traffic-simulation prompt.
"Adaptive thinking is basically an evolution of interleaving thinking where you give even more control to the model as to when it thinks and why it thinks."
Explains adaptive thinking and why it matters.
"The ideal state is you ask Claude a question and it knows how much effort it should put into it."
Describes the goal of automatic effort allocation and human-in-the-loop preferences.

Questions This Video Answers

  • How does test-time compute affect the accuracy of Claude's answers in practice?
  • What are adaptive and interleaved thinking in Claude, and when should you use them?
  • When should I choose Opus versus Haiku for a given task with Claude?
  • What are budgets and task budgets in Claude's API and how do they influence token usage?
  • What’s the recommended default effort level for typical Claude workloads?
Claudetest time computeadaptive thinkinginterleaved thinkingthinking spacetool callingtext outputOpusHaikuSonnet
Full Transcript
Good to see you all today. Uh I'm happy to have so many clawed lovers in one room. My name is Alexander Bricken and I'm on the Applied AI research team here at Enthropic. And today we're going to be talking about the thinking lever. Specifically, we're going to talk about how Claude leverages compute at runtime, at inference time, typically called test time compute to make more effective use of tokens in solving some of the hardest problems that Claude has in front of it. And I'm going to share some of the best practices when it comes to using different levers and using different tokens to essentially try to solve those problems better. and hopefully you'll learn something as well. So looking back a couple years, one of the key developments in large language models has been this idea of reasoning models, which is using test time compute to spend more tokens for a model to become more efficient at answering a question. Similar to how we can scale model performance at training time, such as train time compute, test time compute also results in higher intelligence results. So as you can see here on the left we have our different models in our typical uh range from haiku sonnet to opus and as you increase the size of the model or the number of um parameters you can see that the performance increases up to nearly roughly below 80% for an internal agentic coding benchmark that we run. Equally on the right hand side here we have a logarithmic axis on the on the x- axis and you can see that as claude spends more tokens we see the actual performance increase as well and so both of these the max on the right and the result that you're seeing here on the left are actually the same score. So this is actually true of every knowledge domain work. So you take a reasoning problem in darch QA which is a benchmark computer use through OS world or humanity's last last exam which is a PhD level series of test cases. In all of those results we see that the model becomes better at producing outcomes when it uses more tokens to think through the problem before answering the questions. as a tangible example. And even though we love looking at graphs at Enthropic, I wanted to show uh this in real time. And so I have this prompt in front of us here, which is going to be run at three different levels of effort for Claude. Low, high, and max. And I'll show you how the performance increases as Claude spends more tokens. So this is our prompt. Creating a realistic simulation of cars on a one-way street at a traffic light. Note the one-way street and the traffic light. Okay, so our first one, low effort on Opus 47. It took roughly 50 seconds and there were roughly 4,600 uh output tokens. And as you can see, it's actually quite a good simulation. We have a one-way road. The cars are on two lanes. Um, they're pulling up to the traffic light, stopping at a regular cadence, and then I think we have a few kind of adjustments we can make in this simulation to change the spawn rate of cars or how often it turns red or green. And so, as you can see, the the cars are kind of just moving through and then stop when the the light becomes yellow. Cool. It's quite simplistic, though. Now, let's move over to high effort. Let's turn that effort level up. Now, you can see on the left, Claude is actually spending double the amount of time it takes to run this sim uh to create this simulation and roughly double the amount of tokens. And I would say that this simulation's a little bit more detailed. We now have same one-way road, different types of vehicles showing up. There's a few lorries in there for the Brits out out in the crowd. And as well as that, there's a uh traffic light, but the traffic light isn't in the middle of the road. Like if we flip back to the previous example, you'll see that the traffic light was positioned in the middle road, which makes absolutely no sense. whereas now cla to itself, okay, I should probably position it sort of overhanging the road, but it's sort of upside down, which I don't love. Um, however, I would say it's a more complex simulation. One of the things when we ran this prompt as well that Opus said in this version is, hey, I've actually worked a little bit through making the drivers a bit more intelligent. So depending on how the cars move, the cars around it also react, which is a more intelligent simulation than than the previous version. And finally, we ran Claude Opus 47 on max effort. This is using roughly 10x the number of tokens and time to run uh to create this simulation. And as you can see, it's much more detailed. Arguably, this is the best traffic light we've seen. I like that it's sort of up upside down hanging following the laws of physics. We also have this beautiful skyscape in the back. Um, and the cars also reflect this more intelligent motion of vehicles. So, what does this mean? Well, arguably the more tokens it spends, the more time it takes. And so over time we might see Claude eventually go from seconds, minutes or hours of work to even days, weeks or months of work. And so this is the meter benchmark. We're showing that over generations of models and this is a combination of both train time compute, so larger models, as well as better test time compute, so spending more tokens on our higher reasoning models. we see that Claude is able to work more autonomously to cover human level tasks to a higher degree of of hours. And so mythos uh which is one of our latest models works to an extent of roughly 16 hours of human work to a 50% uh level of accuracy. Now test time compute can be any form of spending tokens. This is typically at inference time and there's three ways in which we like to break this down. The first way is thinking space for reasoning. It's basically a scratch where Claude considers the question that was asked of it, uses whatever data it has available to it in the prompt and thinks about the next steps it should take to solve a problem. Equally, there's tool calling right after. This is Claude's interface with the outside world. So in this example, we're asking Claude to do a web search, learning more about the Anthropic Developer Conference. Funnily enough, we're all here right now. And tools can really be anything though. It's, you know, 1 million types of tools. Interact with your Salesforce, call the MCP server, even write into a file system. All of those things are tool calls. And finally, there's text. This is the output that Claude makes whenever you ask a question of it, and it responds with something. It might be a summary of all the work it did. It might be a question up front in its in its response to gather more information from the user. Test time compute has direct costs in the form of tokens token count and time that it takes. And so naturally your might your mind might be coming to the conclusion of hey wait a minute as a user I want more control over what claude actually does on a data day basis. So using there's essentially two ways in which users can change the number or the amount of test time compute that cla to max and depending on the effort you assign model it will work for a longer amount of time and spend more tokens. So often you're kind of asking yourself the question of okay do I trade intelligence off for speed. Secondly we have budgets. Budgets are basically a way of assigning more strict constraints to the way cloud works. That might be through max token constraints or what we call task budgets which is another feature in our API. I'm going to spend the rest of the session elaborating a little more on effort. Now the ideal state is you ask cloud a question and it knows how much effort it should put into it. Uh but humans always want to have that additional lever they can pull to change the effort over a period of time. And so when we first released reasoning models, the initial state was you would ask lot of question and it would think for a period of time depending on how many tokens you allocated to thinking and then it would execute a series of tool calls reading each one until the output was formulated and then you get the response. Now if you think of the analogy of like how humans work typically we don't do that. someone doesn't ask me a question, I like stand there processing it and then suddenly I like go execute a bunch of steps and come back with the answer. Right? Instead, which is how we resulted in developing interled thinking, you do something, you think about it, you do another thing, you think about it, and then you come back with a result. And that's exactly what interle thinking does. So it allows Claude to basically have thinking steps after every tool call it does. Now let's take it a step further. Adaptive thinking is basically an evolution of interle thinking where you give even more control to the model as to when it thinks and why it thinks. And so depending on the question at hand claude will choose to call either a tool call output some text like that question I was referring to earlier or even think in it in whatever order it likes. And so looking back to the analogy I was talking to earlier, that actually reflects even more of how humans work, right? Sometimes I might take three tasks at once and go, you know, if I'm playing tennis, I go hit the tennis ball, then run back to the baseline. I'm not thinking in that instance. If I'm doing an academic problem set, though, I am probably thinking at every step of the process. Now, Claw can actually choose to not think at all if it doesn't want to. So in this example, we could have no thinking block. Um, and that really just relates to the question you ask of it, right? Even with humans, if I asked you, what is 10 + 10? You'd immediately spawn respond with 20. Whereas, if I asked you, you know, work through this really difficult PhD level problem set, you'd probably think a lot, but different members of the audience here might think for more or less less time. So, it really depends on the problem, the model you're using, and how much context you provided it. Adaptive thinking isn't a model reader. We're not classifying the request that comes through the door. Instead, it's actually telling Claude, "Hey, you have this thinking tool. you must think at one point in time whereas now clog doesn't have to think at all if it doesn't need to. We run all of our benchmarks that you're seeing on screen on adaptive thinking and we've found that it's purto efficient relative to interle thinking the former way in which we served our models. So historically users had thought about thinking as this effort dialer you can turn on thinking for a better answer. That's a reasonable instinct, but in reality, a thinking toggle is actually a poor proxy for the amount of effort that a model should put in. You're not expressing how hard you want Claude to think when you turn a thinking toggle on or off. You're actually just turning off a core capability. As I mentioned, there's these three capabilities: thinking, tool calling, and text. And so when you turned extended thinking off, you just remove that capability from Claude. Now that's an un ideal outcome, right? Instead, I'd rather have Claude know that it can think and work every time using those thinking tools available to it. So as an analogy with tool use, we don't tell Claude to either never search or always search the web. We just give Claude a search tool and allow it to reason as to when it should search. In a similar way, when we work with our teammates, I don't tell you, hey, uh, don't use your inner inner monologue. Don't think about this problem set. Just come up with an answer for me. I'll tell you the constraints of the problem, some sort of knowledge worker task. You'll go off and execute on it pending who you are and what context you have, and then you'll come back to me with an answer, and we collaborate as a team member. Ideally, we want Claude to work in the same way. So I want to dig a little bit more into some of the best practices around using effort and this graph is an articulation as we saw before of effort levels increasing and so so alongside that the capabilities of models. So the first thing to think about is you want to evaluate model performance. So having a really good test set of the problem that you're trying to get Claude to tackle where there's really difficult challenges that you're proposing to the model and evaluating at different effort levels how well the model performs is one of the best ways to just figure out what e effort level you should start with. Now one of the things you might notice here is there's diminishing marginal returns to higher effort effort levels depending on the task. And so we in in this example we even see that where max is disproportionately like double the amount of tokens as extreme extra high and it's only a marginal bump in performance. Now loweffort levels I would say can accomplish a lot of things but you're often trading off intelligence for speed. And so sometimes you might want to think about loweffort things as things that aren't intelligencebound. Things that you can simply accomplish quickly without the model thinking that much. As a quick tip, um I'm going to give an example of loweffort actually coming up with a really cool way of working through a problem. You all might be familiar with Claude plays Pokémon. It's probably my favorite eval. This eval is we put Claude into Pokemon Red and we gave it access to tools to trigger buttons on like a Game Boy for example and we gave it vision over the game and we had it execute and try to beat the Elite 4, which is the objective of Pokemon if you're not familiar. And when we put Claude on low effort, we actually found that it almost tried to scapegoat the game. So what it would do is it would run all sorts of mechanisms like using repels, using potions to have to avoid going back to the Pokemon center, using escape routes to get out of caves quickly, running away from poking Pokemon battles whenever encountered one in the in the shrubs. And this was really interesting because when you put it on low effort, Claude actually came up with this unique solution to navigate the game and almost complete it faster than it otherwise would. So there are some benefits to doing loweffort because you're explicitly constraining how much the model is thinking through the problem set and maybe it does end up in really unique attractor states. Now while evals are always ideal, I understand they're quite hard. I speak to customers a lot and often they don't have the perfect eval. And so I want to give you a quick, you know, rules of thumbs that you can use when you think about the different effort levels that you use. We're going to start on the right looking at max. As I mentioned, max effort can typically deliver gains on the hardest tasks, but they might show diminishing marginal returns. And so, I wouldn't recommend starting here unless you absolutely know that the intelligence of your use case is necessary. You know, the problem set is hard. You know, cloud's going to have to think a long time. Extra high is where we've settled as being the default for cloud code and cloud.AI. We would argue that this is one of the best trade-offs between intelligence, speed, and number of tokens. As you move down to high, this is a still a good balance of token usage and intelligence. And I would argue that if your use case requires any intelligence, you should probably land here. Um, but it will be faster, of course, than the other two I just referred to. Medium and low are ways to just toggle down that amount of tokens used. And as a result of that at low as I mentioned you're really looking at latency sensitive use cases where maybe it's classification summarization or data extraction. Now as I mentioned before we have these two levers that we think about enthropic as we develop our models. One is train time compute and one is test time compute. So you might be thinking to yourself, okay, well, how do I know whether or not I should use a really small model like Haiku and make that effort level really high versus having a really big model and making the effort level really low? Like what are the differences there? So I want to give you a few guidelines for thinking through that. On the left you're seeing the Opus simulation that we created before and on the right you're seeing this Haiku simulation uh Haiku 45. And as you can see Haiku spent roughly half the amount of time developing the simulation but the same number of tokens and I would say the result is not nearly as good. I don't even know if those are cars to be honest. Um, and so the conclusion that we come to here is arguably if the if the question you're asking of Claude result needs any intelligence at all, you're often better off using the larger model. And specifically, you're better off using it even if you have the effort at low. Now, the way that you should think about using smaller models though is these low intelligence use cases where you're not really caring as much about the outcome because the outcome is so simplistic that you know cla evals to figure out what the best way to do things is. If you can run haiku, sonnet and opus on across the range for all effort levels, that's the perfect world. Now, obviously, you won't have uh all of those resources every time. And so, I'll try to give you some rules of thumb in a second. You should enable Claude that space to reason. give it the scratch pad so it knows that it can use that thinking tool when it needs to. You can control that length through the effort levels that I described. Evals are often the best way to find your ideal balance. Finding the hardest evaluations are always the best so that you can know this is actually representative of the things that I want Claude to do in the field. And then finally, when in doubt, go with extra high. It's the default that we've set for our products and I would argue that it's a great kind of purto efficient outcome between latency and number of tokens and intelligence. The ideal world is that you set the bar in the budget for the way Claude works. So let's say I want Claude to work on some really long horizon task, a day, maybe a week. I want to be able to say to Claude, "Hey, I'm only going to spend this amount on whatever you do." or hey only take this long a week to do it and then eventually claude just knows how to allocate that compute appropriately. So it knows how many tokens it should spend and then it goes and executes returning the result given the constraints you provided. And so over time what we want to do is improve the performance of Claude in recognizing what tasks are actually important and allowing Claude to appropriately allocate its resources to solve them. Thanks so much for listening. I'm going to be around the conference if anyone has any questions and uh hope you enjoy the rest of the conference.

Get daily recaps from
Claude

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.