The One Habit That Doubles Your Claude Code Session Limit

Nate Herk | AI Automation| 00:10:43|May 21, 2026

Chapters9

The speaker celebrates saving millions of tokens through prompt caching with Claude, explains that caching happens automatically, and introduces a token dashboard to help track session efficiency.

Cash prompt caching on Claude Code can massively cut your session costs, and Nate Herk shows how to monitor, manage, and sustain long coding sessions with practical tips and a free token dashboard.

Summary

Nate Herk of AI Automation breaks down prompt caching in Claude and Claude Code, showing real-world gains like 91 million cached tokens in a day and 300+ million in a week. He emphasizes that caching is automatic for Claude Code and Claude, but understanding the basics helps you keep session limits under control. Nate explains the difference between cache create (a one-time cost) and cache read (tokens reused at 10x cheaper rate), and clarifies how the 1-hour TTL on Claude subscriptions contrasts with 5-minute TTLs on API/sub-agent usage. He shares practical habits to avoid waste: don’t let sessions run past an hour, use session handoff with a summary to resume later, and prefer projects for large document handling in Claude Chat. The video covers how model switches, such as Opus plan vs. Sonnet execution, can reset the cache, and why edits to cloud.md don’t immediately affect the cache until restart. Nate teases a free token dashboard and a session handoff skill available through his free School community, plus a link to Thork/Anthropic discussions for deeper context. The takeaway is the 80/20 of prompt caching: know enough to save tokens without getting overwhelmed by every edge case. It’s a fast, actionable guide for developers aiming to stretch their session limits and reduce costs.

Key Takeaways

Cached tokens cost only 10% of normal input, so large caching translates into substantial savings (e.g., 91M cached tokens equates to about 9M input-equivalent cost).
Claude Code sessions expire after 1 hour unless you touch them; API/sub-agent usage uses a 5-minute TTL, which can drive higher costs if you’re not mindful.
Cache create vs. cache read: first-time writes into cache are a one-time cost, while subsequent reads reuse cached tokens at a fraction of the price.
Session handoff— Nate’s recommended alternative to /compact: generate a summary, clear the session, and resume with context preserved, often quicker than a full compact.
Switching models or using Opus plan during plan vs. execution can break the cache, so plan changes reset the context unless managed carefully.
Projects vs. chat: for large document intake, projects tend to cache documents more effectively than adding them ad-hoc in Claude Chat.
A free token dashboard and a session handoff skill are available via Nate’s free School community to help track and manage tokens across devices.

Who Is This For?

This is essential viewing for developers using Claude or Claude Code who want to maximize session longevity and minimize token costs, especially those who juggle multiple sessions, documents, or API usage.

Notable Quotes

""On this day I saved 91 million tokens because of cash read.""

—Nate starts with a striking example of caching impact and token savings.

""Cached tokens only cost you 10% of normal input.""

—Explains the core economic benefit of caching.

""If you wait an hour or longer, you’re going to pay more for it.""

—Describes the TTL effect on Claude subscriptions.

""Each model has its own cache. Switching with model means the next request reads the entire conversation history with no cache hits.""

—Highlights how model switches reset cache and why it matters.

""This token dashboard... will pull in all of your past sessions so you can see your tokens.""

—Promo for the free token-tracking tool and community resources.

Questions This Video Answers

How does Claude Code caching actually affect token usage in practice?
How can I use session handoff to preserve context without losing cached data?
What's the difference between cache create and cache read in Claude's pricing model?
Why does switching models or using Opus plan reset the cache, and how can I work around it?
Where can I find Nate Herk's free token dashboard and session handoff skills for Claude?

Claude Codeprompt cachingtoken dashboardcache read vs cache createTTL (time to live)session handoff/compact/clearThork/AnthropicOpus plan vs Sonnet execution

Full Transcript

So, look at this. On this day, I saved 91 million tokens because of cash read. And in the past week, I've saved over 300 million tokens because of it. Now, don't freak out. This isn't anything that you have to go change. This is happening automatically if you are using Claude Code or Claude. And I know that the concept of prompt caching might seem a little bit overwhelming, but today I'm going to make it as simple as possible and only really tell you what you need to know in order to make sure that you are saving your session limits and saving tokens. I'll also give you guys this entire token dashboard for free, so you can actually start tracking your tokens a little bit better. Anyway, so let's talk about prompt caching, why your sessions burn out, and how to stop it. So, what does caching actually cost you? Well, cached tokens only cost you 10% of normal input. So, all the tokens that are getting cached are saving you a ton of money. So, if we go back to this example, on this day when I had 91 million tokens cached, that costed me only as if I was processing about 9 million of those tokens. The cash window on a Claude subscription is an hour, meaning if you're working with Claude Code and you don't touch it for an hour, and then you send another message, everything in that session gets un-cached. So, if you leave a session sitting for an hour or longer, then you're going to pay more for it. And if you're using Claude via API or sub-agents, then the TTL or the time to live is only 5 minutes. You can't change that, but it's just a little bit more expensive. You could bump it up to an hour if you want. But, for Claude Code inside of your terminal or your extension, whatever it is, that's an hour. And now, here's a quote from Thoric from Anthropic. He said that we actually run alerts on our prompt cash hit rate and declare SEVs if they're too low. So, basically them saying we take this stuff really, really seriously, and if we see that the hit rate isn't very high for users Claude Code caching, then we do something about it immediately. And that's very nice of them, but also, of course, it benefits themselves because with a high cash hit rate, Claude Code feels faster, their serving cost is lower, subscription limits feel more generous, you know, because you're using less, and long coding sessions stay practical. And then, if you have low cash hit rate, this is what happens. And obviously, it's just a lose-lose for everybody. And that's why I said like prompt caching can get very, very complex. And if you want to check out more, then I'll link this article right here, which Thork really goes into some depth here, but if you read this, at least when I did, I was like, okay, this is a little bit overwhelming. I have a feeling I don't actually need to know all of this, but I do need to know at least a little bit, at least, you know, the 80/20 of prompt caching so that I can get the most out of my session limits. And that's what I'm going to break down today. So, let's take a look at an example of how this actually grows. So, by default, when you shoot off a message to Claude, there's going to be some information that needs to be cached right away. And actually, let me just switch back to one of Thork's graphics real quick. You can see here that we have the base system instructions get globally cached. We have tools like read, write, bash, grep, glob globally cached. We have per memory, or sorry, per project things like Claude.md in memory, and that gets cached per project. We've got session state, and then we have user messages, which grow each turn. So, now that we take this into context, and we flip back over here, this is what it looks like. This is an example where we have four turns. So, on turn one, there's no cache. Basically, we're matching on a prefix. So, don't really have to worry about what that means. But, I might mention that later. So, anyways, on turn one, there's nothing, right? We're opening up a fresh session. We load in the system prompt, the project context, and we shoot off our first message. And all of this is kind of in this like brown highlight border, which means that this is new, and it has to be fully processed, and it's being written to the cache here. So, before I continue down this graphic, in this dashboard, you can see that we have the difference between cache create and cache read. So, on these days, you can see what are my input tokens, my output tokens, and my cache create. And then over here, you can see my daily cache reads. And just a quick explanation, a cache create is writing something into cache for the first time. It's a one-time cost, and it pays off the next turn, unless, of course, everything gets un-cached. And the cache read is tokens that Claude reused from a cache, like your Claude.md or some of the files or some of the global system instructions. And these are the things that are 10 times cheaper than fresh input. So, anyways, on turn two, given that we're within that 1-hour TTL window, everything here is already in context, so it's cached. And then all that Claude actually has to process for the first time is reply one and message two, and it caches that. So, then down here in turn three, all of that's cached, and we are bumping up a reply and a message, and those are the things that only get processed each time. But, if we waited an hour and then we sent another message, or if we change the system prompt, then everything from the very beginning has to get fully recached. So, imagine if you were on message like, you know, 16, and you're way, way, way over here on the right, and you change the system prompt, or you wait an hour, then everything getting recached is going to be a pretty expensive move that you just made. So, anyways, once again, we have the system layer, the project layer, and the conversation layer. The system layer has instructions, tool definitions, output style, and here's where it might break. The project level, or the project layer, has Claude that I'm dealing memory and rules, and then here's when that might break. And then we have, of course, the conversation, which is just like the replies and the messages, which gets recached every time, but that's how it should be. So, here's where there's been some confusion among the community. So, how long does the cache snapshot live, which is kind of called the TTL, the time to live. So, on your Claude subscription, you have an hour by default because it uses your subscription. But, if you go over that weekly limit and you are now playing in your extra usage territory, where you are paying per token API, then by default that will be 5 minutes, which is very dangerous if you're managing multiple sessions and you're constantly recaching everything because 5 minutes is passing, you got to be careful about that. And people were kind of suspicious. I don't know if you remember like a month or so ago when everyone was complaining about their Claude uh subscriptions, how quick they were eating it up. People thought maybe that they switched the cache TTL from an hour to 5 minutes without like saying anything to anybody. It turns out they didn't, so it is an hour, but that's just like, you know, there was a lot of confusion around that. And I get why, because honestly it's not super clear. Like, if you're on API, you have 5 minutes by default, but you can increase the cost and you can do an hour, and then your sub agents on any plan are going to be 5 minutes. And for some reason, all of this is documented about Claude code in the API, which are two very different things. But, the Claude.ai, like on the web, we don't know exactly how that works. At least I haven't found documentation on that exact. I'm assuming it's the same as your subscription, but I don't know 100% fact. Anyways, three habits that cover 95% of people. Don't pause too long. So, if you've gone over an hour on a session, just hand it off to a new session. Obviously, start fresh when you switch tasks. So, do /compact, which will break the cache, or do a /clear. Or you can also use my session handoff skill, which I will include as well for free. So, both the token dashboard GitHub repo and the skill will be in my free school community. The link for that's down in the description. But, basically what that means is, let's say right here, I've got this project which helped me build this HTML file you guys are looking at. It's got 205,000 tokens in here. And if I come in here and just do a session handoff, this basically summarizes everything we've done, all the important files that we've built, all of the open decisions, exactly where to pick back up. And then, I basically am able to just copy that summary, do a /clear, and then keep going. And it feels like I haven't actually lost anything. So, that has been basically my replacement for doing /compact. I've just enjoyed doing this better. And sometimes the compact takes a long time. This typically doesn't take anywhere over a minute. There you go. So, that is my session handoff. I do a /copy, and then I just go ahead and clear that, paste it in, hit enter, and now I'm basically right back where I was. And then, this last one is for if you're using Claude chat specifically. If you're going to be pasting big documents in there, you're probably better off doing a project, because like I said, I don't know exactly how the caching works in Claude chat, but we do have some confidence in saying that projects, those files are cached a little bit differently, and probably more optimized for storing a bunch of documents compared to just dropping them into your Claude chat. So, keep it alive, keep it focused, and start fresh when you switch. Now, there's a few other things that were a little bit confusing to me as far as like what breaks the cache. So, the first one is if you switch the model. So, you know, if you're in here and you're talking to Claude, hello, hello, hello, and then you go in here and you do a {slash} model and you actually switch the model, that's going to recache everything because if you remember earlier I said it's prefix matching, which I'm not going to dive into right now, but if you switch the model, then you are switching essentially the prefix and it can't match [clears throat] on that same cache. So, if you switch the model, you are recaching everything. Now, I do want to apologize for something here because if you do model Opus plan, which is something that I've shown before in like token hacks videos, this basically means it uses Opus for plan mode and then it switches to Sonnet for the execution, but if you do that, just keep in mind that's actually going to break the cache because you're switching model halfway through. So, right here you can see, "Each model has its own cache. Switching with model means the next request reads the entire conversation history with no cache hits even though the content is identical. The Opus plan model setting resolves to Opus during plan mode and Sonnet during execution. So, each plan toggle is a model switch and starts a fresh cache." So, it's very interesting because typically the point of that is to save your session limit and I think ultimately long run it does, but it is important to understand that doing that does reset the cache. Now, what you can do is you can edit your cloud.md and you can do that mid-session because the edit actually doesn't apply until you restart that session, so the cache stays safe. And then of course, the Claude AI project's caching. It's not exactly documented, but pretty confident that it does help to drop docs and projects rather than in the chat. But anyways, this token dashboard, like I said, is very helpful to just be able to understand, get a little bit more visibility into your tokens. This does track your tokens on a local device. So, if you switch over to a laptop, then your dashboard is going to look different than on your main like PC or whatever you use. It's very, very simple. It is a GitHub repo. You will go to my free school community, the link is in the description, you'll click on classroom, you'll click on all YouTube resources, and then you'll be able to find it right in there. And once you get that GitHub repo, all you have to do is give the link to Cloud Code and say, "Hey, this is a token dashboard. Set this up on a local host." Boom, you've got it open. And it will pull in all of your past sessions. So, it's not like you're going to start fresh as soon as you uh, you know, linking this repo, it will read your past files and it will pull in your tokens. And then of course, I will also include that session handoff skill that I just mentioned to you guys. So, I know this one was super quick. Hopefully this one was helpful though. Um, it's just important, like I said, when I hear about stuff like this, I love to understand it to the point where I know how to use it and I know what's going on under the hood. But truthfully, if I looked at some of these other articles, like how in-depth they go and how much nuance there is, most of the stuff right now, I just don't need to know because I'm not using the the API in this way super heavily. So, the reason I wanted to throw that out there is because it's important to stay updated and follow things, but just understand what do you really need to know at its core. So, if you guys enjoyed the video or learned something new, please give a like. It helps me out a ton. And as always, I appreciate you guys making it to the end of the video, and I'll see you on the next one. Thanks, guys.