I Burned 500 Million Tokens Last Week. Do You Know Yours?
Chapters14
The speaker highlights Microsoft’s huge capex plan of $190B and argues that capacity constraints extend beyond GPUs to the broader chip and packaging supply chain, shaping AI vendor strategies and procurement discussions.
Capacity constraints in AI are landlocked deeper in the supply chain—chips aren’t the bottleneck, memory, packaging, power, and data-center readiness are, and your contracts must reflect that reality.
Summary
Nate B. Jones breaks down why AI’s real bottlenecks sit below GPUs, in a complex factory-like supply chain. He uses Microsoft’s $190B capex as a lens to show how hyperscalers treat AI infrastructure as physical asset deployment, not just software licensing. The talk shifts from glossy AI demos to the tangible steel, silicon, and power that actually enable tokens to be produced at scale. Jones highlights high-bandwidth memory, packaging, substrates, optics, power, and cooling as the choke points, with memory and packaging dominating capacity constraints. He explains that AI contracts now ride on the same rails as supply agreements, including allocation, capacity terms, and fallback, since vendors depend on hyperscaler demand. The piece emphasizes that executives must design procurement and routing strategies, forecast token usage per workflow, and keep engineers at the table to verify usable allocations. He also points to real-world numbers: Meta’s 125–145B spend target, Microsoft’s 40% QoQ uplift in Copilot throughput, and Nvidia’s Envyink-like modules as architectural building blocks. Finally, Jones offers practical questions for leadership: reserved capacity vs best efforts, routing to cheaper models, and detecting hidden human supervision in production workloads. The overarching message is clear: AI strategy today is a production line—a factory of tokens—where planning, power, and capacity are as critical as the models themselves.
Key Takeaways
- Capacity constraints in AI sit in the full manufacturing stack—memory, packaging, interposers, optics, power, and data-center readiness, not just GPUs.
- High-bandwidth memory is the single most constrained input; without fast data movement, compute sits idle even with GPUs available.
- The four largest AI chip designers dominate packaging capacity (and memory supply) but only fractionally utilize advanced logic die production, creating a bottleneck in the overall supply chain.
- Token spend can be enormous and is a leading indicator of scale—the speaker cites nearly 500 million tokens used by his own engineers in one week.
- Data-center timelines have stretched from traditional 12–18 months to multi-year builds (often 4 years) due to capacity, power, and cooling requirements.
- AI contracts should be treated as supply contracts—allocation, capacity terms, fallback, and routing need explicit language and measurable SLAs.
- Forecasting should go beyond seats and licenses to tokens per workflow, context length, concurrency, latency tiers, and failure/retry dynamics.
Who Is This For?
C-suite and procurement leads at AI-first companies, CTOs, CFOs, and engineering leaders who must translate AI ambitions into concrete, contract-rich manufacturing plans. Essential viewing for teams negotiating AI infra, capex, or vendor agreements who want to avoid capacity shortfalls and wasted spend.
Notable Quotes
"Capacity constrained. The most valuable software company on the planet with $190 billion to spend cannot get enough capacity to meet its own demand."
—Jones underscores the scale of the mismatch between capex and actual capacity.
"Every answer from a model is the output of a production chip system. So, chips run the math. High bandwidth memory feeds the chips."
—Links model outputs directly to physical hardware and memory architecture.
"The real constraint is firm power at the right location on the right schedule."
—Power availability and timing are pivotal to data-center readiness.
"Token spend over the last week was almost 500 million tokens in one week."
—Illustrates the scale and daily relevance of token-based workloads.
"In production traffic, you can't treat this like a demo scenario—human supervision isn’t always visible, and that changes pricing and scaling."
—Warns about hidden dependencies that affect procurement and operation.
Questions This Video Answers
- What parts of the AI hardware stack are actually limiting throughput besides GPUs?
- How do I negotiate guaranteed capacity or allocation terms in AI vendor contracts?
- Why is memory a bigger bottleneck than processing power for AI inference today?
- How should we forecast AI capacity needs per workflow rather than per seat?
- What should an AI procurement program include to avoid supply chain bottlenecks in 2026?
AI infrastructureAI supply chainHigh-bandwidth memoryChip packagingData center buildoutsHyperscalersToken economicsAI vendor contractsInference efficiencyIndustrial AI economics
Full Transcript
On Microsoft's Q3 earnings call on April 29th, Satcha Nadella told investors the company will spend $190 billion on capex in this calendar year and still expects to be capacity constrained through year end. Capacity constrained. The most valuable software company on the planet with $190 billion to spend cannot get enough capacity to meet its own demand. Let that sink in for a second. When when Satcha Nadella says capacity constrained, he does not mean Microsoft ran out of GPUs. I want to be precise. The supply problem is a layer below the GPU. It's about whether you can actually manufacture enough chips packaged with the memory they need to keep up with the kind of workloads modern AI models actually run.
Remember when we were in a bubble about a year ago? Yeah, that didn't last very long, did it? The constraint here isn't logic chips. It's the part of the supply chain almost nobody in the boardroom is really fluent in yet. And that's the gap I want to walk you through today because it's about to land on every leader's AI vendor conversation. And I know what some of you are thinking, Nate, calling AI industrial, that's not really new. Like, I've heard that before. Mary Mer did 340 slides on this last year. Nvidia earnings calls Jensen hammers this every quarter.
Yes, that's true. But we are not fully tracking what Satcha's capex spend that he announced means for all of us, all of our AI vendors, the entire AI supply chain because six months ago, an AI vendor contract was structured a lot like a software contract. Now with the hyperscalers spending at this scale and still rationing really heavily, your AI vendor contract is effectively tied into the hyperscalers. It is a supply contract in everything but name. It has allocation. It should have capacity terms. It should have fallback. It should have line items that we didn't have to think about a few months ago because we are so capacity constrained.
Now, there's three things I want you to walk away with here. First, when we talk about capacity constrained, what do we mean underneath the headlines? Second, where is the bottleneck in the AI supply chain really? And what are the numbers behind that? And third, what are the questions that guide your next AI investment that should fall out of that conversation and that understanding? So that's what we're going to get to today. Now, why do I care so much about investments and software and the AI supply chain? Because fundamentally, I think that a lot of our experience with AI is tied down by our assumptions about how software has always been sold.
I think that shapes our assumptions about who gets to be at the table for a procurement decision. Namely, I think developers are under reppresented. Uh I think it also shapes our ability to understand what should be in a good agreement for an AI native software company. Because if we're doing traditional software agreements, we can reasonably assume that the vendor working with us, the supplier working with us has the ability to control their own destiny enough that they can write that agreement. Now, it's a much more complicated conversation with AI native tools because you have to ask yourself, one, do I need this tool at all?
And I've had a whole video that I made on that. You can find that one. And two, if I do need this tool, I determined that I need it. In that situation, is this vendor appropriately accounting for the fact that they need to get hyperscaler capacity to allocate tokens to me as a part of the service? Because they're no longer serving deterministic software. They are serving inference. They are serving intelligence. I am buying intelligence from them. I need to have contractual terms that protect me and help me to understand that. And if you think, "Wow, that's boring.
That's something only the CFO should think about." You got another thing coming. I if we don't get those terms right, you can't roll it out to developers correctly. You can't roll it out in your AI operations correctly. You're going to run out of capacity when you really need it. It affects the entire business. It is absolutely an entire leadership conversation. And it's also a conversation where you need to have engineers at the table who can then speak to whether what is being allocated is actually usable. Have you looked at your engineers token spend recently? You should because it may surprise you.
My gross token spend over the last week was almost 500 million tokens in one week. And if you have engineers like that on your team, you'd better make sure that you are equipping them with tools that are equipped to handle that. Okay, let's jump into it. Stop thinking of AI as a software product with a fancy backend. The visible product may look like software you have. Chad, GPT, Copilot, Gemini, Claw, these look like applications that you open and non-technical employees may think of them that way, but the constraint underneath them is actually all the way down to the metal in the chip.
Every answer from a model is the output of a production chip system. So, chips, memory, packaging, networking, power, cooling, land, data center construction, and operations talent. A user will see a paragraph generated on a screen, but every word in that paragraph came out of a factory. And when I say an AI factory, I am not talking about a building full of GPUs only. I mean the entire whole production system that turns demand into serve tokens. I mean chips run the math. High bandwidth memory feeds the chips. Packaging connects them together. Networking moves data across the cluster.
Power keeps the racks alive. Cooling keeps them at temperature. Operations keeps the whole thing utilized. The most valuable software companies in the world are spending hundreds of billions of dollars a year because they have to operate factories. Now, intelligence at scale has a very physical bill of materials that software never did. Microsoft's not alone. Everybody's figured this out. Meta said in April it's going to spend 125 to$145 billion this year. They had to raise that guidance because component prices are up and they need more data centers. They can't compromise on that. Amazon landed, by the way, Meta is losing the AI race and they still need to spend that much.
There is no way out of this. Amazon landed more than 2.1 million AI chips in the last 12 months. More than half of those were their own tranium silicon. On top of that, they've got multi-gawatt commitments from Anthropic and OpenAI, plus more than a million Nvidia GPUs they're going to deploy through 2027. Google did $185 billion in spend last year, and I covered that one back in February. The the pattern is bigger than any single company. The pattern here underscores how big a shift we're seeing in traditional software companies. We need to stop assuming that these companies are software companies and assume and treat them as if they're physical infrastructure companies.
That is how their unit economics work now. It is a totally different world. and their decision to move into that world and invest heavily is shaping the intelligence that the rest of us are getting either directly from them or from vendors who build on top of their stacks. It shapes everything right now. If a vendor tells you they're investing in AI infra, they typically mean the very thin layer on the top of this whole factory system. I want to take you on a little bit of a deeper tour because vendors can't control what I'm about to tell you about by and large.
Not unless they're hyperscalers and even then you have to be scaled enough to have the conversation. Let's start by looking at the physical unit of infrastructure that drives this entire factory world. I'm not talking about a GPU or a TPU. That's the engine. I'm talking about the module. So Nvidia's GB200 NVL72 is a great example of what an industrial module that drives this world actually looks like. It's a liquid cooled rack scale system connecting 72 Blackwell GPUs and 36 Grace CPUs into a single what's called Envyink domain. It comes with 13.5 terabytes of HBM3 memory, 576 terabytes per second of memory bandwidth and Nvidia talks about it as the infrastructure for real time trillion parameter inference.
It's the core building block of an AI factory. It's a whole lot more than a graphics card because a chip alone doesn't produce intelligence at scale. It needs the memory close to it. It needs that packaging. It needs that networking. It needs a place to run. And memory is part of why this whole thing is getting complicated right now. Right? High bandwidth memory, as I've talked about, is the single most constrained input in the whole supply chain. If you are not able to move data fast enough, all of your compute will sit idle. And moving data fast requires a good memory stack.
A company can have plenty of GPUs on paper and still not be able to ship usable AI accelerators because they cannot get enough high bandwidth memory. Packaging gets even more physical. You have to integrate the logic dies and the HBM stacks into a single working chip package. TSMC's co-ass is what connects compute and memory at the bandwidth AI workloads need to operate. And then underneath the packaging there are substrates, there are interposers, the pieces that carry those signals and hold the components in alignment. If substrate yield were to drop, the production line would slow down even if the chip design were excellent.
Optics and optical compute matter now because large AI clusters are communication machines as much as they are compute machines. The GPUs need to move an enormous amount of data back and forth between one another. Copper has limits at scale, right? It has limits around heat, around distance, around signal integrity. At hundreds of thousands of GPUs, the network has to be optical. Nvidia's Spectrumx Photonix announcement is what that shift looks like when you're actually getting to a shipped product scale. Let's go back even further. Let's look at the power side. You are actually thinking about power when you are thinking about where the vendor you're talking about is getting their capacity from.
You have to say like, do you have guaranteed capacity? And that comes down to does the data center have power. All of this is turning the traditional 2010s era cloud is invisible conversation into a very visible, very real AI factory conversation that most software companies are not remotely ready to handle. They just aren't ready for a world where the software you buy is intelligent and that intelligence requires understanding electricity and power for a factory. Power is sucking up a lot of the dollars Satcha talked about, right? The grid argument is all over the place and most of it isn't getting at the level of detail that I think is useful.
The IEA is projecting global data center electricity consumption roughly doubling to about 945 terowatt hours by 2030. So that's a headline. But 945 terowatt hours is not what anybody's actually dealing with. The real constraint is firm power at the right location on the right schedule. The country might have plenty of power on paper, but a specific site may not get the power it needs to stand up a data center in time. And that's where a lot of the reporting has come in around local communities and their push back on data centers and power, etc. Cooling is another big piece that is the more local, more physical than than people realize.
Dense AI racks generate heat at levels old data center designs were not built to handle. Liquid cooling is part of production capacity today. If the cooling can't handle the rack density, the hardware doesn't actually run at full power. Uh, and so then you need to go and look and say, do we have the electricity? Do we have the chips? Do we have the power? And can we put it together? Can we physically assemble all of that into a data center that works? And and again, software companies are not used to having to manage construction timelines.
That is a new thing for them. Even Google, which is good at data centers and has been good at data centers for a while, isn't used to this level of scale. The CBRE notes that traditional 12 to 18month data center timelines are really no longer useful as back of the envelope calculations. They just don't apply to 500 megawatt plus AI campuses because the cost of construction, the challenge of construction is so high. Uh because even transmission and interconnection for power can stretch that schedule well past 18 months up into the four-year range. Meta's Hyperion campus in Louisiana, which is a joint venture with Blue Owl Capital, is already a multi-year construction project.
And I could pick out other ones. So, we've talked about a few layers here. We have chips, memory, packaging, substrates, optics, power, cooling, construction. Every one of these layers has different supply chain players and different timelines. Any one of them can be the bottleneck or constraint for a given data center that determines whether your AI strategy delivers. And all of this physical infrastructure is becoming much more of a concern than just the cost of designing chips or getting the chips exactly right. So Epic AI estimates that in 2025, the four largest AI chip designers consumed about 90% of global chip packaging capacity.
Remember we talked about packaging, networking, linking, etc. And 90% of HPM memory supply. We talked about memory. But the same four designers consumed only 12% of advanced logic die production. In other words, 12% of the design bandwidth in the world for chips was used to support a 90% utilization of packaging and memory to bring us AI. So the bottleneck was never our ability to design better chips. It's not even GPUs. It's the ability to turn all of this into an integrated compute supply that enables real tokens to be served at scale. And that's what Satcha means when he talks about capacity constraint and how he has to spend his way out of it.
The useful executive question isn't who benefits from AI capex. That turns into stock picking really fast. the the better question is where in the supply chain does a delay stop you from shipping AI? And so this is where you need to think really carefully about the contracts you're signing. Every AI vendor contract does sit on top of that supply chain. Most of the time, nobody wants to talk about that. The buyer feels embarrassed to ask because it feels like digging in the closet. And the vendor doesn't want to talk about it because they may not have a full answer on allocation, on capacity, on delivery, and on fallback.
You want to be in a position where you can just be honest and have that conversation and acknowledge the uncertainty and get into it. Now, every agreement is a financial agreement, and I want you to understand a little bit of the capital cycle in detail so you can understand how all of this is put together. Incidentally, this is also going to help you understand not only where there's software vulnerabilities that can translate into bottom line issues, but also where and how larger hyperscalers are managing their risk to avoid getting into an overleveraged blow up the bubble scenario.
And this is part of why this whole discussion of finance is part of why I don't think we're in an AI bubble. So software finance traditionally focused on stuff like revenue growth, gross margin, sales efficiency, retention, free cash flow. Those are still important, but AI adds a much tougher capital cycle underneath that. GPU depreciation runs between three and five years. Data center shells last a whole lot longer. So the model and the serving stack need to be refreshed. The asset lives don't match. In some cases, the data center shells are not in a position where they can be reused with the newest racks.
And so the question CFOs need to learn to ask is, can we earn enough from this capacity before the next hardware generation changes the cost curve? And I realize that if you are not a hyperscaler CFO, you may think you're exempt from that conversation. But as I've shown you, everything goes back to the factory. We have to be able to have these conversations to apply intelligence in our firms. Utilization of tokens is a central operating metric. Now, an AI factory with low utilization is dangerous because the depreciation clock is going to run whether the tokens are served or not.
And this gets into the idea of token allocation and how you're protected. When demand exceeds supply upstream of the vendor, do you have guaranteed capacity? Do you have a best efforts promise? Do you have a Q position? So, the hyperscaler in this sense is also your competitor for the same compute, right? Microsoft needs GPUs for Copilot and for Azure customers. Google needs them for Gemini and cloud and search and workspace. Amazon needs them for AWS and bedrock and tranium commitments. Meta has maybe fewer customer allocation conflicts because most of the demand is internal, but that means Meta has to fund the whole thing itself.
I want you to dig in deeper on how cloud providers are also potentially competitors, not just from a product perspective, but from a chip and allocation perspective. If you want that full procurement breakdown, you can grab it at the Substack. I go real deep there. For now, we're going to stay at that level and move on to forecasting demand, which is another big piece that we need to understand better if we're going to make good agreements around investing in AI. Forecasting AI usage in seats, users, licenses, and projects is something I see pretty frequently. You need to go farther.
You need to forecast tokens per workflow for context length, for model calls per task, for agent loops, for concurrency, for latency tiers around failure and retry rates. A customer support chatbot and an autonomous claims processing agent do not consume capacity in remotely the same way. And that's just one example. A coding assistant that answers occasional questions is very different from an agent that reads your repository, writes code, runs tests, and loops for hours or days. If you forecast adoption, you will underbudget capacity. If you forecast budget, you will overpay for the wrong layer. Now, I want to be honest about the other side of all of this.
I've talked mostly about all of the costs and all of the complexity of taking traditional massive software businesses and turning them into AI factories that underpin the intelligence economy. Let's look at the good news. Serving costs have been falling quickly. Epic AI found that prices for the same performance level have fallen at very different speeds across tasks, but they're all falling in some cases by orders of magnitude per year. Smaller models, distillation, caching, batching, quantization, speculative decoding, better routing, all of these increase the work you can serve for the same capacity. If you are excited about a model like Opus 4.7 or Chad GPT 5.5 today in May of 2026, you will probably have the same intelligence on an open weights model for free that you can run on your stack in December.
I have a specific example on efficiency that I think tells the story here. Microsoft said co-pilot inference throughput the ability to push tokens through went up 40% in the last quarter just from the team there optimizing software and hardware. Efficiency gains like that are equivalent to building more factory without breaking ground. So the cost improvements are real, but they don't change the fact that this is still an industrial business. Cheaper tokens can also create and do also create more token demand. It's Jven's paradox, right? Longer contacts, more agents, more retries. It's part of why we're token constrained in May is because we got better agents in January.
Uh and so if efficiency gains outrun demand growth, the whole idea of the industrial base that we're building starts to like become more of a bare case. It starts to soften. If cheaper intelligence causes usage to explode faster than capacity arrives, which is what we see so far, then we are in a bind where we have to build and build and build and build and build and the capex is a bet that demand continues outrunning efficiency. And that's that's what we have so far. And that's something that I want to dig into in more detail.
So, I wrote a whole piece on the unit economic signals and what to watch for in 2026 and that help you understand like capacity proof and they give you a practical bubble test. By that test, we are not in a bubble yet, but you can run it yourself. It's over on the Substack. If your CFO is asking whether AI capex is sustainable, that is the briefing to show them. The link is is below in the comments. So, let me close with three questions to bring to your next AI investment review. If you remember nothing else from this video, write these down.
Number one, what share of your AI vendor spend is reserved capacity versus best efforts allocation. And what is the concrete plan if your default provider becomes supply constrained for a period of time, maybe up to like a month, two months, we have a great relationship with the vendor is not a plan, by the way. You have to have an allocation tier and it needs to be written down. Question two, what is your specific routing plan for sending to cheaper models? And how will you measure the savings without degrading the user experience? Companies are running expensive models against tasks they don't need to because nobody has built a routing layer.
And that's margin that's just sitting on the floor. Three, in your top three AI workflows, where is hidden human supervision masking product failure that is relevant from a purchase perspective? And how would you know if that supervision disappeared? So many of these vendor demos run really clean because a human is in the loop somewhere and you wave your hands through it when you have the demo, right? Production traffic isn't a demo scenario. If you cannot see the human supervision, you're going to have trouble pricing it, scaling it, or removing it. So these questions are very much operational questions.
These are the questions I would be asking if I were being asked to think about AI as software and whether we buy it and where we buy it and who we buy it from. I get into a fuller version of these questions on the substack plus a bill of materials, a procurement framework, token forecasting model, all of that. If you want to walk into the next budget meeting, go grab that, right? You you'll be all set. But let's step back for a minute. In the cloud era, the winning abstraction was elastic compute. Developers built as if infrastructure was always available.
The physical world kind of receded away. That abstraction is broken now. Intelligence isn't infinitely elastic. It's constrained by an industrial factory. Microsoft's $190 billion capex matters because the world's most valuable software companies have to think like industrial operators now. They have to have supply assurance, throughput optimization, capacity scheduling, utilization management, depreciation, discipline. Those are words your CFO is about to start using. Your AI strategy is a production line. It's it's it's like it's in a factory, right? You have to be thinking and aware of where that power those tokens are coming from. Are they coming from a particular data center?
Do you know the data center? Do you know what's going to be there? Do you know if there's more capacity there? If that vendor signs on 20 more customers, you should know these things. The executive job is changing. That is one of the big themes for me in 2026. And one of your jobs is due diligence. I've talked about due diligence across technical considerations and workflows internally. This video is about owning the decisions across every layer of the factory. when you make a purchase, you're effectively buying a share in an industrial factory and you're buying a share of those tokens.
Are are you ready for that? Are you thinking about that? That's a whole lot different from what the MBA's degree will tell you about buying software. It's not software anymore. The factory is what makes intelligence possible. Tokens are what it manufactures. We are in the intelligence economy and you got to get ready. So, sign up, get excited, and uh I'll see you next time. This has never been. One of the things I love about this whole journey toward a physical infrastructure connected intelligence economy is that it makes the problem spaces so much more interesting because for so long you could be in a world in the 2010s where software tasted like chicken.
It was all the same. And whatever you did as an exec with an MBA, you could do the same thing at the next company. Not anymore. It's way more interesting than that. This is an example of how deep you have to dive. Jump in. Get excited.
More from AI News & Strategy Daily | Nate B Jones
Get daily recaps from
AI News & Strategy Daily | Nate B Jones
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.







