A staggering amount goes into every AI response you barely think about. The math. The hardware. The supply chain. The economics. The premise of this piece is simple: the money only makes sense once you’ve seen the math.
So I’m going to walk the whole chain, from the thing you actually get out of AI back to the molten-tin plasma that produces the light used to etch the chips, and show you why AI costs what it costs. By the end you should be able to picture what a GPU is doing when you give an AI a task — and that picture is the argument. It explains the prices, the margins, and the buildout.
The chain: the thing people want is agentic workflows. To produce them you do a staggering amount of arithmetic. To do that arithmetic you need GPUs. To get GPUs you need one of the most complex and fragile supply chains ever built. That’s why the buildout is so expensive — and whether it’s worth it comes down to one question: are agentic workflows worth it? I think they clearly are, and the mechanism is what makes me confident.
Start at the top, with what you’re actually asking the machine to do.
1. What an agent is actually doing
A chatbot answers once: you type, it types back, done. An agent doesn’t. It works in a loop, and the gap between those two things is the whole product.
Walk through a real one. You tell an email assistant: "Find the latest email from Jordan and draft a reply." To you that’s one request. To the model it’s several turns — and it helps to see exactly what the software around the model, the harness, hands it.
Before your message even arrives, the harness loads three things into the model’s context:
- A system prompt — the standing instructions ("you are an email assistant, here is how to behave, here are your rules"). Call it ~2,500 tokens, but this can balloon higher when adding custom always-loaded instructions like CLAUDE.md.
- Tool definitions — what lets the model actually do things. On its own the model only emits text; tools are what turn that text into action — searching an inbox, sending a message. They’re written as JSON (like the block below), and there’s no hidden magic to them; they’re simple. Two tools here:
[
{
"name": "search_emails",
"description": "Search the user's inbox; returns matching message snippets.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"max_results": {"type": "integer"}
},
"required": ["query"]
}
},
{
"name": "send_email",
"description": "Send an email on the user's behalf.",
"input_schema": {
"type": "object",
"properties": {
"to": {"type": "string"},
"subject": {"type": "string"},
"body": {"type": "string"}
},
"required": ["to", "subject", "body"]
}
}
]
- A skills index — a list of capabilities the model can pull in on demand, each shown as a one-line snippet:
write-email — Draft replies in the user's voice: tone, sign-off, formatting.
schedule-meeting — Propose times from the user's calendar and send invites.
summarize-thread — Condense a long thread into bullets.
... + 100 more
That "+ 100 more" is where this is going. You don’t dump a hundred full skill documents into context — you show the model the one-liners and let it read the full thing only when it needs it. That’s progressive disclosure. The write-email snippet is a single line; the real file behind it is long — writing style, examples, edge cases, easily a few hundred lines. The same trick increasingly applies to the tools themselves: rather than loading all hundred tool definitions upfront, the harness can show the model a short index and let it pull in a tool’s full JSON schema only when it reaches for that tool. If you’re wondering "Why would there ever be 100+ skills and tools?", think about all the tools it would take to connect all the apps you regularly work with to an AI agent. The example above is simply illustrative, but you would never have an agent that’s exclusively an email agent. Here’s a short list of possible tools:
- Email (5 tools - search, compose/send, delete, move, mark read/unread)
- Google drive (5 tools - Search, Read file, Upload file, Delete File, Edit document)
- All the other apps you want to interface with an agent; all your other data sources (Think about the software you use for work and all of the functions you would need to interface with AI - Easily 70+ tools)
- Internet search tools for various data sources (Easily 20+ tools alone)
- Google search
- YouTube search + download + transcript download
- Images
- Financial data - Yahoo finance, FRED
- Corporation registries; globally there are tons of these
- Many more
Many tools come packaged with skills, e.g. an email drafting skill, a document creation skill…
There’s a trade-off in this idea of progressive disclosure of tools and skills, and it’s worth seeing now because it comes back later. Progressive disclosure keeps the context lean — the model isn’t wading through a hundred skill files and tool schemas it doesn’t need, which makes it both faster and sharper on the task in front of it. But it pays for that with extra turns: the model has to stop and fetch the skill, or the tool’s schema, before it can act on it, and each deferred load is one more round-trip to the model. Better performance, more turns — and more turns, as we’ll see, is exactly what drives the bill.
Now the loop. Watch it run as an actual back-and-forth. Each Agent block is a separate call to the model — text in, text out — and every one re-sends the whole conversation above it. Each Tool block is the harness doing what the model asked and handing the result back.
Find the latest email from Jordan and draft a reply.
Let me search for Jordan’s most recent email.
{ "tool": "search_emails", "input": { "query": "from:Jordan", "max_results": 5 } }search_emails returns:
[
{ "from": "Jordan Lee", "date": "2026-06-13",
"subject": "Re: Q3 pricing",
"snippet": "Can you send the updated deck before Friday's call?" }
(And 4 more blocks here)
]Found it. Before I draft, let me pull the write-email skill. (Reading a skill is a built-in action, exposed to the model as an additional tool)
{ "tool": "read_skill", "input": { "name": "write-email" } }returns the full write-email skill (~1,500 tokens, abbreviated): "Match the sender’s formality. Lead with the answer. Keep it under 120 words…" — plus hundreds more lines of rules and examples.
Here’s a draft: "Hi Jordan — yes, I’ll get the updated deck to you Thursday, ahead of Friday’s call. Anything you want highlighted? Best," — want me to send it?
That’s one of your turns and three of the model’s, with three full re-sends of the conversation, each longer than the last. And this is the trivial case. A real task — debug this failure, reconcile these accounts, plan this trip — runs dozens of turns, and the transcript only grows.
Two things to take from this, and both come back as money. First, your one request became several model turns, and that ratio only climbs as tasks get real — a coding agent can take dozens of turns for a single instruction. Second, every turn re-sends the entire conversation so far — system prompt, tools, skills, your message, and every previous tool call and result. The model itself is a fixed function, with nothing about your conversation stored inside it. What stands in for memory is the transcript, which the harness hands back in full on every turn. (There’s an efficiency layer — caching — that keeps re-sending that transcript from being as wasteful as it sounds; that’s the next section.)
2. Agentic work means long context
Stack those steps up and the context balloons. A chatbot question is a few hundred tokens. An agent reading files, calling tools, and pulling in skills is at hundreds of thousands of tokens within minutes — and frontier models now accept windows around a million tokens, roughly 750,000 words, fed in fresh on every step.
That’s the first load-bearing idea, and the whole cost story grows out of it: agentic work means long context. And re-processing that novel from scratch on every turn would be hopeless — slow, and ruinously expensive. So the system doesn’t. It processes the context once, caches the processed form, and reuses it on the next turn instead of rebuilding it.
While the agent is actively working — generating tokens and looping through tool calls a second or two apart — that cache stays hot in the GPU’s fast memory, holding space that can’t be handed to anyone else. The part to sit with: a chatbot needs a small amount of memory for the second or two it’s typing; an agent orchestrating a task needs a much larger chunk of memory, and holds it for minutes on end.
When you stop to read or type, though, the system won’t pin your cache in that scarce, expensive memory while you sit idle. It offloads it down to cheaper memory and pulls it back on your next message — fast enough to feel instant, as long as you return within a time-to-live window (Anthropic’s default is five minutes, with a one-hour option).
3. The math underneath
In this section, I’ll show you the shape of the computation, quantitatively, just far enough that the prices later feel inevitable instead of arbitrary. I won’t teach you to build an AI transformer model, and I’ll make some simplifications along the way, but a general understanding of the math behind it all gives you a concrete understanding of what GPUs are actually doing, and why we need more of them for agentic workflows. The one-line version: it’s not magic, it’s multiplication, at a scale that’s hard to believe.
3.1 Words become numbers
The AI model never sees words. It sees numbers, only numbers, because the only thing it does is arithmetic.
Text is chopped into tokens — a word or a piece of one ("tokenizing" → "token" + "izing"). For the purposes of understanding, you can think of tokens as words, keeping in mind tokens include numbers and symbols too. Each token is looked up in a table and replaced by a vector: a column of numbers, 12,288 of them in a model the size of GPT-3, each stored at 16-bit precision. So the token "happy" enters the model as something like
That column is the entirety of what the model knows about "happy." The vector assigned to each token is learned during the model’s training, such that tokens that encode similar meaning have similar vectors (via cosine similarity).
3.2 A sequence is a matrix
Lay one vector per token, as rows, and your prompt becomes a matrix with a row for each token, ~12,288 columns wide.
Each row is one token; the horizontal dots stand in for the thousands of numbers in each vector, the vertical dots for the tokens in between (here, just "is"). Everything from here is operations on this grid — numbers in, numbers out.
3.3 Attention
The meaning of a word isn’t fixed — it shifts with the words around it, and a token’s starting vector knows none of that context yet. Take "The child is happy." The token "happy" arrives generic, identical to the "happy" in any other sentence; what we want is for it to become this sentence’s happy — the child’s happiness — by reaching back and absorbing meaning from "child." Attention is the mechanism that does the reaching, and it only ever reaches backward: the model is causal, so a token attends to earlier tokens, never later ones.
Side note: This should actually guide how you prompt AI and the order in which you give it information — If you want sentence B to attend to sentence A, B should come after A — This is why you generally give context first, then the instruction: Because you want the instruction to attend the context).
Each token takes its own vector and multiplies it by three separate learned matrices — , , — producing three new vectors:
- the query: a description of what this token is looking for. "happy" is an adjective, so its query roughly encodes "I describe something — which earlier word is the thing I apply to?"
- the key: a description of what a token is, written in the same language as queries so the two can be compared. "child"‘s key roughly encodes "I’m a noun, a person" — exactly what "happy"‘s query is hunting for.
- the value: the token’s actual content, the meaning it will hand over to whoever attends to it. "child"‘s value roughly captures "person, small, young" and other attributes of children.
Then the match. "happy" compares its query against every earlier token’s key with a dot product — multiply the two vectors element by element, sum the result, one number that’s large when they point the same way. "happy"‘s query lands hard on "child"‘s key (adjective seeking noun, noun on offer) and only weakly on "The" and "is." Those scores become weightings, and "happy" pulls in a blend of the earlier tokens’ values, weighted by them — mostly "child"‘s content, a little of the rest — and adds it into its own vector. The generic "happy" is now a child-shaped happy, specific to this sentence.
Two structural points pay off later. First: are weights — fixed numbers, part of the model’s ~trillion parameters, shared across every token and every user — while the keys and values they produce — and that are cached per-user — are per-token. Second: after the blend, a fourth weight matrix projects the vector back to full width, and the token runs through a feed-forward network — two large matrices, expanding it about 4× wider, followed by a non-linear activation function, then contracting it back. The FFN matrices and are the biggest in the layer — they are around 4× as wide, and so they’re 4× the data and make up most of the model’s weights.
Notice what attention just produced: a key and a value for every token. Those are exactly what the cache from §2 stores — which is why it’s the KV cache. And because each layer runs its own attention, each layer caches its own K and V for every token.
Then it repeats — attention, feed-forward — through about 100 stacked layers, each refining the numbers a little more.
3.4 What "a trillion parameters" means
The parameters are the numbers inside those matrices — , one set per layer. "A trillion parameters" means a trillion numbers sitting in those grids, and it adds up cleanly. Call the model’s width . The four attention matrices are each , so . The two FFN matrices are wider: is and is , which is each, together. Per layer:
GPT-3 had and 96 layers — it’s old and weak by today’s standards, but I use it because it yields clean math for illustrative purposes. Per layer: billion. Times 96 layers ≈ 174 billion — which is the "175B" GPT-3 was sold as.
That formula is for a model like GPT-3, with one FFN per layer. Today’s big models hold many FFNs in each layer instead — a separate and for each of a few hundred experts (Kimi K2.6, an open-source model, has 384 of them per layer). Those extra expert FFN matrices are where the parameter count balloons; it’s how you get from GPT-3’s 175 billion to a trillion. I use GPT-3 here because it cleanly shows the architecture without the complex features frontier models add on, but frontier models follow the same general architecture.
Two facts turn this into money:
Each parameter costs ~2 bytes to store, and 2 arithmetic operations to use — but only for the parameters a token actually passes through.
The ~2 bytes is because weights are usually held at 16-bit precision (the exact serving precision of closed models isn’t disclosed, so some modern models may be compressed to ~1 byte per parameter). The "parameters a token passes through" clause is where storage and compute split apart:
- Storage scales with the total count. Every parameter has to physically sit in memory whether or not a given token uses it. GPT-3’s 174B was ≈ 350 GB; Kimi K2.6’s 1 trillion parameters come to ≈ 2 TB of weights — already too big for any single GPU, which is why the unit of compute becomes a whole rack of GPUs.
- Compute scales with the active count. A dense model like GPT-3 runs every token through every parameter. Modern frontier models are Mixture-of-Experts: each token is routed through only a slice — Kimi K2.6 runs just 32 billion of its trillion active per token. So per token that’s ≈ 2 × 32B ≈ 64 billion operations, and a 500-token reply is ≈ 32 trillion operations. Dense — every parameter for every token — it would be ~30× that, which is the entire point of MoE.
Thirty-odd trillion multiply-adds to write a paragraph. Not thinking — arithmetic, at a scale you can’t picture. MoE shrinks the math by lighting up fewer parameters, but does nothing for the memory — every expert’s set of parameters still has to sit in HBM waiting its turn. The binding constraint is memory, not math.
3.5 Prefill and decode
Generation runs in two phases.
Prefill reads the prompt: one pass over all the input tokens, computing and caching the K and V for every token at every layer. It has to fully run the entire input through ALL of the layers because the K and V of the next layer depend on the full output of the last layer, not just the K and V. So you need to fully compute attention at each layer, and do the FFN pass. Computing attention requires comparing every token to every other previous token, which causes this compute to grow quadratically with context length. This is your "time to first token."
Decode writes the answer, one token at a time. For each new token, at every layer: form its ; append its and to the cache; score its query against all cached keys; blend all cached values; add the result back; run the FFN. After ~100 layers, the final vector encodes what should come next: It yields a probability distribution for what should be the next token, across every token in the vocabulary (~100k of them). Then you take the top-k most probable, sample with a bit of randomness, and that’s the next token. Feed it back in, repeat.
The split is the whole point. Prefill builds the cache (quadratic, expensive); decode reads it (cheap per token) instead of rebuilding the past every step. Each decoded token adds one slice of K and V per layer, so the KV cache grows linearly with context length. And if a later turn re-sends a context you’ve already prefilled, you skip rebuilding that part entirely — that’s the cache hit, for which AI providers usually charge only 10% of the input price (as opposed to the full input price for context that hasn’t been cached).
3.6 Weights vs. cache
Two things live in the GPU’s memory, and they behave completely differently:
- The weights — the trillion parameters, through across all layers — are fixed, and shared by every user. Load once; everyone runs against the same copy.
- The KV cache — the keys and values your specific context produced — is per-user, and grows linearly with your context length.
Both compete for the same scarce, fast memory. The weights are a large fixed cost. The cache is the variable cost — and agentic work, long context held hot while the agent runs, makes it big, per person. That tension is the entire economics. The next section turns it into dollars.
4. How all this runs on a GPU
We have the math. Now the machine that does it — because the hardware is what turns the math into a bill.
4.1 What a GPU is, and why the rack is the unit
A GPU isn’t one chip. It’s a package with two kinds of silicon sitting side by side: a compute die — thousands of cores that do the multiplying — and several stacks of HBM, high-bandwidth memory, towers of DRAM mounted right next to the compute die. The compute cores have almost no memory of their own, a few megabytes. They cannot hold the model. The weights — that ~2 TB — live in the HBM.
So every matrix multiply is a streaming operation: pull a weight matrix out of HBM into the compute die, multiply, write the result back. HBM is fast — an H100 moves about 3.35 TB/s, several terabytes a second — but "fast" is relative when you have terabytes of weights to stream for every token. The compute cores spend most of their time waiting for data to arrive, not multiplying. That gap — cores idle, starved for weights — is the memory wall, and nearly every cost lever in AI is built on it.
And 2 TB doesn’t fit in one GPU. An H100 holds 80 GB, so a frontier model’s weights are split across many GPUs. The detail that matters is what stays put versus what moves. Each GPU holds its slab of the weights, and the slice of KV cache that goes with them, in its own local HBM, and keeps it there: weights don’t get shuffled between GPUs, and a given user’s cache for a given set of layers lives on one GPU, not smeared across the rack. The only thing that travels GPU-to-GPU is the comparatively tiny activation vectors, handed along as a token flows through the model. The reason to architect it that way is speed: a GPU’s own HBM is the fast tier, so you keep the heavy data (weights, cache) pinned to it and move only the light stuff. The GPUs are wired for those handoffs with NVLink, ~1.8 TB/s per GPU on the current generation.
A rack is not one giant GPU. It’s a team of GPUs, each working on its own resident piece of the model, passing small activations between them — because reaching another GPU over NVLink, while quick, is still slower than local HBM, and reaching another rack is slower again. The rack is "the unit of compute" only in the sense that the model runs as a coordinated set of GPUs: a Hopper server tied 8 together; NVIDIA’s current NVL72 ties 72 into one NVLink domain with roughly 13.5 TB of HBM, and the domains keep growing. When you read about a multi-million-dollar rack, that’s the team — not one big pool of uniform memory.
Keeping three speeds straight is most of understanding the hardware. Each is slower than the one above — NVLink is roughly half of local HBM’s speed, and then the drop to rack-to-rack networking is the real cliff:
- inside the package, compute die ↔ its own HBM: ~3.4 TB/s (H100) to ~8 TB/s (B200) (the memory wall);
- inside the rack, GPU ↔ GPU over NVLink: ~1.8 TB/s;
- between racks, over InfiniBand or Ethernet: ~50–100 GB/s per link.
A common mix-up worth noting: NVLink is the intra-rack link at terabytes per second; InfiniBand is the rack-to-rack link an order of magnitude slower.
And the KV cache — the part that, unlike the weights, can’t be loaded once and shared across users — lives in this same HBM, competing with the weights for room. While your agent is mid-loop, your cache sits in HBM, the fastest and scarcest tier, so the next token is instant. When you go idle, the system pushes it down the hierarchy — HBM → CPU memory → SSD, each tier cheaper, bigger, slower — and pulls it back up when you return. That hierarchy exists for one reason: HBM is too precious to spend on a cache nobody is using this second.
4.2 Batching — and why long context breaks it
Here’s how a provider makes money against the memory wall: batching. Since the cores sit idle waiting for each weight matrix to stream in, you don’t multiply that matrix by one user’s vector — you multiply it by a thousand users’ vectors at once. The matrix streams in once and you get a thousand answers for roughly the cost of one stream. That’s the whole efficiency play, and it’s why providers want as many simultaneous requests on a rack as possible: the more you batch, the more you amortize the model weights, and the cheaper each user gets. This is the only way that you actually keep the compute cores active so they don’t spend all their time waiting for data to come their way.
But batching has a hard ceiling — the rack’s memory. The weights take their fixed cut of HBM; everything left over is for KV caches. And how much each user takes out of that is set by how long their context is — the KV cache grows linearly with context length; one more slice of keys and values per layer for every token you add.
Take Kimi K2.6, a real, open-weight frontier-ish model whose attention stores about 70 KB of cache per token. A chatbot turn — a system prompt and a few exchanges, call it ~30k tokens — is only about 2 GB of KV cache per user. An agentic session carrying ~200k tokens is about 14 GB — per user. Same hardware, same weights; the only thing that changed is the context length, and the cache grew right along with it.
And that ratio is the cost. Whether it’s Kimi or any other model, the math is the same — KV cache grows linearly with context — so a ~200k agent against a ~30k chatbot eats roughly seven times the memory per user, and you fit about one-seventh as many of them on the same rack. That’s the factor: almost an order of magnitude, on identical hardware. And it only widens as you push the context toward a million tokens, the kind of long-horizon workflow the frontier is racing toward.
The reason why that rockets the cost rather than nudging it: The bottleneck was never the compute; the cores are fast and largely idle, waiting on data. The bottleneck is bandwidth, and batching is how you handle it: stream the weights in once, serve the whole batch off that one stream. The rack’s total token output is roughly capped by that bandwidth, and it’s shared across however many users are on it — so each user gets similar throughput, but the number of users is the lever. Cut the batch roughly sevenfold and the rack’s salable output falls by nearly the same factor: the same expensive box, the same power bill, a fraction of the tokens to sell. And charging each user for their long context doesn’t rescue you — if you’d instead fit the 7× short-context users that same memory holds, you’d bill roughly the same input tokens off them. You would actually have to bill more per token.
That is the economic engine of this whole piece, and it falls straight out of the mechanics: agentic work means long context, long context means a giant per-user KV cache, and a giant per-user cache means you can’t batch as well. Serving an agent costs dramatically more than serving a chatbot on the exact same hardware — same GPUs, a fraction of the users, many times the cost each. And not just because agents use more output tokens, but because of the mechanics around batching - each token actually costs more resources, and you’re using more of them.
4.3 What NVIDIA is building, and why it has to
NVIDIA’s roadmap is a direct answer to that math: bigger NVLink domains (72 → 144 → 576 GPUs in a single fabric over the next generations) and more HBM per GPU (80 → 141 → ~192 → 288 GB across recent generations).
You keep one copy of the model per NVLink domain. One set of weights, split across the domain’s GPUs, serving the entire batch — you don’t replicate the model onto every GPU (only at the very largest domains, the 576-GPU class, might you load more than one copy). The weights are a fixed cost, paid once per copy. The game is diluting that fixed cost across as many simultaneous users as the leftover memory can hold. And it’s not that you’re diluting the memory those weights take up; what you’re really diluting is the bandwidth used to shuffle those fixed model weights into the compute cores. Batching more users allows you to multiply many vectors on a single shuffle of the weight matrix into the compute cores. So more HBM per GPU and bigger domains do exactly one thing that matters — they leave more room, after the single model copy, for more users’ KV caches. More users per copy, fixed cost split finer, cheaper per user.
There’s a software side too: today an agent’s tools often run on your laptop, a network round-trip away, while its cache ties up GPU memory the whole time; the next move is to run that tool execution on a CPU sitting inches from the GPU in the data center, so the GPU isn’t held idle waiting on a round-trip. NVIDIA is actively working on this.
The bottom line of the hardware story: agentic workflows on frontier models don’t run well on yesterday’s GPUs — there isn’t enough memory to hold the model and a meaningful batch of long-context users at once. You have to build the bigger racks.
5. What it costs — and why $650 billion in 2026 isn’t crazy
Models are priced per million tokens (a token ≈ 0.75 words), and the price splits three ways: input is cheaper than output, and cached input is cheaper still. As a baseline, Claude Sonnet 4.6 runs about $3 per million input / $15 output; Opus 4.8 about $5 / $25; Anthropic’s Fable 5 about $10 / $50. A cache hit — reusing context you already paid to process — is billed 10% of the input rate.
That 10% sounds generous until you run an actual task. Let’s walk through it. Say you’re on Opus 4.8 with a 200k-token context. 200k is a fifth of a million, so a full pass of that input is a fifth of $5 — $1. But it’s cached, so you pay 10%: 10 cents per tool call. An agent doesn’t make one tool call, though — a real task easily runs 50, each re-sending that whole context. Fifty × 10 cents = $5, just in cached input. Then output: the model can easily think and write 100k tokens across the task, at $25 per million — a tenth of a million is $2.50. Add it up: ~$7.50 for one mid-size task.
$7.50 is a lot of money for one task, and it’s why companies that wired AI into everything are now staring at their bills — especially the ones who set up "tokens used" leaderboards and ended up with employees burning through tokens to top them. But look at where the money goes: $5 of that $7.50 is cached input — context the provider processed once and is now mostly just holding between your turns, not recomputing. Inference-layer gross margins — revenue against the bare compute to serve a request — are estimated around 70%, even though the labs still lose money overall once R&D is counted in. The price is currently set more by the value of the answer than the cost of producing it.
The cache-read charge is really where the billing stops tracking the hardware. On a cache hit in a tight loop your context is almost certainly still sitting in HBM — there’s no sense evicting it for a one-second gap — so the provider isn’t redoing the expensive part, the prefill; it’s charging the input rate again just to hold your cache in memory for a second. The pricing is inconsistent relative to the hardware costs: holding that same context while it generates a long output ties up the memory just as long, yet costs you nothing beyond the flat per-token output rate. Let’s continue the example from above: Each tool call costs 10 cents because of the cache hit input token costs. In terms of hardware, you’re paying 10 cents just to hold you KV cache in HBM for 1-2 seconds. If instead your agent was actively generating tokens, it would produce less than 100 tokens per second, and that HBM would still be held up. The billable value of 100 output tokens on Opus 4.8 is — a quarter of a cent, or one-fortieth of what you pay just to hold the cache for that same second.
The cost-based version would follow the compute — charge prefill on the way in, then price each output token by the context length behind it, since a later token ties up more memory and burns more attention as the cache grows. Here it’s enough to notice that the flat rates are a proxy, the proxy runs loosest on exactly the agentic workloads growing fastest, and a loose proxy is margin. This disconnect between API pricing and actual compute costs also explains how AI labs are able to offer consumer subscriptions for agentic workflows like Claude Code without losing a ton of money. Enterprises are balking at their soaring AI costs because they’re on API pricing, paying the full per-token prices. More on this later — I think tokens will commoditize in the coming years.
I don’t think the buildouts are overhyped. Most people have never used an agent. They use ChatGPT or Gemini — chatbots with a little tool-calling bolted on, deliberately limited, not because better isn’t possible but because you genuinely cannot give a billion people a real agent right now. There isn’t enough compute. The good version — the AI agent that reads your files, does extensive searches, runs scripts, does the work in fifty steps — is gated to developers and power users through tools like Claude Code precisely because each session ties up so much memory like what we just walked through.
That’s a supply constraint, not a demand ceiling. Once people get a taste of an agent that actually does the work, they’ll want it for everything, and the demand will be enormous — I’d bet agentic usage grows 10x or more over the next few years. I personally pay $100 a month for Claude Max and I think it’s worth every penny. That demand cannot be served on today’s fleet. Agents need more memory per GPU, more GPUs per rack, and more racks — the exact buildout being planned. The spend is chasing a product that works and that nobody can yet supply at scale.
Let me be careful about what I’m claiming, though, because there are two separate questions with different answers. Are the buildouts needed — is the compute real? Yes, unambiguously: if you want to put a genuine agent in everyone’s hands, you need these racks, and what they cost is reasonable for what they do. The question I’d actually hedge is whether demand will pay enough to recoup the CapEx at today’s prices. And here my answer runs through the next section, so I’ll only plant it: I think the CapEx gets recouped — but not necessarily by the frontier labs at $25 a million tokens. Even if enterprises take one look at a top lab’s bill and walk away, the workflows don’t go away, because the cost of actually running a model is high but not astronomical. So the worst case isn’t idle GPUs; it’s GPUs rented by the hour to run cheaper models instead of premium ones. Lower-margin than the labs’ dream — but still demand, still revenue, still the CapEx earned back.
And this is where the whole technical chain has been heading. Everything — the fifty-trillion-operation paragraph response, the KV cache eating HBM, the long context that breaks batching — lands here: the buildout is supply racing to catch a real, mechanically-constrained demand.
6. Where the money goes — the supply chain
So if we’re spending $650 billion, where does it physically go, and who builds the hardware? The chain is longer and more fragile than almost anyone outside the industry realizes — and the fragility is half the story.
It starts in the Netherlands, with ASML — the only company on Earth that makes EUV lithography machines, the machines that pattern the most advanced chips. The "EUV" is extreme ultraviolet light at a 13.5-nanometer wavelength, and producing it is one of the more absurd feats in modern industry: you fire a high-power laser at a droplet of molten tin fifty thousand times a second, blasting each droplet into a plasma that radiates 13.5 nm light, which is then gathered and focused by some of the flattest mirrors ever made — so flat that, scaled up to the size of Germany, the largest bump would be about a tenth of a millimeter tall. The tin doesn’t become the chip — it’s just how you produce light of the right wavelength to print circuitry that fine.
ASML doesn’t even do all of that itself. The laser comes from Trumpf in Germany; the mirrors from Zeiss, also Germany — and that’s before the thousands of other suppliers feeding in parts from around the world. This is a genuinely global chain, and a brittle one: several of these links are close to single points of failure for the entire frontier, scattered across continents before anything even reaches Taiwan. A single EUV machine runs about $180–220 million; the next-gen High-NA versions, around $380 million.
Those machines ship to Taiwan, to TSMC, which actually manufactures the chips. NVIDIA makes nothing physical — it designs the GPU (the army of engineers laying out the transistor pathways) and hands the design to TSMC to fabricate. That split is where the profit lives: NVIDIA captures most of the margin on the design; TSMC takes a thinner spread for the (staggeringly difficult) manufacturing.
And the GPU that comes out isn’t even one part. It’s three separately-sourced pieces fused at a single TSMC step — CoWoS, chip-on-wafer-on-substrate packaging: the logic die (needs EUV), the HBM stacks (made by just three companies, and export-controlled), and the silicon interposer they sit on. At every link this chain is not only expensive but brittle, a sequence of near-monopolies where no single step can be quickly replaced or scaled. That brittleness is exactly why compute is scarce, and why this boom is harder to satisfy than past ones: you can’t will more EUV machines or CoWoS lines into existence on a quarter’s notice.
An aside: Google’s TPUs — its in-house alternative to NVIDIA’s GPUs — are co-designed with Broadcom and also manufactured by TSMC.
Every link that makes this chain fragile also makes it a geopolitical lever — and the US has used it. The reason why America gets to set the export rules for companies that aren’t even American is intellectual property. The entire leading edge is built on US-invented, US-patented technology, and you can’t reach the frontier without it: NVIDIA is American outright; the chip-design software every fab and memory maker runs on (the EDA tools from Cadence and Synopsys) is American; ASML’s machines lean on US-origin technology and components, the light source tracing back to decades of US national-lab research. So even when the maker is Dutch (ASML), Taiwanese (TSMC), or Korean (Samsung, SK Hynix), the product is soaked in American IP — and that hands Washington the legal hook to block its sale abroad.
And where it bites lines up exactly with the three things China can’t make for itself: it can’t buy NVIDIA’s frontier GPUs (the H100 and up are blocked); it can’t buy EUV machines (ASML is barred from selling them into China — the deepest cut, because without EUV you can’t pattern leading-edge chips); and it can’t freely buy HBM.
The EUV cut is the one that compounds because EUV does two things: it packs transistors tighter, and it switches them with less energy relative to the older DUV technology. A smaller gate is a smaller switch, and a smaller switch takes less power to flip. Without EUV, China’s chipmakers — Huawei, SMIC — can still make genuinely capable chips a generation or two back; SMIC has even reached the previous generation without EUV, though at lower yield and higher cost per wafer because DUV isn’t built for it. The point isn’t that they can’t build chips — it’s what those chips cost to run. A bigger transistor is a bigger switch, and at data-center scale that shows up twice over: the chips run slower, and — the one that really bites — they run less power-efficiently, burning more energy per operation, every operation, forever. China’s compute is more expensive to run, and nothing closes that gap until they can build their own EUV, which is a moat measured in many years, not quarters.
So the supply chain is not only where the money goes but also where the US–China competition actually plays out. Two questions fall out of it: who captures all the money a structurally-scarce chain throws off (§7), and what China does once it’s shut out of the leading edge (§8).
7. Who actually profits
My read: the part of this most likely to pay off isn’t the models — it’s the buildout behind them. The picks and shovels get bought no matter whose model wins; the labs (§8) are walking into a brutal race. Going down the chain:
- NVIDIA — the clearest winner. It designs the GPUs, captures most of the margin, and sells them more or less regardless of which lab’s model runs on them. Whoever wins the model war is buying NVIDIA to fight it. Despite competition with custom silicon like Google’s TPUs, NVIDIA plays in a rapidly growing market and will most likely remain the leader for years.
- TSMC — indispensable, a genuine chokepoint, on thinner margins than NVIDIA. No frontier chip exists without it.
- ASML (+ Zeiss, Trumpf) — the lithography monopoly. No EUV, no leading-edge chips, full stop.
- The memory makers — Micron, Samsung, SK Hynix. The memory wall makes HBM the binding constraint on the entire system, which hands the three HBM suppliers real pricing power.
- Interconnect — NVIDIA again (NVLink, and InfiniBand), plus the Ethernet-for-AI camp (Broadcom, Arista) and the silicon-photonics / optical-I/O and high-end cabling players. As models sprawl across more racks, moving data between them becomes its own industry.
- Cooling and power — the physical bottlenecks. Direct-to-chip liquid cooling is its own fast-growing industry — and power is increasingly "bring-your-own," with data centers building their own gas generation on-site today and betting on modular-nuclear (SMR) startups for the 2030s (SMRs aren’t yet running commercially). Utilities mostly profit only where the data centers actually draw from the grid — and a lot of new capacity deliberately bypasses it.
- Data-center builders and construction firms — unglamorous, but sitting on a resource in high demand.
- Google — the interesting hybrid: it designs its own silicon (TPUs), runs its own models on it, and rents it out on its cloud — so it can profit whether its models win or someone else’s do.
- The customers — companies that get cheaper, better output from all this and have enough pricing power to keep the gains instead of competing them away. (Pure software businesses may not be so lucky)
Every stakeholder above gets paid as long as the buildout happens. Their fortunes don’t ride on which lab has the best model this quarter.
8. The longer game — AGI, and the money drying up
Part of the recent gains have come from scaling — more parameters, more data. The rough rule: every 10× in parameters and training data shaves off around half the model’s remaining error, subject to an unbeatable floor (note this "error" is about accuracy predicting the next token, not about how often the model gets things wrong, though they do track each other). This is the trap: the returns diminish fast, and you pay a factor of 100 in compute for each diminishing step. Going from ten billion parameters to a hundred billion was cheap; ten trillion to a hundred trillion is a different universe of compute — and the hardware won’t support it for years. A 30-trillion-parameter model doesn’t fit on state-of-the-art racks. So pure scaling is hitting a physical wall.
My own view is that scaling alone doesn’t get you to AGI, and the reason is context (Though it’s always possible contexts grow and models are able to reason more coherently on long contexts). Where the line sits today: a model loses coherence somewhere past 200–300k tokens even though technically their context is up to 1M; it can’t really hold a big project in mind the way you hold one in mind. Think how much you know about something you’ve worked on for months — far more than 200k words could capture — and you can see why a model that reasons over only that much can’t independently carry something genuinely hard. It’s still an astonishing tool. It’s just not a mind — not today, anyway.
But you don’t need AGI for most of what people want. You can get something that looks a lot like it by engineering around the context limit: good orchestration, agents that take notes, prune their own context, and hand sub-tasks to other agents. "Good enough" is mostly an information-engineering problem now, not a better-model problem.
Here’s the uncomfortable question for the labs. A huge share of today’s compute demand is training demand — labs burning capital to push the frontier, funded right now by enormous hype and IPO money. It’s also why you see the circular deals: Microsoft invests in OpenAI, OpenAI commits to spend on Microsoft’s cloud. These kinds of deals are expected — it doesn’t make sense for Microsoft to build their own AI when they can get it from OpenAI through a partnership — but it means a lot of the demand is the industry funding itself. What happens when the world decides the models are good enough?
I think that day comes, and when it does, the training money is what dries up first. Enterprise demand plateaus, the frontier stops feeling worth the premium, and a generation-old model that’s good enough for most real work caps what anyone can charge for the best one. The fact that makes this work links all the way back to where we started: the cost of actually running a model — the inference — is high but not astronomical. The astronomical part of today’s token prices isn’t the compute; it’s the development baked into them. That’s the part enterprises are now balking at. They’re not really saying agentic workflows aren’t worth it; they’re saying funding a frontier lab’s R&D isn’t worth it. The workflows themselves, at the true cost of inference, very much are.
Remember where §6 left China: frozen out of the hardware — no EUV, no top NVIDIA chips, no free HBM — and stuck on slower, power-hungrier compute. That corner is exactly what forced them to squeeze more out of less — distilling the American frontier models and compressing the weights — and then they give the result away, publishing the open weights for anyone to download and run.
The threat is not DeepSeek’s cheap API. No US enterprise is going to pipe its data to a Chinese server, and the steep discounts there exist partly because you’re paying in data. That’s a non-threat. The real threat is that the weights are open: an American company can download a Chinese model that’s maybe six-to-nine months behind the frontier, host it on American hardware with the data kept fully private, and run it at the true cost of inference — with none of the development surcharge, because it didn’t pay to develop it. And it’s cheap even hosted in the US — though not as cheap as Deepseek’s own API. DeepSeek’s open V4 Pro runs about $1.75–$2.10 per million input tokens and $3.50–$4.40 per million output on American platforms like Fireworks or Together — roughly a third of a frontier model’s input price and a sixth of its output. That’s the number that matters for a US enterprise, not the ~$0.40-in / ~$0.90-out quoted on DeepSeek’s own API. Good enough, private, and a real discount — on American hardware. For most enterprise work, that combination wins.
The frontier labs end up in a race to the bottom: selling a marginally-better model against cheap, good-enough open weights, and the instant they try to price in their development costs, the open-weight option undercuts them. The durable profit isn’t in the model — it’s in running the models, and in the buildout that supports it. The GPU owners don’t care whether they’re serving a premium model at a markup or an open one at cost; either way they’re rented by the hour, and either way they’re full (although hyperscalers do benefit from consistent and predictable inference and training demand from the AI labs).
Where the money goes in running open-weight models: it’s most likely running on American hardware (NVIDIA got paid), in a data center owned by an American company (the host got paid), for an American business that’s more productive for using it (the customer got paid). Open weights push the value down the stack — toward whoever serves the model and whoever uses it — rather than to whoever trained it.
So is there any lasting profit for a US frontier lab? Absent AGI — which they’re betting on and I’m not — I think there’s exactly one path, and it isn’t being the most capable; it’s being the most efficient. If a lab can build a model only marginally better than the open-weight option but meaningfully cheaper to run, it can sell at roughly the open-weight price and keep a real spread — more performance per unit of compute, pocketed as margin. That’s a genuine, defensible edge. It’s also a punishing one to live on, because the next lab is chasing the same efficiency gains and will undercut you the moment it catches up. Nonetheless, it’s the normal arc of a maturing technology: The first phase competes on developing the best product, and then — once the product is good enough and nobody is paying for more — the focus shifts to cost-leadership.
If you do believe real ultra-powerful AI is coming, the thesis flips. Then capital itself becomes the scarce, compounding resource — you throw compute at growth, let the AI orchestrate the work, and capital reinvests itself faster than any human-run company could. The risk in that world is demand: if the AI does the jobs, who’s left with the income to buy the output? But that’s its own topic.
Either way, whether the trillions of dollars that will go towards AI buildouts are worth it isn’t really a question about chips or today’s token prices. Some of the industry is flinching at those prices and scaling back their agentic AI. But those aren’t the prices that stick, because they’re carrying the cost of an R&D race that doesn’t run forever. Strip that out — self-hosted, or open weights on American servers, at the true cost of inference — and the question gets simple: are agentic workflows worth it at what they actually cost to run? I’ve shown you what they are, what they cost, and why. At the real cost, I think they’re worth it — and that, not the frontier-lab sticker price, is the bet the buildout is actually making.